zhang gang 2012.9.27. big data high scalability one time write, multi times read …….(to be add )

NoSQL DB Comparison

Zhang Gang2012.9.27

DIRAC Accounting system need

Big data High scalability One time write , multi times read …….(to be add )

Features

Riak Written in: Erlang & C, some Javascript Main point: Fault tolerance Principle from Amazon's Dynamo paper Tunable trade-offs for distribution and replication

(N, R, W) Map/reduce in JavaScript or Erlang Masterless multi-site replication Language support: include python Support full-text search, indexing, querying with

Riak Search server

Riak Best used: If you want something Cassandra-

like (Dynamo-like), but no way you're gonna deal with the bloat and complexity. If you need very good single-site scalability, availability and fault-tolerance, but you're ready to pay for multi-site replication.

For example: Point-of-sales data collection. Factory control systems. Places where even seconds of downtime hurt. Could be used as a well-update-able web server.

CouchDB Written in: Erlang Main point: embrace the web, ease of use Document-oriented Data format： JSON Bi-directional replication and off-line operation in

mind MVCC - write operations do not block reads Needs compacting from time to time Views: embedded map/reduce Built for off-line Automatically replicates all the data to all servers. Support AICD transaction.

CouchDB

Best used: Replication and synchronization capabilities of CouchDB make it ideal for using it in mobile devices, where network connection is not guaranteed but the application must keep on working offline.

For accumulating, occasionally changing data, on which pre-defined queries are to be run. Places where versioning is important.

For example: CRM, CMS systems. Master-master replication is an especially interesting feature, allowing easy multi-site deployments

Cassandra Written in: Java Main point: Best of BigTable and Dynamo Tunable trade-offs for distribution and replication

(N, R, W) Querying by column, range of keys BigTable-like features: columns, column families Has secondary indices Writes are much faster than reads (!) Map/reduce possible with Apache Hadoop All nodes are similar, as opposed to

Hadoop/Hbase Gossip protocol, multi data center, no single point

of failure

Cassandra

Best used: When you write more than you read (logging). If every component of the system must be in Java. ("No one gets fired for choosing Apache's stuff.")

For example: Banking, financial industry .Writes are faster than reads, so one natural niche is real time data analysis.

Hadoop HBase Written in: Java Main point: Billions of rows X millions of columns Modeled after Google's BigTable Uses Hadoop's HDFS as storage Map/reduce with Hadoop Optimizations for real time queries A high performance Thrift gateway(access

interface) Cascading, hive, and pig source and sink modules Random access performance is like MySQL A cluster consists of several different types of

nodes(Muster/RegionServer) Not scale down to small installations.

Hadoop HBase

Best used: Hadoop is probably still the best way to run Map/Reduce jobs on huge datasets. Best if you use the Hadoop/HDFS stack already.

For example: Analysing log data.

Comparison

Cassandra VS CouchDB

Points that favor CouchDB A document store Offline replication embrace the web Automatically replicates all the data to all servers,

impractical for very large number of replicas and very databases.

This features maybe unsuitable for DIRAC Accounting System

So, compare CouchDB ,Cassandra win I think

Cassandra VS Riak Both architecturally strongly influenced by

Dynamo Both also go beyond Dynamo in providing a "richer

than pure K/V" data model Points that favor Cassandra

speed support for clusters spanning multiple data centers big names using it (digg, twitter, facebook, webex, ... )

Points that favor Cassandra map/reduce support out of the box(Cassandra can do it

with Hadoop map/reduce )

So, maybe Cassandra win again I think

HBase VS Cassandra C has only one type of nodes, all nodes are

similar . H consists of several different types of nodes(Muster/RegionServer).

H must deployed over the HDFS, compare this C is much more simple

Data consistency of C is tunable(N,W,R). H better support map/reduce H provides the developer with row locking

facilities whereas Cassandra can not. C just use timestamp.

C has better I/O performance and better scalability but not good at range scan.

CAP:C focus on AC and H focus on CP H has an SQL compatibility interface(Hive),so H

support SQL

HBase VS Cassandra The structure of C is simple ,deploy and maintenance

is simple, compare C(save money, save time) ,H is much more complex deploy or maintenance. But we have a Hadoop cluster here already.

H maybe more suitable for data warehousing, and large scale data processing and analysis. And C being more suitable for real time transaction processing and the serving of interactive data.

.

HBase VS Cassandra HBase has been recognized by the WLCG Database

Technical Evolution Group as having the greatest potential impact in the LHC experiments out of all NoSQL technologies. The CERN IT organization is setting up a cluster to try it.

So, for a Accounting system ,maybe HBase is a good choice I think.

A few company use cases

B C D E R G

Use in production for CMS and ATLAS

CouchDB: CMS use CouchDB in production for parts of

its Data and Workflow Management systems, in particular for some queues and for the job state machine. The installation has 3 replicas of a CouchDB database at CERN and 4 replicas of the same database at Fermilab.


HBase: HBase is used in production by ATLAS in

its Distributed Data Manager called DQ2,for both log analysis and accounting on a 12-node cluster. The original method they had for doing their accounting summary was 8 to 20 times faster than the same method on the shared Oracle system they had, depending on the HDFS replication level.


Cassandra: Cassandra is used in production by ATLAS

PanDa monitoring.They chose to host it at BNLon only 3 nodes that were quite high-powered:each node has 24 cores and 1Terabyte of RAID0 Solid-State Disks(SSDs).

ScriptUse the records in type_*table to draw some pie

plot

Script

FUNCTION:def generatePlotByTime(groupby,generate,keyTableName,startTime,endTime): “the main function ,gengrate a plot by

parameters”def getTrueValue(keyTableName,index): “select the key tables to get the true value by

index”Calling like this:

generatePlotByTime(‘Site’,’CPUTime’,’ac_key_Lhcb-Production_job_Site’,’2010-6-20’,’2012-6-20’)

Script

DiskSpace groupby site cost about:97.39s

Script

CPU time groupby User cost about:97.03s

Script

CPU time groupby UserGroup cost about:93.62s

Script

Diskspace groupby ProcessingType cost:94.82s

Script-bytime

processing time:86.64s

Script-bytime

processing time:97.69s

页面标题页面标题

Thanks

zhang gang 2012.9.27. big data high scalability one time write, multi times read …….(to be add )

Documents

multisite replication

replication n

changing data

best way

javamain point

real time data analysis

multi data center

mastermaster replication