zhang gang 2012.9.27. big data high scalability one time write, multi times read …….(to be add )

30
NoSQL DB Comparison Zhang Gang 2012.9.27

Upload: teresa-berry

Post on 29-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

NoSQL DB Comparison

Zhang Gang2012.9.27

Page 2: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

DIRAC Accounting system need

Big data High scalability One time write , multi times read …….(to be add )

Page 3: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

Features

Page 4: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

Riak Written in: Erlang & C, some Javascript Main point: Fault tolerance Principle from Amazon's Dynamo paper Tunable trade-offs for distribution and replication

(N, R, W) Map/reduce in JavaScript or Erlang Masterless multi-site replication Language support: include python Support full-text search, indexing, querying with

Riak Search server

Page 5: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

Riak Best used: If you want something Cassandra-

like (Dynamo-like), but no way you're gonna deal with the bloat and complexity. If you need very good single-site scalability, availability and fault-tolerance, but you're ready to pay for multi-site replication.

For example: Point-of-sales data collection. Factory control systems. Places where even seconds of downtime hurt. Could be used as a well-update-able web server.

Page 6: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

CouchDB Written in: Erlang Main point: embrace the web, ease of use Document-oriented Data format: JSON Bi-directional replication and off-line operation in

mind MVCC - write operations do not block reads Needs compacting from time to time Views: embedded map/reduce Built for off-line Automatically replicates all the data to all servers. Support AICD transaction.

Page 7: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

CouchDB

Best used: Replication and synchronization capabilities of CouchDB make it ideal for using it in mobile devices, where network connection is not guaranteed but the application must keep on working offline.

 For accumulating, occasionally changing data, on which pre-defined queries are to be run. Places where versioning is important.

For example: CRM, CMS systems. Master-master replication is an especially interesting feature, allowing easy multi-site deployments

Page 8: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

Cassandra Written in: Java Main point: Best of BigTable and Dynamo Tunable trade-offs for distribution and replication

(N, R, W) Querying by column, range of keys BigTable-like features: columns, column families Has secondary indices Writes are much faster than reads (!) Map/reduce possible with Apache Hadoop All nodes are similar, as opposed to

Hadoop/Hbase Gossip protocol, multi data center, no single point

of failure

Page 9: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

Cassandra

Best used: When you write more than you read (logging). If every component of the system must be in Java. ("No one gets fired for choosing Apache's stuff.")

For example: Banking, financial industry .Writes are faster than reads, so one natural niche is real time data analysis.

Page 10: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

Hadoop HBase Written in: Java Main point: Billions of rows X millions of columns Modeled after Google's BigTable Uses Hadoop's HDFS as storage Map/reduce with Hadoop Optimizations for real time queries A high performance Thrift gateway(access

interface) Cascading, hive, and pig source and sink modules Random access performance is like MySQL A cluster consists of several different types of

nodes(Muster/RegionServer) Not scale down to small installations.

Page 11: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

Hadoop HBase

Best used: Hadoop is probably still the best way to run Map/Reduce jobs on huge datasets. Best if you use the Hadoop/HDFS stack already.

For example: Analysing log data.

Page 12: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

Comparison

Page 13: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

Cassandra VS CouchDB

 Points that favor CouchDB A document store Offline replication embrace the web Automatically replicates all the data to all servers,

impractical for very large number of replicas and very databases.

This features maybe unsuitable for DIRAC Accounting System

So, compare CouchDB ,Cassandra win I think

Page 14: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

Cassandra VS Riak  Both architecturally strongly influenced by

Dynamo Both also go beyond Dynamo in providing a "richer

than pure K/V" data model Points that favor Cassandra

speed support for clusters spanning multiple data centers big names using it (digg, twitter, facebook, webex, ... )

Points that favor Cassandra map/reduce support out of the box(Cassandra can do it

with Hadoop map/reduce )

 So, maybe Cassandra win again I think

Page 15: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

HBase VS Cassandra C has only one type of nodes, all nodes are

similar . H consists of several different types of nodes(Muster/RegionServer).

H must deployed over the HDFS, compare this C is much more simple

Data consistency of C is tunable(N,W,R). H better support map/reduce H provides the developer with row locking

facilities whereas Cassandra can not. C just use timestamp.

C has better I/O performance and better scalability but not good at range scan.

CAP:C focus on AC and H focus on CP H has an SQL compatibility interface(Hive),so H

support SQL

Page 16: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

HBase VS Cassandra The structure of C is simple ,deploy and maintenance

is simple, compare C(save money, save time) ,H is much more complex deploy or maintenance. But we have a Hadoop cluster here already.

H maybe more suitable for data warehousing, and large scale data processing and analysis. And C being more suitable for real time transaction processing and the serving of interactive data.

.

Page 17: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

HBase VS Cassandra HBase has been recognized by the WLCG Database

Technical Evolution Group as having the greatest potential impact in the LHC experiments out of all NoSQL technologies. The CERN IT organization is setting up a cluster to try it.

So, for a Accounting system ,maybe HBase is a good choice I think.

Page 18: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

A few company use cases

B C D E R G

Page 19: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

Use in production for CMS and ATLAS

CouchDB: CMS use CouchDB in production for parts of

its Data and Workflow Management systems, in particular for some queues and for the job state machine. The installation has 3 replicas of a CouchDB database at CERN and 4 replicas of the same database at Fermilab.

Page 20: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

Use in production for CMS and ATLAS

HBase: HBase is used in production by ATLAS in

its Distributed Data Manager called DQ2,for both log analysis and accounting on a 12-node cluster. The original method they had for doing their accounting summary was 8 to 20 times faster than the same method on the shared Oracle system they had, depending on the HDFS replication level.

Page 21: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

Use in production for CMS and ATLAS

Cassandra: Cassandra is used in production by ATLAS

PanDa monitoring.They chose to host it at BNLon only 3 nodes that were quite high-powered:each node has 24 cores and 1Terabyte of RAID0 Solid-State Disks(SSDs).

Page 22: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

ScriptUse the records in type_*table to draw some pie

plot

Page 23: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

Script

FUNCTION:def generatePlotByTime(groupby,generate,keyTableName,startTime,endTime): “the main function ,gengrate a plot by

parameters”def getTrueValue(keyTableName,index): “select the key tables to get the true value by

index”Calling like this:

generatePlotByTime(‘Site’,’CPUTime’,’ac_key_Lhcb-Production_job_Site’,’2010-6-20’,’2012-6-20’)

Page 24: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

Script

DiskSpace groupby site cost about:97.39s

Page 25: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

Script

CPU time groupby User cost about:97.03s

Page 26: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

Script

CPU time groupby UserGroup cost about:93.62s

Page 27: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

Script

Diskspace groupby ProcessingType cost:94.82s

Page 28: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

Script-bytime

processing time:86.64s

Page 29: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

Script-bytime

processing time:97.69s

Page 30: Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

页面标题页面标题

Thanks