Download - Overiew of Cassandra and Doradus
Overview of Cassandra and The Doradus OSS Project
Randy Guck
Principal Engineer, Dell Software
Overview
• What is No SQL? – Common RDB roadblocks – NoSQL database types
• Overview of Cassandra – What's unique – Limitations
• Doradus – Architecture – Features – The OLAP and Spider storage managers – What each is good for – Where to get Doradus
Why RDB Apps Look for Something Else
• Performance – B-trees – Locking – One writable copy of each record
• Scaling costs – RDBs scale "up" – Big boxes, SANs, fiber channel, etc.
• What if you want... – Distributed access – No single points of failure – Instant failover – Sharding – Replication
NoSQL Data Models
Data Model Examples Elastic? Queries? Relationships?
Key–Value LevelDB, Kyoto Cabinet, Redis
No No No
Distributed Key–Value
Dynamo, MemcacheDB, Riak, Voldemort
Yes No No
Column-Oriented Accumulo, Cassandra, HBase
Yes Some No
Document-Oriented
Couchbase, Elasticsearch, MongoDB
Yes Yes Some
Graph Neo4J, OrientDB, Titan No Yes Yes
Sharding + replication
AND/OR/ranges/etc.
Built-in support
NoSQL Data Models
Data Model Examples Elastic? Queries? Relationships?
Key–Value LevelDB, Kyoto Cabinet, Redis
No No No
Distributed Key–Value
Dynamo, MemcacheDB, Riak, Voldemort
Yes No No
Column-Oriented Accumulo, Cassandra, HBase
Yes Some No
Document-Oriented
Couchbase, Elasticsearch, MongoDB
Yes Yes Some
Graph Neo4J, OrientDB, Titan No Yes Yes
Sharding + replication
AND/OR/ranges/etc.
Built-in support
Doradus goals
NoSQL Common Traits
• Distributed cluster of nodes – Commodity, shared-nothing servers – Scales horizontally – Expands elastically
• Replication – Performant local access – Automatic failover
• De-normalized data model
• Schemaless/dynamic columns
• Eventual consistency
N=5, RF=3
Is NoSQL Catching On?
Source: db-engines.com
Overview of Cassandra
• Wide column NoSQL database
• Open sourced by Facebook
• Apache Project with active community
• Commercially support by DataStax, Acunu, others
• Used by 1,500+ companies
• "Pure peer" architecture
• Largest known Cassandra cluster: 300+ TB data and 400+ machines.
What is Cassandra best for?
• Continuous data streams – Logs, events, audit records, measurements, ... – Fast data ingestion – Predictable read performance
• Partitionable data – "1,000's of little databases in one"
• Elastic scalability – Expand/upgrade/repair without downtime
• Not good for: – Blob store – Persistent queue – OLTP transactions
CQL Static Table
CREATE TABLE songs ( id uuid PRIMARY KEY, title text, album text, artist text, data blob ); CREATE INDEX ON songs (artist);
Row Key Columns: "<column name>"="<column value>"
62c36... "album"="90125" "artist"="Yes" "data"=<audio> "title"="Changes"
837a2... "album"="Crystal Ball" "artist"="Styx" "data"=<audio> "title"="Put Me On"
2de83... "album"="Nevermind" "artist"="Nirvana" "data"=<audio> "title"="Breed"
...
CQL Clustered Table
CREATE TABLE playlists ( id uuid, song_order int, song_id uuid, // copied from songs.id title text, // copied from songs.title album text, // copied from songs.album artist text, // copied from songs.artist PRIMARY KEY (id, song_order) // compound key );
Row Key Columns: "<song_order>:<column name>"="<column value>"
28d23...
"1:"="" "1:album"="90125" "1:artist"="Yes" "1:song_id"="62c36..."
"1:title"="Changes" "2:"="" "2:album"="Nevermind" "2:artist"="Nirvana"
"2:song_id"="2de83..." "2:title"="Breed" "3:"="" ...
2ed91... "1:"="" "1:album"="Crystal Ball" "1:artist"="Styx" "1:song_id"="837a2..."
"1:title"="Put Me On" "2:"="" ...
...
Row Key Columns: "<song_order>:<column name>"="<column value>"
28d23...
"1:"="" "1:album"="90125" "1:artist"="Yes" "1:song_id"="62c36..."
"1:title"="Changes" "2:"="" "2:album"="Nevermind" "2:artist"="Nirvana"
"2:song_id"="2de83..." "2:title"="Breed" "3:"="" ...
2ed91... "1:"="" "1:album"="Crystal Ball" "1:artist"="Styx" "1:song_id"="837a2..."
"1:title"="Put Me On" "2:"="" ...
...
CQL Clustered Table (cont.)
CQL "Rows"
CREATE TABLE playlists ( id uuid, song_order int, song_id uuid, // copied from songs.id title text, // copied from songs.title album text, // copied from songs.album artist text, // copied from songs.artist PRIMARY KEY (id, song_order) // compound key );
Can we make Cassandra more appealing?
• Data Model – No direct support for relationships
• Indexing – Secondary indexes: single column only – Hash table only: no range searching
• Searching – No joins, embedded queries – No aggregate queries – Limited equalities (e.g., SELECT * WHERE <key> IN (<list>)) – No full text search – No OR clauses – ...
What is Doradus?
• Java service that enhances Cassandra
• Adds features: – REST API (JSON and XML) – Multi-tenancy – Graph model – Multi-field/full text query language – Automatic data aging – OLAP and Spider storage services
• Compatible with NoSQL tenets such as idempotent updates
• Under development for ~3 years
• Open source: Apache 2.0 License
Doradus Graph Model
• A cluster hosts one of more applications • An application own tables which store objects • An object consists of single- and multi-valued fields • A pair of link fields form a bi-directional relationship
Message {Size, SendDate}
Participant {ReceiptDate}
Address {Name}
Person {Name, Department}
Attachment {Size, Extension} Managerè
çEmployees
êPerson
Address é
êAttachments
Messageé
Recipientsè
çMessageAsRecipient
Addressè
çParticipants
Senderè
çMessageAsSender
Example Object and Aggregate Queries
• Lucene full text query GET /Email/Person/_query?q=FirstName:j* AND NOT Office:[q TO z]
• Link path with filtering GET /Email/Message/_query?q=
Sender.WHERE(ReceiptDate>'2010-‐06-‐01').Address.Name="*.com"
• Quantifiers GET /Email/Message/_aggregate?m=COUNT(*)
&q=ANY(Recipients).ALL(Address).NONE(Person).Department:sales &f=Tags,TOP(3,TRUNCATE(SendDate,DAY))
• Transitive links GET /Email/Person/_query?q=DirectReports^(3).LastName=wilson
&f=DirectReports(Name,DirectReports(Name))
Doradus: Architecture
Application
Doradus
Cassandra
REST API
Thrift or CQL
Data and Log files
Doradus: Multi-Data Center Clusters
Cassandra
Doradus
Cassandra Cassandra
Doradus
Cassandra
Doradus
Cassandra Cassandra
Doradus
Node 1 Node 2 Node 3 Node 4 Node 5 Node 6
Rack 1, Data Center 1 Rack 1, Data Center 2
Applications Applications
DC=2, N=6, RF=3
Doradus: Internal Architecture
App App App Monitor
App
Spider Storage Service
OLAP Storage Service
Cassandra Cluster
JMX
REST: Embedded Jetty Server
Cassandra Interface do
rad
us.
yam
l
REST
Doradus OLAP Service
• Borrows from online analytical processing – Sharding as data "cubes" – Columnar storage
• Very dense storage – No indexes! – Value arrays are compressed
• Fast load time – Up to 500,000 objects/second/node – Small "data lag" time
• Very fast queries – Searches millions of objects/second – Full DQL object and aggregate query support
OLAP Data Loading
Events Events Events
Events Events People
Events Events Computers
Events Events Domains
Sources
OLAP Data Loading
T1 Events Events Events
Events Events People
Events Events Computers
Events Events Domains
T2
T3
T4
T4
Sources Segments
…
Changes in last n minutes
OLAP Data Loading
T1 Events Events Events
Events Events People
Events Events Computers
Events Events Domains
T2
T3
T4
T4
2013-03-01
2013-02-28
2013-02-27
Sources Segments Shards
… …
Changes in last n minutes
Date-based shards
OLAP Data Loading
T1 Events Events Events
Events Events People
Events Events Computers
Events Events Domains
T2
T3
T4
T4
2013-03-01
2013-02-28
2013-02-27
Sources Segments Shards OLAP Store
… …
Changes in last n minutes
Date-based shards
OLAP Use Case
• Data: Windows Events – 115M events
• Test parameters – Server: Quad Xeon CPUs, 32GB memory, 3 disks – Cassandra memory: 1GB – Load app/embedded Doradus memory: 4GB – Load threads: 5 – Batch size: 5,000 events – Shard size: 1 day (860 shards total)
• Test results – Total objects loaded: ~1 billion – Total time: 32 minutes, 56 seconds – Load rate: 502,991 objects/second – Final database size: ~2GB
Doradus Spider Service
• Analogous to Lucene + NoSQL
• Fully inverted field indexing – Configurable analyzers – Stored-only (non-indexed) fields
• Unique features: – Automatic table-level sharding – Statistics
– Pre-computed aggregate queries – Refreshed in background
– Object-level data aging
• Use case example: – Indexing a massive number of documents
OLAP and Spider: When to Use
• Spider is best for: – Unstructured/variable-
structure data – Configurable indexing – Fine-grained updates with
immediate indexing – Document storage and
searching – Emphasis on full-text/multi-
field searching
• OLAP is best for: – High-volume data streams – High performance analytic
queries – Dense data storage – Immutable/semi-mutable
data – Data that can be loaded in
batches – Data that can be partitioned
(e.g., time-sharded)
Summary
• What's cool about Doradus? – Bi-directional links with referential integrity – Link paths: simpler than joins – Idempotent updates – Partial object updates – Simple transitive searching – OLAP: dense storage and fast queries – It's free!
Thank you ! Doradus is available at: https://github.com/dell-oss/Doradus Contact me: [email protected]