Download - Overiew of Cassandra and Doradus

Overview of Cassandra and The Doradus OSS Project

Randy Guck

Principal Engineer, Dell Software

Overview

•  What is No SQL? – Common RDB roadblocks – NoSQL database types

•  Overview of Cassandra – What's unique – Limitations

•  Doradus – Architecture – Features – The OLAP and Spider storage managers – What each is good for – Where to get Doradus

Why RDB Apps Look for Something Else

•  Performance – B-trees – Locking – One writable copy of each record

•  Scaling costs – RDBs scale "up" – Big boxes, SANs, fiber channel, etc.

•  What if you want... – Distributed access – No single points of failure – Instant failover – Sharding – Replication

NoSQL Data Models

Data Model Examples Elastic? Queries? Relationships?

Key–Value LevelDB, Kyoto Cabinet, Redis

No No No

Distributed Key–Value

Dynamo, MemcacheDB, Riak, Voldemort

Yes No No

Column-Oriented Accumulo, Cassandra, HBase

Yes Some No

Document-Oriented

Couchbase, Elasticsearch, MongoDB

Yes Yes Some

Graph Neo4J, OrientDB, Titan No Yes Yes

Sharding + replication

AND/OR/ranges/etc.

Built-in support

NoSQL Data Models

Data Model Examples Elastic? Queries? Relationships?

Key–Value LevelDB, Kyoto Cabinet, Redis

No No No

Distributed Key–Value

Dynamo, MemcacheDB, Riak, Voldemort

Yes No No

Column-Oriented Accumulo, Cassandra, HBase

Yes Some No

Document-Oriented

Couchbase, Elasticsearch, MongoDB

Yes Yes Some

Graph Neo4J, OrientDB, Titan No Yes Yes

Sharding + replication

AND/OR/ranges/etc.

Built-in support

Doradus goals

NoSQL Common Traits

•  Distributed cluster of nodes – Commodity, shared-nothing servers – Scales horizontally – Expands elastically

•  Replication – Performant local access – Automatic failover

•  De-normalized data model

•  Schemaless/dynamic columns

•  Eventual consistency

N=5, RF=3

Is NoSQL Catching On?

Source: db-engines.com

Overview of Cassandra

•  Wide column NoSQL database

•  Open sourced by Facebook

•  Apache Project with active community

•  Commercially support by DataStax, Acunu, others

•  Used by 1,500+ companies

•  "Pure peer" architecture

•  Largest known Cassandra cluster: 300+ TB data and 400+ machines.

What is Cassandra best for?

•  Continuous data streams – Logs, events, audit records, measurements, ... – Fast data ingestion – Predictable read performance

•  Partitionable data – "1,000's of little databases in one"

•  Elastic scalability – Expand/upgrade/repair without downtime

•  Not good for: – Blob store – Persistent queue – OLTP transactions

CQL Static Table

CREATE TABLE songs ( id uuid PRIMARY KEY, title text, album text, artist text, data blob ); CREATE INDEX ON songs (artist);

Row Key Columns: "<column name>"="<column value>"

62c36... "album"="90125" "artist"="Yes" "data"=<audio> "title"="Changes"

837a2... "album"="Crystal Ball" "artist"="Styx" "data"=<audio> "title"="Put Me On"

2de83... "album"="Nevermind" "artist"="Nirvana" "data"=<audio> "title"="Breed"

...

CQL Clustered Table

CREATE TABLE playlists ( id uuid, song_order int, song_id uuid, // copied from songs.id title text, // copied from songs.title album text, // copied from songs.album artist text, // copied from songs.artist PRIMARY KEY (id, song_order) // compound key );

Row Key Columns: "<song_order>:<column name>"="<column value>"

28d23...

"1:"="" "1:album"="90125" "1:artist"="Yes" "1:song_id"="62c36..."

"1:title"="Changes" "2:"="" "2:album"="Nevermind" "2:artist"="Nirvana"

"2:song_id"="2de83..." "2:title"="Breed" "3:"="" ...

2ed91... "1:"="" "1:album"="Crystal Ball" "1:artist"="Styx" "1:song_id"="837a2..."

"1:title"="Put Me On" "2:"="" ...

...

Row Key Columns: "<song_order>:<column name>"="<column value>"

28d23...

"1:"="" "1:album"="90125" "1:artist"="Yes" "1:song_id"="62c36..."

"1:title"="Changes" "2:"="" "2:album"="Nevermind" "2:artist"="Nirvana"

"2:song_id"="2de83..." "2:title"="Breed" "3:"="" ...

2ed91... "1:"="" "1:album"="Crystal Ball" "1:artist"="Styx" "1:song_id"="837a2..."

"1:title"="Put Me On" "2:"="" ...

...

CQL Clustered Table (cont.)

CQL "Rows"

CREATE TABLE playlists ( id uuid, song_order int, song_id uuid, // copied from songs.id title text, // copied from songs.title album text, // copied from songs.album artist text, // copied from songs.artist PRIMARY KEY (id, song_order) // compound key );

Can we make Cassandra more appealing?

•  Data Model – No direct support for relationships

•  Indexing – Secondary indexes: single column only – Hash table only: no range searching

•  Searching – No joins, embedded queries – No aggregate queries – Limited equalities (e.g., SELECT * WHERE <key> IN (<list>)) – No full text search – No OR clauses – ...

What is Doradus?

•  Java service that enhances Cassandra

•  Adds features: – REST API (JSON and XML) – Multi-tenancy – Graph model – Multi-field/full text query language – Automatic data aging – OLAP and Spider storage services

•  Compatible with NoSQL tenets such as idempotent updates

•  Under development for ~3 years

•  Open source: Apache 2.0 License

Doradus Graph Model

•  A cluster hosts one of more applications •  An application own tables which store objects •  An object consists of single- and multi-valued fields •  A pair of link fields form a bi-directional relationship

Message {Size, SendDate}

Participant {ReceiptDate}

Address {Name}

Person {Name, Department}

Attachment {Size, Extension} Managerè

çEmployees

êPerson

Address é

êAttachments

Messageé

Recipientsè

çMessageAsRecipient

Addressè

çParticipants

Senderè

çMessageAsSender

Example Object and Aggregate Queries

•  Lucene full text query GET /Email/Person/_query?q=FirstName:j* AND NOT Office:[q TO z]

•  Link path with filtering GET /Email/Message/_query?q=

Sender.WHERE(ReceiptDate>'2010-‐06-‐01').Address.Name="*.com"

•  Quantifiers GET /Email/Message/_aggregate?m=COUNT(*)

&q=ANY(Recipients).ALL(Address).NONE(Person).Department:sales &f=Tags,TOP(3,TRUNCATE(SendDate,DAY))

•  Transitive links GET /Email/Person/_query?q=DirectReports^(3).LastName=wilson

&f=DirectReports(Name,DirectReports(Name))

Doradus: Architecture

Application

Doradus

Cassandra

REST API

Thrift or CQL

Data and Log files

Doradus: Multi-Data Center Clusters

Cassandra

Doradus

Cassandra Cassandra

Doradus

Cassandra

Doradus

Cassandra Cassandra

Doradus

Node 1 Node 2 Node 3 Node 4 Node 5 Node 6

Rack 1, Data Center 1 Rack 1, Data Center 2

Applications Applications

DC=2, N=6, RF=3

Doradus: Internal Architecture

App App App Monitor

App

Spider Storage Service

OLAP Storage Service

Cassandra Cluster

JMX

REST: Embedded Jetty Server

Cassandra Interface do

rad

us.

yam

l

REST

Doradus OLAP Service

•  Borrows from online analytical processing – Sharding as data "cubes" – Columnar storage

•  Very dense storage – No indexes! – Value arrays are compressed

•  Fast load time – Up to 500,000 objects/second/node – Small "data lag" time

•  Very fast queries – Searches millions of objects/second – Full DQL object and aggregate query support

OLAP Data Loading

Events Events Events

Events Events People

Events Events Computers

Events Events Domains

Sources

OLAP Data Loading

T1 Events Events Events




T2

T3

T4

T4

Sources Segments

…

Changes in last n minutes

OLAP Data Loading





T2

T3

T4

T4

2013-03-01

2013-02-28

2013-02-27

Sources Segments Shards

… …


Date-based shards

OLAP Data Loading





T2

T3

T4

T4

2013-03-01

2013-02-28

2013-02-27

Sources Segments Shards OLAP Store

… …


Date-based shards

OLAP Use Case

•  Data: Windows Events – 115M events

•  Test parameters – Server: Quad Xeon CPUs, 32GB memory, 3 disks – Cassandra memory: 1GB – Load app/embedded Doradus memory: 4GB – Load threads: 5 – Batch size: 5,000 events – Shard size: 1 day (860 shards total)

•  Test results – Total objects loaded: ~1 billion – Total time: 32 minutes, 56 seconds – Load rate: 502,991 objects/second – Final database size: ~2GB

Doradus Spider Service

•  Analogous to Lucene + NoSQL

•  Fully inverted field indexing – Configurable analyzers – Stored-only (non-indexed) fields

•  Unique features: – Automatic table-level sharding – Statistics

– Pre-computed aggregate queries – Refreshed in background

– Object-level data aging

•  Use case example: – Indexing a massive number of documents

OLAP and Spider: When to Use

•  Spider is best for: – Unstructured/variable-

structure data – Configurable indexing – Fine-grained updates with

immediate indexing – Document storage and

searching – Emphasis on full-text/multi-

field searching

•  OLAP is best for: – High-volume data streams – High performance analytic

queries – Dense data storage – Immutable/semi-mutable

data – Data that can be loaded in

batches – Data that can be partitioned

(e.g., time-sharded)

Summary

•  What's cool about Doradus? – Bi-directional links with referential integrity – Link paths: simpler than joins – Idempotent updates – Partial object updates – Simple transitive searching – OLAP: dense storage and fast queries – It's free!

Thank you ! Doradus is available at: https://github.com/dell-oss/Doradus Contact me: [email protected]

Download - Overiew of Cassandra and Doradus

Top Related