Download - Cassandra - Deep Dive
CassandraA Decentralized Structured Storage System
By Sameera Nelson
Outline …
Introduction
Data Model
System Architecture
Failure Detection & Recovery
Local Persistence
Performance
Statistics
What is Cassandra ?
Distributed Storage System
Manages Structured Data
Highly available , No SPoF
Not a Relational Data Model
Handle high write throughput
◦ No impact on read efficiency
Motivation
Operational Requirements in Facebook
◦ Performance
◦ Reliability/ Dealing with Failures
◦ Efficiency
◦ Continues Growth
Application◦ Inbox Search Problem, Facebook
Similar Work
Google File System◦ Distributed FS, Single master/Slave
Ficus/ Coda
◦ Distributed FS
Farsite
◦ Distributed FS, No centralized server
Bayou◦ Distributed Relational DB System
Dynamo
◦ Distributed Storage system
Data Model
Data Model
Figure from Eben Hewitt’s slides.
Supported Operations
insert(table; key; rowMutation)
get(table; key; columnName)
delete(table; key; columnName)
Query Language
CREATE TABLE users
( user_id int PRIMARY KEY,
fname text,
lname text );
INSERT INTO users
(user_id, fname, lname) VALUES (1745, 'john', 'smith');
SELECT * FROM users;
Data Structure
Log-Structured Merge Tree
System Architecture
Architecture
Fully Distributed …No Single Point of Failure
Cassandra Architecture
PartitioningData distribution across nodes
ReplicationData duplication across nodes
Cluster MembershipNode management in cluster
adding/ deleting
Partitioning
The Token Ring
Partitioning Partitions using Consistent hashing
Partitioning Assignment in to the relevant partition
Partitioning, Vnodes
Replication
Based on configured replication factor
Replication
Different Replication Policies
◦Rack Unaware
◦Rack Aware
◦Data center Aware
Cluster Membership
Based on scuttlebutt
Efficient Gossip based mechanism
Inspired for real life rumor
spreading.
Anti Entropy protocol
◦ Repair replicated data by comparing &
reconciling differences
Cluster Membership
Gossip Based
Failure Detection &
Recovery
Failure DetectionTrack state
◦ Directly, Indirectly
Accrual Detection mechanism
Permanent Node change
◦ Admin should explicitly add or remove
Hints
◦ Data to be replayed in replication
◦ Saved in system.hints table
Accrual Failure Detector
• Node is faulty, suspicion level
monotonically increases.
• Φ(t) k• k - threshold variable
• Node is correct
• Φ(t) = 0
Local Persistence
Write Request
Write Operation
Write OperationLogging data in commit log/ memtable
Flushing data from the memtable
◦Flushing data on threshold
Storing data on disk in SSTables Mark with tombstone
Compaction Remove deletes, Sorts, Merges data,
consolidation
Write Operation
Compaction
Read RequestDirect/ Background (Read repair)
Read Operation
Delete Operation
Data not removed immediately
Only Tombstone is written
Deleted in Compacting Process
Additional Features
Adding compression Snappy Compression
Secondary index support
SSL support
◦ Client/ Node
◦ Node/ Node
Rolling commit logs
SSTable data file merging
Performance
Performance
High Throughput & Low Latency
◦ Eliminating on-disk data modification
◦ Eliminate erase-block cycles
◦ No Locking for concurrency control
◦ Maintaining integrity not required
High Availability
Linear Scalability
Fault Tolerant
Statistics
Stats from Netflix
Liner scalability
Stats from Netflix
Some users
Thank you
Read Detailed
Structure