cassandra – a decentralized structured storage system lecturer : prof. kyungbaek kim presenter : i...

36
Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

Upload: hilary-evans

Post on 01-Jan-2016

221 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

Cassandra – A Decentralized Structured

Storage SystemLecturer : Prof. Kyungbaek Kim

Presenter : I Gde Dharma Nugraha

Page 2: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

Outlined

• Introduction• History• Data Model• System Architecture• Cassandra Configuration• CQL = Cassandra Query Language• Cassandra Driver• Practical Example

Page 3: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

Introduction

• Apache Cassandra ™ is a massively scalable open source NoSQL database.• Cassandra is perfect for managing large amounts of

data across multiple data centers and cloud.• Cassandra delivers continuous availability, linear

scalability, and operational simplicity across many commodity servers with no Single Point of Failure (SPOF), along with a powerful data model designed for maximum flexibility and fast response times.

Page 4: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

Introduction (Cont’d)

• Cassandra has a “masterless” architecture.• Cassandra provides customizable replication,

storing redundant copies of data across nodes that participate in a Cassandra ring.

Page 5: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

History

• Cassandra was created to power the Facebook Inbox Search.• Facebook open-sourced Cassandra in 2008 and

became an Apache Incubator project.• In 2010, Cassandra graduated to a top-level project,

regular update and releases followed.

Page 6: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

General Design Features

• Emphasis on performance over analysis• Still has support for analysis tools such as Hadoop.

• Organization• Rows are organized into tables.• First component of a table’s primary key is the partition key.• Rows are clustered by the remaining columns of the key.• Columns may be indexed separately from the primary key.• Tables may be created, dropped, altered at runtime without

blocking queries.

• Language• CQL (Cassandra Query Language) introduced, similar to SQL

(flattened learning curve).

Page 7: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

Data Model

• Table is a multi dimensional map indexed by key (row key).• Columns are grouped into Column Families.• 2 Types of Column Families

• Simple• Super (nested Column Families)

• Each Column has• Name• Value• Timestamp

• A row is a collection of columns labeled with name.

Page 8: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

Data Model

keyspace

settings

column family

settingscolumn

name value timestamp

Page 9: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

Data Model

• Cassandra Row• The value of a row is

itself a sequence of key-value pairs.• Such nested key-value

pairs are column.• Key = column name.• A row must contain at

least 1 column.

Page 10: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

Data Model

• Example of Column

Page 11: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

Data Model

• Key Space• A Key Space is a

group of column families together. It is only a logical grouping of column families and provides an isolated scope for names.

Page 12: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

System Architecture

• The ring represents a cyclic range of token values (i.e., the token space).

• Each node is assigned a position on the ring based on its token.

• Each node communicates with each other node using Gossip protocol.

• First data written into commit log for data durability.

• Later data pushed from commit log to memtable, once memtable is full then the data written into sstable (disk)

A

B

C

D

Page 13: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

System Architecture

• Important keyword• Node

• The place for store the data. It is the basic infrastructure component of Cassandra.• Data Center

• A collection of related nodes. A data center can be a physical data center or virtual data center.

• Cluster• A cluster contains one or more data centers. It can span physical locations.

• Commit log• All data is written first to the commit log for durability. After all its data has been flushed to

SSTables, it can be archived, deleted or recycled.• Table

• A collection ordered column fetched by row. A row consists of columns and have a primary key. The first part of the key is a column name.

• SSTable• A sorted string table (SSTable) is an immutable data file to which Cassandra writes

memtables periodically. SSTables are append only and stored on disk sequentially and maintained for each Cassandra table.

Page 14: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

System Architecture

• Involve:• Partitioning

• How Data is partitioned across nodes.• Replication

• How Data is duplicated across nodes.• Cluster Membership

• How nodes are added, deleted to the cluster

Page 15: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

System Architecture• Partitioning• Nodes are logically structured in Ring Topology.• Hashed value of key associated with data partition is

used to assign it to a node in the ring.• Hashing rounds off after certain value to support ring

structure.• Cassandra has 3 type of partition

• Murmur3Partitioner• RandomPartitioner• ByteOrdererPartitioner

Page 16: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

System Architecture

• Replication• Each data item is replicated at N (replication factor)

nodes.• Different Replication Policies

• Rack Unaware – replicate data at N-1 successive nodes after its coordinator.

• Rack Aware – uses ‘Zookeeper’ to choose a leader which tells nodes the range they are replicas for.

• Datacenter Aware – similar to Rack Aware but leader is chosen at Datacenter level instead of Rack Level.

Page 17: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

System Architecture

• Gossip Protocol• Network Communication protocols inspired for real life

rumour spreading.• Periodic, Pairwise, inter-node communication.• Low frequency communication ensures low cost.• Random selection of peers.• Example – Node A wish to search for pattern in data

• Round 1 – Node A searches locally and then gossips with node B.

• Round 2 – Node A,B gossips with C and D.• Round 3 – Nodes A,B,C and D gossips with 4 other nodes ……

• Round by round doubling makes protocol very robust.

Page 18: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

System Architecture

• Cluster Membership• Uses Scuttleback (a Gossip protocol) to manage nodes.• Uses gossip for node membership and to transmit

system control state.• Node Fail state is given by variable ‘phi’ which tells how

likely a node might fail (suspicion level) instead of simple binary value (up/down). • This type of system is known as Accrual Failure Detector.

Page 19: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

System Architecture

• Accrual Failure Detector • If a node is faulty, the suspicion level monotonically

increases with time. Φ(t) k as t k• Where k is a threshold variable (depends on system

load) which tells a node is dead.• If node is correct, phi will be constant set by application.

Generally Φ(t) = 0

Page 20: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

System Architecture

• Local Persistence• Relies on local file system for data persistency.• Write operations happens in 2 steps

• Write to commit log in local disk of the node• Update in-memory data structure.

• Read operation• Looks up in-memory ds first before looking up files on disk.• Uses Bloom Filter (summarization of keys in file store in

memory) to avoid looking up files that do not contain the key.

Page 21: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

System Architecture

• Write Path

Page 22: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

System Architecture

• Read Path

Page 23: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

System Architecture

• Example write and read process.• Data Model

Page 24: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

System Architecture

• Write Process

Page 25: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

System Architecture

• Replication Process

Page 26: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

System Architecture

• Read Process

Page 27: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

Cassandra Configuration

• Key components for configuring Cassandra• Gossip

• A peer-to-peer communication protocol to discover and share location and state information about the other nodes in a cluster. Gossip information is also persisted locally by each node to use immediately when a node restarts.

• Partitioner• A partitioner determines how to distribute the data across the

nodes in the cluster and which node to place the first copy of data on.

• Replication factor• The total number of replicas across the cluster.

Page 28: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

Cassandra Configuration

• Key component for configuring Cassandra• Replica placement strategy

• Cassandra stores copies (replicas) of data on multiple nodes to ensure reliability and fault tolerance.

• Snitch• Defines groups of machines into data centers and racks (the

topology) that the replication strategy uses to place replicas.• The cassandra.yaml configuration file

• The main configuration file for setting the initialization properties for a cluster, caching parameters for tables, properties for tuning and resource utilization, timeout settings, client connections, backups and security.

Page 29: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

CQL = Cassandra Query Language• Default and primary interface into the Cassandra

DBMS.• Provide SQL-like command.• CQL and SQL share the same abstract idea of a

table constructed of tables and rows. The main difference from SQL is that CQL does not support joins or subqueries.• Run cqlsh in terminal window. The command is

inside bin directory.

Page 30: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

CQL = Cassandra Query Language • Creating and updating a keyspace• Cassandra keyspace is a namespace that defines how

data is replicated on nodes.• To create a keyspace:

• cqlsh> CREATE KEYSPACE demodb WITH REPLICATION = {‘class’ : ‘SimpleStrategy’, ‘replication_factor’ : 1};

• To update a keyspace:• cqlsh>ALTER KEYSPACE demodb WITH REPLICATION = {‘class’ :

‘NetworkTopologyStrategy’, ‘replication_factor’ : 2};• To use namespace:

• Cqlsh>USE demodb;

Page 31: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

CQL – Cassandra Query Language• Creating Tables:

CREATE TABLE users(email varchar,bio varchar,birthday timestamp,active boolean,PRIMARY KEY (email));

Page 32: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

CQL – Cassandra Query Language• Inserting Data:

• **timestamp fields are specified in milliseconds since epoch.

INSERT INTO users (email, bio, birthday, active)VALUES (‘[email protected]’, ‘RoomMate’, ‘516513600312, true);

Page 33: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

CQL – Cassandra Query Language• Querying Tables:• SELECT expression reads one or more records from

Cassandra column family and returns a result-set of rows.

SELECT * FROM users;

SELECT email FROM users WHERE active = true;

Page 34: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

Cassandra Driver

• To connect with programming language, Cassandra provide driver package.• The programming language that supported by

Cassandra Drivers are :• C#• Java• Node.js• Python

• URL Cassandra driver download:• http://www.datastax.com/download

Page 35: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

Reference

• Lakshman, Avinash, and Prashant Malik. "Cassandra: a decentralized structured storage system." ACM SIGOPS Operating Systems Review 44.2 (2010): 35-40.• Hewitt, Eben. Cassandra: the definitive guide. O'Reilly Media,

2010.• http://

www.datastax.com/documentation/cassandra/2.1/cassandra/gettingStartedCassandraIntro.html• http://

www.datastax.com/documentation/cql/3.1/cql/cql_intro_c.html• http://

www.datastax.com/documentation/developer/java-driver/2.1/java-driver/whatsNew2.html• http://planetcassandra.org/getting-started-with-apache-

cassandra-and-java/

Page 36: Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

• Installation guide and practical example.