cassandra and sigmod contest cloud computing group haiping wang 2009-12-19

Cassandra and Sigmod contest

Cloud computing group

Haiping Wang

2009-12-19

Outline

Cassandra• Cassandra overview• Data model• Architecture• Read and write

Sigmod contest 2009Sigmod contest 2010

Cassandra overview

Highly scalable, distributedEventually consistentStructured key-value storeDynamo + bigtable P2PRandom reads and random writesJava

Data ModelKEY

ColumnFamily1 Name : MailList Type : Simple Sort : Name

Name : tid1

Value : <Binary>

TimeStamp : t1

Name : tid2

Value : <Binary>

TimeStamp : t2

Name : tid3

Value : <Binary>

TimeStamp : t3

Name : tid4

Value : <Binary>

TimeStamp : t4

ColumnFamily2 Name : WordList Type : Super Sort : Time

Name : aloha

ColumnFamily3 Name : System Type : Super Sort : Name

Name : hint1

Name : hint2

Name : hint3

Name : hint4

Name : dude

Column Families are declared

upfront

Columns are added and modified dynamically

SuperColumns are added and modified

dynamically

Columns are added and modified dynamically

Cassandra Architecture

Cassandra API

• Data structures• Exceptions• Service API• ConsistencyLevel(4)• Retrieval methods(5)• Range query: returns matching keys(1)• Modification methods(3)

• Others

Cassandra commands

Partitioning and replication(1)

• Consistent hashing• DHT• Balance• Monotonicity• Spread• Load

• Virtual nodes• Coordinator• Preference list

h(key2)

h(key1)

Partitioning and replication(2)

Data Versioning

• Always writeable• Mulitple versions– put() return before all replicas– get() many versions

• Vector clocks • Reconciliation during reads by clients

Vector clock

• List of (node, counter) pairsE.g. [x,2][y,3] vs. [x,3][y,4][z,1]

[x,1][y,3] vs. [z,1][y,3]

• Use timestampE.g. D([x,1]:t1,[y,1]:t2)

• Remove the oldest version when reach a thresthold

Vector clock

Return all the objects at the leaves

D3,4([Sx,2],[Sy,1],[Sz,1])

Single new version

Excution operations

• Two strategies– A generic load balancer based on load balance• Easy ,not have to link any code specific

– Directory to the node• Achieve lower latency

Put() operation

client coordinator

P1w-1 responses

Object with vector clock

Cluster Membership

• Gossip protocol• State disseminated in O(logN) rounds• Increase its heartbeat counter and send its list

to another every T seconds• Merge operations

Failure

• Data center(s) failure–Multiple data centers

• Temporary failure• Permanent failure–Merkle tree

Temporary failure

Merkle tree

Boolom filter

a space-efficient probabilistic data structure used to test whether an element is a member of a set false positive

CompactionsK1 < Serialized data >

K2 < Serialized data >

--Sorted

MERGE SORT

Sorted

K1 Offset

K5 Offset

K30 Offset

Bloom Filter

Loaded in memory

Index File

Data File

D E L E T E D

Key (CF1 , CF2 , CF3)

Commit LogBinary serialized

Key ( CF1 , CF2 , CF3 )

Memtable ( CF1)

Memtable ( CF2)

• Data size

• Number of Objects

• Lifetime

Dedicated Disk

BLOCK Index <Key Name> Offset, <Key Name> Offset

K128 Offset

K256 Offset

K384 Offset

Bloom Filter

(Index in memory)

Data file on disk

Closest replica

Cassandra Cluster

Replica A

Result

Replica B Replica C

Digest QueryDigest Response Digest Response

Result

Client

Read repair if digests differ

Outline

Sigmod contest 2009

Task overviewAPIData structureArchitectureTest

Task overview

• Index system for main memory data• Running on multi-core machine• Many threads with multiple indices• Serialize execution of user-specified

transactions• Basic function

exact match queries ,range queries , updates inserts , deletes

Record

HashTableHsize

hashTab

average(64)

deviation

domain

warpMode(bool)

0 1 ... size-1

key key

dataType key ; int64_t hashKey ; char * payload ;

*nexrt

key key

HashShared

nbNameIndex

0 1 2 3 ... 999

0 100 200 300 ... 999000

1 101 201 301 ... 99901

2 102 202 302 ... 99902

... ... ... ... ... ...

199 299 399 ... 9999999

int 类型数据

NameIndex类型数据

str \0

IdxState类型的对象

TxnState

indexActive

indexToReset

nbIndex

0123...

IdxState

• Keep track of an index• Created openIndex()• Destroyed closeIndex()• Inherited by IdxStateType• Contains pointers pointing to– a hashtable– a FixedAllocator– a Allocator– a array with the type of action

Architecture

indexManager

Allocator

transactor

DeadLockDetector

IndexManager

nbIndexTab

indexTab

indexTab[0]

indexTab[1]

indexTab[2]

indexTab[3]

indexTab[i]

indexTab[..]

DeadLockDetector

Transactor

• a HashOnlyGet object with type TxnState

nbNameIndex

iThread

nbElement

0 1 2 3 ... 999

0 100 200 300 ... 999000

1 101 201 301 ... 99901

2 102 202 302 ... 99902

... ... ... ... ... ...

199 299 399 ... 9999999

data T

Allocator

• Allocate the memory for the payloads• Use pools and linked list• Pool sized --the max length of payload is 100• The payloads with the same payload are in the

same list

Unit Tests• three threads , run over three indices• the primary thread– create the primary index– inserts, deletes and accesses data in the primary index

• the second thread– simultaneously runs some basic tests over a separate index

• the third thread– ensure the transactional guarantees– Continuously queries the primary index

Outline

Task overview

• Implement a simple distributed query executor with the help of the in-memory index

• Given centralized query plans, translate them into distributed query plans

• Given a parsed SQL query, return the right results• Data stored on disk, the indexes are all in

memory• Measure the total time costs

SQL query form

SELECT alias_name.field_name, ...

FROM table_name AS alias_name,…

WHERE condition1 AND ... AND conditionN

Conditionalias_name.field_name = fixed value

alias_name.field_name > fixed value

alias_name.field_name1 =alias_name.field_name2

Initialization phase

Connection phase

Query phase

Closing phase

• An initial computation• On synthetic and real-world datasets• Tested on a single machine• Tested on an ad-hoc cluster of peers• Passed a collection of unit tests , provided with

an Amazon Web Services account of a 100 USD value

Benchmarks(stag1)

• Assume a partition always cover the entire table, the data is not replicated.

• Unit-tests• Benchmarks

– On a single node, selects with an equal condition on the primary key – On a single node, selects with an equal condition on an indexed field – On a single node, 2 to 5 joins on tables of different size – On a single node, 1 join and a "greater than" condition on an indexed field – On three nodes, one join on two tables of different size, the two tables being on

two different nodes

Benchmarks(stag2)

• Tables are now stored on multiple nodes• Part of a table, or the whole table may be

replicated on multiple nodes• Queries will be sent in parallel up to 50

simultaneous connections• Benchmarks

– Selects with an equal condition on the primary key, the values being uniformly distributed

– Selects with an equal condition on the primary key, the values being non-uniformly distributed

– Multiple joins on tables separated on different nodes

Important Dates

Thank you!!!

cassandra and sigmod contest cloud computing group haiping wang 2009-12-19

api slide

disk slide

hashtable slide

deadlockdetector slide

hashshared slide

indexmanager slide

replication2 slide

record slide

Documents

cassandra summit 2014: cassandra compute cloud: an elastic...

la cassandra day 2015 - testing cassandra

storm@twitter, sigmod 2014

cabs, cassandra, and hailo (at cassandra eu)

cassandra day nyc - cassandra anti patterns

cassandra community webinar: apache cassandra internals

a guide to stress testing kafka, spark and cassandra … ·...

cassandra training session...

acm sigmod 2007, beijing, china -1 -

1 mpca: multilinear principal component analysis of tensor...

1 mpca: multilinear principal component analysis of tensor...

originally published in: research collection sigmod record

hadoop in sigmod 2011

anti-combining for mapreduce...example, after entering \sig"...

cassandra day atlanta 2016 - monitoring cassandra

apache cassandra in action - o'reilly...

state of cassandra, 2012 - nosql | apache cassandra ·...

paris cassandra meetup - cassandra for developers

bigdansing presentation slides for sigmod 2015

michel magnan, haiping wang, yaqi shi · 2016. 6. 7. ·...