cassandra and sigmod contest cloud computing group haiping wang 2009-12-19

Cassandra and Sigmod contest

Cloud computing group

Haiping Wang

2009-12-19

Outline

Cassandra• Cassandra overview• Data model• Architecture• Read and write

Sigmod contest 2009Sigmod contest 2010

Cassandra overview

Highly scalable, distributedEventually consistentStructured key-value storeDynamo + bigtable P2PRandom reads and random writesJava

Data ModelKEY

ColumnFamily1 Name : MailList Type : Simple Sort : Name

Name : tid1

Value : <Binary>

TimeStamp : t1

Name : tid2

Value : <Binary>

TimeStamp : t2

Name : tid3

Value : <Binary>

TimeStamp : t3

Name : tid4

Value : <Binary>

TimeStamp : t4

ColumnFamily2 Name : WordList Type : Super Sort : Time

Name : aloha

ColumnFamily3 Name : System Type : Super Sort : Name

Name : hint1

<Column List>

Name : hint2

<Column List>

Name : hint3

<Column List>

Name : hint4

<Column List>

C1

V1

T1

C2

V2

T2

C3

V3

T3

C4

V4

T4

Name : dude

C2

V2

T2

C6

V6

T6

Column Families are declared

upfront

Columns are added and modified dynamically

SuperColumns are added and modified

dynamically

Columns are added and modified dynamically

Cassandra Architecture

Cassandra API

• Data structures• Exceptions• Service API• ConsistencyLevel(4)• Retrieval methods(5)• Range query: returns matching keys(1)• Modification methods(3)

• Others

Cassandra commands

Partitioning and replication(1)

• Consistent hashing• DHT• Balance• Monotonicity• Spread• Load

• Virtual nodes• Coordinator• Preference list

01

1/2

F

E

D

C

B

A N=3

h(key2)

h(key1)

9

Partitioning and replication(2)

Data Versioning

• Always writeable• Mulitple versions– put() return before all replicas– get() many versions

• Vector clocks • Reconciliation during reads by clients

Vector clock

• List of (node, counter) pairsE.g. [x,2][y,3] vs. [x,3][y,4][z,1]

[x,1][y,3] vs. [z,1][y,3]

• Use timestampE.g. D([x,1]:t1,[y,1]:t2)

• Remove the oldest version when reach a thresthold

Vector clock

Return all the objects at the leaves

D3,4([Sx,2],[Sy,1],[Sz,1])

Single new version

Excution operations

• Two strategies– A generic load balancer based on load balance• Easy ,not have to link any code specific

– Directory to the node• Achieve lower latency

Put() operation

client coordinator

PN-1

P2

P1w-1 responses

Object with vector clock

Cluster Membership

• Gossip protocol• State disseminated in O(logN) rounds• Increase its heartbeat counter and send its list

to another every T seconds• Merge operations

Failure

• Data center(s) failure–Multiple data centers

• Temporary failure• Permanent failure–Merkle tree

Temporary failure

Merkle tree

Boolom filter

a space-efficient probabilistic data structure used to test whether an element is a member of a set false positive

CompactionsK1 < Serialized data >

K2 < Serialized data >


--

--

--Sorted




--

--

--Sorted




--

--

--Sorted

MERGE SORT








Sorted

K1 Offset

K5 Offset

K30 Offset

Bloom Filter

Loaded in memory

Index File

Data File

D E L E T E D

Write

Key (CF1 , CF2 , CF3)

Commit LogBinary serialized

Key ( CF1 , CF2 , CF3 )

Memtable ( CF1)

Memtable ( CF2)

Memtable ( CF2)

• Data size

• Number of Objects

• Lifetime

Dedicated Disk

<Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family>

---

---

---

---

<Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family>

BLOCK Index <Key Name> Offset, <Key Name> Offset

K128 Offset

K256 Offset

K384 Offset

Bloom Filter

(Index in memory)

Data file on disk

Read

Query

Closest replica

Cassandra Cluster

Replica A

Result

Replica B Replica C

Digest QueryDigest Response Digest Response

Result

Client

Read repair if digests differ

Outline



Sigmod contest 2009

Task overviewAPIData structureArchitectureTest

Task overview

• Index system for main memory data• Running on multi-core machine• Many threads with multiple indices• Serialize execution of user-specified

transactions• Basic function

exact match queries ,range queries , updates inserts , deletes

Record

HashTableHsize

size

hashTab

average(64)

deviation

nbEl

domain

warpMode(bool)

0 1 ... size-1

key

key

key

key

key

key

key

key key

key key

key

dataType key ; int64_t hashKey ; char * payload ;

*nexrt

key

key

key key

HashShared

nbNameIndex

ni

0 1 2 3 ... 999

0 100 200 300 ... 999000

1 101 201 301 ... 99901

2 102 202 302 ... 99902

... ... ... ... ... ...

199 299 399 ... 9999999

int 类型数据

NameIndex类型数据

idx

str \0

IdxState类型的对象

TxnState

state

indexActive

th

iNbR

indexToReset

nbIndex

0123...

199

0123...

199

IdxState

• Keep track of an index• Created openIndex()• Destroyed closeIndex()• Inherited by IdxStateType• Contains pointers pointing to– a hashtable– a FixedAllocator– a Allocator– a array with the type of action

Architecture

indexManager

Allocator

transactor

DeadLockDetector

IndexManager

hs

nbIndexTab

indexTab

indexTab[0]

indexTab[1]

indexTab[2]

indexTab[3]

indexTab[i]

indexTab[..]

DeadLockDetector

Transactor

• a HashOnlyGet object with type TxnState

nbNameIndex

id

mutex

iThread

nbElement

0 1 2 3 ... 999

0 100 200 300 ... 999000

1 101 201 301 ... 99901

2 102 202 302 ... 99902

... ... ... ... ... ...

199 299 399 ... 9999999

pt

data T

Allocator

• Allocate the memory for the payloads• Use pools and linked list• Pool sized --the max length of payload is 100• The payloads with the same payload are in the

same list

Unit Tests• three threads , run over three indices• the primary thread– create the primary index– inserts, deletes and accesses data in the primary index

• the second thread– simultaneously runs some basic tests over a separate index

• the third thread– ensure the transactional guarantees– Continuously queries the primary index

Outline



Task overview

• Implement a simple distributed query executor with the help of the in-memory index

• Given centralized query plans, translate them into distributed query plans

• Given a parsed SQL query, return the right results• Data stored on disk, the indexes are all in

memory• Measure the total time costs

SQL query form

SELECT alias_name.field_name, ...

FROM table_name AS alias_name,…

WHERE condition1 AND ... AND conditionN

Conditionalias_name.field_name = fixed value

alias_name.field_name > fixed value

alias_name.field_name1 =alias_name.field_name2

Initialization phase

Connection phase

Query phase

Closing phase

Tests

• An initial computation• On synthetic and real-world datasets• Tested on a single machine• Tested on an ad-hoc cluster of peers• Passed a collection of unit tests , provided with

an Amazon Web Services account of a 100 USD value

Benchmarks(stag1)

• Assume a partition always cover the entire table, the data is not replicated.

• Unit-tests• Benchmarks

– On a single node, selects with an equal condition on the primary key – On a single node, selects with an equal condition on an indexed field – On a single node, 2 to 5 joins on tables of different size – On a single node, 1 join and a "greater than" condition on an indexed field – On three nodes, one join on two tables of different size, the two tables being on

two different nodes

Benchmarks(stag2)

• Tables are now stored on multiple nodes• Part of a table, or the whole table may be

replicated on multiple nodes• Queries will be sent in parallel up to 50

simultaneous connections• Benchmarks

– Selects with an equal condition on the primary key, the values being uniformly distributed

– Selects with an equal condition on the primary key, the values being non-uniformly distributed

– Multiple joins on tables separated on different nodes

Important Dates

Thank you!!!

cassandra and sigmod contest cloud computing group haiping wang 2009-12-19

Documents

api slide

disk slide

hashtable slide

deadlockdetector slide

hashshared slide

indexmanager slide

replication2 slide

record slide