cassandra and sigmod contest cloud computing group haiping wang 2009-12-19
TRANSCRIPT
Cassandra and Sigmod contest
Cloud computing group
Haiping Wang
2009-12-19
Outline
Cassandra• Cassandra overview• Data model• Architecture• Read and write
Sigmod contest 2009Sigmod contest 2010
Cassandra overview
Highly scalable, distributedEventually consistentStructured key-value storeDynamo + bigtable P2PRandom reads and random writesJava
Data ModelKEY
ColumnFamily1 Name : MailList Type : Simple Sort : Name
Name : tid1
Value : <Binary>
TimeStamp : t1
Name : tid2
Value : <Binary>
TimeStamp : t2
Name : tid3
Value : <Binary>
TimeStamp : t3
Name : tid4
Value : <Binary>
TimeStamp : t4
ColumnFamily2 Name : WordList Type : Super Sort : Time
Name : aloha
ColumnFamily3 Name : System Type : Super Sort : Name
Name : hint1
<Column List>
Name : hint2
<Column List>
Name : hint3
<Column List>
Name : hint4
<Column List>
C1
V1
T1
C2
V2
T2
C3
V3
T3
C4
V4
T4
Name : dude
C2
V2
T2
C6
V6
T6
Column Families are declared
upfront
Columns are added and modified dynamically
SuperColumns are added and modified
dynamically
Columns are added and modified dynamically
Cassandra Architecture
Cassandra API
• Data structures• Exceptions• Service API• ConsistencyLevel(4)• Retrieval methods(5)• Range query: returns matching keys(1)• Modification methods(3)
• Others
Cassandra commands
Partitioning and replication(1)
• Consistent hashing• DHT• Balance• Monotonicity• Spread• Load
• Virtual nodes• Coordinator• Preference list
01
1/2
F
E
D
C
B
A N=3
h(key2)
h(key1)
9
Partitioning and replication(2)
Data Versioning
• Always writeable• Mulitple versions– put() return before all replicas– get() many versions
• Vector clocks • Reconciliation during reads by clients
Vector clock
• List of (node, counter) pairsE.g. [x,2][y,3] vs. [x,3][y,4][z,1]
[x,1][y,3] vs. [z,1][y,3]
• Use timestampE.g. D([x,1]:t1,[y,1]:t2)
• Remove the oldest version when reach a thresthold
Vector clock
Return all the objects at the leaves
D3,4([Sx,2],[Sy,1],[Sz,1])
Single new version
Excution operations
• Two strategies– A generic load balancer based on load balance• Easy ,not have to link any code specific
– Directory to the node• Achieve lower latency
Put() operation
client coordinator
PN-1
P2
P1w-1 responses
Object with vector clock
Cluster Membership
• Gossip protocol• State disseminated in O(logN) rounds• Increase its heartbeat counter and send its list
to another every T seconds• Merge operations
Failure
• Data center(s) failure–Multiple data centers
• Temporary failure• Permanent failure–Merkle tree
Temporary failure
Merkle tree
Boolom filter
a space-efficient probabilistic data structure used to test whether an element is a member of a set false positive
CompactionsK1 < Serialized data >
K2 < Serialized data >
K3 < Serialized data >
--
--
--Sorted
K2 < Serialized data >
K10 < Serialized data >
K30 < Serialized data >
--
--
--Sorted
K4 < Serialized data >
K5 < Serialized data >
K10 < Serialized data >
--
--
--Sorted
MERGE SORT
K1 < Serialized data >
K2 < Serialized data >
K3 < Serialized data >
K4 < Serialized data >
K5 < Serialized data >
K10 < Serialized data >
K30 < Serialized data >
Sorted
K1 Offset
K5 Offset
K30 Offset
Bloom Filter
Loaded in memory
Index File
Data File
D E L E T E D
Write
Key (CF1 , CF2 , CF3)
Commit LogBinary serialized
Key ( CF1 , CF2 , CF3 )
Memtable ( CF1)
Memtable ( CF2)
Memtable ( CF2)
• Data size
• Number of Objects
• Lifetime
Dedicated Disk
<Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family>
---
---
---
---
<Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family>
BLOCK Index <Key Name> Offset, <Key Name> Offset
K128 Offset
K256 Offset
K384 Offset
Bloom Filter
(Index in memory)
Data file on disk
Read
Query
Closest replica
Cassandra Cluster
Replica A
Result
Replica B Replica C
Digest QueryDigest Response Digest Response
Result
Client
Read repair if digests differ
Outline
Cassandra• Cassandra overview• Data model• Architecture• Read and write
Sigmod contest 2009Sigmod contest 2010
Sigmod contest 2009
Task overviewAPIData structureArchitectureTest
Task overview
• Index system for main memory data• Running on multi-core machine• Many threads with multiple indices• Serialize execution of user-specified
transactions• Basic function
exact match queries ,range queries , updates inserts , deletes
API
Record
HashTableHsize
size
hashTab
average(64)
deviation
nbEl
domain
warpMode(bool)
0 1 ... size-1
key
key
key
key
key
key
key
key key
key key
key
dataType key ; int64_t hashKey ; char * payload ;
*nexrt
key
key
key key
HashShared
nbNameIndex
ni
0 1 2 3 ... 999
0 100 200 300 ... 999000
1 101 201 301 ... 99901
2 102 202 302 ... 99902
... ... ... ... ... ...
199 299 399 ... 9999999
int 类型数据
NameIndex类型数据
idx
str \0
IdxState类型的对象
TxnState
state
indexActive
th
iNbR
indexToReset
nbIndex
0123...
199
0123...
199
IdxState
• Keep track of an index• Created openIndex()• Destroyed closeIndex()• Inherited by IdxStateType• Contains pointers pointing to– a hashtable– a FixedAllocator– a Allocator– a array with the type of action
Architecture
indexManager
Allocator
transactor
DeadLockDetector
IndexManager
hs
nbIndexTab
indexTab
indexTab[0]
indexTab[1]
indexTab[2]
indexTab[3]
indexTab[i]
indexTab[..]
DeadLockDetector
Transactor
• a HashOnlyGet object with type TxnState
nbNameIndex
id
mutex
iThread
nbElement
0 1 2 3 ... 999
0 100 200 300 ... 999000
1 101 201 301 ... 99901
2 102 202 302 ... 99902
... ... ... ... ... ...
199 299 399 ... 9999999
pt
data T
Allocator
• Allocate the memory for the payloads• Use pools and linked list• Pool sized --the max length of payload is 100• The payloads with the same payload are in the
same list
Unit Tests• three threads , run over three indices• the primary thread– create the primary index– inserts, deletes and accesses data in the primary index
• the second thread– simultaneously runs some basic tests over a separate index
• the third thread– ensure the transactional guarantees– Continuously queries the primary index
Outline
Cassandra• Cassandra overview• Data model• Architecture• Read and write
Sigmod contest 2009Sigmod contest 2010
Task overview
• Implement a simple distributed query executor with the help of the in-memory index
• Given centralized query plans, translate them into distributed query plans
• Given a parsed SQL query, return the right results• Data stored on disk, the indexes are all in
memory• Measure the total time costs
SQL query form
SELECT alias_name.field_name, ...
FROM table_name AS alias_name,…
WHERE condition1 AND ... AND conditionN
Conditionalias_name.field_name = fixed value
alias_name.field_name > fixed value
alias_name.field_name1 =alias_name.field_name2
Initialization phase
Connection phase
Query phase
Closing phase
Tests
• An initial computation• On synthetic and real-world datasets• Tested on a single machine• Tested on an ad-hoc cluster of peers• Passed a collection of unit tests , provided with
an Amazon Web Services account of a 100 USD value
Benchmarks(stag1)
• Assume a partition always cover the entire table, the data is not replicated.
• Unit-tests• Benchmarks
– On a single node, selects with an equal condition on the primary key – On a single node, selects with an equal condition on an indexed field – On a single node, 2 to 5 joins on tables of different size – On a single node, 1 join and a "greater than" condition on an indexed field – On three nodes, one join on two tables of different size, the two tables being on
two different nodes
Benchmarks(stag2)
• Tables are now stored on multiple nodes• Part of a table, or the whole table may be
replicated on multiple nodes• Queries will be sent in parallel up to 50
simultaneous connections• Benchmarks
– Selects with an equal condition on the primary key, the values being uniformly distributed
– Selects with an equal condition on the primary key, the values being non-uniformly distributed
– Multiple joins on tables separated on different nodes
Important Dates
Thank you!!!