cassandra and sigmod contest cloud computing group haiping wang 2009-12-19

52
Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Upload: jackeline-edmundson

Post on 30-Mar-2015

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Cassandra and Sigmod contest

Cloud computing group

Haiping Wang

2009-12-19

Page 2: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Outline

Cassandra• Cassandra overview• Data model• Architecture• Read and write

Sigmod contest 2009Sigmod contest 2010

Page 3: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Cassandra overview

Highly scalable, distributedEventually consistentStructured key-value storeDynamo + bigtable P2PRandom reads and random writesJava

Page 4: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Data ModelKEY

ColumnFamily1 Name : MailList Type : Simple Sort : Name

Name : tid1

Value : <Binary>

TimeStamp : t1

Name : tid2

Value : <Binary>

TimeStamp : t2

Name : tid3

Value : <Binary>

TimeStamp : t3

Name : tid4

Value : <Binary>

TimeStamp : t4

ColumnFamily2 Name : WordList Type : Super Sort : Time

Name : aloha

ColumnFamily3 Name : System Type : Super Sort : Name

Name : hint1

<Column List>

Name : hint2

<Column List>

Name : hint3

<Column List>

Name : hint4

<Column List>

C1

V1

T1

C2

V2

T2

C3

V3

T3

C4

V4

T4

Name : dude

C2

V2

T2

C6

V6

T6

Column Families are declared

upfront

Columns are added and modified dynamically

SuperColumns are added and modified

dynamically

Columns are added and modified dynamically

Page 5: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Cassandra Architecture

Page 6: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Cassandra API

• Data structures• Exceptions• Service API• ConsistencyLevel(4)• Retrieval methods(5)• Range query: returns matching keys(1)• Modification methods(3)

• Others

Page 7: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Cassandra commands

Page 8: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Partitioning and replication(1)

• Consistent hashing• DHT• Balance• Monotonicity• Spread• Load

• Virtual nodes• Coordinator• Preference list

Page 9: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

01

1/2

F

E

D

C

B

A N=3

h(key2)

h(key1)

9

Partitioning and replication(2)

Page 10: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Data Versioning

• Always writeable• Mulitple versions– put() return before all replicas– get() many versions

• Vector clocks • Reconciliation during reads by clients

Page 11: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Vector clock

• List of (node, counter) pairsE.g. [x,2][y,3] vs. [x,3][y,4][z,1]

[x,1][y,3] vs. [z,1][y,3]

• Use timestampE.g. D([x,1]:t1,[y,1]:t2)

• Remove the oldest version when reach a thresthold

Page 12: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Vector clock

Return all the objects at the leaves

D3,4([Sx,2],[Sy,1],[Sz,1])

Single new version

Page 13: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Excution operations

• Two strategies– A generic load balancer based on load balance• Easy ,not have to link any code specific

– Directory to the node• Achieve lower latency

Page 14: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Put() operation

client coordinator

PN-1

P2

P1w-1 responses

Object with vector clock

Page 15: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Cluster Membership

• Gossip protocol• State disseminated in O(logN) rounds• Increase its heartbeat counter and send its list

to another every T seconds• Merge operations

Page 16: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19
Page 17: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19
Page 18: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19
Page 19: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Failure

• Data center(s) failure–Multiple data centers

• Temporary failure• Permanent failure–Merkle tree

Page 20: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Temporary failure

Page 21: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Merkle tree

Page 22: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Boolom filter

a space-efficient probabilistic data structure used to test whether an element is a member of a set false positive

Page 23: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

CompactionsK1 < Serialized data >

K2 < Serialized data >

K3 < Serialized data >

--

--

--Sorted

K2 < Serialized data >

K10 < Serialized data >

K30 < Serialized data >

--

--

--Sorted

K4 < Serialized data >

K5 < Serialized data >

K10 < Serialized data >

--

--

--Sorted

MERGE SORT

K1 < Serialized data >

K2 < Serialized data >

K3 < Serialized data >

K4 < Serialized data >

K5 < Serialized data >

K10 < Serialized data >

K30 < Serialized data >

Sorted

K1 Offset

K5 Offset

K30 Offset

Bloom Filter

Loaded in memory

Index File

Data File

D E L E T E D

Page 24: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Write

Key (CF1 , CF2 , CF3)

Commit LogBinary serialized

Key ( CF1 , CF2 , CF3 )

Memtable ( CF1)

Memtable ( CF2)

Memtable ( CF2)

• Data size

• Number of Objects

• Lifetime

Dedicated Disk

<Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family>

---

---

---

---

<Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family>

BLOCK Index <Key Name> Offset, <Key Name> Offset

K128 Offset

K256 Offset

K384 Offset

Bloom Filter

(Index in memory)

Data file on disk

Page 25: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Read

Query

Closest replica

Cassandra Cluster

Replica A

Result

Replica B Replica C

Digest QueryDigest Response Digest Response

Result

Client

Read repair if digests differ

Page 26: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Outline

Cassandra• Cassandra overview• Data model• Architecture• Read and write

Sigmod contest 2009Sigmod contest 2010

Page 27: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Sigmod contest 2009

Task overviewAPIData structureArchitectureTest

Page 28: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Task overview

• Index system for main memory data• Running on multi-core machine• Many threads with multiple indices• Serialize execution of user-specified

transactions• Basic function

exact match queries ,range queries , updates inserts , deletes

Page 29: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

API

Page 30: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Record

Page 31: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

HashTableHsize

size

hashTab

average(64)

deviation

nbEl

domain

warpMode(bool)

0 1 ... size-1

key

key

key

key

key

key

key

key key

key key

key

dataType key ; int64_t hashKey ; char * payload ;

*nexrt

key

key

key key

Page 32: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

HashShared

nbNameIndex

ni

0 1 2 3 ... 999

0 100 200 300 ... 999000

1 101 201 301 ... 99901

2 102 202 302 ... 99902

... ... ... ... ... ...

199 299 399 ... 9999999

int 类型数据

NameIndex类型数据

idx

str \0

IdxState类型的对象

Page 33: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

TxnState

state

indexActive

th

iNbR

indexToReset

nbIndex

0123...

199

0123...

199

Page 34: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

IdxState

• Keep track of an index• Created openIndex()• Destroyed closeIndex()• Inherited by IdxStateType• Contains pointers pointing to– a hashtable– a FixedAllocator– a Allocator– a array with the type of action

Page 35: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Architecture

indexManager

Allocator

transactor

DeadLockDetector

Page 36: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

IndexManager

hs

nbIndexTab

indexTab

indexTab[0]

indexTab[1]

indexTab[2]

indexTab[3]

indexTab[i]

indexTab[..]

Page 37: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

DeadLockDetector

Page 38: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Transactor

• a HashOnlyGet object with type TxnState

nbNameIndex

id

mutex

iThread

nbElement

0 1 2 3 ... 999

0 100 200 300 ... 999000

1 101 201 301 ... 99901

2 102 202 302 ... 99902

... ... ... ... ... ...

199 299 399 ... 9999999

pt

data T

Page 39: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Allocator

• Allocate the memory for the payloads• Use pools and linked list• Pool sized --the max length of payload is 100• The payloads with the same payload are in the

same list

Page 40: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Unit Tests• three threads , run over three indices• the primary thread– create the primary index– inserts, deletes and accesses data in the primary index

• the second thread– simultaneously runs some basic tests over a separate index

• the third thread– ensure the transactional guarantees– Continuously queries the primary index

Page 41: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Outline

Cassandra• Cassandra overview• Data model• Architecture• Read and write

Sigmod contest 2009Sigmod contest 2010

Page 42: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Task overview

• Implement a simple distributed query executor with the help of the in-memory index

• Given centralized query plans, translate them into distributed query plans

• Given a parsed SQL query, return the right results• Data stored on disk, the indexes are all in

memory• Measure the total time costs

Page 43: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

SQL query form

SELECT alias_name.field_name, ...

FROM table_name AS alias_name,…

WHERE condition1 AND ... AND conditionN

Conditionalias_name.field_name = fixed value

alias_name.field_name > fixed value

alias_name.field_name1 =alias_name.field_name2

Page 44: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Initialization phase

Page 45: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Connection phase

Page 46: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Query phase

Page 47: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Closing phase

Page 48: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Tests

• An initial computation• On synthetic and real-world datasets• Tested on a single machine• Tested on an ad-hoc cluster of peers• Passed a collection of unit tests , provided with

an Amazon Web Services account of a 100 USD value

Page 49: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Benchmarks(stag1)

• Assume a partition always cover the entire table, the data is not replicated.

• Unit-tests• Benchmarks

– On a single node, selects with an equal condition on the primary key – On a single node, selects with an equal condition on an indexed field – On a single node, 2 to 5 joins on tables of different size – On a single node, 1 join and a "greater than" condition on an indexed field – On three nodes, one join on two tables of different size, the two tables being on

two different nodes

Page 50: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Benchmarks(stag2)

• Tables are now stored on multiple nodes• Part of a table, or the whole table may be

replicated on multiple nodes• Queries will be sent in parallel up to 50

simultaneous connections• Benchmarks

– Selects with an equal condition on the primary key, the values being uniformly distributed

– Selects with an equal condition on the primary key, the values being non-uniformly distributed

– Multiple joins on tables separated on different nodes

Page 51: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Important Dates

Page 52: Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

Thank you!!!