top five questions to ask when choosing a big data solution

45
Five factors to consider when choosing a big data solution Jonathan Ellis CTO, DataStax Project Chair, Apache Cassandra

Upload: jbellis

Post on 26-Jan-2015

107 views

Category:

Technology


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Top five questions to ask when choosing a big data solution

Five factors to consider when choosing a big data solution!Jonathan EllisCTO, DataStaxProject Chair, Apache Cassandra

Page 2: Top five questions to ask when choosing a big data solution

©2012 DataStax

how do I

modelmy application?

Page 3: Top five questions to ask when choosing a big data solution

©2012 DataStax

Popular options• Key/value

• Tabular

• Document

• Graph?

Page 4: Top five questions to ask when choosing a big data solution

©2012 DataStax

Schema is your friend

{ "id": "e451dd42-ece3-11e1-a0a3-34159e154f4c", "name": "jbellis", "state": "TX", "birthdate": "1/1/1976", "email_addresses": ["jbellis@gmail", "[email protected]"],}

Page 5: Top five questions to ask when choosing a big data solution

©2012 DataStax

SQL can be your friend too

CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date);

CREATE INDEX ON users(state);

SELECT * FROM usersWHERE state=‘Texas’ AND birth_date > ‘1950-01-01’;

Page 6: Top five questions to ask when choosing a big data solution

©2012 DataStax

CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date);

CREATE TABLE users_addresses ( user_id uuid REFERENCES users, email text);

SELECT *FROM users NATURAL JOIN users_addresses;

Collections

Page 7: Top five questions to ask when choosing a big data solution

©2012 DataStax

CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date);

CREATE TABLE users_addresses ( user_id uuid REFERENCES users, email text);

SELECT *FROM users NATURAL JOIN users_addresses;

Collections

X

Page 8: Top five questions to ask when choosing a big data solution

©2012 DataStax

CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date, email_addresses set<text>);

UPDATE usersSET email_addresses = email_addresses + {‘[email protected]’, ‘[email protected]’};

Collections

Page 9: Top five questions to ask when choosing a big data solution

©2012 DataStax

Joins don’t scale• No joins

• No subqueries

• No aggregation functions* or GROUP BY

• ORDER BY?

Page 10: Top five questions to ask when choosing a big data solution

©2012 DataStax

SELECT * FROM tweetsWHERE user_id IN (SELECT follower FROM followers WHERE user_id = ’driftx’)

followers

?

tweets

Page 11: Top five questions to ask when choosing a big data solution

©2012 DataStax

CREATE TABLE timeline (  user_id uuid,  tweet_id timeuuid,  tweet_author uuid, tweet_body text,  PRIMARY KEY (user_id, tweet_id));

Clustering in Cassandrauser_id tweet_id _author _body

jbellis 3290f9da.. rbranson loremjbellis 3895411a.. tjake ipsum

... ... ...

driftx 3290f9da.. rbranson loremdriftx 71b46a84.. yzhang dolor

... ... ...

yukim 3290f9da.. rbranson loremyukim e451dd42.. tjake amet

... ... ...

Page 12: Top five questions to ask when choosing a big data solution

©2012 DataStax

CREATE TABLE timeline (  user_id uuid,  tweet_id timeuuid,  tweet_author uuid, tweet_body text,  PRIMARY KEY (user_id, tweet_id));

Clustering in Cassandrauser_id tweet_id _author _body

jbellis 3290f9da.. rbranson loremjbellis 3895411a.. tjake ipsum

... ... ...

driftx 3290f9da.. rbranson loremdriftx 71b46a84.. yzhang dolor

... ... ...

yukim 3290f9da.. rbranson loremyukim e451dd42.. tjake amet

... ... ...

SELECT * FROM timelineWHERE user_id = ’driftx’;

Page 13: Top five questions to ask when choosing a big data solution

©2012 DataStax

how does it

perform?

Page 14: Top five questions to ask when choosing a big data solution

©2012 DataStax

Larger than memory datasets

Page 15: Top five questions to ask when choosing a big data solution

©2012 DataStax

Locking

Page 16: Top five questions to ask when choosing a big data solution

©2012 DataStax

Efficiency

Page 17: Top five questions to ask when choosing a big data solution

©2012 DataStax

UPDATE usersSET email_addresses = email_addresses + {...}WHERE user_id = ‘jbellis’;

Page 18: Top five questions to ask when choosing a big data solution

©2012 DataStax

Durability

Page 19: Top five questions to ask when choosing a big data solution

©2012 DataStax

C* storage engine very briefly

Memory

Hard drive

Memtable

write( , )k1 c1:v1

Commit log

Page 20: Top five questions to ask when choosing a big data solution

©2012 DataStax

Memory

Hard drive

Memtable

write( , )k1 c1:v1

Commit log

k1 c1:v1

k1 c1:v1

Page 21: Top five questions to ask when choosing a big data solution

©2012 DataStax

Memory

Hard drive

write( , )k1 c2:v2

k1 c1:v1

k1 c1:v1

k1 c2:v2

c2:v2

Page 22: Top five questions to ask when choosing a big data solution

©2012 DataStax

Memory

Hard drive

k1 c1:v1

k1 c1:v1

k1 c2:v2

c2:v2

write( , )k2 c1:v1 c2:v2

k2 c1:v1 c2:v2

k2 c1:v1 c2:v2

Page 23: Top five questions to ask when choosing a big data solution

©2012 DataStax

Memory

Hard drive

k1 c1:v1

k1 c1:v4

k1 c2:v2

c2:v2

write( , )k1 c1:v4 c3:v3

k2 c1:v1 c2:v2

k2 c1:v1 c2:v2

k1 c1:v4 c3:v3

c3:v3

Page 24: Top five questions to ask when choosing a big data solution

©2012 DataStax

Memory

Hard drive

SSTable

flush

k1 c1:v4 c2:v2

k2 c1:v1 c2:v2

c3:v3

index

cleanup

Page 25: Top five questions to ask when choosing a big data solution

©2012 DataStax

No random writes

Page 26: Top five questions to ask when choosing a big data solution

©2012 DataStax

0

5000

10000

15000

20000

25000

30000

35000

Cassandra 0.6

Cassandra 1.0

reads/s writes/s

Page 27: Top five questions to ask when choosing a big data solution

©2012 DataStax

how does it handle

failure?

Page 28: Top five questions to ask when choosing a big data solution

©2012 DataStax

Classic partitioning with SPOFpartition 1 partition 2 partition 3 partition 4

router

client

Page 29: Top five questions to ask when choosing a big data solution

©2012 DataStax

Availability• “High availability implies that a single fault will not bring

down your system. Not ‘we’ll recover quickly.’” -- Ben Coverston: DataStax

• “The biggest problem with failover is that you're almost never using it until it really hurts. It's like backups that you never test.” -- Rick Branson: Instagram

Page 30: Top five questions to ask when choosing a big data solution

©2012 DataStax

Fully distributed, no SPOFclient

p1

p1

p1p3

p6

Page 31: Top five questions to ask when choosing a big data solution

©2012 DataStax

Multiple datacenters

Page 32: Top five questions to ask when choosing a big data solution

©2012 DataStax

Page 33: Top five questions to ask when choosing a big data solution

©2012 DataStax

how does it

scale?

Page 34: Top five questions to ask when choosing a big data solution

©2012 DataStax

Scaling antipatterns• Metadata servers

• Router bottlenecks

• Overloading existing nodes when adding capacity

Page 35: Top five questions to ask when choosing a big data solution

©2012 DataStax

Page 36: Top five questions to ask when choosing a big data solution

©2012 DataStax

how

flexibleis it?

Page 37: Top five questions to ask when choosing a big data solution

36

Page 38: Top five questions to ask when choosing a big data solution

©2012 DataStax

Data model: Realtime

Portfolios

StockHist

stock lastGOOG $95.52AAPL $186.10AMZN $112.98

LiveStocks

stock date priceGOOG 2011-01-01 $8.23GOOG 2011-01-02 $6.14GOOG 2011-001-03 $7.78

user stock sharesjbellis GOOG 80jbellis LNKD 20yukim AMZN 100

Page 39: Top five questions to ask when choosing a big data solution

©2012 DataStax

Data model: Analytics

worst_date loss2011-07-23 -$34.812011-03-11 -$11432.242011-05-21 -$1476.93

Portfolio1

HistLoss

Portfolio2Portfolio3

Page 40: Top five questions to ask when choosing a big data solution

©2012 DataStax

Data model: Analyticsstock rdate returnGOOG 2011-07-25 $8.23GOOG 2011-07-24 $6.14GOOG 2011-07-23 $7.78AAPL 2011-07-25 $15.32AAPL 2011-07-24 $12.68

10dayreturns

INSERT OVERWRITE TABLE 10dayreturnsSELECT a.stock, b.date as rdate, b.price - a.priceFROM StockHist a JOIN StockHist b ON (a.stock = b.stock AND date_add(a.date, 10) = b.date);

Page 41: Top five questions to ask when choosing a big data solution

©2012 DataStax

Data model: Analytics

portfolio rdate preturnPortfolio1 2011-07-25 $118.21Portfolio1 2011-07-24 $60.78Portfolio1 2011-07-23 -$34.81Portfolio2 2011-07-25 $2143.92Portfolio3 2011-07-24 -$10.19

portfolio_returns

INSERT OVERWRITE TABLE portfolio_returnsSELECT portfolio, rdate, SUM(b.return)FROM portfolios a JOIN 10dayreturns b ON (a.stock = b.stock)GROUP BY portfolio, rdate;

Page 42: Top five questions to ask when choosing a big data solution

©2012 DataStax

Data model: Analytics

INSERT OVERWRITE TABLE HistLossSELECT a.portfolio, rdate, minpFROM ( SELECT portfolio, min(preturn) as minp FROM portfolio_returns GROUP BY portfolio) a JOIN portfolio_returns b ON (a.portfolio = b.portfolio and a.minp = b.preturn);

worst_date loss2011-07-23 -$34.812011-03-11 -$11432.242011-05-21 -$1476.93

Portfolio1

HistLoss

Portfolio2Portfolio3

Page 43: Top five questions to ask when choosing a big data solution

42

Page 44: Top five questions to ask when choosing a big data solution

©2012 DataStax

Some Cassandra users