scaling ancestrydna using hadoop and hbase

48
1 Scaling AncestryDNA Using Hadoop and HBase April 10th, 2014 Jeremy Pollack

Upload: ancestrycom

Post on 27-Jan-2015

104 views

Category:

Technology


1 download

DESCRIPTION

Learn more about how our team scales our AncestryDNA technology to match you within our database's DNA pool.

TRANSCRIPT

Page 1: Scaling AncestryDNA Using Hadoop and HBase

1

Scaling AncestryDNAUsing Hadoop and HBase

April 10th, 2014Jeremy Pollack

Page 2: Scaling AncestryDNA Using Hadoop and HBase

Ancestry.com Mission

2

Page 3: Scaling AncestryDNA Using Hadoop and HBase

• Over 30,000 historical content collections

• 13 billion records and images

• Records dating back to 16th century

• 10 petabytes

We are the world's largest online family history resource

Discoveries are the Key

3

Page 4: Scaling AncestryDNA Using Hadoop and HBase

The “eureka” moment drives our business

Discoveries in Detail

4

Page 5: Scaling AncestryDNA Using Hadoop and HBase

Spit in a tube, pay $99, learn your past

Autosomal DNA testsOver 200,000+ DNA samples700,000 SNPs for each sample10,000,000+ cousin matches

DNA molecule 1 differs from DNA molecule 2 at a single base-pair location (a C/T polymorphism)

(http://en.wikipedia.org/wiki/Single-nucleiotide_polymorphism)

MayJune

July

August

Septem

ber

October

November

December

January

Febru

aryMarc

hApril

May -

50,000

100,000

150,000 Genotyped Samples

Discoveries with DNA

5

Page 6: Scaling AncestryDNA Using Hadoop and HBase

Network Effect – Cousin Matches

6

2000 10053 21205 40201 60240 80405 1157560

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

Cous

in M

atch

es

Database Size

Page 7: Scaling AncestryDNA Using Hadoop and HBase

So how do we do it?

7

Page 8: Scaling AncestryDNA Using Hadoop and HBase

• GERMLINE is an algorithm that finds hidden relationships within a pool of DNA

• GERMLINE also refers to the reference implementation of that algorithm written in C++

• You can find it here :

http://www1.cs.columbia.edu/~gusev/germline/

Introducing … GERMLINE!

8

Page 9: Scaling AncestryDNA Using Hadoop and HBase

• GERMLINE (the implementation) was not meant to be used in an industrial setting

Stateless Single threaded Prone to swapping (heavy memory usage)

• GERMLINE performs poorly on large data sets

• Our metrics predicted exactly where the process would slow to a crawl

• Put simply: GERMLINE couldn't scale

So What’s the Problem?

9

Page 10: Scaling AncestryDNA Using Hadoop and HBase

GERMLINE Run Times (in hours)

2500

5000

7500

10000

12500

15000

17500

20000

22500

25000

27500

30000

32500

35000

37500

40000

42500

45000

47500

50000

52500

55000

57500

60000

0

5

10

15

20

25H

ours

Samples

10

Page 11: Scaling AncestryDNA Using Hadoop and HBase

Projected GERMLINE Run Times (in hours)

2500

7500

12500

17500

22500

27500

32500

37500

42500

47500

52500

57500

62500

67500

72500

77500

82500

87500

92500

97500

102500

107500

112500

117500

122500

127500

132500

0

100

200

300

400

500

600

700

GERMLINE run times

Projected GERMLINE run times

Samples

Hou

rs

11

Page 12: Scaling AncestryDNA Using Hadoop and HBase

The Mission : Create a Scalable Matching Engine

... and thus was born

(aka "Jermline with a J")

12

Page 13: Scaling AncestryDNA Using Hadoop and HBase

What is Hadoop?

• Hadoop is an open-source platform for processing large amounts of data in a scalable, fault-tolerant, affordable fashion, using commodity hardware

• Hadoop specifies a distributed file system called HDFS

• Hadoop supports a processing methodology known as MapReduce

• Many tools are built on top of Hadoop, such as HBase, Hive, and Flume

13

Page 14: Scaling AncestryDNA Using Hadoop and HBase

What is HDFS?

Page 15: Scaling AncestryDNA Using Hadoop and HBase

What happens when HDFS loses a server

Page 16: Scaling AncestryDNA Using Hadoop and HBase

What is MapReduce?

16

Page 17: Scaling AncestryDNA Using Hadoop and HBase

What is HBase?

• HBase is an open-source NoSQL data store that runs on top of HDFS

• HBase is columnar; you can think of it as a weird amalgam of a hashtable and a spreadsheet

• HBase supports unlimited rows and columns

• HBase stores data sparsely; there is no penalty for empty cells

• HBase is gaining in popularity: Salesforce, Facebook, and Twitter have all invested heavily in the technology, as well as many others

17

Page 18: Scaling AncestryDNA Using Hadoop and HBase

Game of Thrones Characters, in an HBase Table

KEY gender hair_color family_name is_evilJoffrey male blonde Baratheon yesCersei female blonde Lannister kinda

18

Page 19: Scaling AncestryDNA Using Hadoop and HBase

Adding a Row to an HBase Table

KEY gender hair_color family_name is_evilJoffrey male blonde Baratheon yesCersei female blonde Lannister kindaSansa female red Stark no

19

Page 20: Scaling AncestryDNA Using Hadoop and HBase

Adding a Column to an HBase Table

KEY gender hair_color family_name is_evil titleJoffrey male blonde Baratheon yes kingCersei female blonde Lannister kindaSansa female red Stark no

20

Page 21: Scaling AncestryDNA Using Hadoop and HBase

Cersei : ACTGACCTAGTTGACJoffrey : TTAAGCCTAGTTGAC

The Input

Cersei Baratheon

• Former queen of Westeros

• Machiavellian manipulator

• Mostly evil, but occasionally sympathetic

Joffrey Baratheon

• Pretty much the human embodiment of evil

• Needlessly cruel • Kinda looks like

Justin Bieber

DNA Matching : How it Works

21

Page 22: Scaling AncestryDNA Using Hadoop and HBase

0 1 2Cersei : ACTGA CCTAG TTGACJoffrey : TTAAG CCTAG TTGAC

Separate into words

DNA Matching : How it Works

22

Page 23: Scaling AncestryDNA Using Hadoop and HBase

0 1 2Cersei : ACTGA CCTAG TTGACJoffrey : TTAAG CCTAG TTGAC

ACTGA_0 : CerseiTTAAG_0 : JoffreyCCTAG_1 : Cersei, JoffreyTTGAC_2 : Cersei, Joffrey

Build the hash table

DNA Matching : How it Works

23

Page 24: Scaling AncestryDNA Using Hadoop and HBase

Iterate through genome and find matches

Cersei and Joffrey match from position 1 to position 2

0 1 2Cersei : ACTGA CCTAG TTGACJoffrey : TTAAG CCTAG TTGAC

ACTGA_0 : CerseiTTAAG_0 : JoffreyCCTAG_1 : Cersei, JoffreyTTGAC_2 : Cersei, Joffrey

DNA Matching : How it Works

24

Page 25: Scaling AncestryDNA Using Hadoop and HBase

Does that mean they're related?

...maybe

?25

Page 26: Scaling AncestryDNA Using Hadoop and HBase

IBD to Relationship Estimation

• We use the total length of all shared segments to estimate the relationship between to genetic relatives

• This is basically a classification problem

26

Page 27: Scaling AncestryDNA Using Hadoop and HBase

Jaime : TTAAGCCTAGGGGCG

But Wait...What About Jaime?

Jaime Lannister

• Kind of a has-been• Killed the Mad King• Has the hots for his

sister, Cersei

27

Page 28: Scaling AncestryDNA Using Hadoop and HBase

Adding a new sample, the GERMLINE way

28

Page 29: Scaling AncestryDNA Using Hadoop and HBase

0 1 2Cersei : ACTGA CCTAG TTGACJoffrey : TTAAG CCTAG TTGACJaime : TTAAG CCTAG GGGCG

ACTGA_0 : CerseiTTAAG_0 : Joffrey, JaimeCCTAG_1 : Cersei, Joffrey, JaimeTTGAC_2 : Cersei, JoffreyGGGCG_2 : Jaime

Step one: Rebuild the entire hash table from scratch, including the new sample

The GERMLINE Way

29

Page 30: Scaling AncestryDNA Using Hadoop and HBase

Cersei and Joffrey match from position 1 to position 2Joffrey and Jaime match from position 0 to position 1Cersei and Jaime match at position 1

Step two: Find everybody's matches all over again, including the new sample. (n x n comparisons)

0 1 2Cersei : ACTGA CCTAG TTGACJoffrey : TTAAG CCTAG TTGACJaime : TTAAG CCTAG GGGCG

ACTGA_0 : CerseiTTAAG_0 : Joffrey, JaimeCCTAG_1 : Cersei, Joffrey, JaimeTTGAC_2 : Cersei, JoffreyGGGCG_2 : Jaime

The GERMLINE Way

30

Page 31: Scaling AncestryDNA Using Hadoop and HBase

Cersei and Joffrey match from position 1 to position 2Joffrey and Jaime match from position 0 to position 1Cersei and Jaime match at position 1

Step three : Now, throw away the evidence!

0 1 2Cersei : ACTGA CCTAG TTGACJoffrey : TTAAG CCTAG TTGACJaime : TTAAG CCTAG GGGCG

ACTGA_0 : CerseiTTAAG_0 : Joffrey, JaimeCCTAG_1 : Cersei, Joffrey, JaimeTTGAC_2 : Cersei, JoffreyGGGCG_2 : Jaime

The GERMLINE Way

31

Page 32: Scaling AncestryDNA Using Hadoop and HBase

Step one: Update the hash tableCersei Joffrey

2_ACTGA_0 1

2_TTAAG_0 1

2_CCTAG_1 1 1

2_TTGAC_2 1 1

Already stored in HBase

Jaime : TTAAG CCTAG GGGCG New sample to add

Key : [CHROMOSOME]_[WORD]_[POSITION]Qualifier : [USER ID]Cell value : A byte set to 1, denoting that the user has that word at that position on that chromosome

The Way

32

Page 33: Scaling AncestryDNA Using Hadoop and HBase

Jaime and Joffrey match from position 0 to position 1Jaime and Cersei match at position 1

Already stored in HBase

2_Cersei 2_Joffrey

2_Cersei { (1, 2), ...}

2_Joffrey { (1, 2), ...}

New matches to add

Key : [CHROMOSOME]_[USER ID]Qualifier : [CHROMOSOME]_[USER ID]Cell value : A list of ranges where the two users match on a chromosome

Step two: Find matches, update the results table

The Way

33

Page 34: Scaling AncestryDNA Using Hadoop and HBase

Results Table2_Cersei 2_Joffrey 2_Jaime

2_Cersei { (1, 2), ...} { (1), ...}

2_Joffrey { (1, 2), ...} { (0,1), ...}

2_Jaime { (1), ...} { (0,1), ...}

Hash TableCersei Joffrey Jaime

2_ACTGA_0 1

2_TTAAG_0 1 1

2_CCTAG_1 1 1 1

2_TTGAC_2 1 1

2_GGGCG_2 1

The Way

34

Page 35: Scaling AncestryDNA Using Hadoop and HBase

But wait ... what about Daenerys, Tyrion, Arya, and Jon Snow?

35

Page 36: Scaling AncestryDNA Using Hadoop and HBase

Run them in parallel with Hadoop!

36

Page 37: Scaling AncestryDNA Using Hadoop and HBase

Parallelism with Hadoop

• Batches are usually about a thousand people

• Each mapper takes a single chromosome for a single person

• MapReduce Jobs :Job #1 : Match Words

o Updates the hash tableJob #2 : Match Segments

o Identifies areas where the samples match

37

Page 38: Scaling AncestryDNA Using Hadoop and HBase

How does perform?

A 1700% performance improvement over GERMLINE!

38

Page 39: Scaling AncestryDNA Using Hadoop and HBase

2500

7500

12500

17500

22500

27500

32500

37500

42500

47500

52500

57500

62500

67500

72500

77500

82500

87500

92500

97500

102500

107500

112500

117500

0

5

10

15

20

25

Samples

Hou

rsRun times for Matching (in hours)

39

Page 40: Scaling AncestryDNA Using Hadoop and HBase

25007500

1250017500

2250027500

3250037500

4250047500

5250057500

6250067500

7250077500

8250087500

9250097500

102500

107500

112500

1175000

20

40

60

80

100

120

140

160

180

GERMLINE run times

Projected GERMLINE run times

Jermline run times

Run times for Matching (in hours)

Samples

Hou

rs

40

Page 41: Scaling AncestryDNA Using Hadoop and HBase

Bottom line: Without Hadoop and HBase, this would have been expensive and difficult.

Dramatically Increased our Capacity

41

Page 42: Scaling AncestryDNA Using Hadoop and HBase

And now for everybody's favorite part ....

Lessons Learned

42

Page 43: Scaling AncestryDNA Using Hadoop and HBase

Lessons Learned

What went right?

43

Page 44: Scaling AncestryDNA Using Hadoop and HBase

Lessons Learned : What went right?

44

• This project would not have been possible without TDD• Two sets of test data : generated and public domain• 89% coverage

• Corrected a bug in the reference implementation

• Has never failed in production

Page 45: Scaling AncestryDNA Using Hadoop and HBase

Lessons Learned

What would we do differently?

45

Page 46: Scaling AncestryDNA Using Hadoop and HBase

Lessons Learned : What would we do differently?

• Front-load some performance tests HBase and Hadoop can have odd performance profiles HBase in particular has some strange behavior if you're not

familiar with its inner workings

• Allow a lot of time for live tests, dry runs, and deployment

These technologies are relatively new, and it isn't always possible to find experienced admins. Be prepared to "get your hands dirty"

46

Page 47: Scaling AncestryDNA Using Hadoop and HBase

Questions?

47

Page 48: Scaling AncestryDNA Using Hadoop and HBase

48

And yes, we are hiring!You can contact me at [email protected]

or just talk to me here!