big data at ancestry.com

DNA Learning from Data:

Who Do You Think You Are? Sco$ Sorensen and Leonid Zhukov

Ancestry.com Mission

2

Discoveries

It’s the “aha” moment of a discovery that drives our business!

3

World’s largest online family history resource

Historical Content Over 30,000 historical content collec2ons 11 billion records and images Records da2ng back to 16th century

4

World’s largest online family history resource

User Contributed Content 45 million family trees More than 4 billion profiles 200 million stories and photos

5

DNA Data

DNA Data

Over 120,000 DNA samples 700,000 SNPs for each sample 2,000,000 4th cousin matches

Spit in a tube, pay $99, learn your past Derrick Harris -‐ GigaOm

6

DNA molecule 1 differs from DNA molecule 2 at a single base-pair location (a C/T polymorphism). (http://en.wikipedia.org/wiki/Single-nucleiotide_polymorphism)

User Behavior Data

User Behavior Data 40 million searches / day 10 million people added to trees / day 5 million Hints accepted / day 3.5 million Records aMached / day

7

1/12 12/12 1/12 12/12

Real-‐Ome data feed

8

Technology

9

Machine Learning

Person and record search

10

•  Search query

Hint suggesOons system

11

• Hints -‐ sugges2ons to aMach a record

Record linkage

•  Record linkage – finding and matching records in mul2ple data sets with non-‐unique iden2fiers

•  Goal: bring together informa2on about the same person

•  Some non-‐unique iden2fiers: –  Names: first name, last name (John Smith – 300,000 records) –  Dates: date of birth, date of death –  Places: place of birth, residence, place of death –  Extra: family members, life events

•  Records o[en incomplete

•  Records contains mistakes

•  Exact and fuzzy match

12

Life events in collecOons

13

•  Life events –  Birth: 2.59 bln –  Marriage: 114 mln –  Census: 2.74 bln –  Death: 467 mln

•  Total: 5.91 bln events

Candidate set funnel: exact match

14

John Smith: 300,000

John Smith, 1870: 2,200

John Smith, 1870, Boston, MA:

10

Search: high precision

Candidate set funnel: fuzzy match

15

John Smith: 380,000

John Smith, 1870: 97,000

John Smith, 1870, Boston, MA:

1400

Explora2on: large recall

Results set

16

Names edit distance

Extended dates

Missing fields

Short names

initials

Exact match

Hints suggesOon system

17

• User feedback loop: – Accept sugges2on – Reject sugges2on

•  Supervised machine learning

•  Learn similarity measure

(how to combine iden2fiers)

•  Training & tes2ng sets: – User accepts, rejects

•  Features (> 500): – First last name, DOB, POB, DOD, POD – Parents, children, siblings, spouses – Fuzzy matches

•  Similar to “learning to rank” problem

A place for machine learning

18

ML suggest

Candidate k-‐set

Person Record ?

Similarity measure learning

19

Ancestry collections

Feature generation

Member trees

Person ID

ML Random forest

Person ID

Label

Model

Index

Top-k records candidate set

Feature generation Ranked List

Training

Scoring

Hadoop Hive

Record ID

Large scale machine learning

20

Random forest (R)

Random forest (R)

Random forest (R)

Random forest (R)

Model

Hadoop streaming

Hadoop HDFS

Data

21

Big Data – Big Picture

Family tree

22

•  User generated family trees:

–  45 mln family trees

–  4.9 bln profiles

Family tree as a graph (DAG)

23

2020 nodes 572 marriage edges 2910 family edges

Family trees

24

Family trees staOsOcs

25

“Power law” distribu2on 44 mln trees

History from family trees

26

500 nodes 700 edges

55 genera2ons

2me

Historical immigraOon to the US

•  ImmigraOon is the movement of people into a country or region to which they are not na2ve in order to seMle there

•  Immigrants are those who were born outside the US and died in the US

•  Based on family tree profiles: –  Birth/death dates range 1500-‐1990 –  Select only complete profiles with FLN, POB, DOB, POD, DOD –  Perform de-‐duplica2on, remove same ancestors from different family trees –  Select only those with POB != US, POD == US

•  15 mln profiles ( 0.3 % from 4.9 bln profiles)

27

ImmigraOon to the USA 1500-‐1990

28

ImmigraOon map

30

Ports of arrival (1800-‐1980)

31

Data Science

• Ancestry is building data science team

• We work on product data and BI

• We are hiring

•  Special thanks to Mercator Group for inforgraphics

32

big data at ancestry.com

Technology

random forest

000 johnsmith

dob

boston

candidatesetfunnel

pob