big data at ancestry.com
DESCRIPTION
Presentation at Big Data Summit, April 2013, SFTRANSCRIPT
DNA Learning from Data:
Who Do You Think You Are? Sco$ Sorensen and Leonid Zhukov
Ancestry.com Mission
2
Discoveries
It’s the “aha” moment of a discovery that drives our business!
3
World’s largest online family history resource
Historical Content Over 30,000 historical content collec2ons 11 billion records and images Records da2ng back to 16th century
4
World’s largest online family history resource
User Contributed Content 45 million family trees More than 4 billion profiles 200 million stories and photos
5
DNA Data
DNA Data
Over 120,000 DNA samples 700,000 SNPs for each sample 2,000,000 4th cousin matches
Spit in a tube, pay $99, learn your past Derrick Harris -‐ GigaOm
6
DNA molecule 1 differs from DNA molecule 2 at a single base-pair location (a C/T polymorphism). (http://en.wikipedia.org/wiki/Single-nucleiotide_polymorphism)
User Behavior Data
User Behavior Data 40 million searches / day 10 million people added to trees / day 5 million Hints accepted / day 3.5 million Records aMached / day
7
1/12 12/12 1/12 12/12
Real-‐Ome data feed
8
Technology
9
Machine Learning
Person and record search
10
• Search query
Hint suggesOons system
11
• Hints -‐ sugges2ons to aMach a record
Record linkage
• Record linkage – finding and matching records in mul2ple data sets with non-‐unique iden2fiers
• Goal: bring together informa2on about the same person
• Some non-‐unique iden2fiers: – Names: first name, last name (John Smith – 300,000 records) – Dates: date of birth, date of death – Places: place of birth, residence, place of death – Extra: family members, life events
• Records o[en incomplete
• Records contains mistakes
• Exact and fuzzy match
12
Life events in collecOons
13
• Life events – Birth: 2.59 bln – Marriage: 114 mln – Census: 2.74 bln – Death: 467 mln
• Total: 5.91 bln events
Candidate set funnel: exact match
14
John Smith: 300,000
John Smith, 1870: 2,200
John Smith, 1870, Boston, MA:
10
Search: high precision
Candidate set funnel: fuzzy match
15
John Smith: 380,000
John Smith, 1870: 97,000
John Smith, 1870, Boston, MA:
1400
Explora2on: large recall
Results set
16
Names edit distance
Extended dates
Missing fields
Short names
initials
Exact match
Hints suggesOon system
17
• User feedback loop: – Accept sugges2on – Reject sugges2on
• Supervised machine learning
• Learn similarity measure
(how to combine iden2fiers)
• Training & tes2ng sets: – User accepts, rejects
• Features (> 500): – First last name, DOB, POB, DOD, POD – Parents, children, siblings, spouses – Fuzzy matches
• Similar to “learning to rank” problem
A place for machine learning
18
ML suggest
Candidate k-‐set
Person Record ?
Similarity measure learning
19
Ancestry collections
Feature generation
Member trees
Person ID
ML Random forest
Person ID
Label
Model
Index
Top-k records candidate set
Feature generation Ranked List
Training
Scoring
Hadoop Hive
Record ID
Large scale machine learning
20
Random forest (R)
Random forest (R)
Random forest (R)
Random forest (R)
Model
Hadoop streaming
Hadoop HDFS
Data
21
Big Data – Big Picture
Family tree
22
• User generated family trees:
– 45 mln family trees
– 4.9 bln profiles
Family tree as a graph (DAG)
23
2020 nodes 572 marriage edges 2910 family edges
Family trees
24
Family trees staOsOcs
25
“Power law” distribu2on 44 mln trees
History from family trees
26
500 nodes 700 edges
55 genera2ons
2me
Historical immigraOon to the US
• ImmigraOon is the movement of people into a country or region to which they are not na2ve in order to seMle there
• Immigrants are those who were born outside the US and died in the US
• Based on family tree profiles: – Birth/death dates range 1500-‐1990 – Select only complete profiles with FLN, POB, DOB, POD, DOD – Perform de-‐duplica2on, remove same ancestors from different family trees – Select only those with POB != US, POD == US
• 15 mln profiles ( 0.3 % from 4.9 bln profiles)
27
ImmigraOon to the USA 1500-‐1990
28
29
ImmigraOon map
30
Ports of arrival (1800-‐1980)
31
Data Science
• Ancestry is building data science team
• We work on product data and BI
• We are hiring
• Special thanks to Mercator Group for inforgraphics
32