berkeley dataproduct talk
TRANSCRIPT
Data Products Deep Dive
Pete Skomoroch @peteskomoroch 3/31/14 Berkeley CS194-‐16: Intro to Data Science
Some Background
• Physics/Math BS Undergrad • Analyst/ SoGware Engineer @ProfitLogic -‐ 3.5 years • Biodefense Engineer / ML Student @ MIT -‐ 3.5 years • Sr. Research Engineer @ AOL Search -‐ 1 year • Director @ Juice AnalyScs -‐ 1 year • ConsulSng @ Cloudera, Amazon etc -‐ 1 year • Principal Data ScienSst @ LinkedIn -‐ 4 years
Four types of data scienSst (at least)
source: "Analyzing the Analyzers" O'Reilly Media
Data ScienSsts create data products
The data product process
• Verify you are solving the right problem • Theory + model design • Measurement: data collecSon and cleaning • Feature engineering & model development • Error analysis and invesSgaSon • Iterate and improve each step in the process • Leverage derived data to build new products
Data factories & flywheels
Source: h`p://www.linkedin.com/channels/disrupt2013 Steve Jennings/Ge`y
Images Entertainment
Data Product Example: LinkedIn Skills
• Skill ExtracSon and StandardizaSon Pipeline • Skill Pages • Skills SecSon on Member Profiles • Suggested Skills Algorithm and Email • Skill Endorsements
Skill Discovery: Unsupervised Topics from Profile SpecialSes SecSon
10
Extract
Topic Clustering & Phrase Sense DisambiguaSon
11
DeduplicaSon Signals from Mechanical Turk
12
Sample Task for Mechanical Turk Workers
13
Mechanical Turk StandardizaSon
Skill Phrase DeduplicaSon
15
Lead designer and engineer for the implementaSon of a user-‐centric, fully-‐configurable UI for data aggregaSon and reporSng. Developed over 20 SaaS custom applicaSons using Python, Javascript and RoR.
Tagging Skill Phrases • Tagging: Extract potenSal skill phrases from text
• Standardize unambiguous phrase variants
16
JavaScript RoR SaaS Python
ror rubyonrails ruby on rails development ruby rails ruby on rail
Ruby on Rails
Document (ex: Profile)
TokenizaSon
Skills Tagger
Phrases (up to 6 words)
Skills Classifier
Skills (unordered)
Skills (ranked by relevance)
30
Skills Related to “Big Data”
31
Skills Correlated with the Job Title “Data ScienSst”
32
SkillRank: Algorithm for Top People
33
How do we get more people into the skill graphs?
Suggested Skills Inference • How suggested/inferred skills work:
– The skill likelihood is a condiSonal model
– ProbabiliSes are combined using a Naïve Bayes Classifier
• If you are an engineer at Apple, you probably know
about iPhone Development.
35
Profile
Extract a`ributes
- Company ID - Title ID - Groups ID - Industry ID - …
Skills Classifier
Skills (ranked by likelihood)
Feature Vectors
Skill RecommendaSons for Your LinkedIn Profile
41
49% Conversion
4% Conversion
ReputaSon: Build Endorsements Product to Collect More Graph Edges
42
PYMK + Suggested Skills
43
44
Viral Growth: 1 Billion Endorsements in 5 Months
Social Viral Tagging = Lots of Data
Suggested endorsements
Skill recommendaSons Skill markeSng
Virality only
How Did We Gather this Data?
46
1. Desire + Social Proof 2. Viral Loops + Network Effects 3. Data FoundaSon + RecommendaSon
Algorithms
Recap: Data Product EvoluSon
• Skill ExtracSon and StandardizaSon Pipeline • Skill Pages • Skills SecSon on Member Profiles • Suggested Skills Algorithm and Email > 20M members • Skill Endorsements > 60M members, 3B+ Edges • Big product wins in engagement, recall, relevance • SkillRank & ReputaSon integraSon… • Sets stage for next generaSon of products
QuesSons?
@peteskomoroch h`p://datawrangling.com h`p://www.linkedin.com/in/peterskomoroch