scale, structure, and semantics
DESCRIPTION
Keynote at 2012 Semantic Technology and Business Conference Scale, Structure, and Semantics Daniel Tunkelang, LinkedIn Science fiction has a mixed track record when it comes to anticipating technological innovations. While Jules Verne fared well with with his predictions of submarine and space technology, artificial intelligence hasn't produced anything like Arthur C. Clarke's HAL 9000. Instead, we've managed to elicit intelligence from machines through unexpected means. Search engines have achieved remarkable success in organizing the world's information by crawling the web, indexing documents, and exploiting link structure to establish authoritativeness. At LinkedIn, we apply large-scale analytics to terabytes of semistructured data to deliver products and insights that serve our 150M+ members. Semantics emerge when we apply the right analytical techniques to a sufficient quality and quantity of data. In this talk, I will describe how LinkedIn's huge and rich graph of relationship data that powers the products our users love. I believe that the lessons we have learned apply broadly to other semantic applications. While quantity and quality of data are the key challenges to delivering a semantically rich experience, the key is to create the right ecosystem that incents people to give you good data, which then forms the basis for great data products.TRANSCRIPT
Recruiting Solutions Recruiting Solutions Recruiting Solutions
Scale, Structure, and Semantics Daniel Tunkelang Principal Data Scientist at LinkedIn
Daniel
1
Take-Aways
2
Communication trumps knowledge representation.
Communication is the problem and the solution.
Overview
1. Knowledge representation is overrated. 2. Computation is underrated.
3. We have a communication problem.
3
The Bad News
1. Knowledge representation is overrated. 2. Computation is underrated.
3. We have a communication problem.
4
AI: a dream deferred.
5
Memex: the Computer Science Version
6
Cyc
7
Freebase
8
Wolfram Alpha
9
Knowledge representation is overrated.
Today’s knowledge repositories are: § incomplete § inconsistent § inscrutable § and not sustained by economic incentives. 1986 estimate of effort to complete Cyc: § 250,000 rules + 350 person-years
10
The Good News
1. Knowledge representation is overrated. 2. Computation is underrated.
3. We have a communication problem.
11
Deep Blue
12
vs.
Watson
13
Plain Old Search Engines are Pretty Good Too
14
http://blog.stephenwolfram.com/2011/01/jeopardy-ibm-and-wolframalpha/
The Unreasonable Effectiveness of Data
§ simple models + lots of data >> elaborate models + less data
§ machine translation: parallel corpora >> elaborate rules for syntactic and semantic patterns
§ semantic web formalism just means semantic interpretation on shorter strings between angle brackets
Alon Halevy, Peter Norvig, and Fernando Pereira (2009)
15
Today’s Challenge
1. Knowledge representation is overrated. 2. Computation is underrated.
3. We have a communication problem.
16
Semi-structured Data
17
Michael K. Bergman, http://www.mkbergman.com/
Semi-structured Data at LinkedIn
<person> <id> <first-name /> <last-name /> <location> <name> <country> <code> </country>
</location> <industry> … </person>
Summary
I lead a data science team at LinkedIn, which analyzes terabytes of data to produce products and insights that serve LinkedIn’s members. Prior to LinkedIn, I led a local search quality team at Google and was a founding employee of faceted search pioneer Endeca (acquired by Oracle in 2010), where…
Semi-structured Search is a Killer App
19
Another Example: Helping a Friend
Dear Daniel, I'm attaching the resume of an old friend who just moved up to the Bay Area.
He has a very strong background in: § mobile / wireless applications § start-ups and new product launches § international expansion
Best regards, XXX
20
Company Search
21
Semi-structured Data Empowers Users
22
Data-Driven Recommendations
23
Data-Driven Computation Serves Communication
24
for i in [1..n]! s ← w1 w2 … wi! if Pc(s) > 0! a ← new Segment()! a.segs ← {s}! a.prob ← Pc(s)! B[i] ← {a}! for j in [1..i-1]! for b in B[j]! s ← wj wj+1 … wi! if Pc(s) > 0! a ← new Segment()! a.segs ← b.segs U {s}! a.prob ← b.prob * Pc(s)! B[i] ← B[i] U {a}! sort B[i] by prob! truncate B[i] to size k!
Recommendations Leverage Semi-structured Data
25
Corpus Stats
Job
User Base
Filtered
title geo company
industry description functional area
…
Candidate
General expertise specialties education headline geo experience
Current Position title summary tenure length industry functional area …
Similarity (candidate expertise, job description)
0.56 Similarity
(candidate specialties, job description)
0.2 Transition probability
(candidate industry, job industry)
0.43
Title Similarity
0.8
Similarity (headline, title)
0.7 . . .
derived
Matching Binary Exact matches: geo, industry, … Soft transition probabilities, similarity, … Text
Transition probabilities Connectivity yrs of experience to reach title education needed for this title …
Skills: A Practical Knowledge Representation
26
Data-Driven Query Expansion for Recall
27
Data-Driven Query Refinement for Precision
28
There is no perfect schema or vocabulary.
§ And even if there were, not everyone would use it.
§ Knowledge representation has only succeeded within narrow scope.
§ Brute force is surprisingly effective but does not leverage the user as an intelligent partner.
29
Communication is the problem and the solution.
§ Rich communication channel fills gaps in system’s knowledge representation and in user’s knowledge.
§ Use data science to make the system smart, but be humble and empower the human user.
You've got the brawn I've got the brains Let's make lots of money Pet Shop Boys, “Opportunities”
30
The Future is Upon Us
31
One More Thing
“More data beats clever algorithms but better data beats more data.”
Monica Rogati @ Strata 2012
32