![Page 1: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/1.jpg)
On Beyond Hypertext: Searching in Graphs Containing
Documents, Words, and Data
William W. CohenCenter for Automated Learning and Discovery
+ Language Technology Institute+ Center for Bioimage Informatics
+ Joint CMU-Pitt Program in Bioinformatics
Carnegie Mellon University
![Page 2: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/2.jpg)
On Beyond Hypertext: Searching in Graphs Containing
Documents, Words, and Data
William W. CohenMachine Learning Department
+ Language Technology Institute+ Center for Bioimage Informatics
+ Joint CMU-Pitt Program in Bioinformatics
Carnegie Mellon University
![Page 3: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/3.jpg)
On Beyond Hypertext: Searching in Graphs Containing
Documents, Words, and Data
William W. CohenCarnegie Mellon University
joint work with:
Einat Minkov (CMU)Andrew Ng (Stanford)
![Page 4: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/4.jpg)
Outline
• Motivation: why I’m interested in – structured data that is partly text;– structured data represented as graphs;– measuring similarity of nodes in graphs
• Contributions:– a simple query language for graphs;– experiments on natural types of queries;– techniques for learning to answer queries of a
certain type better
![Page 5: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/5.jpg)
“A Little Knowledge is A Dangerous Thing” [A. Pope, 1709]
• Three centuries later, we’ve learned that a lot of knowledge is also sort of dangerous....
... so how do we deal with information overload?
![Page 6: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/6.jpg)
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
One approach: adding structure to unstructured information
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
... by recognizing entity names...
... and relationships between them...
NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..
IE
![Page 7: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/7.jpg)
One approach: adding structure to unstructured information
[Carvalho, Cohen SIGIR05; Cohen, Carvalho, Mitchell EMNLP 04]
![Page 8: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/8.jpg)
One approach: adding structure to unstructured information
[Mitchell et al CEAS 2004]
![Page 9: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/9.jpg)
One approach: adding structure to unstructured information
![Page 10: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/10.jpg)
One approach: adding structure to unstructured information
[McCallum et al IJCAI05]
![Page 11: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/11.jpg)
![Page 12: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/12.jpg)
![Page 13: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/13.jpg)
![Page 14: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/14.jpg)
![Page 15: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/15.jpg)
![Page 16: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/16.jpg)
![Page 17: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/17.jpg)
Is converting unstructured data to structured data enough?
![Page 18: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/18.jpg)
Limitations of structured data
• Diversity: many different types of information from many different sources, that arise to fill many different needs.
• Uncertainty: information from many sources (like IE programs or the web) need not be correct.
• Complexity of interaction: formulating ‘information needs’ as queries to a DB can be difficult...especially a heterogeneous DB, with a complex/changing schema.
What is the email address for the person named “Halevy” mentioned in this presentation?
What files from my home machine will I need for this meeting?
What people will attend this meeting?
... ?
?
How can you include many diverse sources of information in single database?
How do you discover & access the tens or hundreds of structured databases?
How do you understand & combine the hundreds of schemata, with thousands of fields?
How do you relate the thousands or millions or ... of entity identifiers from the different databases?
![Page 19: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/19.jpg)
When are two entities the same?When is referent(oid1)=referent(oid2) ?
• Bell Labs• Bell Telephone Labs• AT&T Bell Labs• AT&T Labs• AT&T Labs—Research• AT&T Labs Research,
Shannon Laboratory• Shannon Labs• Bell Labs Innovations• Lucent Technologies/Bell
Labs Innovations
History of Innovation: From 1925 to today, AT&T has attracted some of the world's greatest scientists, engineers and developers…. [www.research.att.com]
Bell Labs Facts: Bell Laboratories, the research and development arm of Lucent Technologies, has been operating continuously since 1925… [bell-labs.com]
[1925]
![Page 20: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/20.jpg)
=
=
≠
Bell Telephone Labs
Is there a definition of ‘entity identity’ that is user- and purpose- independent?
![Page 21: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/21.jpg)
“Buddhism rejects the key element in folk psychology: the idea of a self (a unified personal identity that is continuous through time)…
King Milinda and Nagasena (the Buddhist sage) discuss … personal identity… Milinda gradually realizes that "Nagasena" (the word) does not stand for anything he can point to: … not … the hairs on Nagasena's head, nor the hairs of the body, nor the "nails, teeth, skin, muscles, sinews, bones, marrow, kidneys, ..." etc… Milinda concludes that "Nagasena" doesn't stand for anything… If we can't say what a person is, then how do we know a person is the same person through time? …
There's really no you, and if there's no you, there are no beliefs or desires for you to have… The folk psychology picture is profoundly misleading and believing it will make you miserable.” -S. LaFave
When are two entities are the same?
![Page 22: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/22.jpg)
LinkageQueries
Traditional approach:
Uncertainty about what to linkmust be decided by the integration
system, not the end user
![Page 23: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/23.jpg)
Link items asneeded by Q
WHIRL vision:
Query Q
SELECT R.a,S.a,S.b,T.b FROM R,S,T
WHERE R.a=S.a and S.b=T.b
R.a S.a S.b T.b
Anhai Anhai Doan Doan
Dan Dan Weld Weld
Strongest links: those agreeable to most users
William Will Cohen Cohn
Steve Steven Minton Mitton
Weaker links: those agreeable to some users
William David Cohen Cohneven weaker links…
![Page 24: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/24.jpg)
Link items asneeded by Q
WHIRL vision:
Query Q
SELECT R.a,S.a,S.b,T.b FROM R,S,T
WHERE R.a~S.a and S.b~T.b (~ TFIDF-similar)
R.a S.a S.b T.b
Anhai Anhai Doan Doan
Dan Dan Weld Weld
Incrementally produce a ranked list of possible links,
with “best matches” first. User (or downstream process)
decides how much of the list to generate and examine.
William Will Cohen Cohn
Steve Steven Minton Mitton
William David Cohen Cohn
DB1 + DB2 ≠ DB
![Page 25: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/25.jpg)
Outline
• Motivation: why I’m interested in – structured data that is partly text: similarity!– structured data represented as graphs;– measuring similarity of nodes in graphs
• Contributions:– a simple query language for graphs;– experiments on natural types of queries;– techniques for learning to answer queries of a
certain type better
There are general-purpose, fast, robust similarity measures for text, which are useful for data integration....and hence, combining information from multiple sources.
![Page 26: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/26.jpg)
Limitations of structured data
• Diversity: many different types of information from many different sources, that arise to fill many different needs.
• Uncertainty: information from many sources (like IE programs or the web) need not be correct.
• Complexity of interaction: formulating ‘information needs’ as queries to a DB can be difficult...especially a heterogeneous one.
What is the email address for the person named “Halevy” mentioned in this presentation?
What files from my home machine will I need for this meeting?
What people will attend this meeting?
... ?
?
How can you exploit structure without understanding the structure?
![Page 27: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/27.jpg)
Schema-free structured search
• DataSpot (DTL)/Mercado Intuifind: [VLDB 98]• Proximity Search: [VLDB98]• Information units (linked Web pages): [WWW10]• Microsoft DBExplorer, Microsoft English query
• BANKS (Browsing ANd Keyword Search): [Chakrabarti & others, VLDB 02, VLDB 05]
![Page 28: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/28.jpg)
BANKS: Basic Data Model
• Database is modeled as a graph– Nodes = tuples– Edges = references between tuples
• edges are directed.
• foreign key, inclusion dependencies, ..
MultiQuery Optimization
S. Sudarshan
Prasan Roy
writes
author
paper
Charuta
BANKS: Keyword search…
User need not know organization of database to formulate queries.
![Page 29: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/29.jpg)
BANKS: Answer to Query
Query: “sudarshan roy” Answer: subtree from graph
MultiQuery Optimization
S. Sudarshan
Prasan Roy
writes writes
author author
paper
![Page 30: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/30.jpg)
BANKS: Basic Data Model
• Database is modeled as a graph– Nodes = tuples– Edges = references between tuples
• edges are directed.
• foreign key, inclusion dependencies, ..
![Page 31: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/31.jpg)
BANKS: Basic Data Model
• Database All information is modeled as a graph– Nodes = tuples or documents or strings or words
– Edges = references between tuples nodes• edges are directed, labeled and weighted• foreign key, inclusion dependencies, ...• doc/string D to word contained by D (TFIDF weighted, perhaps)• word W to doc/string containing W (inverted index)• [string S to strings ‘similar to’ S]
not quite so basic
![Page 32: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/32.jpg)
Outline
• Motivation: why I’m interested in – structured data that is partly text – similarity!– structured data represented as graphs; all sorts
of information can be poured into this model.– measuring similarity of nodes in graphs
• Contributions:– a simple query language for graphs;– experiments on natural types of queries;– techniques for learning to answer queries of a
certain type better
![Page 33: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/33.jpg)
Yet another schema-free query language
• Assume data is encoded in a graph with:– a node for each object x– a type of each object x, T(x)– an edge for each binary relation r:x y
• Queries are of this form:
– Given type t* and node x, find y:T(y)=t* and y~x.
• We’d like to construct a general-purpose similarity function x~y for objects in the graph:
• We’d also like to learn many such functions for different specific tasks (like “who should attend a meeting”)
Node similarity
![Page 34: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/34.jpg)
Similarity of Nodes in Graphs
Given type t* and node x, find y:T(y)=t* and y~x.
• Similarity defined by “damped” version of PageRank• Similarity between nodes x and y:
– “Random surfer model”: from a node z,• with probability α, stop and “output” z• pick an edge label r using Pr(r | z) ... e.g. uniform• pick a y uniformly from { y’ : z y with label r }• repeat from node y ....
– Similarity x~y = Pr( “output” y | start at x)
• Intuitively, x~y is summation of weight of all paths from x to y, where weight of path decreases exponentially with length.
![Page 35: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/35.jpg)
BANKS: Basic Data Model
• Database All information is modeled as a graph– Nodes = tuples or documents or strings or words
– Edges = references between tuples nodes• edges are directed, labeled and weighted• foreign key, inclusion dependencies, ...• doc/string D to word contained by D (TFIDF weighted, perhaps)• word W to doc/string containing W (inverted index)• [string S to strings ‘similar to’ S]
not quite so basic
optional—strings that are similar in TFIDF/cosine distance will still be “nearby” in graph (connected by many length=2 paths)
“William W. Cohen, CMU”
“Dr. W. W. Cohen”
cohenwilliam w
drcmu
![Page 36: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/36.jpg)
Similarity of Nodes in Graphs
• Random surfer on graphs:– natural extension to PageRank– closely related to Lafferty’s heat diffusion kernel
• but generalized to directed graphs– somewhat amenable to learning parameters of the walk
(gradient search, w/ various optimization metrics):• Toutanova, Manning & NG, ICML2004• Nie et al, WWW2005• Xi et al, SIGIR 2005
– can be sped up and adapted to longer walks by sampling approaches to matrix multiplication (e.g. Lewis & E. Cohen, SODA 1998), similar to particle filtering
– our current implementation (GHIRL): Lucene + Sleepycat with extensive use of memory caching (sampling approaches visit many nodes repeatedly)
![Page 37: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/37.jpg)
Query: “sudarshan roy” Answer: subtree from graph
MultiQuery Optimization
S. Sudarshan
Prasan Roy
writes writes
author author
paper
![Page 38: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/38.jpg)
Query: “sudarshan roy” Answer: subtree from graph
y: paper(y) & y~“roy” w: paper(y) & w~“roy”AND
![Page 39: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/39.jpg)
Evaluation on Personal Information Management Tasks
Such as:
• Person Name Disambiguation in Email
• Threading
• Finding email-address aliases given a person’s name
• Finding relevant meeting attendees
Many tasks can be expressed as simple, non-conjunctive search queries in this framework.
novel
novel
novel
[eg Diehl, Getoor, Namata, 2006]
[eg Lewis & Knowles 97]
What is the email address for the person named “Halevy” mentioned in this presentation?
What files from my home machine will I need for this meeting?
What people will attend this meeting?
... ?
Also consider a generalization: x Vq
Vq is a distribution over nodes x
[Minkov et al, SIGIR 2006]
![Page 40: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/40.jpg)
file1
Email address
1
person name
2
date1
term2
term3
term5
term6
term1
term4
term7
term8
Email address
2
Email address
3
Email address
4
person name
3
person name
4
person name
1
Sent_from
sf_inv
alias
a_inv
sent_date
sd_Inv
in_file If_inv
in_subj
is_inv
Sent_to
st_invfile2
sent_to
sent_from
Email address
5
person name
5
sent_to
term10
term9
term11
date2
sent_date
+1_day
Email as a graph
![Page 41: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/41.jpg)
Person Name Disambiguation
Q: “who is Andy?”
• Given: a term that is not mentioned ‘as is’ in header (otherwise, easy), that is known to be a personal name
• Output: ranked person nodes.
file
term:andy
file
Person: Andrew Johns
Person
Person
file
* This task is complementary to person name annotation in email (E. Minkov, R. Wang, W.Cohen, Extracting Personal Names from Emails: Applying Named Entity Recognition to Informal Text, HLT/EMNLP 2005)
![Page 42: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/42.jpg)
Corpora and Datasets
a. Corpora
b. Types of names
Example nicknames:
Dave for David,
Kai for Keiko,
Jenny for Qing
![Page 43: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/43.jpg)
Person Name Disambiguation
1. Baseline: String matching (& common nicknames)
Find persons that are similar to the name term (Jaro)
• Successful in many cases
• Not successful for some nicknames
• Can not handle ambiguity (arbitrary)
2. Graph walk: term
Vq: name term node (2 steps)
• Models co-occurrences.
• Can not handle ambiguity (dominant)
3. Graph walk: term+file
Vq: name term + file nodes (2 steps)
• The file node is natural available context
• Solves the ambiguity problem!
• But, incorporates additional noise.
4. Graph walk: term+file, reranked using learning
Re-rank the output of (3), using:
- path-describing features
- ‘source count’ : do the paths originate from a single or two source nodes
- string similarity
![Page 44: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/44.jpg)
Results
![Page 45: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/45.jpg)
Results
baseline: string match, nicknames
graph walk from name
graph walk from {name,file}
after learning-to-rank
![Page 46: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/46.jpg)
Results
Enron execs
![Page 47: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/47.jpg)
Results
![Page 48: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/48.jpg)
Learning
• There is no single “best” measure of similarity:– How can you learn how to better rank graph nodes, for a
particular task?
• Learning methods for graph walks:– The parameters can be adjusted using gradient descent methods
(Diligenti et-al, IJCAI 2005)
– We explored a node re-ranking approach – which can take advantage of a wider range of features features (and is complementary to parameter tuning)
• Features of candidate answer y describe the set of paths from query x to y
![Page 49: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/49.jpg)
Re-ranking overview
Boosting-based reranking, following (Collins and Koo, Computational Linguistics, 2005):
A training example includes:– a ranked list of li nodes.
– Each node is represented through m features
– At least one known correct node
Scoring function:
Find w that minimizes (boosted version):
Requires binary features and has a closed form formula to find best feature and delta in each iteration.
, where
original score y~x
linear combination of features
![Page 50: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/50.jpg)
Path describing Features
• The set of paths to a target node in step k is recovered in full.
K=0 K=1 K=2
X1
X2
X3
X4
X5
Paths (x3, k=2):
x2 x1 x3
x4 x1 x3
x2 x2 x3
x2 x3
‘Edge unigram’ features:was edge type l used in reaching x from Vq.
‘Edge bigram’ features:were edge types l1 and l2 used (in that order) in reaching x from Vq.
‘Top edge bigram’ features:were edge types l1 and l2 used (in that order) in reaching x from Vq, among the top two highest scoring paths.
![Page 51: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/51.jpg)
Results
![Page 52: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/52.jpg)
Threading
Threading is an interesting problem, because:
• There are often irregularities in thread structural information, thus threads discourse should be captured using an intelligent approach (D.E. Lewis and K.A.
Knowles, Threading email: A preliminary study, Information Processing and Management, 1997)
• Threading information can improve message categorization into topical folders (B. Klimt and Y. Yang, The Enron corpus: A new dataset for email classification
research, ECML, 2004)
• Adjacent messages in a thread can be assumed to be most similar to each other in the corpus. Therefore, threading is related to the general problem of finding similar messages in a corpus.
The task: given a message, retrieve adjacent messages in the thread
![Page 53: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/53.jpg)
Some intuition ?
filex
![Page 54: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/54.jpg)
Some intuition ?
filex
Shared content
![Page 55: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/55.jpg)
Some intuition ?
filex
Shared content
Social network
![Page 56: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/56.jpg)
Some intuition ?
filex
Shared content
Social network
Timeline
![Page 57: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/57.jpg)
Threading: experiments
1. Baseline: TF-IDF SimilarityConsider all the available information (header & body) as text
2. Graph walk: uniformStart from the file node, 2 steps, uniform edge weights
3. Graph walk: randomStart from the file node, 2 steps, random edge weights (best out of 10)
4. Graph walk: rerankedRerank the output of (3) using the graph-describing features
![Page 58: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/58.jpg)
Results
Highly-ranked edge-bigrams:
• sent-from sent-to -1
• date-of date-of -1
• has-term has-term -1
![Page 59: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/59.jpg)
Finding Meeting Attendees
Extended graph contains 2 months of calendar data:
![Page 60: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/60.jpg)
Main Contributions
• Presented an extended similarity measure incorporating non-textual objects
• Finite lazy random walks to perform typed search
• A re-ranking paradigm to improve on graph walk results
• Instantiation of this framework for email
• Defined and evaluated novel tasks for email
![Page 61: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/61.jpg)
Another Task that Can be Formulated as a Graph Query: GeneId-Ranking
• Given:– a biomedical paper abstract
• Find:– the geneId for every gene mentioned in the abstract
• Method: – from paper x, ranked list of geneId y: x~y
• Background resources: – a “synonym list”: geneId { name1, name2, ... }– one or more protein NER systems– training/evaluation data: pairs of (paper, {geneId1, ...., geneIdn})
![Page 62: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/62.jpg)
truelabels
NERextractor
Sample abstracts and synonyms
• MGI:96273
• Htr1a
• 5-hydroxytryptamine (serotonin) receptor 1A
• 5-HT1A receptor
•MGI:104886
•Gpx5
•glutathione peroxidase 5
•Arep
• ...
• 52,000+ for mouse, 35,000+ for fly
![Page 63: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/63.jpg)
Graph for the task....
file:doc115
“HT1A” “HT1” “CA1”
term:HT term:1 term:A term:CAterm:hippo-
campus
“5-HT1A receptor”
“Htr1a”
...
abstracts
proteins
terms
synonyms “eIF-1A” ...
...
...
geneIds MGI:95298MGI:46273 ...
hasProtein
hasProteinhasProtein
hasTerm
hasTermhasTerm
inFile
synonymsynonym
![Page 64: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/64.jpg)
file:doc115
“HT1A” “HT1” “CA1”
term:HT term:1 term:A term:CAterm:hippo-
campus
“5-HT1A receptor”
“Htr1a”
...
abstracts
proteins
terms
synonyms “eIF-1A” ...
...
...
geneIds MGI:95298MGI:46273 ...
hasProtein
hasProteinhasProtein
hasTerm
hasTermhasTerm
inFile
synonym
noisytrainingabstracts
file:doc214 file:doc523 file:doc6273 ...
![Page 65: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/65.jpg)
Experiments
• Data: Biocreative Task 1B– mouse: 10,000 train abstracts, 250 devtest, using first
150 for now; 50,000+ geneId’s; graph has 525,000+ nodes
• NER systems:– likelyProtein: trained on yapex.train using off-the-shelf
NER systems (Minorthird)– possibleProtein: same, but modified (on yapex.test) to
optimize F3, not F1 (rewards recall over precision)
![Page 66: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/66.jpg)
Experiments with NER
Token Span
Precision Recall Precision Recall F1
94.9 64.8 87.2 62.1 72.5
49.0 97.4 47.2 82.5 60.0
81.6 31.3 66.7 26.8 45.3
43.9 88.5 30.4 56.6 39.6
50.1 46.9 24.5 43.9 31.4
likely
possible
likely
possible
yapex.test
mouse
dictionary
![Page 67: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/67.jpg)
Experiments with Graph Search
• Baseline method:– extract entities of type x– for each string of type x, find best-matching synonym,
and then its geneId• consider only synonyms sharing >=1 token• Soft/TFIDF distance• break ties randomly
– rank geneId’s by number of times they are reached• rewards multiple mentions (even via alternate synonyms)
• Evaluation:– average, over 50 test documents, of
• non-interpolated average precision (plausible for curators)• max F1 over all cutoff’s
![Page 68: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/68.jpg)
Experiments with Graph Search
mouse dataset MAP maxF1
likelyProtein + softTFIDF 45.0 58.1
possibleProtein + softTFIDF 62.6 74.9
graph walk 51.3 64.3
![Page 69: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/69.jpg)
Baseline vs Graphwalk
• Baseline includes:– softTFIDF distances from NER entity to gene synonyms– knowledge that “shortcut” path docentitysynonymgeneId is
important• Graph includes:
– IDF effects, correlations, training data, etc• Proposed graph extension:
– add softTFIDF and “shortcut” edges
• Learning and reranking:– start with “local” features fi(e) of edges e=uv– for answer y, compute expectations: E( fi(e) | start=x,end=y)– use expectations as feature values and voted perceptron
(Collins, 2002) as learning-to-rank method.
![Page 70: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/70.jpg)
Experiments with Graph Search
mouse dataset MAP maxF1
likelyProtein + softTFIDF 45.0 58.1
possibleProtein + softTFIDF 62.6 74.9
graph walk 51.3 64.3
walk + extra links 73.0 80.7
walk + extra links + learning 79.7 83.9
![Page 71: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/71.jpg)
Experiments with Graph Search
![Page 72: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/72.jpg)
Hot off the presses
• Ongoing work: learn NER system from pairs of (document,geneIdList)
– much easier to obtain training data than documents in which every occurrence of every gene name is highlighted (usual NER training data)
– obtains F1 of 71.4 on mouse data (vs 45.3 by training on YAPEX data, which is from different distribution)
– Joint work with Richard Wang, Bob Frederking, Anthony Tomasic
![Page 73: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/73.jpg)
Experiments with Graph Search
mouse dataset MAP
(Yapex Trained)
MAP
(MGI Trained)
likelyProtein + softTFIDF 45.0 72.7
possibleProtein + softTFIDF 62.6 65.7
graph walk 51.3 54.4
walk + extra links 73.0 76.7
walk + extra links + learning 79.7 84.2
![Page 74: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/74.jpg)
Summary
• Contributions:– a very simple query language for graphs, based
on a diffusion-kernel (damped PageRank,...) similarity metric
– experiments on natural types of queries:• finding likely meeting attendees• finding related documents (email threading)• disambiguating person and gene/protein entity
names
– techniques for learning to answer queries• reranking using expectations of simple, local features• tune performance to a particular “similarity”
![Page 75: On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d585503460f94a375f5/html5/thumbnails/75.jpg)
Summary
• Some open problems:– scalability & efficiency:
• K-step walk on node-node graph with fan-out b is O(KbN)
• accurate sampling is O(1min) for 10-steps with O(106) nodes.
– faster, better learning methods:• combine re-ranking with learning parameters of
graph walk
– add language modeling, topic modeling:• extend graph to include models as well as data