UMass andLearning for CALO
Andrew McCallum
Information Extraction & Synthesis Laboratory
Department of Computer Science
University of Massachusetts
Outline• CC-Prediction
– Learning in the wild from user email usage
• DEX– Learning in the wild from user correction...
as well as KB records filled by other CALO components
• Rexa– Learning in the wild from user corrections to
coreference... propagating constraints in a Markov-Logic-like system that scales to ~20 million objects
• Several new topic models– Discover interesting useful structure without the need for
supervision... learning from newly arrived data on the fly
CC Prediction Using Various Exponential Family
Factor Graphs
Learning to keep an org. connected & avoid stove-piping.
First steps toward ad-hoc team creation.
Learning in the wild from user’s CC behavior,and from other parts of the CALO ontology.
Graphical Models for Email
xb
y
Nb
xsNs
xrNr-1
Body Subject Other Words Words Recipients
Recipient of Email
Nr
• Compute P(y|x) for CC prediction
- function - random variable
- N replicationsN
• Local functions facilitate system engineering through modularity
Email Model: Nb words in the body, Ns words in the subject, Nr recipients
The graph describes the joint distribution of random variables in term of the product of local functions
Document Models
xb
y
Nb
xsNs
xrNa-1
Title Abstract Body Co-authors References
Author ofDocument
Na
• Models may relational attributes
xt xbNt Nr
• We can optimize P(y|x) for classification performance and P(x|y) for model interpretability and parameter transfer (to other models)
CC Prediction and Relational Attributes
xb
y
Nb
xsNs
xrNr-1
Thread Body Subject Other Relation Relation Words Words Recipients
Target Recipient
Nr
xr’xtr
Thread Relations – e.g. Was a given recipient ever included on this email thread?
Recipient Relationships – e.g. Does one of the other recipients report to the target recipient?
Ntr
CC-Prediction Learning in the Wild
• As documents are added to Rexa, models of expertise for authors grows
• As DEX obtains more contact information and keywords, organizational relations emerge
• Model parameters can be adapted on-line
• Priors on parameters can be used to transfer learned information between models
• New relations can be added on-line• Modular model construction and intelligent model
optimization enable these goals
CC Prediction Upcoming work on
Multi-Conditional Learning
A discriminatively-trained topic model,
discovering low-dimensional representations for
transfer learning and improved regularization & generalization.
Objective Functions for Parameter EstimationTraditional, joint training (e.g. naive Bayes, most topic models)
Traditional, conditional training (e.g. MaxEnt classifiers, CRFs)
Conditional mixtures (e.g. Jebara’s CEM, McCallum CRF string edit distance, ...)
Multi-conditional(mostly conditional, generative regularization)
Multi-conditional(for semi-sup)
Multi-conditional(for transfer learning, 2 tasks, shared hiddens)
Tra
dit
ion
alN
ew,
mu
lti-
con
dit
ion
al
Traditional mixture model (e.g. LDA)
“Multi-Conditional Learning” (Regularization)[McCallum, Pal, Wang, 2006]
Predictive Random Fieldsmixture of Gaussians on synthetic data
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Data, classify by color Generatively trained
Conditionally-trained [Jebara 1998]
Multi-Conditional
[McCallum, Wang, Pal, 2005]
Multi-Conditional Mixturesvs. Harmoniun
on document retrieval task
Harmonium, joint with words, no labels
Harmonium, joint,with class labels and words
Conditionally-trained,to predict class labels
Multi-Conditional,multi-way conditionally trained
[McCallum, Wang, Pal, 2005]
DEX
Beginning with a review of previous work,
then new work on record extraction,
with the ability to leverage new KBs in the wild, and for transfer
System Overview
ContactInfo andPerson Name
Extraction
Person Name
Extraction
NameCoreference
HomepageRetrieval
Social NetworkAnalysis
KeywordExtraction
CRFWWW
names
Email QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
An ExampleTo: “Andrew McCallum” [email protected]
Subject ...
First Name:
Andrew
Middle Name:
Kachites
Last Name:
McCallum
JobTitle: Associate Professor
Company: University of Massachusetts
Street Address:
140 Governor’s Dr.
City: Amherst
State: MA
Zip: 01003
Company Phone:
(413) 545-1323
Links: Fernando Pereira, Sam Roweis,…
Key Words:
Information extraction,
social network,…
Search for new people
Summary of Results
Token
Acc
Field
Prec
Field
Recall
Field
F1
CRF 94.50 85.73 76.33 80.76
Person Keywords
William Cohen Logic programming
Text categorization
Data integration
Rule learning
Daphne Koller Bayesian networks
Relational models
Probabilistic models
Hidden variables
Deborah McGuiness
Semantic web
Description logics
Knowledge representation
Ontologies
Tom Mitchell Machine learning
Cognitive states
Learning apprentice
Artificial intelligence
Contact info and name extraction performance (25 fields)
Example keywords extracted
1. Expert Finding: When solving some task, find friends-of-friends with relevant expertise. Avoid “stove-piping” in large org’s by automatically suggesting collaborators. Given a task, automatically suggest the right team for the job. (Hiring aid!)
2. Social Network Analysis: Understand the social structure of your organization. Suggest structural changes for improved efficiency.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
• Information about – people – contact information– email– affiliation– job title– expertise – ...
are key to answering many CALO questions...both directly, and as supporting inputs to higher-level questions.
Importance of accurate DEX fields in IRIS
Learning Field Compatibilities in DEX
Professor Jane Smith
University of California
209-555-5555
Professor Smith chairs the Computer Science Department. She hails from Boston, …her administrative assistant …
John Doe
Administrative Assistant
University of California
209-444-4444
Name: Jane Smith, John Doe
JobTitle: Professor, Administrative Assistant
Company: U of California
Department: Computer Science
Phone: 209-555-5555, 209-444-4444
City: Boston
Extracted Record
Jane Smith University of California
209-555-5555Computer Science
Boston
John Doe
Administrative Assistant
University of California
209-444-4444
Professor-.5
-.4
-.6
.4
.8
.4
-.5
Compatibility Graph
Learning Field Compatibilities in DEX
Professor Jane Smith
University of California
209-555-5555
Professor Smith chairs the Computer Science Department. She hails from Boston, …her administrative assistant …
John Doe
Administrative Assistant
University of California
209-444-4444
Name: Jane Smith, John Doe
JobTitle: Professor, Administrative Assistant
Company: U of California
Department: Computer Science
Phone: 209-555-5555, 209-444-4444
City: Boston
Extracted Record
Jane Smith University of California
209-555-5555 Computer Science
Boston
John Doe
Administrative Assistant
University of California
209-444-4444
Professor
• ~35% error reduction over transitive closure
• Qualitatively better than heuristic approach • Mine Knowledge Bases from other parts of IRIS
for learning compatibility rules among fields– “Professor” job title co-occurs with “University” company– Area code / city compatibility– “Senator” job title co-occurs with “Washington, D.C” location
• In the wild– As the user adds new fields & make corrections, DEX learns from
this KB data
• Transfer learning – between departments/industries
Learning Field Compatibilities in DEX
Rexa A knowledge base of publications,
grants, people, their expertise, topics, and inter-connections
Learning for information extraction and coreference.
Incrementally leveraging multiple sources of information for improved coreference
Gathering information about people’s expertise and co-author, citation relations
First a tour of Rexa, then slides about learning
Previous Systems
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
ResearchPaper
Cites
Previous Systems
ResearchPaper
Cites
Person
UniversityVenue
Grant
Groups
Expertise
More Entities and Relations
Learning in Rexa
Extraction, coreferenceIn the wild: Re-adjusting KB after corrections from a user
Also, learning research topics/expertise, and their interconnections
(Linear Chain) Conditional Random Fields
yt -1
yt
xt
yt+1
xt +1
xt -1
Finite state model Graphical model
Undirected graphical model, trained to maximize
conditional probability of output sequence given input sequence
. . .
FSM states
observations
yt+2
xt +2
yt+3
xt +3
said Jones a Microsoft VP …
where
OTHER PERSON OTHER ORG TITLE …
output seq
input seq
Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]
Wide-spread interest, positive experimental results in many applications.
Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…
[Lafferty, McCallum, Pereira 2001]
€
p(y | x) =1
Zx
Φ(y t ,y t−1,x, t)t
∏
€
Φ(y t ,y t−1,x, t) = exp λ k fk (y t ,y t−1,x, t)k
∑ ⎛
⎝ ⎜
⎞
⎠ ⎟
(500
cit
atio
ns)
IE from Research Papers[McCallum et al ‘99]
IE from Research Papers
Field-level F1
Hidden Markov Models (HMMs) 75.6[Seymore, McCallum, Rosenfeld, 1999]
Support Vector Machines (SVMs) 89.7[Han, Giles, et al, 2003]
Conditional Random Fields (CRFs) 93.9[Peng, McCallum, 2004]
error40%
(Word-level accuracy is >99%)
p
Databasefield values
c
Joint segmentation and co-reference
o
s
o
s
c
c
s
o
Citation attributes
y y
y
Segmentation
[Wellner, McCallum, Peng, Hay, UAI 2004]Inference:Variant of Iterated Conditional Modes
Co-reference decisions
Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990.
Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990.
[Besag, 1986]
World Knowledge
35% reduction in co-reference error by using segmentation uncertainty.
6-14% reduction in segmentation error by using co-reference.
Extraction from and matching of research paper citations.
see also [Marthi, Milch, Russell, 2003]
Rexa Learning in the Wildfrom User Feedback
• Coreference will never be perfect.• Rexa allows users to enter corrections to
coreference decisions• Rexa then uses this feedback to
– re-consider other inter-related parts of the KB– automatically make further error corrections
by propagating constraints
• (Our coreference system uses underlying ideas very much like Markov Logic, and scales to ~20 million mention objects.)
Finding Topics in 1 million CS papers
200 topics & keywords automatically discovered.
Topical Transfer
Citation counts from one topic to another. Map “producers and consumers”
Topical Diversity
Find the topics that are cited by many other topics---measuring diversity of impact.
Entropy of the topic distribution among papers that cite this paper (this topic).
LowDiversity
HighDiversity
Some New Work onTopic Models
Robustly capturing topic correlationsPachkinko Allocation Model
Capturing phrases in topic-specific waysTopical N-Gram Model
Pachinko Machine
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.QuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
Pachinko Allocation Model[Li, McCallum, 2005]
Model stru
cture
,
not the g
raphical m
odel
Distributions over words (like “LDA topics”)
Distributions over topics;mixtures, representing topic correlations
Distributions over distributions over topics...
Some interior nodes could contain one multinomial, used for all documents.(i.e. a very peaked Dirichlet)
22
31 33
41 42 43 44 45
32
word1 word2 word3 word4 word5 word6 word7 word8
21
11
Topic Coherence Comparison
LDA 100estimationlikelihoodmaximumnoisyestimatesmixturescenesurfacenormalizationgeneratedmeasurementssurfacesestimatingestimatediterativecombinedfiguredivisivesequenceideal
LDA 20models modelparametersdistributionbayesianprobabilityestimationdatagaussianmethodslikelihoodemmixtureshowapproachpaperdensityframeworkapproximationmarkov
Example super-topic33 input hidden units function number27 estimation bayesian parameters data methods24 distribution gaussian markov likelihood mixture11 exact kalman full conditional deterministic1 smoothing predictive regularizers intermediate slope
“models,estimation, stopwords”
“estimation,some junk”
PAM 100estimationbayesianparametersdatamethodsestimatemaximumprobabilisticdistributionsnoisevariablevariablesnoisyinferencevarianceentropymodelsframeworkstatisticalestimating
“estimation”
Topic Correlations in PAM
5000 research paper abstracts, from across all CS
Numbers on edges are supertopics’ Dirichlet parameters
Likelihood Comparison
Varying number of topics
Want to Model Trends over Time
• Is prevalence of topic growing or waning?
• Pattern appears only briefly– Capture its statistics in focused way– Don’t confuse it with patterns elsewhere in time
• How do roles, groups, influence shift over time?
Topics over Time (TOT)
w t
Nd
z
D
T
T
Betaover time
Multinomialover words
Dirichlet
multinomialover topics
topicindex
wordtime
stamp
Dirichletprior
Uniformprior
w
t
Nd
z
D
T
Multinomialover words
time stamp
multinomialover topics
topicindex
word
Dirichletprior
distributionon timestamps
T
Betaover time
Uniformprior
[Wang, McCallum 2006]
State of the Union Address
208 Addresses delivered between January 8, 1790 and January 29, 2002.
To increase the number of documents, we split the addresses into paragraphs and treated them as ‘documents’. One-line paragraphs were excluded. Stopping was applied.
•17156 ‘documents’
•21534 words
•669,425 tokens
Our scheme of taxation, by means of which this needless surplus is takenfrom the people and put into the public Treasury, consists of a tariff orduty levied upon importations from abroad and internal-revenue taxes leviedupon the consumption of tobacco and spirituous and malt liquors. It must beconceded that none of the things subjected to internal-revenue taxationare, strictly speaking, necessaries. There appears to be no just complaintof this taxation by the consumers of these articles, and there seems to benothing so well able to bear the burden without hardship to any portion ofthe people.
1910
Comparing
TOT
against
LDA
Topic Distributions Conditioned on Time
time
top
ic m
ass
(in
ver
tica
l h
eig
ht)
NIPSvol1-14