information extraction, conditional random fields, and social network analysis andrew mccallum...

Information Extraction, Conditional Random Fields, and Social Network Analysis Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Aron Culotta, Charles Sutton, Ben Wellner, Khashayar Rohanimanesh, Wei Li, Andres Corrada, Xuerui Wang Slide 2 Goal: Mine actionable knowledge from unstructured text. Slide 3 Extracting Job Openings from the Web foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1 Slide 4 A Portal for Job Openings Slide 5 Job Openings: Category = High Tech Keyword = Java Location = U.S. Slide 6 Data Mining the Extracted Job Information Slide 7 IE from Chinese Documents regarding Weather Department of Terrestrial System, Chinese Academy of Sciences 200k+ documents several millennia old - Qing Dynasty Archives - memos - newspaper articles - diaries Slide 8 IE from Research Papers [McCallum et al 99] Slide 9 IE from Research Papers Slide 10 Mining Research Papers [Giles et al] [Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004] Slide 11 What is Information Extraction Information Extraction = segmentation + classification + clustering + association As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. Richard Stallman, founder of the Free Software Foundation, countered saying Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Slide 12 What is Information Extraction Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. Richard Stallman, founder of the Free Software Foundation, countered saying Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Slide 13 What is Information Extraction Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. Richard Stallman, founder of the Free Software Foundation, countered saying Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Slide 14 What is Information Extraction Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. Richard Stallman, founder of the Free Software Foundation, countered saying Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. * * * * Slide 15 Larger Context Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Database Discover patterns - entity types - links / relations - events Data Mining Spider Actionable knowledge Slide 16 Outline Examples of IE and Data Mining. Brief review of Conditional Random Fields Joint inference: Motivation and examples Joint Labeling of Cascaded Sequences (Belief Propagation) Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution (Graph Partitioning) Joint Segmentation and Co-ref (Iterated Conditional Samples) Interactive IE Two example projects Email, contact management, and Social Network Analysis Research Paper search and analysis Slide 17 Hidden Markov Models S t-1 S t O t S t+1 O t +1 O t - 1... Finite state model Graphical model Parameters: for all states S={s 1,s 2,} Start state probabilities: P(s t ) Transition probabilities: P(s t |s t-1 ) Observation (emission) probabilities: P(o t |s t ) Training: Maximize probability of training observations (w/ prior) HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, ... transitions observations o 1 o 2 o 3 o 4 o 5 o 6 o 7 o 8 Generates: State sequence Observation sequence Usually a multinomial over atomic, fixed alphabet Slide 18 IE with Hidden Markov Models Yesterday Rich Caruana spoke this example sentence. Person name: Rich Caruana Given a sequence of observations: and a trained HMM: Find the most likely state sequence: (Viterbi) Any words said to be generated by the designated person name state extract as a person name: person name location name background Slide 19 We want More than an Atomic View of Words Would like richer representation of text: many arbitrary, overlapping features of the words. S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in -ski is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor last person name was female next two words are and Associates part of noun phrase is Wisniewski ends in -ski Slide 20 Problems with Richer Representation and a Joint Model These arbitrary features are not independent. Multiple levels of granularity (chars, words, phrases) Multiple dependent modalities (words, formatting, layout) Past & future Two choices: Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! Ignore the dependencies. This causes over-counting of evidence (ala nave Bayes). Big problem when combining evidence, as in Viterbi! S t-1 S t O t S t+1 O t +1 O t - 1 S t-1 S t O t S t+1 O t +1 O t - 1 Slide 21 Conditional Sequence Models We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(s|o) instead of P(s,o): Can examine features, but not responsible for generating them. Dont have to explicitly model their dependencies. Dont waste modeling effort trying to generate what we are given at test time anyway. Slide 22 Joint Conditional S t-1 StSt OtOt S t+1 O t+1 O t-1 S t-1 StSt OtOt S t+1 O t+1 O t-1... (A super-special case of Conditional Random Fields.) [Lafferty, McCallum, Pereira 2001] where From HMMs to Conditional Random Fields Set parameters by maximum likelihood, using optimization method on L. Slide 23 Conditional Random Fields StSt S t+1 S t+2 O = O t, O t+1, O t+2, O t+3, O t+4 S t+3 S t+4 1. FSM special-case: linear chain among unknowns, parameters tied across time steps. [Lafferty, McCallum, Pereira 2001] 2. In general: CRFs = "Conditionally-trained Markov Network" arbitrary structure among unknowns 3. Relational Markov Networks [Taskar, Abbeel, Koller 2002]: Parameters tied across hits from SQL-like queries ("clique templates") Slide 24 (Linear Chain) Conditional Random Fields y t-1 y t x t y t+1 x t +1 x t - 1 Finite state modelGraphical model Undirected graphical model, trained to maximize conditional probability of outputs given inputs... FSM states observations y t+2 x t +2 y t+3 x t +3 said Veght a Microsoft VP where OTHER PERSON OTHER ORG TITLE output seq input seq Asian word segmentation [COLING04], [ACL04] IE from Research papers [HTL04] Object classification in images [CVPR 04] Fast-growing, wide-spread interest, many positive experimental results. Noun phrase, Named entity [HLT03], [CoNLL03] Protein structure prediction [ICML04] IE from Bioinformatics text [Bioinformatics 04], [Lafferty, McCallum, Pereira 2001] Slide 25 Training CRFs Feature count using correct labels Feature count using predicted labels Smoothing penalty -- Slide 26 Linear-chain CRFs vs. HMMs Comparable computational efficiency for inference Features may be arbitrary functions of any or all observations Parameters need not fully specify generation of observations; can require less training data Easy to incorporate domain knowledge Slide 27 Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------- 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves. Slide 28 Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------- 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves. CRF Labels: Non-Table Table Title Table Header Table Data Row Table Section Data Row Table Footnote... (12 in all) [Pinto, McCallum, Wei, Croft, 2003 SIGIR] Features: Percentage of digit chars Percentage of alpha chars Indented Contains 5+ consecutive spaces Whitespace in this line aligns with prev.... Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}. 100+ documents from www.fedstats.gov Slide 29 Table Extraction Experimental Results Line labels, percent correct Table segments, F1 95 %92 % 65 %64 % 85 %- HMM Stateless MaxEnt CRF [Pinto, McCallum, Wei, Croft, 2003 SIGIR] Slide 30 IE from Research Papers [McCallum et al 99] Slide 31 IE from Research Papers Field-level F1 Hidden Markov Models (HMMs)75.6 [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs)89.7 [Han, Giles, et al, 2003] Conditional Random Fields (CRFs)93.9 [Peng, McCallum, 2004] error 40% Slide 32 Named Entity Recognition CRICKET - MILLNS SIGNS FOR BOLAND CAPE TOWN 1996-08-22 South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional. Labels: Examples: PERYayuk Basuki Innocent Butare ORG3M KDP Cleveland LOCCleveland Nirmal Hriday The Oval MISCJava Basque 1,000 Lakes Rally Slide 33 Automatically Induced Features IndexFeature 0inside-noun-phrase (o t-1 ) 5stopword (o t ) 20capitalized (o t+1 ) 75word=the (o t ) 100in-person-lexicon (o t-1 ) 200word=in (o t+2 ) 500word=Republic (o t+1 ) 711word=RBI (o t ) & header=BASEBALL 1027header=CRICKET (o t ) & in-English-county-lexicon (o t ) 1298company-suffix-word (firstmention t+2 ) 4040location (o t ) & POS=NNP (o t ) & capitalized (o t ) & stopword (o t-1 ) 4945moderately-rare-first-name (o t-1 ) & very-common-last-name (o t ) 4474word=the (o t-2 ) & word=of (o t ) [McCallum & Li, 2003, CoNLL] Slide 34 Named Entity Extraction Results MethodF1 HMMs BBN's Identifinder73% CRFs w/out Feature Induction83% CRFs with Feature Induction90% based on LikelihoodGain [McCallum & Li, 2003, CoNLL] Slide 35 Outline Examples of IE and Data Mining. Brief review of Conditional Random Fields Joint inference: Motivation and examples Joint Labeling of Cascaded Sequences (Belief Propagation) Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution (Graph Partitioning) Joint Segmentation and Co-ref (Iterated Conditional Samples) Interactive IE Two example projects Email, contact management, and Social Network Analysis Research Paper search and analysis Slide 36 Larger Context Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Database Discover patterns - entity types - links / relations - events Data Mining Spider Actionable knowledge Slide 37 Problem: Combined in serial juxtaposition, IE and DM are unaware of each others weaknesses and opportunities. 1)DM begins from a populated DB, unaware of where the data came from, or its inherent uncertainties. 2)IE is unaware of emerging patterns and regularities in the DB. The accuracy of both suffers, and significant mining of complex text sources is beyond reach. Slide 38 Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Database Discover patterns - entity types - links / relations - events Data Mining Spider Actionable knowledge Uncertainty Info Emerging Patterns Solution: Slide 39 Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Probabilistic Model Discover patterns - entity types - links / relations - events Data Mining Spider Actionable knowledge Solution: Conditional Random Fields [Lafferty, McCallum, Pereira] Conditional PRMs [Koller], [Jensen], [Geetor], [Domingos] Discriminatively-trained undirected graphical models Complex Inference and Learning Just what we researchers like to sink our teeth into! Unified Model Slide 40 Larger-scale Joint Inference for IE What model structures will capture salient dependencies? Will joint inference improve accuracy? How do to inference in these large graphical models? How to efficiently train these models, which are built from multiple large components? Slide 41 1. Jointly labeling cascaded sequences Factorial CRFs Part-of-speech Noun-phrase boundaries Named-entity tag English words [Sutton, Khashayar, McCallum, ICML 2004] Slide 42 1. Jointly labeling cascaded sequences Factorial CRFs Part-of-speech Noun-phrase boundaries Named-entity tag English words [Sutton, Khashayar, McCallum, ICML 2004] Slide 43 1. Jointly labeling cascaded sequences Factorial CRFs Part-of-speech Noun-phrase boundaries Named-entity tag English words [Sutton, Khashayar, McCallum, ICML 2004] But errors cascade--must be perfect at every stage to do well. Slide 44 1. Jointly labeling cascaded sequences Factorial CRFs Part-of-speech Noun-phrase boundaries Named-entity tag English words [Sutton, Khashayar, McCallum, ICML 2004] Joint prediction of part-of-speech and noun-phrase in newswire, matching accuracy with only 50% of the training data. Inference: Tree reparameterization BP [Wainwright et al, 2002] Slide 45 2. Jointly labeling distant mentions Skip-chain CRFs Senator Joe Green said today . Green ran for [Sutton, McCallum, SRL 2004] Dependency among similar, distant mentions ignored. Slide 46 2. Jointly labeling distant mentions Skip-chain CRFs Senator Joe Green said today . Green ran for [Sutton, McCallum, SRL 2004] 14% reduction in error on most repeated field in email seminar announcements. Inference: Tree reparameterization BP [Wainwright et al, 2002] Slide 47 3. Joint co-reference among all pairs Affinity Matrix CRF... Mr Powell...... Powell...... she... 45 99 Y/N 11 [McCallum, Wellner, IJCAI WS 2003, NIPS 2004] ~25% reduction in error on co-reference of proper nouns in newswire. Inference: Correlational clustering graph partitioning [Bansal, Blum, Chawla, 2002] Entity resolution Object correspondence Slide 48 Coreference Resolution Input AKA "record linkage", "database record deduplication", "entity resolution", "object correspondence", "identity uncertainty" Output News article, with named-entity "mentions" tagged Number of entities, N = 3 #1 Secretary of State Colin Powell he Mr. Powell Powell #2 Condoleezza Rice she Rice #3 President Bush Bush Today Secretary of State Colin Powell met with................................................. he................... Condoleezza Rice......... Mr Powell..........she..................... Powell............... President Bush.................................. Rice................ Bush......................................................................... Slide 4950% character tri-gram overlap19 N< 25% character tri-gram overlap-34 YIn same sentence9 YWithin two sentences8 NFurther than 3 sentences apart-1 Y"Hobbs Distance" < 311 NNumber of entities in between two mentions = 012 NNumber of entities in between two mentions > 4-3 YFont matches1 YDefault-19 OVERALL SCORE = 98 > threshold=0 Pair-wise Affinity Metric Y/N?"> Inside the Traditional Solution Mention (3) Mention (4)... Mr Powell...... Powell... NTwo words in common29 YOne word in common13 Y"Normalized" mentions are string identical39 YCapitalized word in common17 Y> 50% character tri-gram overlap19 N< 25% character tri-gram overlap-34 YIn same sentence9 YWithin two sentences8 NFurther than 3 sentences apart-1 Y"Hobbs Distance" < 311 NNumber of entities in between two mentions = 012 NNumber of entities in between two mentions > 4-3 YFont matches1 YDefault-19 OVERALL SCORE = 98 > threshold=0 Pair-wise Affinity Metric Y/N? Slide 50 The Problem... Mr Powell...... Powell...... she... affinity = 98 affinity = 11 affinity = 104 Pair-wise merging decisions are being made independently from each other Y Y N Affinity measures are noisy and imperfect. They should be made in relational dependence with each other. Slide 51 A Generative Model Solution [Russell 2001], [Pasula et al 2002], [Milch et al 2003], [Marthi et al 2003] id wordscontext distancefonts id surname agegender N............ (Applied to citation matching, and object correspondence in vision) Slide 52 A Markov Random Field for Co-reference... Mr Powell...... Powell...... she... 45 30 Y/N [McCallum & Wellner, 2003, ICML] (MRF) Make pair-wise merging decisions in dependent relation to each other by - calculating a joint prob. - including all edge weights - adding dependence on consistent triangles. 11 Slide 53 A Markov Random Field for Co-reference... Mr Powell...... Powell...... she... 45 30 Y/N [McCallum & Wellner, 2003] (MRF) Make pair-wise merging decisions in dependent relation to each other by - calculating a joint prob. - including all edge weights - adding dependence on consistent triangles. 11 Slide 54 A Markov Random Field for Co-reference... Mr Powell...... Powell...... she... Y N N [McCallum & Wellner, 2003] (MRF) 44 45) 30) (11) Slide 55 A Markov Random Field for Co-reference... Mr Powell...... Powell...... she... Y Y N [McCallum & Wellner, 2003] (MRF) infinity 45) 30) (11) Slide 56 A Markov Random Field for Co-reference... Mr Powell...... Powell...... she... N Y N [McCallum & Wellner, 2003] (MRF) 64 45) 30) (11) Slide 57 Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]... Mr Powell...... Powell...... she... 45 11 30... Condoleezza Rice... 134 10 106 Slide 58 Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]... Mr Powell...... Powell...... she...... Condoleezza Rice... = 22 45 11 30 134 10 106 Slide 59 Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]... Mr Powell...... Powell...... she...... Condoleezza Rice... = 314 45 11 30 134 10 106 Slide 60 Co-reference Experimental Results Proper noun co-reference DARPA ACE broadcast news transcripts, 117 stories Partition F1Pair F1 Single-link threshold16 %18 % Best prev match [Morton]83 %89 % MRFs88 %92 % error=30% error=28% DARPA MUC-6 newswire article corpus, 30 stories Partition F1Pair F1 Single-link threshold11%7 % Best prev match [Morton]70 %76 % MRFs74 %80 % error=13% error=17% [McCallum & Wellner, 2003] Slide 61 Y/N Joint Co-reference Decisions, Discriminative Model Stuart Russell [Culotta & McCallum 2005] S. Russel People Slide 62 Y/N Co-reference for Multiple Entity Types Stuart Russell University of California at Berkeley [Culotta & McCallum 2005] S. Russel Berkeley PeopleOrganizations Slide 63 Y/N Joint Co-reference of Multiple Entity Types Stuart Russell University of California at Berkeley [Culotta & McCallum 2005] S. Russel Berkeley PeopleOrganizations Reduces error by 22% Slide 64 Joint Co-reference Experimental Results CiteSeer Dataset 1500 citations, 900 unique papers, 350 unique venues Paper Venue indepjointindepjoint constraint88.991.079.494.1 reinforce92.292.256.560.1 face88.293.780.982.8 reason97.497.075.679.5 Micro Average91.793.473.179.1 error=20% error=22% [Culotta & McCallum 2005] Slide 65 Joint co-reference among all pairs Affinity Matrix CRF... Mr Powell...... Powell...... she... 45 99 Y/N 11 [McCallum, Wellner, IJCAI WS 2003, NIPS 2004] ~25% reduction in error on co-reference of proper nouns in newswire. Inference: Correlational clustering graph partitioning [Bansal, Blum, Chawla, 2002] Slide 66 p Database field values c 4. Joint segmentation and co-reference o s o s c c s o Citation attributes y y y Segmentation [Wellner, McCallum, Peng, Hay, UAI 2004] Inference: Variant of Iterated Conditional Modes Co-reference decisions Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison- Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. [Besag, 1986] World Knowledge 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Extraction from and matching of research paper citations. see also [Marthi, Milch, Russell, 2003] Slide 67 Joint IE and Coreference from Research Paper Citations Textual citation mentions (noisy, with duplicates) Paper database, with fields, clean, duplicates collapsed AUTHORS TITLE VENUE Cowell, Dawid ProbabSpringer Montemerlo, ThrunFastSLAM AAAI Kjaerulff Approxi Technic 4. Joint segmentation and co-reference Slide 68 Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. Citation Segmentation and Coreference Slide 69 Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. 1) Segment citation fields Citation Segmentation and Coreference Slide 70 Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. 1) Segment citation fields 2) Resolve coreferent citations Citation Segmentation and Coreference Y?NY?N Slide 71 Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. Citation Segmentation and Coreference Y?NY?N Segmentation QualityCitation Co-reference (F1) No Segmentation78% CRF Segmentation91% True Segmentation93% 1) Segment citation fields 2) Resolve coreferent citations Slide 72 Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. 1) Segment citation fields 2) Resolve coreferent citations 3) Form canonical database record Citation Segmentation and Coreference AUTHOR =Brenda Laurel TITLE =Interface Agents: Metaphors with Character PAGES =355-366 BOOKTITLE =The Art of Human-Computer Interface Design EDITOR =T. Smith PUBLISHER =Addison-Wesley YEAR =1990 Y?NY?N Resolving conflicts Slide 73 Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. 1) Segment citation fields 2) Resolve coreferent citations 3) Form canonical database record Citation Segmentation and Coreference AUTHOR =Brenda Laurel TITLE =Interface Agents: Metaphors with Character PAGES =355-366 BOOKTITLE =The Art of Human-Computer Interface Design EDITOR =T. Smith PUBLISHER =Addison-Wesley YEAR =1990 Y?NY?N Perform jointly. Slide 74 x s Observed citation CRF Segmentation IE + Coreference Model J Besag 1986 On the AUT AUT YR TITL TITL Slide 75 x s Observed citation CRF Segmentation IE + Coreference Model Citation mention attributes J Besag 1986 On the AUTHOR = J Besag YEAR = 1986 TITLE = On the c Slide 76 x s IE + Coreference Model c J Besag 1986 On the Smyth. 2001 Data Mining Smyth, P Data mining Structure for each citation mention Slide 77 x s IE + Coreference Model c Binary coreference variables for each pair of mentions J Besag 1986 On the Smyth. 2001 Data Mining Smyth, P Data mining Slide 78 x s IE + Coreference Model c y n n J Besag 1986 On the Smyth. 2001 Data Mining Smyth, P Data mining Binary coreference variables for each pair of mentions Slide 79 y n n x s IE + Coreference Model c J Besag 1986 On the Smyth. 2001 Data Mining Smyth, P Data mining Research paper entity attribute nodes AUTHOR = P Smyth YEAR = 2001 TITLE = Data Mining... Slide 80 y y y x s IE + Coreference Model c J Besag 1986 On the Smyth. 2001 Data Mining Smyth, P Data mining Research paper entity attribute node Slide 81 y n n x s IE + Coreference Model c J Besag 1986 On the Smyth. 2001 Data Mining Smyth, P Data mining Slide 82 Such a highly connected graph makes exact inference intractable Slide 83 Loopy Belief Propagation v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 m1(v2)m1(v2) m2(v3)m2(v3) m3(v2)m3(v2)m2(v1)m2(v1) messages passed between nodes Approximate Inference 1 Slide 84 Loopy Belief Propagation Generalized Belief Propagation v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 m1(v2)m1(v2) m2(v3)m2(v3) m3(v2)m3(v2)m2(v1)m2(v1) v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v9v9 v8v8 v7v7 messages passed between nodes messages passed between regions Here, a message is a conditional probability table passed among nodes. But when message size grows exponentially with size of overlap between regions! Approximate Inference 1 Slide 85 Iterated Conditional Modes (ICM) [Besag 1986] v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v 6 i+1 = argmax P(v 6 i | v \ v 6 i ) v6iv6i = held constant Approximate Inference 2 Slide 86 Iterated Conditional Modes (ICM) [Besag 1986] v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v 5 j+1 = argmax P(v 5 j | v \ v 5 j ) v5jv5j = held constant Approximate Inference 2 Slide 87 Iterated Conditional Modes (ICM) [Besag 1986] v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v 4 k+1 = argmax P(v 4 k | v \ v 4 k ) v4kv4k = held constant Approximate Inference 2 but greedy, and easily falls into local minima. Structured inference scales well here, Slide 88 Iterated Conditional Modes (ICM) [Besag 1986] Iterated Conditional Sampling (ICS) (our name) Instead of selecting only argmax, sample of argmaxes of P(v 4 k | v \ v 4 k ) e.g. an N-best list (the top N values) v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v 4 k+1 = argmax P(v 4 k | v \ v 4 k ) v4kv4k = held constant v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 Approximate Inference 2 Can use Generalized Version of this; doing exact inference on a region of several nodes at once. Here, a message grows only linearly with overlap region size and N! Slide 89 Features of this Inference Method 1)Structured or factored representation (ala GBP) 2)Uses samples to approximate density 3)Closed-loop message-passing on loopy graph (ala BP) Beam search Forward-only inference Particle filtering, e.g. [Doucet 1998] Usually on tree-shaped graph, or feedforward only. MC SamplingEmbedded HMMs [Neal, 2003] Sample from high-dim continuous state space; do forward-backward Sample Propagation [Paskin, 2003] Messages = samples, on a junction tree Fields to Trees [Hamze & de Freitas, UAI 2003] Rao-Blackwellized MCMC, partitioning G into non-overlapping trees Factored Particles for DBNs [Ng, Peshkin, Pfeffer, 2002] Combination of Particle Filtering and Boyan-Koller for DBNs Related Work Slide 90 IE + Coreference Model Exact inference on these linear-chain regions J Besag 1986 On the Smyth. 2001 Data Mining Smyth, P Data mining From each chain pass an N-best List into coreference Slide 91 IE + Coreference Model J Besag 1986 On the Smyth. 2001 Data Mining Smyth, P Data mining Approximate inference by graph partitioning integrating out uncertainty in samples of extraction Make scale to 1M citations with Canopies [McCallum, Nigam, Ungar 2000] Slide 92 NameTitle Laurel, BInterface Agents: Metaphors with Character The Laurel, B.Interface Agents: Metaphors with Character Laurel, B. Interface Agents Metaphors with Character When calculating similarity with another citation, have more opportunity to find correct, matching fields. NameTitleBook TitleYear Laurel, B. InterfaceAgents: Metaphors with Character The Art of Human Computer Interface Design 1990 Laurel, B.Interface Agents: Metaphors with Character The Art of Human Computer Interface Design 1990 Laurel, B. InterfaceAgents: Metaphors with Character The Art of Human Computer Interface Design 1990 Inference: Sample = N-best List from CRF Segmentation y ? n Slide 93 y n n IE + Coreference Model J Besag 1986 On the Smyth. 2001 Data Mining Smyth, P Data mining Exact (exhaustive) inference over entity attributes Slide 94 y n n IE + Coreference Model J Besag 1986 On the Smyth. 2001 Data Mining Smyth, P Data mining Revisit exact inference on IE linear chain, now conditioned on entity attributes Slide 95 y n n Parameter Estimation Coref graph edge weights MAP on individual edges Separately for different regions IE Linear-chain Exact MAP Entity attribute potentials MAP, pseudo-likelihood In all cases: Climb MAP gradient with quasi-Newton method Slide 96 Experimenal Results Set of citations from CiteSeer 1500 citation mentions to 900 paper entities Hand-labeled for coreference and field-extraction Divided into 4 subsets, each on a different topic RL, Face detection, Reasoning, Constraint Satisfaction Within each subset many citations share authors, publication venues, publishers, etc. 70% of the citation mentions are singletons Slide 97 p Database field values c 4. Joint segmentation and co-reference o s o s c c s o Citation attributes y y y Segmentation [Wellner, McCallum, Peng, Hay, UAI 2004] Inference: Variant of Iterated Conditional Modes Co-reference decisions Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison- Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. [Besag, 1986] World Knowledge 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Extraction from and matching of research paper citations. Slide 98 NReinforceFaceReasonConstraint 1 (Baseline)0.9460.960.940.96 30.950.980.96 70.950.980.950.97 90.9820.970.960.97 Optimal0.99 Coreference cluster recall Average error reduction is 35%. Optimal makes best use of N-best list by using true labels. Indicates that even more improvement can be obtained Coreference Results Slide 99 ReinforceFaceReasonConstraint Baseline.943.908.929.934 w/ Coref.949.914.935.943 Err. Reduc..101.062.090.142 P-value.0442.0014.0001 Segmentation F1 Error reduction ranges from 6-14%. Small, but significant at 95% confidence level (p-value < 0.05) Information Extraction Results Biggest limiting factor in both sets of results: data set is small, and does not have large coreferent sets. Slide 100 y n n Parameter Estimation Coref graph edge weights MAP on individual edges Separately for different regions IE Linear-chain Exact MAP Entity attribute potentials MAP, pseudo-likelihood In all cases: Climb MAP gradient with quasi-Newton method Slide 101 Outline Examples of IE and Data Mining. Brief review of Conditional Random Fields Joint inference: Motivation and examples Joint Labeling of Cascaded Sequences (Belief Propagation) Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution (Graph Partitioning) Joint Segmentation and Co-ref (Iterated Conditional Samples) Interactive IE Two example projects Email, contact management, and Social Network Analysis Research Paper search and analysis Slide 102 Interactive Information Extraction with End-Users Correction for Classification Easy: Often found in user interfaces e.g. Apple Mail Seminar: How to Organize your Life by Jane Smith, Stevenson & Smith Mezzanine Level, Papadapoulos Sq 3:30 pm Thursday March 31 In this seminar we will learn how to use CALO to... Seminar announcement Todo request Other Correction for Extraction Painful: Difficult even for paid labelers Complex tools Seminar: How to Organize your Life by Jane Smith, Stevenson & Smith Mezzanine Level, Papadapoulos Sq 3:30 pm Thursday March 31 In this seminar we will learn how to use CALO to... Click, drag, adjust, label,... Slide 103 Multiple-choice Annotation for Interactive IE with End-Users [Culotta, McCallum 2005] Jane Smith, Stevenson & Smith, Mezzanine Level, Papadopoulos Sq. Task: Information Extraction. Fields: NAME COMPANY ADDRESS (and others) Jane Smith, Stevenson & Smith Mezzanine Level, Papadopoulos Sq. user corrects labels, not segmentations Interface presents top hypothesized segmentations Slide 104 Multiple-choice Annotation for Interactive IE with End-Users [Culotta, McCallum 2005] Jane Smith, Stevenson & Smith, Mezzanine Level, Papadopoulos Sq. Jane Smith, Stevenson & Smith Mezzanine Level, Papadopoulos Sq. user corrects labels, not segmentations Interface presents top hypothesized segmentations Task: Information extraction. Fields: NAME COMPANY ADDRESS (and others) Slide 105 Multiple-choice Annotation for Interactive IE with End-Users [Culotta, McCallum 2005] Jane Smith, Stevenson & Smith, Mezzanine Level, Papadopoulos Sq. Jane Smith, Stevenson & Smith Mezzanine Level, Papadopoulos Sq. 29% percent reduction in user actions needed to train Interface presents top hypothesized segmentations Task: Information extraction. Fields: NAME COMPANY ADDRESS (and others) Slide 106 Outline Examples of IE and Data Mining. Brief review of Conditional Random Fields Joint inference: Motivation and examples Joint Labeling of Cascaded Sequences (Belief Propagation) Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution (Graph Partitioning) Joint Segmentation and Co-ref (Iterated Conditional Samples) Interactive IE Two example projects Email, contact management, and Social Network Analysis Research Paper search and analysis Slide 107 Workplace effectiveness ~ Ability to leverage network of acquaintances But filling Contacts DB by hand is tedious, and incomplete. Email Inbox Contacts DB WWW Automatically Managing and Understanding Connections of People in our Email World Slide 108 System Overview Contact Info and Person Name Extraction Person Name Extraction Name Coreference Homepage Retrieval Social Network Analysis Keyword Extraction CRF WWW names Email Slide 109 An Example To: Andrew McCallum [email protected] Subject... First Name: Andrew Middle Name: Kachites Last Name: McCallum JobTitle:Associate Professor Company:University of Massachusetts Street Address: 140 Governors Dr. City:Amherst State:MA Zip:01003 Company Phone: (413) 545-1323 Links:Fernando Pereira, Sam Roweis, Key Words: Information extraction, social network, Search for new people Slide 110 Summary of Results Token Acc Field Prec Field Recall Field F1 CRF94.5085.7376.3380.76 PersonKeywords William CohenLogic programming Text categorization Data integration Rule learning Daphne KollerBayesian networks Relational models Probabilistic models Hidden variables Deborah McGuiness Semantic web Description logics Knowledge representation Ontologies Tom MitchellMachine learning Cognitive states Learning apprentice Artificial intelligence Contact info and name extraction performance (25 fields) Example keywords extracted 1.Expert Finding: When solving some task, find friends-of-friends with relevant expertise. Avoid stove-piping in large orgs by automatically suggesting collaborators. Given a task, automatically suggest the right team for the job. (Hiring aid!) 2.Social Network Analysis: Understand the social structure of your organization. Suggest structural changes for improved efficiency. Slide 111 Clustering words into topics with Latent Dirichlet Allocation [Blei, Ng, Jordan 2003] Slide 112 STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN Example topics induced from a large collection of text FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE [Tennenbaum et al] Slide 113 STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE Example topics induced from a large collection of text [Tennenbaum et al] Slide 114 From LDA to Author-Recipient-Topic (ART) Slide 115 Inference and Estimation Gibbs Sampling: - Easy to implement - Reasonably fast r Slide 116 Enron Email Corpus 250k email messages 23k people Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT) From: [email protected] To: [email protected] Subject: Enron/TransAltaContract dated Jan 1, 2001 Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions. DP Debra Perlingiere Enron North America Corp. Legal Department 1400 Smith Street, EB 3885 Houston, Texas 77002 [email protected] Slide 117 Topics, and prominent sender/receivers discovered by ART Slide 118 Beck = Chief Operations Officer Dasovich = Government Relations Executive Shapiro = Vice Presidence of Regulatory Affairs Steffes = Vice President of Government Affairs Slide 119 Comparing Role Discovery connection strength (A,B) = distribution over authored topics Traditional SNA distribution over recipients distribution over authored topics Author-TopicART Slide 120 Comparing Role Discovery Tracy Geaconne Dan McCarty Traditional SNAAuthor-TopicART Similar roles Different roles Geaconne = Secretary McCarty = Vice President Slide 121 Traditional SNAAuthor-TopicART Different roles Very similarNot very similar Geaconne = Secretary Hayslett = Vice President & CTO Comparing Role Discovery Tracy Geaconne Rod Hayslett Slide 122 Traditional SNAAuthor-TopicART Different roles Very differentVery similar Blair = Gas pipeline logistics Watson = Pipeline facilities planning Comparing Role Discovery Lynn Blair Kimberly Watson Slide 123 Traditional SNAAuthor-TopicART Block structured Not Comparing Group Discovery Enron TransWestern Division Slide 124 McCallum Email Corpus 2004 January - October 2004 23k email messages 825 people From: [email protected] Subject: NIPS and.... Date: June 14, 2004 2:27:41 PM EDT To: [email protected] There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for: NIPS registration receipt. CALO registration receipt. Thanks, Kate Slide 125 McCallum Email Blockstructure Slide 126 Four most prominent topics in discussions with ____? Slide 127 Slide 128 Two most prominent topics in discussions with ____? Slide 129 Topic 37 Slide 130 Topic 40 Slide 131 Slide 132 Pairs with highest rank difference between ART & SNA 5 other professors 3 other ML researchers Slide 133 Role-Author-Recipient-Topic Models Slide 134 Outline Examples of IE and Data Mining. Brief review of Conditional Random Fields Joint inference: Motivation and examples Joint Labeling of Cascaded Sequences (Belief Propagation) Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution (Graph Partitioning) Joint Segmentation and Co-ref (Iterated Conditional Samples) Interactive IE Two example projects Email, contact management, and Social Network Analysis Research Paper search and analysis Slide 135 Previous Systems Slide 136 Slide 137 Research Paper Cites Previous Systems Slide 138 Research Paper Cites Person UniversityVenue Grant Groups Expertise More Entities and Relations Slide 139 Slide 140 Slide 141 Slide 142 Slide 143 Slide 144 Slide 145 Slide 146 Summary Conditional Random Fields Conditional probability models of structured data Data mining complex unstructured text suggests the need for joint inference IE + DM. Early examples Factorial finite state models Jointly labeling distant entities Coreference analysis Segmentation uncertainty aiding coreference Interactive IE Bring IE to the masses! Current projects Email, contact management, expert-finding, SNA Mining the scientific literature Slide 147 End of Talk