presentation

Web Scale Information Extraction ’Traditional’ methods Bootstrapping Methods

Information Extraction at the Web scalewith minimal amount of training data

Andrei Lopatenko1

1Google

The Free University of Bozen-Bolzano, 2012


Outline

1 Web Scale Information Extraction

2 ’Traditional’ methods

3 Bootstrapping MethodsIs extracted information correctSemantic DriftSynonymsOpen Relation ExtractionInference


Large Scale Information Extraction

Can machine read?Web Scale Information Extraction is about converting webinto a database

Can machine extract information about real world objectsand relations between them from web?

Can machine do it with minimal involvement of humans?


Large Scale Information Extraction

IE systems developed in the academia

TextRunner, KnowItAll, Sherlock, Holmes a fact extractor,learning system, etc from the University of Washington(Etzioni etc) - automatic, unsupervised understanding oftext

NELL or Never Ending Language Learner [1] is a coupledsemi-supervising learning system developed at the CMUwhich learns objects and their categories, and relationsbetween objects from web text.

WOE, REVERB, R2A2


Hearst’s Acquisition of Hyponyms

proposed in 1992 [2]

acquires hyponyms only (isa) such as Enrico Franconi is aProfessor of Computer Science

it’s a pattern based approach

based on observation that some relations, such ashypernymy and meronymy, are expressed using smallnumber of lexico-syntactic patterns

does not handle noise in input data


Hearst’s patterns

Patternssuch NP as NP∗, (or |and)NP

NP(,NP)∗ or other NP

NP(,NP)∗, and other NP

NP(, ) including (NP, ) ∗ (or |andNP)

NP(, ) especially (NP, ) ∗ (or |and)NP


Difficulties

nouns frequently occur in plural forms Enrico Franconi , · · ·and other professors of Computer Science

nouns are often modified by comparatives (‘important‘) andother modifiers (’five’, ’certain’, ’many’)

synonyms (are to be discussed latter in RESOLVERsection)

too generic extractions (is an item), context dependentsextractions (aircraft is a target)


Patterns

ProblemsHearst patterns are intended to extract hyponymy isarelation.

Other patterns are required to extract other relations

Patterns must be specified manually per relation, a lot ofwork required to provide comprehensive set of patterns perrelation.

Idea: starting with a minimal set of patterns, let’s minepatterns from text. -> bootstrapping


What are bootstrapping methods

Bootstrapping methods starts with a small number ofexamples ("seeds") and iteratively grow the collection oflabels

They proved to be efficient in information extraction,classification problems such as page classification[3] andnamed entity classification[4], machine translation,parsing[5].


How bootstrapping works: An example

Input

A labeled set Λ of companies and locations of theirheadquarters: Apple in Cupertino, Google in MountainView, Facebook in Palo Alto

Step 1: Collect instances from the web.

shareholders meetings at Apple’s Cupertinoheadquarters

the iPhone lineup has momentum and Apple, based inCupertino

Google, based in Mountain View , California, plans to sellthe company’s stake

Chrome at Google’s head office in Mountain View

speaks at a press event at Facebook headquarters inPalo Alto


How bootstrapping works: An example

Step 2: Extract Patterns

COMPANYNAME’s LOCATION headquarters .

COMPANYNAME, based in LOCATION

COMPANYNAME’s head office in LOCATION

COMPANYNAME headquarters in LOCATION

Add them to the pattern set Ψ

Step 3. Run extraction using patterns

EnerG2, based in Seattle

Riverstone, based in New York

New Energy Cities, a nonprofit group based in Seattle

Peacocks’ head office in Cardiff


How boostrapping works: An example

Step 4. Update labeled set Λ using new extractions

add EnerG2 in Seattle, Riverstone in New York, Peacocksin Cardiff

Step 5. Loop

go back to step 2


Bootstraping

ProblemsNoise in the data, how many extraction of fact f do weneed to start believing that f is true

Semantic drift, an infection of semantic classes witherroneous terms or patterns/contexts

The same relations and entities might be specified invarious forms


Is extracted information correct

Correctness

Why incorrect

Web data are often incorrect because of opinions etc(Humans are created by the Intelligent Design)

Parsers splliting sentence might produce incorrect parse(US World and News Report published X): parser producestwo NPs: US World, News Report so we get two erroneousfacts US World published X, News Report published X

we need a certain number of extraction of fact f to becondifent that it’s true.



Assessing correctness of extraction

MethodsRiloff, Jones, 1999 count the number of distinct patternsgenerating extraction and compute reliability of patterns.

Cullota McCallum 2004 use CRF (Conditional RandomField) model to assess probability that extraction from aparticular sentence is correct

[6] propose Urn based combinatorial model to assess thecorrectness of extraction which performs better than manyother models



Urn Model for Information Extraction

basic definitionsUrn model by TextRunner proved to be very productive inassessing correctness of many extraction problems suchas correctness of an extracted fact, correctness of twostring to be synonyms

Each extraction is labelled as a ball in an urn - an url perrelation.

a label represents either an instance of target relation oran error.

Extraction is a drawing a ball from an urn with replacement,



Urn Model

Computation

p(x ∈ C, x k times in n draws) =

=1

1 + A ∗ (PE/PC)k ∗ en(pc−pe)(1)

A is a parameter pc and pe are probabilities to extractcorrect and incorrect facts, usually pc > pe

Urn model allows to estimate expected recall andprecission based on sample size

this model is generalized to encompass multiple rules (=multiple patterns per relation)



Urn Model

Results[6] compares URNS with noisy-or, log. regression, SVM forsupervised IE

URNS outperforms noisy-or by 19 precent, logisticregression by 10 percents, SVM by 0.4 percents


Semantic Drift

Semantic Drift

What is Semantic Drift?bootstrapping tend to lead to generic patterns whichextract unintended information

extraction patterns are underconstrained

Google, based in California, California is a state, not a city.

Approaches to fix it

Mutual Exlusion Bootstrapping[7]

Constraint-Driven Learning[8]

Coupled Semi-Superwised Learning[1]


Semantic Drift

Coupled Learning

Use ontology knowledge to infer which information is wrong.Couple the semi-supervised learning of many functions toconstrain learning [1]

Relation Argument Type Checking

Items of type A can not be items of type B.

Use it to reject labels learned. Google based inCalifornia . states such as Arizona, California , Texas

Use to generate more precise patterns. Reject based in ,use cities such as

Soft constraints are used rather than hard constraints. [1]rejects learned label if the number of positive examples isless than three times exeeds the number of negativeexamples.


Semantic Drift

Coupled Learning

Relation Argument Type Checking (cont.)

is a case of compositional constraint given two functionf1 : X1− > Y1 and f2 : X1 ∗ X2− > Y2, (x1, x2) constrain(y1, y2). Type check is ∀x1, x2f2(x1, x2)− > f1(x1)


Semantic Drift

Coupled Learning

Use ontology knowledge to infer which information is wrong.Couple the semi-supervised learning of many functions toconstrain learning [1]

Mutual ExclusionItems of type A and type B can not be in relation with thesame item x .

or only one item might be in relation with item of type A

is a case of output constraints: For two functionsfa : X− > Ya and fb : X− > Yb imply constraints on ya, yb

values


Synonyms

Synonyms

D. C. is capital of United States

Washington is capital city of United States

D. C. Washington

is capital of is capital city of

other names: cross-document entity coreference,paraphrase discovery


Synonyms

Spelling variation synonyms

spelling variations: acronyms, abbreviations, variations,spelling errors

RESOLVER [9] uses a simple string similarity approach toevaluate the probability of co-refence

P = (α ∗ sim(s1, s2) + 1)/(α+ β)

for entities The Monge-Elkan string distance is used

the Levenshtein distance is used for relations

special solutions might be applied to deal withabbreviations and acronyms


Synonyms

Synonyms

RESOLVERMajority of work in synonym discovery used distributionsimilarity metrics to find synonyms which are based onassumption that if context of words are similar, than wordsare similar.

RESOLVER [9] usess urn combinatorial model to build asimilarity metrics.

the Extracted Shared Property Model (ESP) takes as inputa set of extractions for two strings, computes the similarityof assertions and output the probability that two stringcorefer to the same entity.


Synonyms

Synonyms

RESOLVERa pair of strings (r , s) is a property of string o if there is anassertion (r , o, s) or (r , s, o)

a pair of strings s1, s2 is an instance of r if there is anassertion (r , s1, s2)

let k is the number of properties observed which are thesame for s1 and s2

len n1 is the number of properties extracted for s1, n2 is thenumber of properties extracted for s2


Synonyms

Synonyms

Some combinatorial computations

P(Rti,j |Di ,Dj ,Pi ,Pj) = P(k |ni , nj ,Pi ,Pj ,Si,j) =

Pmin/∑

Si,j k<=Si,j<=PminP(k |ni .nj ,Pi ,Pj ,Si,j)

where P(k |ni , nj ,Pi ,Pj ,Si,j) = Count(k , ni , nj |Pi ,Pj ,Si , j)

where Count ...

crucial: some hidden parameters Pi ,Pj are unknown andrequire experimental estimates


Synonyms

Synonyms

Resolver resultsfor objects 0.7 precission and 0.66 recall are reportedwhich is significantly higher than precission and recall byother methods ( 0.5, 0.4)

for relations 0.9 precission and 0.35 recall are reported vs( 0.6, 0.3)


Synonyms

Synonyms

Problemsextraction errors (entity might be spliited by extractionsystems <b> US NEWS <e> and <b> World Report<e>)and than synonymized because of similar contexts – fixedas extraction is fixed

similar context for similar entities (Asia and Africa )

Multiple word senses (Apple, President, even PresidentBush)


Synonyms

Synonyms

Solutions (partial) for similar entities

two names seens in too narrow context (sentence, forexample) too many times

functional predicates to prove that terms are not synonyms

weighting of predicates


Synonyms

Synonyms

ConceptResolver (NELL)[10]

finds multiple senses for noun phrases apple -> [applecomputer, apple fruit]

find synonyms for nouns phrases or maps a set of nounphrases into concepts

developed as a part of NELL project

takes as input a set of extracted relation and categoryinstances and produce a set of concept and noun phrasesassociated with these concepts

ex.: [kaspersky labs], [kaspersky], [kaspersky lab]

ex.: [nielsen media research], [nielsen company]


Synonyms

NELL Synonyms

World Sense Inductionuses isa relations extracted from the text to learncategories to which instances belongs.

apple is a fruit, apple is a company

creates senses which are (instance, category) pairs foreach instance.

categories might be learned from isa relations directlymentioned in the text (Hearst patterns) or from relations inwhich entity participate and derived information aboutDomain and Range of relations. (Domain(ceoOfCompany)= person, Range(ceoOfCompany) = Company)


Synonyms

NELL Synonyms

Synonym Resolution algorithm

for each category C

initialize labeled data L with 10 positive and 50 negativeexamples (pairs of sense)

Initiliaze unlabeled data U by running canopies on allsenses of Crepeat

train the string similarity classifier on Ltrain the relation classifier on Llabel U with each classifieradd 5 most confident positive and 25 negative predictionsto L


Synonyms

NELL Synonyms

Resultsprecision (0.7-0.9) recall (0.3-0.9)


Open Relation Extraction


Information Extraction described in previous slides extractedinformation for a predefined set of relations given in ontologyand training labels/seed examples. Open Relation Extraction isextraction of relations which are not known in advance from atext corpus and learning patterns to extract these relations.

Open RE methods

CRF-based OpenIE [11]

Unsupervised clustering [12]

Ontology Extension System (OntExt) [13]



TextRunner’s O-CRF

CRF (Conditional Random Fields) are undirected graphicalmodels trained to mazimie the conditional probability of afinite set of labels given a set of input observations.

[11] reduces relation extraction problem to a sequencelabeling problem by making a first-order Markovassumption about the dependencies among the outputvariables, and arranging variables sequentially on a linearchain.

CRF are commonly applied to sequential labellingproblems such as NER, POS tagging




Set of features: POS tags (predicted by a separatemaximum entropy model), capitlization, punctuation,context words (in their case prepositions and determinersonly), conjunction of features occuring in adjunctingpositions withins 6 words of the current word.

O-CRF first applied a phrase chunker to identify nounphrases as candidates for extraction.

Generated entity pairs anchor ends of a linear-chain CRF




Obtaining training data. Observation: majority of relations aredescribed in one of the following forms

Relation PatternsE1 Verb E2, X established Y

E1 NP Prep E2, X settlement with Y

E1 to Verb E2, X moved to Y

E1 Verb E2 Noun, X X is Y winner

E1 (and|,|-|:) E2 NP, X-Y deal

E1 (and,) E2 Verb, X, Y merge




ResultsO-CRF trained on 500 sentences

precission 0.9, recall .65 for Verb, 0.36 for Noun + Prep,0.5 for Verb + Prep

comparing to closed relation CRF, thousands of labeledsentences are required to achieve a comparable level ofprecission.

recall of O-CRF is 3-4 times below recall of traditional RE.



OntExt

Preprocessing

input: a category list with list of entities (City ,CountryOttawa ∈ City , Canada ∈ Country )

tokenize, POS-tag sentences

find sentences which contains a pair of known categoryinstances and group them by category pairs

a text between the two instances is called the contextpattern. Canada is a capital of Canada.

only frequent patterns are retained (> 5 occurences inexperiment)

remove patters with few instances of either category type(3 instances)

remove patterns which do not satisfy certainlexico-synctatic patterns (described in "O-CRF training set”slide)



OntExt

Relation ExtractionFor each pair of categories build a Context to Contextco-occurence matrix

find clusters in the matrix (K-means clustering was applied)

rank the known instances pairs by the distance to thecluster center and take top instances as seed instances forthe relation.



OntExt

Invalid relationsmany invalid relations are generated because of errors incategories, semantic ambiguity, semantically incompleterelations, illogical relations

normalized frequency counts (the frequency count for eachcategory instance divided by frequency count for thecategory instance with max count)

for a pattern P measure distribution, with how manycategories it occurs (for categories pairs without subtyperelation)

the number of connection per instance to selectnon-informative patterns



OntExt

Resultsran over 500 million web pages, with 22000 instances of122 categories as input data

generated 781 relations

developed a classifier to classify relations to valid/invalid,115/252 relations are valid by manual classification,classifier performance 0.7 precission, 0.7 recall



O-CRF vs OntExt

OntExt learns category types of entities involved inrelations

O-CRF is a single pass systems vs OntExt which ismulti-pass

O-CRF does not checks validity of relations extracted


Inference

Inference

We described methods to extract ground facts directly given intext. Can we infer some information?

Inference Example 1

PlaysFor(John, NewYorkGiants)

PlaysInLeague(NewYorkGiants, NFL)

⇒ AthletePlaysInLeague(John,NFL)

Inference Example 2

Socrates is a man

all men are mortal

all men are Socrates


Inference

Inference Rules

Learning Inference Rules

Learning Horn Clauses. High precision but low coveragemethod.

Random Walks on Graphs[14].

Described methods learn probabilistic/soft inference rulesrather than hard Horn clauses. Learned inference predictsprobability of the relationship rather than induce that two enitiesare in relationship.


Inference

Graph Random Walks [14]

Basic ideaGiven a training relationship R(x , y), consider walks on agraph of entities and relationships which reach y startingfrom x .

With each walk associate a probability of this walkconsidered as a random waslk on graph.

Consider walks as features predicting R(x , y)

train a predictor to predict R(x , y) given a vector of walksfrom x to y


Inference

Graph Random Walks

Probability of the walk (ex. from [14])

isa(x1,ProfessionalAthlete) · · · isa(xn,ProfessionalAthlete)

AthletePlaysInLeague(xn,NFL)

a ruleisa(x1, c) ∧ isa−1(c, x2) ∧ AthletePlaysInLeague(x2, y)

⇒ AthletePlaysInLeague(x1, y)

a walk P isa(HinesWard ,ProfessionalAthlete),isa−1(ProfessionalAthlete, xn),AthletePlaysInLeague(xn,NFL)

Probability of P is 1 ∗ 1/n ∗ 1


Inference

Graph Random Walks

A probability of the walk

A relation path P between entities x and y is a sequence ofrelation R1, · · · ,Rl such that there exists x1, · · · , xl−1 suchR1(x , x1),R2(x1, x2), · · · ,Rl(xl − 1, y)

if P is empty, hy ,P(x) = 1ifx = y , 0otherwise

if P is nonempty, define P ′ = R1, · · · ,Rl−1

hx ,P(y) =∑

y ′∈P′(x)hx ,P′(y ′) ∗ P(y |y ′,Rl)

P(y |y ′,Rl) is a probability of reaching y from y ′ by Rl

assuming uniform distribution of selecting the edge to walk.

Given a set of paths P1, · · · ,Pk from x to y , treat eachhx ,P(y) as a path feature for a linear model

∑θihx ,Pi (y)

train a model on this feature set to predict R(x , y)


Inference

Graph Random Walks

Resultsas reported in [14]

discover rules such as TeamHomeStadium(x , y) ⇐teamPlaysInCity(x , z), cityStadiums(x , y)

TeamHomeStadium(x , y) ⇐ teamMember(x , z),athletePlaysforTeam(z,w), teamHomeStadium(w , y)

p@10 0.6 in average p@100 0.6 in average

this inference extraction technique can be applied to nonfunctional predicates vs N-FOIL


Inference

N-FOIL

DefinitionNELL uses a variation of FOIL to extract horn clauses. A set ofpositive and negative examples in training set as input. ButFOIL is computationally hard. N-FOIL simplified it by assumingthat consequential predicates are functional. For a derivedN-FOIL rule an estimated probabilityP = (N++m ∗prior)/(N++N−+m), where m = 5, prior = 0.5,N+,N− are the numbers of pisitive and negative examples.N-FOIL learns the small amount of high precision rules


Inference

Sherlock Horn Clauses Learning

as alternative to ILP in [15]

Learns statistically significant rules

Learns minimal rules (containing no irrelevant terms in thebody)

based on discriminative weight learning adopted for thenoisy data as data extracted from web


Inference

Sherlock algorithm

.Given a target relation R, a set E of observed examples ofR, a maximum clause length k , a minimum support s, anda threshold t

generate all first-order, definitive clauses up to length k ,where R appears as the ehad of the clause

retains clauses which contain no unbound variables

infer at least s examples scores at least t according to thescore function


Inference

The score function of Sherlock

Statistical relevance, p(H|C)/p(H)

Statistical sinificance∑

H∈Head ,¬Head p(H|Body) ∗ log(P(H|Body)/P(H|B′))


Inference

Evaluation of Sherlock

[15] reports 5x increase in the number of facts decuded

while the number of high quality facts (0.8 precision)increased 3x times

56 percents of new facts are produced by multiple relationclauses

sources of errors1/3 of errors are due to metonomy and word senseambiguity

1/3 of errors are due to inferences based on incorrectlyextracted facts


Inference

Bibliography I

A. Carlson, J. Betteridge, R. Wang, E. Hrushka, andT. Mitchell, “Coupled semi-supervised learning forinformation extraction,” Proc. of WSDM, 2010.

M. A. Hearst, “Automatic acquisition of hyponyms fromlarge text corps,” Actes de COLING, 1992.

A. Blum and T. Mitchell, “Combining labeled and unlabeleddata with co-training,” Proc. of COLT, 1998.

M. Collins and Y. Singer, “Unsupervised models for namedentity classification,” Proc. of EMNLP, 1999.

D. McClosky, E. Charniak, and M. Johnson, “Effectiveself-training for parsing,” Proc. of NAACL, 2006.


Inference

Bibliography II

D. Downey, O. Etzioni, and S. Soderland, “A probabilisticmodel for redundancy in information extraction,” Proc ofIJCAI, 2004.

J. Curran, T. Murphy, and B. Scholz, “Minimizing semanticdrift with mutual exlusion bootstrapping,” Proc. of PACLING,2007.

M.-W. Chang, L.-A. Ratinov, and D. Roth, “Guidingsemi-supervision with constraint driven learning,” Proc. ofACL, 2007.

A. Yates and O. Etzioni, “Unsupervised methods fordeterming object and relation synonyms on the web,”Journal of Artificial Intelligence Research, 2009.


Inference

Bibliography III

J. Krishnamurthy and T. M. Mitchell, “Which noun phrasesdenote which concepts?,” Proc of 49th Annual Meeting ofthe ACL, 2011.

M. Banko and O. EtzioÐ¸ni, “The tradeoffs between openand traditional relation extraction,” Proc. of ACL, 2008.

T. Hasegawa, S. Sekine, and R. Grishman, “Discoveringrelations among named entities from large corpora,” Proc.of ACL, 2004.

T. Mohamed, E. Hruschka, and T. Mitchel, “Discoveringrelations between noun categories,” Proc. of EMNLP, 2011.

N. Lao, T. Mitchell, and W. W. Cohen, “Random walkinference and learning in large scale knowledge base,”Proc. of EMNLP, 2011.


Inference

Bibliography IV

S. Schoenmackers, J. Davis, O. Etzioni, and D. Weld,“Learning first order horn clauses from web text,” Proc. ofEMNLP, 2010.

presentation

Documents

methods information

bootstrapping methods

methods hearsts patterns

extract information

web text

methods difculties nouns

orandnp np

relations patterns