finding frequent and interesting triples in text janez brank, dunja mladenić, marko grobelnik...

11
Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia

Upload: scarlett-fletcher

Post on 18-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia

Finding frequent and interesting triples in text

Janez Brank, Dunja Mladenić, Marko GrobelnikJožef Stefan Institute, Ljubljana, Slovenia

Page 2: Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia

Motivation• Help with populating a knowledge base / ontology (e.g.

something like Cyc) with common-sense “facts” that would help with reasoning or querying– We’ll be interested in concept1, relation, concept2 triples– E.g. person, inhabit, country tells us that a country is

something that can be inhabited by a person, which is potentially useful

• We’d like to automatically extract such triples from a corpus of text– They are likely to contain slightly abstract concepts and

aren’t mentioned directly in the text, but their specializations are

– We will use WordNet to generalize concepts

Page 3: Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia

Overview of the approachParser +

some heuristics

Corpus of text

List of subject,

predicate, object triples

List of concept triples

WordNet

List of frequent

triples

List of frequent,

interesting triples

Generalization, minimum support

threshold

Measures of interest

Page 4: Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia

Associating input triples with WordNet concepts

• Our input was a list of subject, predicate, object triples– Each component is a phrase in natural language

European Union finance ministers, approved, convergence plans

– But we’d like each component to be a WordNet concept so that we’ll be able to use WordNet for generalization

• We use a simple heuristical approach:– Look for the longest subsequence of words that also happens to be

the name of a WordNet concept• Thus “finance minister”, not “minister”

– Break ties by selecting the rightmost such sequence• Thus “finance minister”, not “European Union”

– Be prepared to normalize words when matching• “ministers” “minister”

– Use only the nouns in WordNet when processing the subject and object, and only the verbs when processing the predicate

Page 5: Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia

Identifying frequent triples• Now we have a list of concept triples, each of

which corresponds roughly to one clause in the input textual corpus

• Let u v denote that v is a hypernym (direct or indirect) of u in WordNet (including u = v)

• support(s, v, o) := the number of concept triples (s', v', o') such that s' s, v' v, o' o– Thus, a triple that supports finance minister,

approve, plan also supports executive, approve, idea

• We want to identify all s, v, o whose support exceeds a certain threshold

Page 6: Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia

Identifying frequent triples• We use an algorithm inspired by Apriori• However, we have to adapt it to prevet the generation of an

untractably large amount of candidate triples (most of which would turn out to be infrequent)

• We use the depth of concepts in the WordNet hierarchy to order the search space

• Process triples in increasing order of the sum of the depths of their concepts– Each depth-sum requires one pass

through the data

Page 7: Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia

Identifying interesting triples• Not all frequent triples are interesting

– Generalizing one or more components of the triple leads to a higher (or at least equal) support

– Thus the most general triples are also the most frequent, but they aren’t interesting• E.g. entity, act, entity

• We are investigating heuristics to identify which triples are likely to be interesting– Let s be a concept and s' its hypernym.– Every input triple that supports s in its subject also supports s', but the other

way around is usually not true.– We can think of the ratio support(s) / support(s') as a “conditional probability”

P(s|s').– So we might naively expect that P(s|s') support(s', v, o) input triples will

support the triple s, v, o.– But the actual support(s, v, o) can be quite different. If it is significantly higher,

we conclude that s fits well together with v and o.– Thus, interestingnessS(s, v, o) = support(s, v, o) / (P(s|s') support(s', v, o)).– Can be defined for v and o as well.

Page 8: Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia

Identifying interesting triples• But this measure of interestingness turns out to be too

sensitive to outliers and quirks in the WordNet hierarchy

• Define the sv-neighborhood of a triple s, v, o as the set of all (frequent) triples with the same s and v.– The so- and vo-neighborhoods can be defined analogously.

• Possible criteria to select interesting triples now include:– A triple is interesting if it is the most interesting in two (or

even all three) of its neighbourhoods (sv-, so- and vo-).– We might also require that the neighbourhoods be large

enough.

Page 9: Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia

Experiments: Frequent triples• Input: 15.9 million subject, predicate, object triples extracted

from the Reuters (RCV1) corpus• For 11.8 of them, we were able to associate them with WordNet

concepts. These are the basis of further processing.• Frequent triple discovery:

– Found 40 M frequent triples (at various levels of generalization) in about 60 hours of CPU time

– Required 35 passes through the data (one for each depth-sum)

– At no pass was the number of candidates generated greater than the number of actually frequent triples by more than 60%

Page 10: Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia

Experiments: Interesting triples• We manually evaluated the interestingness of all the

frequent triples that are specializations of person, inhabit, location (there were 1321 of them)– On a scale of 1..5, we consider 4 and 5 as being interesting– If, instead of looking at all these triples, we select a smaller

group of them on the basis of our interestingness measures, does the percentage of tripes scored 4 or 5 increase?

Page 11: Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia

Conclusions and future work• Frequent triples

– Our frequent triple algorithm successfully handles large amounts of data– Its memory footprint only minimally exceeds the amount needed to store the

actual frequent triples themselves• Interesting triples

– Our measure of interestingness has some potential but it remains to be seen what’s the right way to use it

– Evaluation involving a larger set of triples is planned• Ideas for future work: covering approaches

– Suppose we fix s and v, and look where the corresponding o’s (i.e. those for which s, v, o is frequent) fall in the WordNet hypernym tree

– We want to identify nodes whose subtrees cover a lot of these concepts but not too many other concepts (combined with an MDL criterion)

– Alternative: think of the input concept triples as positive examples, and generate random triples of concepts as negative examples. Use this as the basis for a coverage problem similar to those used in training association rules.