arnd christian könig venkatesh ganti rares vernica microsoft research entity categorization over...
TRANSCRIPT
Arnd Christian KönigVenkatesh Ganti
Rares Vernica
Microsoft Research
Entity Categorization Over Large Document
Collections
Relationship Extraction from TextTask: Given a corpus of documents and entity-
recognition logic, extract structured relations between entities from text.
… Donald Knuth works in research …
is-a-researcher(Donald_Knuth)
…Yao Ming plays forthe Houston Rockets…
works-for(Yao_Ming, Houston_Rockets)
Motivation: Going from unstructured data to structured data Applications in search, business intelligence, etc.Focus: Open relationship extraction vs. targeted extraction
Context
Entity
Relationship Extraction from TextTask: Given a corpus of documents and entity-
recognition logic, extract structured relations between entities from text.
… Donald Knuth works in research …
is-a-researcher(Donald_Knuth)
…Yao Ming plays forthe Houston Rockets…
works-for(Yao_Ming, Houston_Rockets)
Motivation: Going from unstructured data to structured data Applications in search, business intelligence, etc.Focus: Open relationship extraction vs. targeted extraction Large document collections (> 107 Documents)
Context
Entity
Using Aggregate ContextSingle-context Extraction:
([Entity], is-a-researcher)
Multi-context Extraction:
“…[Entity] works in research…”
“…[Entity] published…”
“…[Entity]’s paper…”
“…[Entity] gave a talk…” }
([Entity], is-a-researcher)
Multi-Feature Relation Extractor
Extraction logic: ‘[E] works … research’
We track an entity across contexts, allowing us to combine
less predictive features.
[Entity], ‘paper’ [Entity], ‘talk’[Entity], ‘published’
Aggregate Context Features
Using Co-occurrence FeaturesLeverage co-occurrence of entity classes (e.g.
directors likely co-occur with actors) for extraction.
Example: Extraction of is-a-director relation:
Alan AlbaRichard GereJulia Roberts …
Actor-List
… Julia Roberts
starred in a
Robert Altman
film in 1994 …
Co-occurrence features can be between
Entities of different classes. Entities of one class.Combination with text-features
possible: e.g., ‘[Entity] plays for
[Team_Name]’.
Robert_Altman, co-occurs with actor name
…
Aggregate Context Features
}Two Questions:
(a) What difference do the aggregate
contexts make for extraction accuracy?
(b) This means keeping track of contexts
across documents - can we make this
efficient?
Processing large Document Collections
Single-Context Extraction
Agg. Feature Extraction
Architecture
Context FeatureExtraction
Document Corpus D
Entity-Relation Pairs
Aggregation
COUNT(entity, relation) > Δ
Entity-Feature Pairs
Classification
Co-Occurrence List corpus L
Co-OccurrenceDetection
Co-OccurrenceDetection
Co-OccurrenceDetection
Co-OccurrenceDetection
Þ Duplicated overhead from - Document scanning - Document processing - Entity Extraction.
New Architecture
Challenges:1. Fast & accurate
co-occurrence detection using the synopsis.
2. Pruning of redundant output.
Context FeatureExtraction
New Architecture
Document Corpus D
Aggregation
Rule-based Extraction
Classification
Agg. Feature Extraction
Synopsis of L
Delete false Positives
Co-Occurrence List corpus L
AggregationList-Member
Extraction
Co-OccurrenceDetection
Entity – Candidate Context Pairs
Entity-List Pairs
Entity-Feature Pairs
Fast identification of candidate matches through 2-stage filtering.
Use of Bloom-Filters to trade off memory footprint with false positive rate.
Frequency-distribution of entities very skewed.
Pruning based on retaining most frequent entities and list members in memory.
Challenge: Determining frequencies online.
=> Compact hash-synopses of frequencies (CM-Sketch) perform well.
Potentially very large output:
Duplication via very many co-occurrences, e.g. actor-actor.
Potentially very large output:
Duplication, e.g. Entity: “George
Bush” Feature: ‘President’
Experiments
Experimental Evaluation Task: Categorization of entities into
professions (actor, writer, painter, etc.) Document-Corpus: 3.2 Million Wikipedia
pages Training data generated using Wikipedia
lists of famous painters, writers, etc… Aggregate-Context Classifier: linear SVM
using text n-gram & co-occurrence features (binary)
Single-Context classifier: 100K extraction rules (incl. gaps) derived from training data (algorithm of [König and Brill, KDD’06]).
Co-occurrence list: contains 10% of entity strings in training data.
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%75%
80%
85%
90%
95%
100%
Painters - aggregate n-gram features only
Painters - n-gram and list-membership features
Rule-base extraction (60% conf.)
Rule-based extraction (80% conf.)
Recall
Pre
cisi
on
Experimental Evaluation: Accuracy
Baseline Size = 1% Size = 2% Size = 5% Size = 10%0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Aggregation
Verification
Write-Overhead
Perc
enta
ge o
f Bas
elin
e O
verh
ead
Experimental Evaluation: Overhead
Main remaining overhead: writing of entity-features pairs.
Simple caching strategy reduces this overhead by an order of magnitude.
Conclusions Studied the effect of aggregate context in
relation extraction. Proposed efficient processing techniques
for large text corpora. Both aggregate and co-occurrence
features provide significant increase in extraction accuracy compared to single-context classifiers.
The use of pruning techniques and approximate filters results in significant reduction in the overall extraction overhead.
Questions?