toward best-effort information extraction

by Warren Shen, Pedro DeRose, Robert McCann, AnHai Doan, and Raghu Ramakrishnan, SIGMOD'08,

June 2009, Vancouver, British Columbia, Canada, 2007, 1031-1042

Presented by Andrew Zitzelberger

Many solutions exist for extracting structured data from raw data pages.

But … Virtually all of these solutions focus on precise

Information Extraction programs that output exact results

Generally cannot execute a partially specified version of the program.

Generally takes a long time (days or weeks) before obtaining the first meaningful results (partially due to the first limitation). Not acceptable for time sensitive applications

Writing precise IE programs can be a waste of time in some instances.

Given 500 pages find all houses which cost more than $500,000 and whose high school is Lincoln. Case 1: Define price as a numeric value and run

the approximate program. 9 pages are returned containing a number greater than 500,000 and the word Lincoln. Search these 9 pages manually.

Case 2: Instead 120 pages are returned. The program is underspecified so the Next-Effort assistant is consulted which asks if price tags are always bolded. After discovering they are, this is added to the specification, and this time 35 pages are returned.

Allows the developer to quickly develop an approximate extraction program (Alog)

The approximate program can then be run to quickly retrieve approximate results (Compact Tables)

To improve results the developer can enlist the aid of the Next-Effort assistant

Xlog is a variant of datalog Consists of a number of rules in the form of p:-

q1,…,qn where p and qi are predicates and p is the head and the qi’s form the body.

Xlog does not allow rules with negated predicates or recursion.

Xlog can accommodate procedural steps of real world IE using p-predicates and p-functions.

p-predicate A p-predicate takes the form q(a1, . . . , an,

b1, . . . , bm), where ai and bi are variables and q is associated with some procedural code module.

The associated procedural code module takes a in an input tuple (u1, . . . , un), where ui is bound to ai, i ∈ [1, n], and produces as out-put a set of tuples (u1, . . . , un, v1, . . . , vm).

p-function p-function f(a1, . . . , an) takes as input a tuple

(u1, . . . , un) and returns a scalar value.

Extract houses with a price above $500,000, more than 4500 square feet, and with a top high school. p-predicates

extractHouses(x, p, a, h) extractSchools(y, s)

p-function approxMatch(h, s)

Query R1: houses(x,p,a,h) :- housePages(x),

extractHouses(x,p,a,h) R2: schools(s) :- schoolPages(y), extractSchools(y,s) R3: Q(x,p,a,h) :- houses(x,p,a,h), schools(s), p>500000,

a>4500, approxMatch(h,s)

To write an Xlog program the developer must first decompose the program into smaller tasks.

Then p-predicates and p-functions are designed to reflect the decomposition.

Next procedural modules to perform the functionality of the p-predicates and p-functions must be designed and written (takes a lot of time and must be fairly complete before testing can begin).

Finally, the modules must be linked in.

IE predicates – a p-predicate that extracts one or more output spans from a single input document or span.

The procedure writing stage of Xlog is replaced by the ability to write description rules to do “good enough.” The developer can also attach procedural

modules if desired. The developer also specifies the type of

approximation to use with annotations.

Written in the same form as traditional Xlog rules except that the head of the rule must be an IE predicate.

Can be used to define domain constraints in the form of f(a) = v (example: numeric(a) = yes) Values can be yes, distinct-yes, no, distinct-no, and

unknown Can also describe text features such as bold-

font, followed-by, underlined, hyperlinked, etc. iFlex provides a rich set of built in features and

provides an interface for the user to add more.

Verify(s, f, v) checks whether f(s) = v. Refine(s, f, v) returns all subspans t from

s such that f(t) = v This implementation is done once and

stored so that all future Alog programs can make use of it.

Description rules must be safe – meaning that they don’t produce an infinite relation.

extractHouses(x, p, a, h) :- numeric(p), numeric(a) is not safe because it does not specify where p, a, and h are extracted from. iFlex provides built-in rule from(x, y) that

conceptually extracts all sub-spans y from document x.

This predicate can be used to easily make rules safe. extractHouses(x, p, a, h) :- from(x, p), from(x, a),

from(x, h), numeric(p)=yes, numeric(a)=yes

Existence Annotation Indicates that a tuple in the relation may or

may not exist. schools(s)? :- schoolPages(y),

extractSchools(y,s) Attribute Annotation

Indicates that an attribute takes a value from a given set, but we do not know which value.

houses(x,<p>,<a>,<h>) :- housePages(x), extractHouses(x,p,a,h)

Suppose we determine that school names are in bold font.

It is not likely that every bold word in the document is a school name.

Thus we can use the existence annotation to specify that each tuple found may or may not be in the actual relation.

Every tuple found is added to a relation and the power set is returned to specify the set of relations that are possibly correct.

Suppose that each document x in housePages describes exactly one house (the x is a key in the relation)

Then we can specify that price, area, and high school come from some matching values we found on the page.

All possible relations are constructed for houses where one value is selected for each attribute.

Need a way to store the set of relations an Alog program produces.

An a-table is a multiset of a-tuples. An a-tuple is a tuple (V1,…,Vn), where each Vi

is a multiset of possible values. An a-tuple may be annotated with a ‘?’, in

which case it is also called a maybe a-tuple. An a-table represents the set of all possible

relations that can be constructed by: (a) selecting a subset of the maybe a-tuples and

all non-maybe a-tuples, then (b) selection one possible value for each attribute

in each a-tuple in (a).

A-tables are not typically succinct enough due to the fact that an Alog rule may produce a huge number of extracted values.

iFlex employs compact tables which exploit the sequential nature of text to “pack” the set of values into each cell into a much smaller set of so-called assignments.

A compact table is a multiset of compact tuples. A compact tuple is a tuple of cells (c1,…,cn) where each cell ci is a multiset of assignments or an expansion cell. A compact tuple may optionally be designated as a maybe compact tuple, denoted with a ‘?’.

exact exact(s) – encodes a value that is exactly span

s contain

contains(s) – encodes all values that are sub-spans of s on the page (example: contain(“Cherry”) includes {“C”, “Ch”, …,“Cherry”}

Suppose a tuple t with cells (c1, …,ci, …, cn) where ci = expand(v1, …, vk).

T can be expanded into a set of compact tuples obtained by replacing cell ci with an assignment exact (vj): (c1,…,exact(vj),…cn), where 1≤j ≤ k.

Not a complete model for approximate data (cannot do mutual exclusion)

Not closed under traditional relational operators

Ensure superset semantics – result is always a super set of actual results

Projection – ignore duplicate detection Selection – if any of the possible tuples in a

compact tuple meet the selection condition the tuple is retained. If only some of the tuples meet the condition, it becomes a maybe compact tuple.

θ-join – evaluate θ condition on all compact tuples in the Cartesian product using the selection criteria.

Unfold all rules (unifying variables if necessary) until only IE predicates remain that are associated with procedures in the program.

Construct a logical plan fragment

Suggests ways to refine the current information extraction program by asking the developer questions? Example: “is price in bold font?”

iFlex adds new constraints to the program based on the developers responses.

If the number of tuples does not change for k iterations, the assistant can notify the developer that the results of have converged.

Sequential Rank attributes in decreasing importance

(using various heuristics) Always ask questions about the most

important attribute Simulation

Ask questions whose answers will eliminate the most possible answers The results of each stage of the execution plan

are stored, so that only the changes have to be rerun.

Domains: Movies, DLBP, Books Comparisons in performance are based

on the time it takes to write the program for extraction (or do the extraction in the manual case). Times are averaged over 1-3 volunteers for

each task. Time stops when correct result is obtained or

the program converges.

iFlex reduced time by 25-98% in all 27 scenarios

• iFlex converged correctly in 23 out of 27 of the scenarios (not shown due to space limitations)

• The four remaining cases were 170%, 161%, 114%, and 102%.• Two of those cases had a small number of tuples

Tasks took 104, 351, and 107 seconds to run; iFlex running time is comparable to Perl extraction programs.

iFlex is a best-effort information extraction program that can be use to quickly obtain approximate results.

iFlex significantly reduces the developer time in creating information extraction programs.

iFlex is efficient enough to run with comparable speed to Perl

Simulated question patterns from the Next-Effort assistant outperforms the sequential pattern.

toward best-effort information extraction

Documents

form of p

ppredicatea ppredicate

pfunction pfunction

xlog program

procedural modules

negated predicates

waste of time

raw data pages