454 project ideas

46
454 Project Ideas

Upload: ori-dale

Post on 03-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

454 Project Ideas. Administrivia. Office Hours 11-noon, Fridays in 588 Or by email Project proposals due today Not binding (at least not yet) To be elaborated In-person project reviews next week. HW 1 – due next Tues @ noon. Autonomously Semantifying Wikipedia. Fei Wu - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 454 Project Ideas

454 Project Ideas

Page 2: 454 Project Ideas

Administrivia

Office Hours 11-noon, Fridays in 588 Or by email

Project proposals due today Not binding (at least not yet) To be elaborated In-person project reviews next week.

HW 1 – due next Tues @ noon

Page 3: 454 Project Ideas

Autonomously Semantifying Wikipedia

Fei WuDept. Computer Science & Eng.

University of Washington

(Joint work with Dan Weld)

Page 4: 454 Project Ideas

Motivation

Semantic Web [Berners-Lee 01] is great. Web content machine readable Software agents find, share and integrate information

Page 5: 454 Project Ideas

Motivation

Semantic Web [Berners-Lee 01] is great. Web content machine readable Software agents find, share and integrate information

Semantic Data Applications

Chicken-egg problem:

Page 6: 454 Project Ideas

Motivation

Semantic Web [Berners-Lee 01] Web content machine readable Software agents find, share and integrate information

Semantic Data Applications

Bootstrapping:

Automatically Semantifying Data

Chicken-egg problem:

Page 7: 454 Project Ideas

Idea: “Semantify” Wikipedia Wikipedia [http://wikipedia.org]

Comprehensive (1.7 million English articles)

High-quality Important

6th most popular web-site & growing

Benefits: User-tagged data

(links, infobox, lists, categories, etc.) Large, but not too large

Page 8: 454 Project Ideas

Wikipedia Challenges

Much natural-language text

Missing data

Inconsistency

Low information redundancy

Page 9: 454 Project Ideas

Kylin: Autonomously Semantifying Wikipedia Totally autonomous with no additional human efforts

Information extraction from both semi-structured and unstructured data

Kylin: a mythical hooved Chinese chimerical creature that is said to appear in conjunction with the arrival of a sage.

------ Wikipedia

[Wu & Weld CIKM-07]

Page 10: 454 Project Ideas

Outline

Semantics in Wikipedia Opportunities Challenges

Kylin System Infobox Generation Link Creation

Conclusion

Page 11: 454 Project Ideas

Semantics in Wikipedia

Infobox Link List Category Redirection Disambiguation ……

{{Infobox U.S. County| county = Clearfield County| state = Pennsylvania | seal = | map = Map of Pennsylvania highlighting Clearfield County.svg | map size = 225| founded = [[March 26]], [[1804]]| seat = [[Clearfield, Pennsylvania|Clearfield]] | area = 2,988 [[km²]] (1,154 [[square mile|mi²]]) | area water = 17 km² (6 mi²) | area percentage = 0.56% | census yr = 2000| pop = 83,382 | density = 28||}}

Page 12: 454 Project Ideas

04/20/23 16:19 12

Self-Supervised Learning of Infoboxes

Page 13: 454 Project Ideas

Infobox Challenges

Incompleteness US County: ~50% of articles have infoboxes

Inconsistency Manual process -> contradictions between text & infobox 16% of US County articles had an error (revision)

Schema Drift U.S. County (1428), US County (574), Counties (50),

County (19) Attribute drift & duplication, Rare attributes: only 29% used by 30% or more articles

Page 14: 454 Project Ideas

Infobox Challenges (Continued)

Type-free System Deliberate low-tech design “King county” has the following attributes:

Land area = 2126 sq miles Land area (km) = 5506 sq km

Irregular lists Some separate information in items Others use tables with different schemata Others are hierarchical

List of cities & towns in US Places in Florida List of counties in Florida

Page 15: 454 Project Ideas

Infobox Challenges (Continued)

Infoboxes hierarchical themselves Country leader – instead of name, has nested

element listing title to be “king” with name at lower level

Page 16: 454 Project Ideas

Semantics in Wikipedia

Infobox Link List Category Redirection Disambiguation

Why are these useful?

Page 17: 454 Project Ideas

Semantics in Wikipedia

Infobox Link List Category Redirection Disambiguation

Why useful?

Why challenging?

Page 18: 454 Project Ideas

Semantics in Wikipedia

Infobox Link List Category Redirection Disambiguation

“Seattle, Washington”

Challenge: crappy

• flattened

• “to be merged since 3/06

Page 19: 454 Project Ideas

Semantics in Wikipedia

Infobox Link List Category Redirection Disambiguation

Why useful?

Page 20: 454 Project Ideas

Semantics in Wikipedia

Infobox Link List Category Redirection Disambiguation

Why useful?

Page 21: 454 Project Ideas

Semantics in Wikipedia

Infobox Link List Category Redirection Disambiguation

Opportunities

Semantic source

Training dataset

Challenges

Missing data

Inconsistency

Page 22: 454 Project Ideas

Semantics in Wikipedia

Infobox Link List Category Redirection Disambiguation

Opportunities

Semantic source

Training dataset

Challenges

Missing data

Inconsistency

Kylin: Autonomously Semantifying Wikipedia

Page 23: 454 Project Ideas

Outline

Semantics in Wikipedia Opportunities Challenges

Kylin System Infobox Generation Link Creation

Conclusion

Page 24: 454 Project Ideas

Infobox Generation

Page 25: 454 Project Ideas

Preprocessor Schema Refinement Free edit -> schema drift

Duplicate templates: U.S.County(1428), US County(574), Counties(50), County(19)

Low usage of attribute

Duplicate attributes: “Census Yr”, “Census Estimate Yr”, “Census Est.”, “Census Year”

Kylin:

Strict name match

????

>15% occurrences

U.S. County Infobox

0

0.2

0.4

0.6

0.8

1

Preprocessor

Classifier

Extractor Infobox

Page 26: 454 Project Ideas

Preprocessor Training Dataset Construction

Clearfield County was created on 1804 from parts of Huntingdon and Lycoming Counties but was administered as part of Centre County until 1812.

Its county seat is Clearfield.

2,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water.

Problems:

Missing data

Noise

As of 2005, the population density was 28.2/km².

Preprocessor

Classifier

Extractor Infobox

Steps:

1.Segment to sentences

2.Find unique match (heuristics)

Page 27: 454 Project Ideas

Classifier

Document Classifiers (1 per article type)

Sentence Classifier (1 per article type x attribute)

Preprocessor

Classifier

Extractor Infobox

Trained on preprocessor output Features: bag of words, POS tags Maximum Entropy Classifier with Bagging:

multi-class, multi-label, missing data

List & Category Fast Precision(98.5%) – with no learning! Recall(68.8%)

Page 28: 454 Project Ideas

Extractor

Preprocessor

Classifier

Extractor Infobox

Input A sentence predicted to contain an attribute: “After considerable

debate, the county was incorporated on September 13, 1852”

Output <founding date, September 13, 1852>

Page 29: 454 Project Ideas

Landscape of Extraction Techniques

Any of these models can be used to capture words, formatting or both.

Lexicons

AlabamaAlaska…WisconsinWyoming

Abraham Lincoln was born in Kentucky.

member?

Classify Pre-segmentedCandidates

Abraham Lincoln was born in Kentucky.

Classifier

which class?

…and beyond

Sliding Window

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Try alternatewindow sizes:

Boundary Models

Abraham Lincoln was born in Kentucky.

Classifier

which class?

BEGIN END BEGIN END

BEGIN

Context Free Grammars

Abraham Lincoln was born in Kentucky.

NNP V P NPVNNP

NP

PP

VP

VP

S

Mos

t lik

ely

pars

e?

Finite State Machines

Abraham Lincoln was born in Kentucky.

Most likely state sequence?

Slides from Cohen & McCallum

Page 30: 454 Project Ideas

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

Slides from Cohen & McCallum

Page 31: 454 Project Ideas

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

Slides from Cohen & McCallum

Page 32: 454 Project Ideas

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

Slides from Cohen & McCallum

Page 33: 454 Project Ideas

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

Slides from Cohen & McCallum

Page 34: 454 Project Ideas

A “Naïve Bayes” Sliding Window Model [Freitag 1997]

00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m

prefix contents suffix

If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.

… …

Estimate Pr(LOCATION|window) using Bayes rule

Try all “reasonable” windows (vary length, position)

Assume independence for length, prefix words, suffix words, content words

Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)

Slides from Cohen & McCallum

Page 35: 454 Project Ideas

“Naïve Bayes” Sliding Window Results

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

Domain: CMU UseNet Seminar Announcements

Field F1 Person Name: 30%Location: 61%Start Time: 98%

Slides from Cohen & McCallum

Page 36: 454 Project Ideas

State of the Art Performance

Named entity recognition Person, Location, Organization, … F1 in high 80’s or low- to mid-90’s

Binary relation extraction Contained-in (Location1, Location2)

Member-of (Person1, Organization1) F1 in 60’s or 70’s or 80’s

Wrapper induction Extremely accurate performance obtainable Human effort (~30min) required on each site

Slides from Cohen & McCallum

Page 37: 454 Project Ideas

CRF Extractor

Conditional Random Fields Model [Lafferty 01]Attribute value extraction: sequential data labeling CRF model for each attribute independently

2,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water.

Relabel – filter false negative training examples

Preprocessor: Water_area

Classifier: Water_area; Land_area

Preprocessor

Classifier

Extractor Infobox

Precision + Recall -

Pipeline – prune irrelevant sentences

Page 38: 454 Project Ideas

Infobox Generation Experiments

2007.02.06 Wikipedia Dump Data

Dataset

4 popular classes:

U.S.County(1245) Actor(3819)

Airline(791) University(4025)

50 random test articles per class

Page 39: 454 Project Ideas

Kylin performance

Page 40: 454 Project Ideas

Kylin performance (detailed view)

U.S.County (better than manual labeling)

Strict expression Number-typed

Abbeville County is a county located in the U.S. state of South Carolina.

The county has a total area of 2,988 square kilometers (1,154 mi²). 2,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water.

Page 41: 454 Project Ideas

Kylin performance (detailed view)

Former U.S. President Dwight D. Eisenhower served as President of the University.

The College began first in 1855 as a one room schoolhouse.

UCL was founded in 1826 under the name “University of London”.

The college opened in 1973 with the Charlestown campus.

University (worse than manual labeling) Flexible expression:

Global context:

Implicit:Eg: students at 3 campus sum up to the total student number

Page 42: 454 Project Ideas

Effect of Relabel, Pipeline

Page 43: 454 Project Ideas

Default Project

Reimplement Kylin (or build on Fei’s code) Improve it See how much information we can extract

Post on web: Dbpedia Merge back into Wikipedia?

Bot issues Associate javascript

Extraction from the Greater WWW Self-verify accuracy by external extraction Add infobox facts which are missing from articles

Page 44: 454 Project Ideas

Extensions

Semi-automated bot interface Firefox plugin Displays improved infobox – user checks & says ok

Safer than a bot

For general Wikipedia authors Extraction in real-time & error checking Attribute values Guide towards best schema & attribute

Typing & microformats

Page 45: 454 Project Ideas

Extensions

Other wikipedia issues Learn author reputation Watch for changes Look for framing or biased language Recognize vandalism

Auto-generate disambiguation pages Extract events & create a timeline view Citation assistance

identify correspondence between text and citation Semiautomatic article generation  

Page 46: 454 Project Ideas

Extensions

Where could this be applied besides Wikipedia?

Broader Questions Internet enables generation of structured content How integrate methods? Overwrite, training data, ???