mianwei zhou, tao cheng, kevin chen-chuan chang wsdm 2010, new york, usa 1

14
anwei Zhou, Tao Cheng, Kevin Chen-Chuan Chan WSDM 2010, New York, US 1

Upload: reginald-gray

Post on 13-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mianwei Zhou, Tao Cheng, Kevin Chen-Chuan Chang WSDM 2010, New York, USA 1

Mianwei Zhou, Tao Cheng, Kevin Chen-Chuan ChangWSDM 2010, New York, USA

1

Page 2: Mianwei Zhou, Tao Cheng, Kevin Chen-Chuan Chang WSDM 2010, New York, USA 1

Web Info Extractio

n

Web Info Extractio

n

Typed Entity Search

Typed Entity Search

Web-based

Q/A

Web-based

Q/A

In most cases, what we really want are not pages, but the information units inside.

? ? ? ? ? ? ? ?

2

Page 3: Mianwei Zhou, Tao Cheng, Kevin Chen-Chuan Chang WSDM 2010, New York, USA 1

Specialized Information Specialized Information ExtractorsExtractors

Specialized Information Specialized Information ExtractorsExtractors

Web Information Extraction (WIE)(Marius 2006, Cafarella 2005, Etzioni 2004)

Pattern: “X is CEO

of Y”

Company

CEO

Google Eric Schmidt

IBM S. Palmisano

… …Limitation

•Focus on simple patterns.

•Lack of interactivity.

3

Page 4: Mianwei Zhou, Tao Cheng, Kevin Chen-Chuan Chang WSDM 2010, New York, USA 1

Web-based Question Answering (WQA)(Wu 2007, Lin 2003, Brill 2002)

Who is CEO of Dell?

Who is CEO of Dell?

Keywords:“CEO Dell”

Parse Top-k results

Michael Dell

Limitation

•Only rely on top-k pages to

retrieve the answer.

4

Page 5: Mianwei Zhou, Tao Cheng, Kevin Chen-Chuan Chang WSDM 2010, New York, USA 1

Typed-Entity Search (TES)(Cheng 2007, Cafarella 2007, Chakrabarti 2006)

Amazon PhoneAmazon Phone

……

0.60

0.80

0.90

Ranked Entity List

But … Where is CEO

Limitation

•Limited Number of Data Type

•Lack of Flexibility

5

Page 6: Mianwei Zhou, Tao Cheng, Kevin Chen-Chuan Chang WSDM 2010, New York, USA 1

? ?? ?? ?? ?Data-oriented Content Data-oriented Content Query SystemQuery System

Data-oriented Content Data-oriented Content Query SystemQuery System

Web Info Extractio

n

Web Info Extractio

n

Typed Entity Search

Typed Entity Search

Web-based QA

Web-based QA

Requirements1. Extensible Data Types2. Flexible Contextual

Patterns3. Customizable Scoring 6

Page 7: Mianwei Zhou, Tao Cheng, Kevin Chen-Chuan Chang WSDM 2010, New York, USA 1

Input: CQL (Content Query Language)

Output

Entity Searc

h

Entity Searc

h

Web QA

Web QA

Data-oriented Content Data-oriented Content Query SystemQuery System

Data-oriented Content Data-oriented Content Query SystemQuery System

7

Page 8: Mianwei Zhou, Tao Cheng, Kevin Chen-Chuan Chang WSDM 2010, New York, USA 1

8

http://wisdm.cs.uiuc.edu/demos/docqs

Page 9: Mianwei Zhou, Tao Cheng, Kevin Chen-Chuan Chang WSDM 2010, New York, USA 1

What we need Relational Model

Person Organization

LocationNumber

Number

Person

Organization

Location

9

Page 10: Mianwei Zhou, Tao Cheng, Kevin Chen-Chuan Chang WSDM 2010, New York, USA 1

What we need Relational Model

Find the population of China

WHERE pattern(…)

GROUP BY #number

ORDER BY conf()

FROM #number

10

China has a population of 1.3 billion

China with its population of 1.3 billion people

China is established in 1949.

Shanghai is the largest city with 15 million inhabitants in China

1.3 billion 15 million

1. 1.3 billion

2. 15 million

1. 1.3 billion

2. 15 million

Page 11: Mianwei Zhou, Tao Cheng, Kevin Chen-Chuan Chang WSDM 2010, New York, USA 1

What we need Relational Model

Number

Location

Person

PopulationPhonePrice

CapitalHeadquarter

ProfessorCEOPresident

Table

View

Number

population

price

phone

11

Page 12: Mianwei Zhou, Tao Cheng, Kevin Chen-Chuan Chang WSDM 2010, New York, USA 1

Index LayerIndex Layer

Parsing Layer

Parsing Layer

Index Selection Module

Index Selection Module

Execution Tree

Execution Tree

INPUTSELECT …FROM …WHERE …

OUTPUT

Index DesignSpecial Inverted

Index•Contextual Index•Join Index

Index DesignSpecial Inverted

Index•Contextual Index•Join Index

Query OptimizationGraph Coverage

Problem

Query OptimizationGraph Coverage

Problem

Data Type

Repository

Data Type

Repository

Data Type Definition

12

Experimental Result

• Speed improvement: 6-10

times•Space overhead: Around 2

times original corpus size.

Experimental Result

• Speed improvement: 6-10

times•Space overhead: Around 2

times original corpus size.

Page 13: Mianwei Zhou, Tao Cheng, Kevin Chen-Chuan Chang WSDM 2010, New York, USA 1

Data-oriented Content Query Data-oriented Content Query SystemSystem

Data-oriented Content Query Data-oriented Content Query SystemSystem

Web Info ExtractionWeb Info

Extraction

Typed Entity Search

Typed Entity Search

Web-based Q/A

Web-based Q/A

13

Page 14: Mianwei Zhou, Tao Cheng, Kevin Chen-Chuan Chang WSDM 2010, New York, USA 1

14