anhai doan university of wisconsin-madison the cimple project on community information management

AnHai DoanUniversity of Wisconsin-Madison

The Cimple Project on The Cimple Project on Community Information Management Community Information Management

2

The CIM ProblemThe CIM Problem Numerous online communities

– database researchers, movie fans, legal professionals, bioinformatics, enterprise intranets, tech support groups

Each community = many data sources + many members Database community

– home pages, project pages, DBworld, DBLP, conference pages, ...

Movie fan community– review sites, movie home pages, theatre listings, ...

Legal profession community– law firm home pages

3

The CIM ProblemThe CIM Problem Members often want to discovery, query, monitor

information in the community

Database community– what is new in the past week in the database community?– any interesting connection between researchers X and Y?– find all citations of this paper in the past one week on the Web– what are current hot topics? who has moved where?

Legal profession community– which lawyers have moved where? – which law firms have taken on which cases?

4

The CIM ProblemThe CIM Problem To address such needs, build data portals Starting out topic-based, now structured data portals

– DBLP, Citeseer, IMDB, GlobalSpec, etc.

Limitations of current solutions– mostly by hand, labor intensive, error prone

– hard-to-port solutions

– few services other than browsing and keyword search

5

Cimple Project @ Wisconsin / Yahoo! ResearchCimple Project @ Wisconsin / Yahoo! Research

Researcher

Homepages

Conference

Pages

Group Pages

DBworld

mailing list

DBLP

Web pages

Text documents

* **

** * ***

SIGMOD-04

**

** give-talk

Jim Gray

Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

**

Personalize system, provide feedback

Develop generic solutions to create structured data portals via extraction + integration + mass collaboration

6

The Research TeamThe Research Team

Faculty / Vice President– AnHai Doan– Raghu Ramakrishnan

Current students– Pedro DeRose– Warren Shen– Fei Chen– Yoonkyong Lee– Doug Burdick– Mayssam Sayyadian – Xiaoyong Chai – Ting Chen

7

Prototype System: DBLifePrototype System: DBLife Integrate data of the DB research community 1164 data sources

Crawled daily, 11000+ pages = 160+ MB / day

8

Data ExtractionData Extraction

9

Data IntegrationData Integration

Raghu Ramakrishnan

co-authors = A. Doan, Divesh Srivastava, ...

10

Resulting ER GraphResulting ER Graph

“Proactive Re-optimization

Jennifer Widom

Shivnath Babu

SIGMOD 2005

David DeWitt

Pedro Bizarrocoauthor

coauthor

coauthor

advise advise

write

write

write

PC-Chair

PC-member

11

Querying The ER GraphQuerying The ER Graph

Query: “David DeWitt Jennifer Widom”

1.

2.

3.

Jennifer Widom

David DeWittcoauthor

Jennifer Widom

SIGMOD 2005

David DeWittcoauthor

PC-Chair

PC-member

Jennifer Widom

Shivnath Babu

David DeWitt

coauthor

coauthoradvise

12

Provide ServicesProvide Services DBLife system

http://sapa.cs.uiuc.edu/cgi-bin/dblife/index.cgi

13

Mass Collaboration: Example 1Mass Collaboration: Example 1

Picture is removed if enough users vote “no”.

14

Mass Collaboration Meets Jeff NaughtonMass Collaboration Meets Jeff Naughton

Jeffrey F. Naughton swears that this is David J. DeWitt

15

Mass Collaboration: Example 2Mass Collaboration: Example 2

Community Wikipedia

backed up by a structured underlying database

16

What We Have DoneWhat We Have Done

Define the CIM problem / understand it a little bit– start to talk about it in the DB community

[SIGMOD-06 tutorial, IEEE DEB-06, CIDR-07]

Build DBLife / helps clarify research issues– live at dblife.cs.wisc.edu– latest stuff at dblife-labs.cs.wisc.edu

Start some preliminary research– ICDE-07a, ICDE-07b, ICDE-07b

17

What We Would Like to Do NextWhat We Would Like to Do Next Release DBLife

– as a research / education toolpossible service to the DB community demo of CIM systems benchmark / challenge for data integration / extraction

Develop and release a generic Cimple platform– anyone can use it to build structured data portals

Build CimBase: a hosting service– anyone can specify a structured portal on CimBase– we will build and host it

Continue research / expand team / build alliance

18

Research Challenges (1)Research Challenges (1)

Information extraction Data integration Mass collaboration

Researcher

Homepages

Conference

Pages

Group Pages

DBworld

mailing list

DBLP

Web pages

Text documents

* **

** * ***

SIGMOD-04

**

** give-talk

Jim Gray

Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

**


19


Researcher

Homepages

Conference

Pages

Group Pages

DBworld

mailing list

DBLP

Web pages

Text documents

* **

** * ***

SIGMOD-04

**

** give-talk

Jim Gray

Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

**


Exploiting extracted data Handling uncertainty / provenance / explanation Dealing with evolving data, versioning, temporal data

20


Researcher

Homepages

Conference

Pages

Group Pages

DBworld

mailing list

DBLP

Web pages

Text documents

* **

** * ***

SIGMOD-04

**

** give-talk

Jim Gray

Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

**


What is the right architecture? What is the right data model / storage? How to build continuously running systems How to build massively scalable hosting services? How to build a generic CIM platform?

21

Rest of the TalkRest of the Talk

The CIM problem The Cimple solution approach What we have done / plan to do Research challenges

– information extraction– data integration (focus on entity matching)– mass collaboration

Broader perspectives

22

Declarative IEDeclarative IE Current IE research

– develops learning- & rule-based solutions [SIGMOD-06 tutorial]– focuses largely on improving accuracy

Real-world IE applications– glue multiple such solutions together, using Perl

Serious problems– hard to develop, understand, debug, and optimize

DECLARATIVE IE

Dr. R. Ramakrishnan

This is a fun topic ...

23

Example in DBLifeExample in DBLife

Find conference name in raw text#############################################################################

# Regular expressions to construct the pattern to extract conference names#############################################################################

# These are subordinate patternsmy $wordOrdinals="(?:first|second|third|fourth|fifth|sixth|seventh|eighth|ninth|tenth|eleventh|twelfth|thirteenth|fourteenth|fifteenth)";

my $numberOrdinals="(?:\\d?(?:1st|2nd|3rd|1th|2th|3th|4th|5th|6th|7th|8th|9th|0th))";my $ordinals="(?:$wordOrdinals|$numberOrdinals)";

my $confTypes="(?:Conference|Workshop|Symposium)";my $words="(?:[A-Z]\\w+\\s*)"; # A word starting with a capital letter and ending with 0 or more spaces

my $confDescriptors="(?:international\\s+|[A-Z]+\\s+)"; # .e.g "International Conference ...' or the conference name for workshops (e.g. "VLDB Workshop ...")

my $connectors="(?:on|of)";my $abbreviations="(?:\$[A-Z]\\w\\w+[\\W\\s]*?(?:\\d\\d+)?\$)"; # Conference abbreviations like "(SIGMOD'06)"

# The actual pattern we search for. A typical conference name this pattern will find is# "3rd International Conference on Blah Blah Blah (ICBBB-05)"

my $fullNamePattern="((?:$ordinals\\s+$words*|$confDescriptors)?$confTypes(?:\\s+$connectors\\s+.*?|\\s+)?$abbreviations?)(?:\\n|\\r|\\.|<)";

############################## ################################# Given a <dbworldMessage>, look for the conference pattern

##############################################################lookForPattern($dbworldMessage, $fullNamePattern);

########################################################## In a given <file>, look for occurrences of <pattern>

# <pattern> is a regular expression#########################################################

sub lookForPattern { my ($file,$pattern) = @_;

24

Example in DBLife (cont.)Example in DBLife (cont.)

# Only look for conference names in the top 20 lines of the file my $maxLines=20;

my $topOfFile=getTopOfFile($file,$maxLines);

# Look for the match in the top 20 lines - case insenstive, allow matches spanning multiple lines if($topOfFile=~/(.*?)$pattern/is) { my ($prefix,$name)=($1,$2);

# If it matches, do a sanity check and clean up the match # Get the first letter

# Verify that the first letter is a capital letter or number if(!($name=~/^\W*?[A-Z0-9]/)) { return (); }

# If there is an abbreviation, cut off whatever comes after that if($name=~/^(.*?$abbreviations)/s) { $name=$1; }

# If the name is too long, it probably isn't a conference if(scalar($name=~/[^\s]/g) > 100) { return (); }

# Get the first letter of the last word (need to this after chopping off parts of it due to abbreviation my ($letter,$nonLetter)=("[A-Za-z]","[^A-Za-z]");

" $name"=~/$nonLetter($letter) $letter*$nonLetter*$/; # Need a space before $name to handle the first $nonLetter in the pattern if there is only one word in name

my $lastLetter=$1; if(!($lastLetter=~/[A-Z]/)) { return (); } # Verify that the first letter of the last word is a capital letter

# Passed test, return a new crutch return newCrutch(length($prefix),length($prefix)+length($name),$name,"Matched pattern in top $maxLines lines","conference

name",getYear($name)); }

return ();}

25

Solution: Declarative, Compositional IESolution: Declarative, Compositional IE

Treat each solution as a “black box” Glue black boxes using a Datalog-like language

– author(y,d) :- docs(d), name(y,d), title(x,d), distance-line(x,y)<3– name(y,d) :- docs(d), seeds(s), namepatterns(s,p), match(p,d,y)– title(x,d) :- docs(d), lines(x,n,d), allcaps(x), (n<5)

DECLARATIVE IE

Dr. R. Ramakrishnan


Raghu, Ramakrishnan

Divesh, Srivastava

...

seeds(s)

p = Raghu Ramakrishnan R. Ramakrishnan Dr. Ramakrishnan, etc.

26

IE Execution PlanIE Execution Plan

docs(d)

lines(x,n,d)

SELECT_[allcaps(x) and (n<5)]

seeds(s)

namepatterns(p,s) docs(d)

match(y,p,d)

distance-line(x,y)<3

PROJECT_[y,d]

DECLARATIVE IE

Dr. R. Ramakrishnan


27

Sample Optimization: Push Down SelectionsSample Optimization: Push Down Selections

docs(d)

lines(x,n,d)


seeds(s)


match(y,p,d)


PROJECT_[y,d]

DECLARATIVE IE

Dr. R. Ramakrishnan


28

Sample Optimization: Order OperationsSample Optimization: Order Operations

docs(d)

lines(x,n,d)


seeds(s)


match(y,p,d)


PROJECT_[y,d]

DECLARATIVE IE

Dr. R. Ramakrishnan


29

Sample Optimization: Sample Optimization: Efficient Large-Scale Pattern MatchingEfficient Large-Scale Pattern Matching

docs(d)

lines(x,n,d)


seeds(s)


match(y,p,d)


PROJECT_[y,d]

DECLARATIVE IE

Dr. R. Ramakrishnan


30

Related Project: Avatar @ IBM AlmadenRelated Project: Avatar @ IBM Almaden

Person followed by ContactPattern followed by PhoneNumber

ContactPattern RegularExpression(Email.body,”can be reached at”)

PersonPhone Precedes ( Precedes (Person, ContactPattern, D), Phone, D)

Person can be reached at PhoneNumber

Declarative Query Language

31

DECLARATIVE IE

Dr. R. Ramakrishnan


Information Extraction: Another ExampleInformation Extraction: Another Example

DECLARATIVE IE

Dr. R. Ramakrishnan

This is a great topic ...

DECLARATIVE IE

Dr. R. Ramakrishnan

More will follow soon ...

time 0

time 1

time 2

How to efficiently extract information over text streams?

32

Data Integration Research: Setting the ContextData Integration Research: Setting the Context Past and current work

– build the foundation: TSIMMIS, Information Manifold, UPenn, P2P, etc.

– develop solutions for specific integration tasks: wrapping, schema matching, entity matching, adaptive QP, etc.

– branching into many app. domains: bioinformatics, PIM (e.g., semex, iMemex), etc.

– top-k, topX query processing

Our work in Cimple– compositional solutions for schema matching, entity matching, etc.

[VLDB-05a, VLDBJ-06, ICDE-07a, Tech Report-07a] – best-effort data integration:

e.g. keyword search + automatic schema matching + automatic entity matching over relational databases [ICDE-07b]

– data integration for masses [Tech Report-07b]

33

Sample Data Integration Challenge in Cimple:Sample Data Integration Challenge in Cimple:Matching Mentions of EntitiesMatching Mentions of Entities

Researcher

Homepages

Conference

Pages

Group Pages

DBworld

mailing list

DBLP

Web pages

Text documents

* **

** * ***

SIGMOD-04

**

** give-talk

Jim Gray

Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

**


34

Extremely Important Problem!Extremely Important Problem!

Appears in numerous real-world contexts Plagues many applications that we have seen

– Citeseer, Rexa, DBLP, InfoZoom, etc.

Why so important? Many services rely on correct mention matching Incorrect matching propagates errors

35

An ExampleAn Example

DBLife incorrectly matches this mention “J. Han” with “Jiawei Han”, but it actually refers to “Jianchao Han”.

Discover related organizations using occurrence analysis:

“J. Han ... Centrum voor Wiskunde en Informatica”

36

Classical Mention MatchingClassical Mention Matching

Applies just a single “matcher” Focuses mainly on improving matcher accuracy

Our key observation: A single matcher often has limited utility

37

Illustrating ExampleIllustrating Example

L. Gravano, K. Ross.Text Databases. SIGMOD 03

L. Gravano, J. Sanz.Packet Routing. SPAA 91

MembersL. Gravano K. Ross J. Zhou

L. Gravano, J. Zhou.Text Retrieval. VLDB 04

C. Li.Machine Learning. AAAI 04

C. Li, A. Tung.Entity Matching. KDD 03

Luis Gravano, Kenneth Ross. Digital Libraries. SIGMOD 04

Luis Gravano, Jingren Zhou.Fuzzy Matching. VLDB 01

Luis Gravano, Jorge Sanz.Packet Routing. SPAA 91

Chen Li, Anthony Tung.Entity Matching. KDD 03

Chen Li, Chris Brown. Interfaces. HCI 99

d4: Chen Li’s Homepage

d1: Luis Gravano’s Homepage d2: Columbia DB Group Page d3: DBLP

Only one Luis Gravano

Two Chen Li-s What is the best way to match mentions here?

38

A liberal matcher: A liberal matcher: good for matching Luis Gravano, good for matching Luis Gravano,

bad for matching Chen Libad for matching Chen Li














s0 matcher: two mentions match if they share the same name.

39

A conservative matcher: A conservative matcher: good for matching Chen Li, good for matching Chen Li,

bad for matching Luis Gravanobad for matching Luis Gravano














s1 matcher: two mentions match if theyshare the same name and at least one co-author name.

40

Better solution: Better solution: apply both matchers in a workflowapply both matchers in a workflow














union

d1 d2

s0

s1

union

d3

d4

s0 s0 matcher: two mentions match if they share the same name.

s1 matcher: two mentions match if theyshare the same name and at least one co-author name.

41

Key ChallengesKey Challenges

How to compose matchers, to form a space of workflows?

How to estimate the accuracy of each workflow?

How to efficiently find one with high accuracy?

union

d1 d2

s0

s1

union

d3

d4

s0

[See ICDE-07a]

42

Mass Collaboration: The General IdeaMass Collaboration: The General Idea

Many applications have multiple developers / users– how to exploit feedback from all of them?

Variants of this is known as – collective development of system, mass collaboration,

collective curation, Web 2.0 applications, social software, etc.

Has been applied to many applications– open-source software, bug detection, tech support group, Yahoo!

Answers, Google Co-op, and many more

Studied in some academic contexts, e.g., ESP Game

Little has been done in extraction / integration contexts– except in industry, e.g., epinions.com

43

Sample Mass Collaboration in DBLifeSample Mass Collaboration in DBLife

44

Sample Mass Collaboration in DBLifeSample Mass Collaboration in DBLife

W2Raw data

Wn

W1

IE

45

Key ChallengesKey Challenges

What types of extraction / integration tasks are most amenable to mass collaboration?– e.g., see MOBS project at Illinois [WebDB-03, ICDE-05]

How to entice people to contribute? What can they contribute? What is the underlying data model? How to handle the Naughton effect? How to propagate user contributions? How to undo? How to reconcile multiple conflicting editions?

– e.g., see ORCHESTRA project at Penn [Taylor & Ives, SIGMOD-06]

46

Sample Research: SummarySample Research: Summary

Information extraction– how to do it in a declarative / compositional fashion? – how to apply database-like optimization techniques?

Data integration– how to do it incrementally (best effort, pay-as-you-go)?

an example of a Data Space? – how to do it in a compositional fashion?

Human computation / mass collaboration– new! (Though industry has been doing it for years.)– how to do it for data management tasks?

47

ConclusionsConclusions Community Information Management

– increasingly crucial problem

The Cimple project– sample challenges: information extraction

data integration human computation

– extends the footprints of DB technologies to Web data– develops new DB technologies

DBLife prototype– research/education tool, community service, benchmark

Search “cimple wisc” for project homepage

48

Broader PerspectivesBroader Perspectives[speculation mode][speculation mode]

Current Web: keyword search over text Future Web

– should have increasingly more structure– should have more ways to exploit structure– should be more “social”

This future Web should be great for our community– we are the “Structure King” – if the Web remains text-centric not as good for us

How to accelerate the coming of this future Web?– Cimple and many current projects can contribute– but as a community we need more efforts in this direction!

anhai doan university of wisconsin-madison the cimple project on community information management

Documents

dewitt slide

community database community

data extraction slide

keyword search slide

community wikipedia

db community sigmod

services dblife system

project pages