anhai doan university of wisconsin-madison data quality challenges in community systems joint work...

43
AnHai Doan University of Wisconsin- Madison Data Quality Challenges in Data Quality Challenges in Community Systems Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron Gao, Fei Chen, Yoonkyong Lee, Raghu Ramakrishnan, Jeff Naughton

Upload: ami-greer

Post on 03-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

AnHai DoanUniversity of Wisconsin-Madison

Data Quality Challenges in Data Quality Challenges in Community SystemsCommunity Systems

Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron Gao, Fei Chen, Yoonkyong Lee, Raghu Ramakrishnan, Jeff Naughton

Page 2: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Numerous Web CommunitiesNumerous Web Communities

Academic domains– database researchers, bioinformatists

Infotainments– movie fans, mountain climbers, fantasy football

Scientific data management– biomagnetic databank, E. Coli community

Business– enterprise intranets, tech support groups, lawyers

CIA / homeland security– Intellipedia

Page 3: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Much Efforts to Build Community PortalsMuch Efforts to Build Community Portals Initially taxonomy based (e.g., Yahoo style) But now many structured data portals

– capture key entities and relationships of community

No general solution yet on how to build such portals

Page 4: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Cimple Project @ Wisconsin / Yahoo! ResearchCimple Project @ Wisconsin / Yahoo! Research

Researcher

Homepages

Conference

Pages

Group Pages

DBworld

mailing list

DBLP

Web pages

Text documents

* **

** * ***

SIGMOD-04

**

** give-talk

Jim Gray

Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

**

Maintain and add more sources

Develops such a general solution using extraction + integration + mass collaboration

Mass collaboration

Page 5: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Prototype System: DBLifePrototype System: DBLife

Integrate data of the DB research community 1164 data sources

Crawled daily, 11000+ pages = 160+ MB / day

Page 6: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Data ExtractionData Extraction

Page 7: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Data IntegrationData Integration

Raghu Ramakrishnan

co-authors = A. Doan, Divesh Srivastava, ...

Page 8: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Resulting ER GraphResulting ER Graph

“Proactive Re-optimization

Jennifer Widom

Shivnath Babu

SIGMOD 2005

David DeWitt

Pedro Bizarrocoauthor

coauthor

coauthor

advise advise

write

write

write

PC-Chair

PC-member

Page 9: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Provide ServicesProvide Services DBLife system

Page 10: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Mass Collaboration: VotingMass Collaboration: Voting

Picture is removed if enough users vote “no”.

Page 11: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Mass Collaboration via WikiMass Collaboration via Wiki

Page 12: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Summary: Community SystemsSummary: Community Systems Data integration systems + extraction + Web 2.0

– manage both data and users in a synergistic fashion

In sync with current trends– manage unstructured data (e.g., text, Web pages)– get more structure (IE, Semantic Web)– engage more people (Web 2.0)– best-effort data integration, data spaces, pay-as-you-go

Numerous potential applications

But raises many difficult data quality challenges

Page 13: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Rest of the TalkRest of the Talk

Data quality challenges in 1. Source selection2. Extraction and integration3. Detecting problems and providing feedback4. Mass collaboration

Conclusions & ways forward

Page 14: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

1. Source Selection1. Source Selection

Researcher

Homepages

Conference

Pages

Group Pages

DBworld

mailing list

DBLP

Web pages

Text documents

* **

** * ***

SIGMOD-04

**

** give-talk

Jim Gray

Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

**

Maintain and add more sources

Mass collaboration

Page 15: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Current Solutions vs. Cimple Current Solutions vs. Cimple

Current solutions– find all relevant data sources

(e.g., using focused crawling, search engines)– maximize coverage – have lot of noisy sources

Cimple – starts with a small set of high-quality “core” sources– incrementally adds more sources

– only from “high-quality” places– or as suggested by users (mass collaboration)

Page 16: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Start with a Small Set of “Core” SourcesStart with a Small Set of “Core” Sources

Key observation: communities often follow 80-20 rules– 20% of sources cover 80% of interesting activities

Initial portal over these 20% often is already quite useful

How to select these 20%– select as many sources as possible– evaluate and select most relevant ones

Page 17: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Evaluate the Relevancy of SourcesEvaluate the Relevancy of Sources Use PageRank + virtual links across entities + TF/IDF

... Gerhard Weikum

G. Weikum

See [VLDB-07a]

Page 18: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Add More Sources over TimeAdd More Sources over Time Key observation: most important sources will

eventually be mentioned within the community– so monitor certain “community channels” to find themMessage type: conf. ann.Subject: Call for Participation: VLDB Workshop on Management of Uncertain Data

Call for Participation Workshop on "Management of Uncertain Data" in conjunction with VLDB 2007

http://mud.cs.utwente.nl ...

Also allow users to suggest new sources– e.g., the Silicon Valley Database Society

Page 19: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Summary: Source SelectionSummary: Source Selection

Sharp contrast to current work– start with highly relevant sources– expand carefully – minimize “garbage in, garbage out”

Need a notion of source relevance Need a way to compute this

Page 20: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

2. Extraction and Integration2. Extraction and Integration

Researcher

Homepages

Conference

Pages

Group Pages

DBworld

mailing list

DBLP

Web pages

Text documents

* **

** * ***

SIGMOD-04

**

** give-talk

Jim Gray

Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

**

Maintain and add more sources

Mass collaboration

Page 21: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Extracting Entity MentionsExtracting Entity Mentions Key idea: reasonable plan, then patch Reasonable plan:

– collect person names, e.g., David Smith– generate variations, e.g., D. Smith, Dr. Smith, etc.– find occurrences of these variations

ExtractMbyName

Union

s1 … sn

Works well, but can’t handle certain difficult spots

Page 22: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Handling Difficult SpotsHandling Difficult Spots Example

– R. Miller, D. Smith, B. Jones– if “David Miller” is in the dictionary

will flag “Miller, D.” as a person name

Solution: patch such spots with stricter plans

ExtractMbyName

Union

s1 … sn

FindPotentialNameLists

ExtractMStrict

Page 23: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Matching Entity MentionsMatching Entity Mentions Key idea: reasonable plan, then patch Reasonable plan

– mention names are the same (modulo some variation) match

– e.g., David Smith and D. Smith

Union

Extract Plan

MatchMbyName

s1 sn…Works well, but can’t handle certain difficult spots

Page 24: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Handling Difficult SpotsHandling Difficult Spots

Estimate the semantic ambiguity of data sources– use social networking techniques [see ICDE-07a]

Apply stricter matchers to more ambiguous sources

MatchMStrict

Extract Plan

MatchMbyName

Union

{s1 … sn} DBLP\

Extract Plan

DBLP

DBLP: Chen Li

· · ·41. Chen Li, Bin Wang, Xiaochun Yang.VGRAM. VLDB 2007.· · ·38. Ping-Qi Pan, Jian-Feng Hu, Chen Li.Feasible region contraction.Applied Mathematics and Computation.· · ·

Page 25: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Going Beyond Sources: Going Beyond Sources: Difficult Data Spots Can Cover Any Difficult Data Spots Can Cover Any

Portion of DataPortion of Data

MatchMStrict

Extract Plan

MatchMbyName

Union

{s1 … sn} DBLP\

Extract Plan

DBLP

Mentions that Match “J. Han”

MatchMStrict2

Page 26: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Summary: Extraction and IntegrationSummary: Extraction and Integration Most current solutions

– try to find a single good plan, applied to all of data

Cimple solution: reasonable plan, then patch So the focus shifts to:

– how to find a reasonable plan? – how to detect problematic data spots? – how to patch those?

Need a notion of semantic ambiguity Different from the notion of source relevance

Page 27: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

3. Detecting Problems 3. Detecting Problems and Providing Feedbackand Providing Feedback

Researcher

Homepages

Conference

Pages

Group Pages

DBworld

mailing list

DBLP

Web pages

Text documents

* **

** * ***

SIGMOD-04

**

** give-talk

Jim Gray

Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

**

Maintain and add more sources

Mass collaboration

Page 28: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

How to Detect Problems?How to Detect Problems? After extraction and matching, build services

– e.g., superhomepages Many such homepages contain minor problems

– e.g., X graduated in 19998 X chairs SIGMOD-05 and VLDB-05 X published 5 SIGMOD-03 papers

Intuitively, something is semantically incorrect

To fix this, lets build a Semantic Debugger– learns what is a normal profile for researcher, paper, etc. – alerts the builder to potentially buggy superhomepages– so feedback can be provided

Page 29: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

What Types of Feedback?What Types of Feedback? Say that a certain data item Y is wrong Provide correct value for Y, e.g., Y = SIGMOD-06 Add domain knowledge

– e.g., no researcher has ever published 5 SIGMOD papers in a year

Add more data– e.g., X was advised by Z– e.g., here is the URL of another data source

Modify the underlying algorithm– e.g., pull out all data involving X

match using names and co-authors, not just names

Page 30: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

How to Make How to Make Providing Feedback Very Easy?Providing Feedback Very Easy?

“Providing feedback” for the masses– in sync with current trends of empowering the masses

Extremely crucial in DBLife context If feedback can be provided easily

– can get more feedback– can leverage the mass of users

But this turned out to be very difficult

Page 31: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Critical in our experience, but

unsolved

Provide a Wiki interface

How to Make How to Make Providing Feedback Very Easy?Providing Feedback Very Easy?

Say that a certain data item Y is wrong Provide correct value for Y, e.g., Y = SIGMOD-06 Add domain knowledge Add more data Modify the underlying algorithm

Provide form interfaces

Unsolved, some recent interest on

how to mass customize software

See our IEEE Data Engineering Bulletin paperon user-centric challenges, 2007

Page 32: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

What Feedback What Feedback Would Make the Most Impact?Would Make the Most Impact?

I have one hour spare time, would like to “teach” DBLife– what problems should I work on?– what feedback should I provide?

Need a Feedback Advisor– define a notion of system quality Q(s)– define questions q1, ..., qn that DBLife can ask users– for each qi, evaluate its expected improvement in Q(s)– pick question with highest expected quality improvement

Observations– a precise notion of system quality is now crucial– this notion should model the expected usage

Page 33: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Summary: Detection and FeedbackSummary: Detection and Feedback

How to detect problems? – Semantic Debugger

What types of feedback & how to easily provide them?– critical, largely unsolved

What feedback would make most impact?– crucial in large-scale systems – need a Feedback Advisor– need a precise notion of system quality

Page 34: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

4. Mass Collaboration4. Mass Collaboration

Researcher

Homepages

Conference

Pages

Group Pages

DBworld

mailing list

DBLP

Web pages

Text documents

* **

** * ***

SIGMOD-04

**

** give-talk

Jim Gray

Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

**

Maintenance and expansion

Mass collaboration

Page 35: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Mass Collaboration: VotingMass Collaboration: Voting

Can be applied to numerous problems

Page 37: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

ChallengesChallenges How to detect and remove noisy users?

– evaluate them using questions with known answers

How to combine user feedback?– # of yes votes vs. # of no votes

See [ICDE-05a, ICDE-08a]

Page 38: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Mass Collaboration: WikiMass Collaboration: Wiki

Community wikipedia– built by machine + human– backed up by a structured database

DataSources G

T

V1

V2

V3

W1

W2

W3

u1

V3’ W3’

T3’

M

Page 39: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Machine MachineHuman

Mass Collaboration: WikiMass Collaboration: Wiki

<# person(id=1){name}=David J. DeWitt #>

<# person(id=1){title}=Professor #>

<strong>Interests:</strong><# person(id=1).interests(id=3).topic(id=4){name}=Parallel Database #>

David J. DeWitt

Professor

Interests: Parallel Database

<# person(id=1){name}=David J. DeWitt #>

<# person(id=1){title}=John P. Morgridge Professor #>

<# person(id=1) {organization}=UW #> since 1976

<strong>Interests:</strong><# person(id=1).interests(id=3).topic(id=4){name}=Parallel Database #>

<# person(id=1){name}=David J. DeWitt #>

<# person(id=1){title}= John P. Morgridge Professor #>

<# person(id=1){organization}=UW-Madison#>since 1976

<strong>Interests:</strong><# person(id=1).interests(id=3).topic(id=4){name}=Parallel Database #>

<# person(id=1).interests(id=5).topic(id=6){name}=Privacy #>

David J. DeWitt

John P. Morgridge ProfessorUW-Madison since 1976

Interests: Parallel Database Privacy

Machine

Human

Page 40: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Sample Data Quality ChallengesSample Data Quality Challenges How to detect noisy users?

– no clear solution yet– for now, limit editing to trusted editors– modify notion of system quality to account for this

How to combine feedback, handle inconsistent data?– user vs. user– user vs. machine

How to verify claimed ownership of data portions?– e.g., this superhomepage is about me– only I can edit it

See [ICDE-08b]

Page 41: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Summary: Mass CollaborationSummary: Mass Collaboration

What can users contribute? How to evaluate user quality? How to reconcile inconsistent data?

Page 42: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

Additional ChallengesAdditional Challenges

Dealing with evolving data (e.g., matching) Iterative code development Lifelong quality improvement Querying over inconsistent data Managing provenance and uncertainty Generating explanations Undo

Page 43: AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron

ConclusionsConclusions Community systems:

– data integration + IE + Web 2.0– potentially very useful in numerous domains

Such systems raise myriad data quality challenges– subsume many current challenges– suggest new ones

Can provide a unifying context for us to make progress– building systems has been a key strength of our field– we need a community effort, as always

See “cimple wisc” for more detail Let us know if you want code/data