cimple: building community portal sites through crawling & extraction zachary g. ives university...
TRANSCRIPT
![Page 1: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/1.jpg)
Cimple: Building Community Portal Sitesthrough Crawling & Extraction
Zachary G. IvesUniversity of Pennsylvania
CIS 650 – Implementing Data Management Systems
November 4, 2008
Slides based on content by AnHai Doan, used with permission
![Page 2: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/2.jpg)
Administrivia
By next Tuesday: a rough schedule and division of duties for your project
Please read the Halevy et al. paper on Piazza
2
![Page 3: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/3.jpg)
The Web Is Full of Special-Interest Portal Sites for Communities
Academia Certain bioinformatics topics; citations; etc.
Medicine WebMD
Infotainment Rotten Tomatoes, IMDB, fantasy football
Business enterprise intranets, tech support groups, lawyers
CIA / homeland security Intellipedia
Some of these gather information from the Web
3
![Page 4: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/4.jpg)
Cimple Project @ Wisconsin (+ Yahoo)
Researcher
Homepages
Conference
Pages
Group Pages
DBworld
mailing list
DBLP
Web pages
Text documents
* **
** * ***
SIGMOD-04
**
** give-talk
Jim Gray
Keyword search
SQL querying
Question answering
Browse
Mining
Alert/Monitor
News summary
Jim Gray
SIGMOD-04
**
Maintain and add more sources
Develops a general solution to community Web portals using extraction + integration + mass collaboration
Mass collaboration
![Page 5: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/5.jpg)
The Basic Ideas
Architecture mainly consists of extractors and ER-graphs
The key challenge is to extract “good” information for the ER-graphs, and to allow all of the information to be extended and repaired
5
![Page 6: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/6.jpg)
Prototype System: DBLife
Integrate data of the DB research community 1164 data sources
Crawled daily, 11000+ pages = 160+ MB / day
![Page 7: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/7.jpg)
Data Integration
Raghu Ramakrishnan
co-authors = A. Doan, Divesh Srivastava, ...
![Page 8: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/8.jpg)
Resulting ER Graph
“Proactive Re-optimization
Jennifer Widom
Shivnath Babu
SIGMOD 2005
David DeWitt
Pedro Bizarrocoauthor
coauthor
coauthor
advise advise
write
write
write
PC-Chair
PC-member
![Page 9: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/9.jpg)
Provide Services
DBLife system
![Page 10: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/10.jpg)
Mass Collaboration via Wiki
![Page 11: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/11.jpg)
Issues Addressed by Cimple
Cimple addresses challenges in 1. Source selection2. Extraction and integration3. Detecting problems and providing
feedback4. Mass collaboration
![Page 12: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/12.jpg)
1. Source Selection
Researcher
Homepages
Conference
Pages
Group Pages
DBworld
mailing list
DBLP
Web pages
Text documents
* **
** * ***
SIGMOD-04
**
** give-talk
Jim Gray
Keyword search
SQL querying
Question answering
Browse
Mining
Alert/Monitor
News summary
Jim Gray
SIGMOD-04
**
Maintain and add more sources
Mass collaboration
![Page 13: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/13.jpg)
Current Solutions vs. Cimple Current solutions: topic specific crawlers
find all relevant data sources (e.g., using focused crawling, search engines)
maximize coverage results in many “noisy” sources
Cimple allows for incremental development, deployment starts with a small set of high-quality “core”
sources incrementally adds more sources
only from “high-quality” places or as suggested by users (mass collaboration)
![Page 14: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/14.jpg)
Start with a Small Set of “Core” Sources
Key observation: communities often follow 80-20 rule 20% of sources cover 80% of interesting
activities
Initial portal over these 20% often is already quite useful
How do we select these 20%? select as many sources as possible then evaluate and select most relevant ones
![Page 15: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/15.jpg)
Evaluate the Relevance of Sources
Use PageRank + virtual links across entities + TF/IDF
... Gerhard Weikum
G. Weikum
See [VLDB-07a]
![Page 16: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/16.jpg)
Add More Sources over Time Key observation: most important sources will
eventually be mentioned within the community so monitor certain “community channels” to find them
Message type: conf. ann.Subject: Call for Participation: VLDB Workshop on Management of Uncertain Data
Call for Participation Workshop on
"Management of Uncertain Data" in conjunction with VLDB 2007
http://mud.cs.utwente.nl ...
Also allow users to suggest new sources– e.g., the Silicon Valley Database Society
![Page 17: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/17.jpg)
Summary: Source Selection
Incremental approach: start with highly relevant sources expand carefully minimize “garbage in, garbage out”
Need a notion of source relevance Need a way to compute this
![Page 18: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/18.jpg)
2. Extraction and Integration
Researcher
Homepages
Conference
Pages
Group Pages
DBworld
mailing list
DBLP
Web pages
Text documents
* **
** * ***
SIGMOD-04
**
** give-talk
Jim Gray
Keyword search
SQL querying
Question answering
Browse
Mining
Alert/Monitor
News summary
Jim Gray
SIGMOD-04
**
Maintain and add more sources
Mass collaboration
![Page 19: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/19.jpg)
Extracting Entity Mentions Key idea: reasonable plan, then “patch” Reasonable basic plan:
collect person names, e.g., David Smith generate variations, e.g., D. Smith, Dr. Smith, etc. find occurrences of these variations
ExtractMbyName
Union
s1 … sn
Works well, but can’t handle
certain difficult spots
![Page 20: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/20.jpg)
Handling Difficult Spots Example
R. Miller, D. Smith, B. Jones if “David Miller” is in the dictionary
will flag “Miller, D.” as a person name
Solution: patch such spots with stricter plans
ExtractMbyName
Union
s1 … sn
FindPotentialNameLists
ExtractMStrict
![Page 21: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/21.jpg)
Matching Entity Mentions Key idea: reasonable plan, then patch Reasonable plan
mention names are the same (modulo some variation) match
e.g., David Smith and D. Smith
Union
Extract Plan
MatchMbyName
s1 sn…Works well, but can’t handle
certain difficult spots
![Page 22: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/22.jpg)
Handling Difficult Spots
Estimate the semantic ambiguity of data sources use social networking techniques related to cohesion of graphs [see ICDE-
07a]
Apply stricter matchers to more ambiguous sources
MatchMStrict
Extract Plan
MatchMbyName
Union
{s1 … sn} DBLP\
Extract Plan
DBLP
DBLP: Chen Li
· · ·41. Chen Li, Bin Wang, Xiaochun Yang.
VGRAM. VLDB 2007.· · ·
38. Ping-Qi Pan, Jian-Feng Hu, Chen Li.Feasible region contraction.
Applied Mathematics and Computation.· · ·
![Page 23: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/23.jpg)
Summary: Extraction and Integration Most current solutions
try to find a single good plan, applied to all of data
Cimple solution: reasonable plan, then patch So the focus shifts to:
how to find a reasonable plan? how to detect problematic data spots? how to patch those?
Need a notion of semantic ambiguity Different from the notion of source relevance
![Page 24: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/24.jpg)
3. Detecting Problems and Making Corrections
Researcher
Homepages
Conference
Pages
Group Pages
DBworld
mailing list
DBLP
Web pages
Text documents
* **
** * ***
SIGMOD-04
**
** give-talk
Jim Gray
Keyword search
SQL querying
Question answering
Browse
Mining
Alert/Monitor
News summary
Jim Gray
SIGMOD-04
**
Maintain and add more sources
Mass collaboration
![Page 25: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/25.jpg)
How to Detect Problems?
After extraction and matching, build services e.g., superhomepages
Many such homepages contain minor problems e.g., X graduated in 19998
X chairs SIGMOD-05 and VLDB-05 X published 5 SIGMOD-03 papers
Intuitively, something is semantically incorrect
To fix this, build a Semantic Debugger learns what is a normal profile for researcher, paper, etc. alerts the builder to potentially buggy superhomepages so corrections / feedback can be provided
![Page 26: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/26.jpg)
What Types of Feedback?
Say that a certain data item Y is wrong Provide correct value for Y, e.g., Y = SIGMOD-06 Add domain knowledge
e.g., no researcher has ever published 5 SIGMOD papers in a year
Add more data e.g., X was advised by Z e.g., here is the URL of another data source
Modify the underlying algorithm e.g., pull out all data involving X
match using names and co-authors, not just names
![Page 27: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/27.jpg)
How to Make Providing Feedback Very Easy?
Extremely crucial in DBLife context If feedback can be provided easily
can get more feedback can leverage the mass of users
![Page 28: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/28.jpg)
Critical but unsolved
Provide a Wiki interface
How to Make Providing Feedback Very Easy?
Say that a certain data item Y is wrong Provide correct value for Y, e.g., Y = SIGMOD-06
Add domain knowledge Add more data Modify the underlying algorithm
Provide form interfaces
Unsolved: some recent interest on how to mass
customize software
![Page 29: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/29.jpg)
Summary: Detection and Feedback
How to detect problems? Semantic Debugger
What types of feedback & how to easily provide them? critical, largely unsolved
What feedback would make most impact? crucial in large-scale systems need a notion of a Feedback Advisor need a precise notion of system quality
![Page 30: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/30.jpg)
4. Mass Collaboration
Researcher
Homepages
Conference
Pages
Group Pages
DBworld
mailing list
DBLP
Web pages
Text documents
* **
** * ***
SIGMOD-04
**
** give-talk
Jim Gray
Keyword search
SQL querying
Question answering
Browse
Mining
Alert/Monitor
News summary
Jim Gray
SIGMOD-04
**
Maintenance and expansion
Mass collaboration
![Page 31: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/31.jpg)
Mass Collaboration: Voting
Can be applied to numerous problems
![Page 32: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/32.jpg)
Example: Matching
Hard for machine, but easy for human
Mouse for Dell laptop 200 series ...
Dell X200; mouse at reduced price ...
Dell laptop X200 with mouse ...
![Page 33: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/33.jpg)
Mass Collaboration: Wiki
Community wikipedia built by machine + human backed up by a structured database
DataSources G
T
V1
V2
V3
W1
W2
W3
u1
V3’ W3’
T3’
M
![Page 34: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/34.jpg)
Machine MachineHuman
Mass Collaboration: Wiki
<# person(id=1){name}=David J. DeWitt #>
<# person(id=1){title}=Professor #>
<strong>Interests:</strong><# person(id=1).interests(id=3)
.topic(id=4){name}=Parallel Database #>
David J. DeWitt
Professor
Interests: Parallel Database
<# person(id=1){name}=David J. DeWitt #>
<# person(id=1){title}=John P. Morgridge Professor #>
<# person(id=1) {organization}=UW #> since 1976
<strong>Interests:</strong><# person(id=1).interests(id=3)
.topic(id=4){name}=Parallel Database #>
<# person(id=1){name}=David J. DeWitt #>
<# person(id=1){title}= John P. Morgridge Professor #>
<# person(id=1){organization}=UW-Madison#>since 1976
<strong>Interests:</strong><# person(id=1).interests(id=3)
.topic(id=4){name}=Parallel Database #>
<# person(id=1).interests(id=5).topic(id=6){name}=Privacy #>
David J. DeWitt
John P. Morgridge ProfessorUW-Madison since 1976
Interests: Parallel Database
Privacy
Machine
Human
![Page 35: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/35.jpg)
Summary: Mass Collaboration
What can users contribute? How to evaluate user quality? How to reconcile inconsistent data?
![Page 36: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/36.jpg)
Summary: Cimple
A very interesting attempt to rethink Web crawling and information extraction
Based on a “best-effort” notion One of many concurrent efforts in that vein “Dataspaces”
Simple building blocks, progressive refinement
36
![Page 37: Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management](https://reader035.vdocuments.us/reader035/viewer/2022062517/56649eef5503460f94bffc81/html5/thumbnails/37.jpg)
Open Questions and Issues
Incorporating uncertain data
Can we really scale, and manage the complexity, of many, many extractors that might produce data of different quality?
How do we provide feedback? Particularly as there are many learners, data and feedback are very sparse
Others?
37