Semex: A Platform for Personal Information Management and Integration
Xin (Luna) DongUniversity of Washington
June 24, 2005
IntranetInternet
Is Your Personal Information
a Mine or a Mess?
IntranetInternet
Is Your Personal Information
a Mine or a Mess?
Questions Hard to Answer Where are my SEMEX papers and
presentation slides (maybe in an attachment)?
Index Data from Different SourcesE.g. Google, MSN desktop search
IntranetInternet
Questions Hard to Answer Where are my SEMEX papers and
presentation slides (maybe in an attachment)?
Who are working on SEMEX? What are the emails sent by my
PKU alumni? What are the phone numbers and
emails of my coauthors?
Organize Data in a Semantically Meaningful Way
IntranetInternet
Questions Hard to Answer Where are my SEMEX papers and
presentation slides (maybe in an attachment)?
Who are working on SEMEX? What are the emails sent by my PKU
alumni? What are the phone numbers and
emails of my coauthors? Whom of SIGMOD’05 authors do I
know?
Integrate Organizational and Public Data with Personal Data
IntranetInternet
OriginitatedFrom
PublishedIn
ConfHomePage
ExperimentOf
ArticleAbout
BudgetOf
CourseGradeIn
AddressOf
Cites
CoAuthor
FrequentEmailer
HomePage
Sender
EarlyVersion
Recipient
AttachedTo
PresentationFor
ComeFrom
SEMEX (SEMantic EXplorer) – I. Provide a Logical View of Data
Cites
Event
Message
Document
Web Page
Presentation
Cached
SoftcopySoftcopySender,
Recipients
Organizer, Participants
Person
Paper
Author
Homepage
HTMLMail &
calendar Papers Files Presentations
SEMEX (SEMantic EXplorer) – II. On-the-fly Data Integration
Cites
Event
Message
Document
Web Page
Presentation
Cached
SoftcopySoftcopySender,
Recipients
Organizer, Participants
Person
Paper
Author
Homepage
How to Find Alon’s Papers on My Desktop?
How to Find Alon’s Papers on My Desktop? – Google Search Results
Send me the semex demo slides again?
Search Alon Halevy
How to Find Alon’s Papers on My Desktop? – Google Search Results
Ignore previous request, I found
them
Search Alon Halevy
How to Find Alon’s Papers on My Desktop? – Google Search Results
Semex Goal Build a Personal Information
Management (PIM) system prototype that provides a logical view of personal informationBuild the logical view automatically
Extract object instances and associations Remove instance duplications
Leverage the logical view for on-the-fly data integration
Exploit the logical view for information search and browsing to improve people’s productivity
Be resilient to the evolution of the logical view
An Ideal PIM is a Magic Wand
An Ideal PIM is a Magic Wand
Outline Problem definition and project goals Technical issues:
System architecture and instance extraction [CIDR’05]
Reference reconciliation [Sigmod’05] On-the-fly data integration Association search and browsing Domain model personalization and
evolution [WebDB’05]Interleaved with Semex demo [Best demo in Sigmod’05]
Overarching PIM Themes
DomainManagement
Module
DomainModel
ReferenceReconciliater
Association DB
Extractors
Indexer Index
ObjectsAssociations
Word PPT PDF Latex Email Webpage Excel DB
Integrator
Searcher Browser Analyzer
DomainManager
Data Analysis Module
DomainModel
Data Collection Module
ReferenceReconciliater
Association DB
Extractors
Indexer Index
ObjectsAssociations
Word PPT PDF Latex Email Webpage Excel DB
Integrator
Searcher Browser Analyzer
System Architecture
DomainManager
Outline Problem definition and project goals Technical issues:
System architecture and instance extraction [CIDR’05]
Reference reconciliation [Sigmod’05] On-the-fly data integration Association search and browsing Domain model personalization and
evolution [WebDB’05]Interleaved with Semex demo [Best demo in Sigmod’05]
Overarching PIM Themes
Reference Reconciliation in Semex
Xin (Luna) Dong
xin dong
•¶ ðà xinluna dong
luna
dongxin
x. dong
Lab-#dong xin
dong xin luna
Names
Emails
Semex Without Reference Reconciliation Search results for luna
luna dongSenderOfEmails(3043)RecipientOfEmails(2445)MentionedIn(94)
23 persons
Semex Without Reference Reconciliation Search results for luna
Xin (Luna) DongAuthorOfArticles(49)MentionedIn(20)
23 persons
Semex Without Reference Reconciliation
A Platform for Personal Information Management and Integration
Semex Without Reference Reconciliation
9 Persons: dong xin xin dong
Semex NEEDS Reference Reconciliation
Reference Reconciliation
A very active area of research in Databases, Data Mining and AI. (Surveyed in [Cohen, et al. 2003])
Traditional approaches assume matching tuples from a single tableBased on pair-wise comparisons
Harder in our context
Challenges Article: a1=(“Bounds on the Sample Complexity of
Bayesian Learning”, “703-746”, {p1,p2,p3}, c1)
a2=(“Bounds on the sample complexity of bayesian learning”,
“703-746”, {p4,p5,p6}, c2) Venue: c1=(“Computational learning theory”, “1992”,
“Austin, Texas”) c2=(“COLT”, “1992”, null)
Person: p1=(“David Haussler”, null)p2=(“Michael Kearns”, null)p3=(“Robert Schapire”, null) p4=(“Haussler, D.”, null)p5=(“Kearns, M. J.”, null)p6=(“Schapire, R.”, null)
Challenges Article: a1=(“Bounds on the Sample Complexity of Bayesian
Learning”, “703-746”, {p1,p2,p3}, c1)a2=(“Bounds on the sample complexity of bayesian
learning”,“703-746”, {p4,p5,p6}, c2)
Venue: c1=(“Computational learning theory”, “1992”, “Austin, Texas”)
c2=(“COLT”, “1992”, null) Person: p1=(“David Haussler”, null)
p2=(“Michael Kearns”, null)p3=(“Robert Schapire”, null) p4=(“Haussler, D.”, null)p5=(“Kearns, M. J.”, null)p6=(“Schapire, R.”, null) p7=(“Robert Schapire”, “[email protected]”)p8=(null, “[email protected]”)p9=(“mike”, “[email protected]”)
1. MultipleClasses
3. Multi-valueAttributes
2. LimitedInformation
?
?
Intuition
Complex information spaces can be considered as networks of instances and associations between the instances
Key: exploit the network, specifically, the clues hidden in the associations
I. Exploiting Richer Evidences Cross-attribute similarity –
Name&email p5=(“Stonebraker, M.”, null) p8=(null, “[email protected]”)
Context Information I – Contact list p5=(“Stonebraker, M.”, null, {p4, p6}) p8=(null, “[email protected]”, {p7}) p6=p7
Context Information II – Authored articles p2=(“Michael Stonebraker”, null) p5=(“Stonebraker, M.”, null) p2 and p5 authored the same article
Considering Only Attribute-wise Similarities Cannot Merge Persons Well
1750
1950
2150
2350
2550
2750
2950
3150
3350
1 2 3 4
Evidence
#(P
erso
n P
arti
tio
ns)
1409
Person references: 24076 Real-world persons (gold-standard):1750
3159
Considering Richer Evidence Improves the Recall
3159
2169 21692096
1750
1950
2150
2350
2550
2750
2950
3150
3350
Attr-wise Name&Email Article Contact
Evidence
#(P
erso
n P
arti
tio
ns)
1409
346
Person references: 24076 Real-world persons:1750
II. Propagate Information between Reconciliation Decisions Article: a1=(“Distributed Query Processing”,“169-180”,
{p1,p2,p3}, c1)a2=(“Distributed query processing”,“169-180”,
{p4,p5,p6}, c2)
Venue: c1=(“ACM Conference on Management of Data”, “1978”,
“Austin, Texas”) c2=(“ACM SIGMOD”, “1978”, null)
Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null)
3159
2169 21692096
3159
2146 2135
2022
1750
1950
2150
2350
2550
2750
2950
3150
3350
Attr-w ise Name&Email Article Contact
Evidence
#(Pe
rson
Par
titio
ns)
Traditional Propagation
Propagating Information between Reconciliation Decisions Further Improves Recall
Person references: 24076 Real-world persons:1750
III. Reference Enrichment
p2=(“Michael Stonebraker”, null, {p1,p3})p8=(null, “[email protected]”, {p7})p9=(“mike”, “[email protected]”, null)
p8-9 =(“mike”, “[email protected]”, {p7})
V
XXV
References Enrichment Improves Recall More than Information Propagation
3159
2169 21692096
3169
2036 2036
19101750
1950
2150
2350
2550
2750
2950
3150
3350
Attr-wise Name&Email Article Contact
Evidence
#(P
erso
n P
arti
tio
ns)
Traditional Enrichment Propagation
Person references: 24076 Real-world persons:1750
3159
2169 21692096
3169
2002 1990
18731750
1950
2150
2350
2550
2750
2950
3150
3350
Attr-wise Name&Email Article Contact
Evidence
#(P
erso
n P
artit
ions
)
Traditional Enrichment Propagation Full
Applying Both Information Propagation and Reference Enrichment Gets the Highest Recall
Person references: 24076 Real-world persons:1750
1409
125346
Outline Problem definition and project goals Technical issues:
System architecture and instance extraction [CIDR’05]
Reference reconciliation [Sigmod’05] On-the-fly data integration Association search and browsing Domain model personalization and
evolution [WebDB’05]Interleaved with Semex demo [Best demo in Sigmod’05]
Overarching PIM Themes
Importing External Data Sources
Cites
Event
Message
Document
Web Page
Presentation
Cached
SoftcopySoftcopySender,
Recipients
Organizer, Participants
Person
Paper
Author
Homepage
Traditional approaches: proceed in two steps Step 1. Schema matching (Surveyed in
[Rahm&Bernstein, 2001]) Generate term matching candidates E.g., “paperTitle” in table Author matches “title”
in table Article Step 2. Query discovery [Miller et al., 2000]
Take term matching as input, generate mapping expressions (typically queries)
E.g., SELECT Article.title as paperTitle, Person.name as author
FROM Article, PersonWHERE Article.author = Person.id
Intuition—Explore associations in schema mapping
Traditional approaches: proceed in two steps Step 1. Schema matching (Surveyed in [Rahm&Bernstein,
2001]) Generate term matching candidates E.g., “paperTitle” in table Author matches “title” in
table Article Step 2. Query discovery [Miller et al., 2000]
Take term matching as input, generate mapping expressions (typically queries)
E.g., SELECT Article.title as paperTitle, Person.name as author
FROM Article, PersonWHERE Article.author = Person.id
User’s input is needed to fill in the gap between Step 1 output and Step 2 input
Our approach: check association violations to filter inappropriate matching candidates
Intuition—Explore associations in schema mapping
Integration Example
Person(name, email) Book(title, year) Article(title, page) Conference(name, year)
Webpage-item (title, author, conf, year)
publishedIn
authoredBy
authoredBy
Integration Example
Person(name, email) Book(title, year) Article(title, page) Conference(name, year)
Webpage-item (title, author, conf, year)
authoredBy
Person(name, email) Book(title, year) Article(title, page) Conference(name, year)
Webpage-item (title, author, conf, year)
publishedInauthoredBy
Outline Problem definition and project goals Technical issues:
System architecture and instance extraction [CIDR’05]
Reference reconciliation [Sigmod’05] On-the-fly data integration Association search and browsing Domain model personalization and
evolution [WebDB’05]Interleaved with Semex demo [Best demo in Sigmod’05]
Overarching PIM Themes
Explore the association network – 1. Find the relationship between two instances Example: How did I know this
person? Solution: Lineage
Find an association chain between two object instances
Shortest chain? “Earliest” chain OR “Latest” chain
Explore the association network – 2. Find all instances related to a given keyword Example: Who are working on “Schema
Matching”? Solution:
Naive approach: index object instances on attribute values
A list of papers on schema matching A list of emails on schema matching A list of persons working on schema matching A list of conferences for schema-matching papers A list of institutes that conduct schema-matching
research Our approach: index objects on the attributes of
associated objects
Explore the association network – 3. Rank returned instances in a keyword search Example: What are important
papers on “schema matching”? Solution:
Naive approach: rank by TF/IDF metric
Our approach: ranking by Significance score: PageRank measure Relevance score: TF/IDF metric Usage score: last visit time and
modification time
Explore the association network – 4. Fuzzy Queries Queries we pose today—something we can
describe Find me something with (related to) keyword X Find me the co-authors of Person Y
Fuzzy queries: Q: What do I want to know? A: In this webpage, 5 papers are written by
your friends Q: What significant things have happened
today? A: The President wrote an email to you!!
Outline Problem definition and project goals Technical issues:
System architecture and instance extraction [CIDR’05]
Reference reconciliation [Sigmod’05] On-the-fly data integration Association search and browsing Domain model personalization and
evolution [WebDB’05]Interleaved with Semex demo [Best demo in Sigmod’05]
Overarching PIM Themes
The Domain Model
Event
Message
Document
Web Page
Presentation
Cached
SoftcopySoftcopySender,
Recipients
Organizer, Participants
Person
Paper
Author
Homepage
The logical view is described with a domain model
Semex provides very basic classes and associations as a default domain model
Users can personalize the domain model
cite
Problems in Domain Model Personalization Problem: hard to precisely model a domain
At certain point we are not able to give a precise domain model
Not enough knowledge of the domain Inherently evolution of a domain Non-existence of a precise model
Overly detailed models may be a burden to users
Modeling every details of the information on one’s desktop is often overwhelming
We may want to leave part of the domain unstructured
Extract descriptions at different levels of granularity Address v.s. street, city, state, zip
Malleable Schemas
Clean Schema
Structured datasources
Unstructured datasources
Malleable Schema
Key idea: capture the important aspects of the domain model without committing to a strict schema
Malleable Schema Introduce “text” into schemas
Phrases as element names E.g., “InitialPlanningPhaseParticipant”
Regular expressions as element namesE.g., “*Phone”, “State|Province”
Chains as element namesE.g., “name/firstName”
Introduce imprecision into queriesSELECT S.~name, S.~phoneFROM Student as S, ~Project as PWHERE (S ~initialParticipant P) AND
(P.name = “Semex”)
Outline Problem definition and project goals Technical issues:
System architecture and instance extraction [CIDR’05]
Reference reconciliation [Sigmod’05] On-the-fly data integration Association search and browsing Domain model personalization and
evolution [WebDB’05]Interleaved with Semex demo [Best demo in Sigmod’05]
Overarching PIM Themes
Overarching PIM Themes It is PERSONAL data!
How to build a system supporting users in their own habitat?
How to create an ‘AHA!’ browsing experience and increase user’s productivity?
There can be any kind of INFORMATION How to combine structured and un-structured
data? We are pursuing life-long data MANAGEMENT
What is the right granularity for modeling personal data?
How to manage data and schema that evolve over time?
PERSONAL
INFORMATION
MANAGEMENT
Related Work
Personal Information Management Systems Indexing
Stuff I’ve Seen (MSN Desktop Search)[Dumais et al., 2003]
Google Desktop Search [2004] Richer relationships
MyLifeBits [Gemmell et al., 2002] Placeless Documents [Dourish et al., 2000] LifeStreams [Freeman and Gelernter, 1996]
Objects and associations Haystack [Karger et al., 2005]
Summary
60 years passed since the personal Memex was envisioned It’s time to get serious Great challenges for data management
Deliverables of the project An approach to automatically build a
database of objects and associations from personal data
An algorithm for on-the-fly integration Algorithms for data analysis for
association search and browsing The concept of malleable schema as a
modeling tool A PIM system incorporating the above
co-worker
Association Network for Semex Project: Semex
Person: Luna
participant
advisor
co-worker
Person: AlonprojectLeader
co-worker
Person: Jayant
Advice-giver
Person: Michelle
Person: Yuhan
participant
participant
ArticleAbout
ArticleAbout
ArticleAboutCIDR
publishedIn
publishedIn
publishedIn