ralf schenkel joint work with jens graupmann and gerhard weikum the spheresearch engine for unified...
TRANSCRIPT
Ralf Schenkel
joint work with Jens Graupmann and Gerhard Weikum
The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents
VLDB 2005, Trondheim, Norway 2
Outline
• Where existing search engines fail
• SphereSearch Concepts
• Transformation and Annotation
• Query Language and Scoring
• Experimental Evaluation
• Summary
VLDB 2005, Trondheim, Norway 3
Example query #1
Which professors from Saarbrücken do research on XML
Different terminology in query and Web pages
Director of Department 5 DBS & IS
Professor atSaarland University
Abstraction Awareness
VLDB 2005, Trondheim, Norway 4
Example query #2
Conferences about XML in Norway 2005?
Context Awareness
Information is not present on a single page, but distributed across linked pages
VLDB Conference 2005, Trondheim, Norway
Call for Papers…XML…
VLDB 2005, Trondheim, Norway 5
What are the publications of Max Planck?
Example query #3
Max Planck should be instance of concept person, not of concept institute
Concept Awareness
VLDB 2005, Trondheim, Norway 6
SphereSearch Concepts
• Unified search for unstructured, semistructured, structured data from heterogeneous sources
• Graph-based model, including links• Annotation engines from NLP to recognize classes
of named entities (persons, locations, dates, …) for concept-aware queries
• Flexible yet simple abstraction-aware query language with context-aware scoring
• Compactness-based scores
Goal: Increase recall & precision for hard queries on linked and heterogeneous data
VLDB 2005, Trondheim, Norway 7
Some Related Work
• Web Query Languagese.g., W3QS [VLDB95], WebOQL [ICDE95],…
• Web IR with thesaurie.g., Qiu et al.[SIGIR93], Liu et al.[SIGIR04],…
• XML IRe.g., XXL [WebDB00], XIRQL [SIGIR01],XSearch [VLDB93], XRank [SIGMOD03], …
• Information extractione.g., Lixto, KnowItAll, …
• Advanced Web graph IRe.g., BANKS [ICDE02], Hristidis et al.[VLDB03], …
VLDB 2005, Trondheim, Norway 8
Outline
• Where existing search engines fail
• SphereSearch Concepts
• Transformation and Annotation
• Query Language and Scoring
• Experimental Evaluation
• Current and Future Work
VLDB 2005, Trondheim, Norway 9
Unifying Search on Heterogeneous Data
Web
Intranet
Databases
EnterpriseInformation
Systems
…
XML
Heuristics, type-spec transformations
VLDB 2005, Trondheim, Norway 10
Heuristic Transformation of HTML
• Headlines<h1>Experiments</h1><h2>Settings</h2>We evaluated...<h2>Results</h2>Our system...
Goal: Transform layout tagsto semantic annotations
<Experiments><Settings>...</Settings><Results>...</Results>
</Experiments>
• Patterns<b>Topic:</b>XML <Topic>XML</Topic>
• Rules for tables, lists, …
VLDB 2005, Trondheim, Norway 11
(Almost) Generic XML Data Model<Professor> Gerhard Weikum <Course> IR </Course> Saarbrücken <Research> XML </Research></Professor>
1
docid=1tag=“Professor“content=“Gerhard Weikum Saarbrücken“
32
docid=1tag=“Research“content=“XML“
docid=1tag=“Course“content=“IR“
Automatic annotation of important concepts (persons, locations, dates,
money amounts) with tools from Information Extraction
Tags annotate content with corresponding concept
person
location
VLDB 2005, Trondheim, Norway 12
Information Extraction (IE)
The Pelican Hotel in Salvador, operated byRoberto Cardoso, offers comfortable rooms starting at$100 a night, including breakfast. Please check in before 7pm.
The <company> Pelican Hotel </company> in<location> Salvador </location>, operated by<person> Roberto Cardoso </person>, offerscomfortable rooms starting at<price> $100 </price> a night, includingbreakfast. Please check in before <time> 7pm </time>.
• Named Entity Recognition (NER)• Named Entity ~ abstract datatype, concept
(location, person,…, IP-address) • Mature (out-of-the-box products, e.g. GATE/ANNIE)• Extensible
VLDB 2005, Trondheim, Norway 13
Unifying Search on Heterogeneous Data
Web
Intranet
Databases
EnterpriseInformation
Systems
…
XML
Heuristics, type-spec transformations
AnnotatedXML
Annotation of named entitieswith IE tools (e.g., GATE)
VLDB 2005, Trondheim, Norway 14
Annotation-Aware Data Model<Professor> Gerhard Weikum <Course>IR</Course> Saarbrücken <Research>XML</Research></Professor>
1docid=1tag=“Professor“content=“Gerhard Weikum Saarbrücken“
32docid=1tag=“Research“content=“XML“
docid=1tag=“Course“content=“IR“
2
1
docid=1tag=„Professor“content=“Gerhard Weikum“
3
docid=1tag=“Research“content=“XML“
docid=1tag=“Course“content=“IR“
4
docid=1tag=“location“content=“Saarbrücken“
Annotation with GATE:„Saarbrücken“ of type „location“
Annotation introduces new tags
VLDB 2005, Trondheim, Norway 16
Architecture
Tourist
Guide
(XML)
HotelWebsiteFlight
Schedule
INDEX
Web Portal
Adapter
Search Engine
XML
Adapter
Location= Salvador
Price =89 $
Date = 15-18 August
Event=SIGIRLocation=Salvador
Location= Frankfurt
Location=Salvador
Time = 13:15
SIGIRWebsite
Adapter
IE ProcessorAnnotation Module
DATE
Annotation Module
PRICE
……
Annotation Module
LOCATION
…
Person=Schenkel
FROM=SIGIR
SUBJECT=Notification
Web Adapter
Homepage
GraupmannSources
Adapters
Annotators
Search
Engine
VLDB 2005, Trondheim, Norway 17
Outline
• Where existing search engines fail
• SphereSearch Concepts
• Transformation and Annotation
• Query Language and Scoring
• Experimental Evaluation
• Current and Future Work
VLDB 2005, Trondheim, Norway 18
SphereSearch Queries
Extended keyword queries:• similarity conditions ~professor, ~Saarbrücken
• concept-based conditions person=Max Planck, location=Trondheim
• grouping
• join conditions
Ranked results with context-aware scoring
VLDB 2005, Trondheim, Norway 19
Score Aggregation: SphereScore
Weighted aggregation of local scores in environment of element (sphere score):
2
1
1
2
0 ':( , ')
( ) ( '), 0 1D
dL
d edist e e d
s e s e
Rewards proximity of terms and compactness of term distribution
s(1):
research XMLLocal score sL(e) for each element e (tf/idf, BM25,…)
Context awareness
VLDB 2005, Trondheim, Norway 20
Similarity Conditions
wizard
intellectual
artist
alchemist
directorprimadonna
lecturer
professor
teacher
educator
scholar
academic,academician,faculty member
scientist
researcher
HYPONYM (0.7)HYPONYM (0.7)
Thesaurus/Ontology:concepts, relationships, glossesfrom WordNet, Gazetteers, Web forms & tables, Wikipedia
relationships quantified bystatistical co-occurence measures
investigator
mentor
Similarity conditions like~professor, ~Saarbrücken
Query expansion
Local score: weighted max over all expansion terms
disambiguation
δ-exp(x)={w|sim(x,w)>δ}
sL(e,~professor) =max tδ-exp(professor) {sim(professor,t)*sL(e,t)}
Abstraction awareness
VLDB 2005, Trondheim, Norway 21
Concept-based conditions
Concept awareness
Goal: Exploit explicit (tags) and automatic annotations in documents
location=Trondheimconcept value e
docid=1tag=„location“content=“Trondheim“
Allows similarity and range queries (for annotated concepts) likelocation~Trondheim1970<date<1980with concept-specific distancemeasures
sL(e,c=v)= score for concept-tag match + score for value-content-match
concept-specific
VLDB 2005, Trondheim, Norway 22
Query Groups
Group conditions that relate to the same „entity“ professor teaching IR research XML
professor T(teaching IR) R(research XML)
SphereScore computed for each group
Find compact sets with one result for each group
Goal: Related terms should occur in the same context
VLDB 2005, Trondheim, Norway 23
Scores for Query Resultsquery result R: one result per query group
( ) ( ) (1 ) ( )i
i ie R
score R s e compactness R
A
X
B
2
1
compactness ~ 1/size of a minimal spanning tree
A
1
X3
11
1( )
3C N
2
A
2
X3
4
B
1
X5
3
B
2
X5
6
1
1
2
21
( )4
C N
31
( )5
C N
41
( )6
C N
Context awareness
VLDB 2005, Trondheim, Norway 24
Join conditions
Goal: Connect results of different query groups
A(research, XML)
B(VLDB 2005 paper)
A.person=B.person
Dependent on database size, application
• Precomputed• Computed during query execution
researchresearch
XMLXML
Ralf Ralf SchenkelSchenkel 20042004
20052005
R.SchenkelR.Schenkel
VLDBVLDB
20052005
1.0
0.9
•Join conditions do not change the score for a node•Join conditions create a new link with a specific weight
A
B
VLDB 2005, Trondheim, Norway 25
Score for Join Conditions
Join condition A.T=B.S:
• For all nodes n1 with type T, n2 with type S, add edge (n1,n2) with weight sim(n1,n2))-1
• sim(n1,n2): content-based similarity
A
X
B
2
1
B
2
X2
3 14
1( )
3C N
VLDB 2005, Trondheim, Norway 26
Outline
• Where existing search engines fail
• SphereSearch Concepts
• Transformation and Annotation
• Query Language and Scoring
• Experimental Evaluation
• Current and Future Work
VLDB 2005, Trondheim, Norway 27
Setup for Experiments
Three corpora:• Wikipedia• extended Wikipedia with links to IMDB• extended DBLP corpus with links to homepages
50 Queries like• A(actor birthday 1970<date<1980) western• G(California,governor) M(movie)• A(Madonna,husband) B(director)
A.person=B.director
Opponent: keyword queries with standard TF/IDF-based score „simplified Google“
No existing benchmark (INEX, TREC, …) fits
VLDB 2005, Trondheim, Norway 28
SSE-Join(join conditions)
SSE-QG(query groups)
SSE-CV(concept-based conditions)
Incremental Language Levels
SSE-basic(keywords, SphereScores)
VLDB 2005, Trondheim, Norway 30
Experimental Results on Wiki++ and DBLP++
• SphereScores better than local scores
• New SSE features nearly double precision
VLDB 2005, Trondheim, Norway 31
Current and Future Work• Improve graphical user interface
• Refined type-specific similarity measures (like geographic distances) [SIGIR-WS 2005]
• Deep Web search through automatic portal queries
• Parameter tuning with relevance feedback
• Efficiency of query evaluation through precomputation and integrated top-k(TopX talk this afternoon)