1 the index-based xxl search engine for querying xml data with relevance ranking anja theobald and...
Post on 21-Dec-2015
216 views
TRANSCRIPT
1
The Index-based XXL Search Enginefor Querying XML Datawith Relevance Ranking
Anja Theobald and Gerhard WeikumUniversity of the Saarland
Saarbrücken, Germany
[email protected]://www-dbs.cs.uni-sb.de
Conclusion
Problem:• diversity of Web / Intranet data despite XML, global schema is a myth users are swamped with results or are looking for needles in haystacks
• combine XML querying with relevance ranking• demonstrate efficiency and search result quality with XXL search engine prototype
Our contribution:
3
Outline
• Adding relevance to XML
• The XXL search engine:index-based query processing
• Experiments
XML Data Graph<Uni> ETH Zürich<Fak> Nat.-Techn. Fak. I<FR> Fachrichtung Informatik<Lehre> ... <Hauptstudium> <Vorlesung> Leistungsanalyse <Dozent> ... </> <Inhalt> ... Warteschlangen ... </> <Lit href=springer/nelson.xml > <Lit href=... > </Vorlesung> <Vorlesung> Sprachverarbeitung <Inhalt> ... Markovketten ... </> </Vorlesung> ... </Lehre> ... </FR> ... </Fak> ... </Uni>
<Uni> Uni Stuttgart <Fak> Nat.-Techn. Fak. I<FR> Fachrichtung Informatik<Lehre> ... <Hauptstudium> <Vorlesung> Leistungsanalyse <Dozent> ... </> <Inhalt> ... Warteschlangen ... </> <Lit href=springer/nelson.xml > <Lit href=... > </Vorlesung> <Vorlesung> Sprachverarbeitung <Inhalt> ... Markovketten ... </> </Vorlesung> ... </Lehre> ... </FR> ... </Fak> ... </Uni>
...
<Uni> Uni Saarland <School> Math & Engineering <Dept> CS<Teaching> ... <GradStudies> <Course> Performance analysis <Lecturer> ... </> <Content> Queueing models .. </> <Lit href=springer/nelson.xml > <Lit href=... > </Course> <Course> Speech processing <Content> ... Markov chains... </> </Course> ... </Teaching> .. </Dept> .. </School> ... </Uni>
Uni: Uni Saarland
Book
Title:Stochastic...
Author:R. Nelson
Review: ... Chapter on Markov chains
School: ...
Dept: ... CS ...
Teaching
GradStudies
Course: Speech processing
School: ...
...
...
......
Course: Performance analysis
...
Content: ... Queueing models
Lit: Lit:...
Content: ... Markov chains ...
...
Uni: Uni Stuttgart
School: CS
Course: Mobile Comm.
Prerequisites: ... Markov processes
...
...
...
Uni: Uni Augsburg
Curriculum: E Commerce
...
Weekend: Data Mining
... ...
...
DozentURL=...
Inhalt...Semistructured data:elements, attributes, linksorganized as labeled graph
XML Querying
Uni: Uni Saarland
Book
Title:Stochastic...
Author:R. Nelson
Review: ... Chapter onMarkov chains
School: ...
Dept: ... CS
Teaching
GradStudies
Course: Speech processing
School: ...
...
...
......
Course: Performance analysis
...
Content: ... Queueing models
Lit: Lit:...
Content: ... Markov chains ...
...
Uni: Uni Stuttgart
School: CS
Course: Mobile comm.
Prerequisites: ... Markov processes
...
...
...
Uni: Uni Augsburg
Curriculum: E Commerce
...
Weekend: Data Mining
... Outline: ...statistical methodsfor classification ...
...
Select U, C From www.allunis.de/unis.xml Where Uni As U And U.#.School?.#.(Inst | Dept)+ As D And D Like „%CS%“And D.#.Course As C And C.# Like „%Markov chain%“
www.allunis.de/unis.xml
Regular expressionsover path labelsLogical conditionsover element contents
+
Select U, C From www.allunis.de/unis.xml Where Uni As U And U.#.School?.#.(Inst | Dept)+ As D And D Like „%CS%“And D.#.Course As C And C.# Like „%Markov chain%“
XML Querying
Uni: Uni Saarland
Book
Title:Stochastic...
Author:R. Nelson
Review: ... Chapter onMarkov chains
School: ...
Dept: ... CS
Teaching
GradStudies
Course: Speech processing
School: ...
...
...
......
Course: Performance analysis
...
Content: ... Queueing models
Lit: Lit:...
Content: ... Markov chains ...
...
Uni: Uni Stuttgart
School: CS
Course: Mobile comm.
Prerequisites: ... Markov processes
...
...
...
Uni: Uni Augsburg
Curriculum: E Commerce
...
Weekend: Data Mining
... Outline: ...statistical methodsfor classification ...
...
www.allunis.de/unis.xml
Uni As U
Uni:
Uni:
Uni:
U.#.School?.#.(Inst | Dept)+ As D
School:
School: School:
Dept:
D Like „%CS%“
CS
CS
D.#.Course As C
Course:
Course: Course:
C.# Like „%Markov chain%“
Markov chains
Markov chains
U, C
Boolean vs. Ranked Retrieval
There is no global schema for Intranets or the Web Relevance ranking of results is absolutely crucial !
Ranked Retrieval with XXL
Uni: Uni Saarland
Book
Title:Stochastic...
Author:R. Nelson
Review: ... Chapter on Markov chains
School: ...
Dept: ... CS
Teaching
GradStudies
Course: Speech processing
School: ...
...
...
......
Course: Performance analysis
...
Content: ... Queueing models
Lit: Lit:...
Content: ... Markov chains ...
...
Uni: Uni Stuttgart
School: CS
Course: Mobile comm.
Prerequisites: ... Markov processes
...
...
...
Uni: Uni Augsburg
Curriculum: E Commerce
...
Weekend: Data Mining
... Outline: ...statistical methodsfor classification ...
...
Select U, C From www.allunis.de/unis.xml Where Uni As U And U.# As D And D ~~ „CS“And D.#.~Course As C AND C.# ~~ „Markov chain“
www.allunis.de/unis.xml
Ranked Retrieval with XXL
Uni: Uni Saarland
Book
Title:Stochastic...
Author:R. Nelson
Review: ... Chapter on Markov chains
School: ...
Dept: ... CS
Teaching
GradStudies
Course: Speech processing
School: ...
...
...
......
Course: Performance analysis
...
Content: ... Queueing models
Lit: Lit:...
Content: ... Markov chains ...
...
Uni: Uni Stuttgart
School: CS
Course: Mobile comm.
Prerequisites: ... Markov processes
...
...
...
Uni: Uni Augsburg
Curriculum: E Commerce
...
Weekend: Data Mining
... Outline: ...statistical methodsfor classification ...
...
Select U, C From www.allunis.de/unis.xml Where Uni As U And U.# As D And D ~~ „Computer Science“And D.#.~Course As C and C.# ~~ „Markov chain“
www.allunis.de/unis.xml
DozentURL=...
Inhalt...Result ranking of XML databased on semantic similarity
10
Outline
Adding relevance to XML
• The XXL search engine:index-based query processing
• Experiments
XXL: Flexible XML Search Language
Where clause: conjunction of regular path expressions with binding of variables
Extensible, simple core language
Select F, D, S From www.allunis.de/unis.xml Where Uni.#.School?.#.(Inst|Dept) As FAnd F.#.Lecturer As D And F.#.Student As SAnd D.Name = S.Name And D.Area Like „%XML%“
Elementary conditions on element/attribute names and contents
Semantic similarity conditions on names and contents
Based on tf*idf similarity of contents, ontological similarity of names probabilistic combination of conditions
... F.#.~Lecturer As D And D.~Area ~~ „XML“
XXL Result RankingWhere Uni.#.School?.#.(Inst|Dept)+ As D AndD.#.~Lecturer As D And D.~Area ~~ „XML“
Query:
Data graph: Result graph:
Uni: UniSaarland
Dept: CS Dept: Math
Prof: GW
Teaching Project: IR forsemistruct. data
Course: IR Seminar: XML
Project: Digital libraries
Uni: UniSaarland
Dept: CS Dept: Math
Prof: GW
Project: IR for semistruct. data
0.9
0.80.6
1.0
1.0
Relevance score: 0.432= 1.0 * 1.0 * 0.9 * 0.8 * 0.6
F.#.~Course.# ~~ „Markov Chains“F.#.~Seminar.# ~~ „Markov Chains“
F.#.~Course.# ~~ „Markov Chains“F.#.~Seminar.# ~~ „Markov Chains“
XXL Search EngineWWW
......
.....
......
.....
XXL servlets
Queryprocessor
Pathindexer
Contentindexer
Ontology
XXLapplet
Select ... Where Uni.#.(Inst|Dept) As F And F ~~ „Computer Science“And F.#.~Course.# ~~ „Markov Chains“
Uni.#.(Inst|Dept) As F F ~~ „Computer Science“
• Query decomposition into index-supported subexpressions• wide range of optimizations
Index StructuresElement Path Index:
Engineering, idf=..., {<id79, tf=...>, <id85, tf=...>}XML, idf=..., {<id46, tf=...>, <id49, tf=...>, <id53, tf=...>}
Element Content Index:
Uni, {id1, {<School, {id13, id14}> <Prof, {id111, id117, id119}>}, id2, {<Prof>, {id15}>} }School, {id13, {<Dean, {id27}>, <Dept, {id31, id32, id33}>}, id14, { ... } }
Element Ontology Index:
Course, {<Seminar, 0.9>, <Project, 0.7>}, {<Teaching, 0.9>} {<Telecourse, 0.9>, <Video lecture, 0.7>, <Meditation, 0.1>}
materializes all (parent, child)element name pairs and dynamically checkstransitive connectivity
precomputes all termoccurrences in element contents,with frequency statistics
contains synonyms, hypernyms,and hyponyms of element names,and „semantic“ distances
Uni.#.(Inst|Dept)+ As FAnd F ~~ „Computer Science“AndF.#.~Course.# ~~ „Markov Chains“
Uni.#.(Inst|Dept)+ Uni.#.(Inst|Dept)+
Query Decomposition & Evaluation
decompose query into subqueries choose global evaluation order of subqueries represent subquery as NFSA for each subquery choose local evaluation strategy (top-down or bottom-up) evaluate subexpressions using indexes compute subquery result paths with relevance scores combine result paths into result graphExample query: Example of subquery NFSA:
Uni %
Inst
Dept
......
.....
......
.....
WWW / Intranet
The Role of Ontologies
<Uni> Univ. Saarland<School> Engineering <Dept> Computer Science <Faculty> Prof. Dr. GW <Project> Semistructured Data ... XML</>...
Course
Prof
Dept
Insti-tute
Re-search
Teach-ing Pro-
ject
Semi-nar
Univer-sity
Publi-cation
Confe-rence
Jour-nal
c (Course(c) s ((Dept(s) Inst(s)) Curriculum (c,x)))
Course
Prof
Dept
Insti-tute
Re-search
Teach- ing Pro-
ject
Semi- nar
Univer-sity
Publi-cation
Confe-rence
Jour-nal
c (Course(c) s ((Dept(s) Inst(s)) Curriculum (c,s)))
Observation:Information becomes better searchable when it is more explicitly structured and canonically annotated
Graph of concepts capturinghypernym/hyponym relationships (e.g., from WordNet)
„Poor man‘s ontology“:
quantitative reasoning („semantic similarity“ measures)
17
Outline
Adding relevance to XML
The XXL search engine:index-based query processing
• Experiments
Example Data
Example Query
SELECT *FROM INDEXWHERE ~drama.#.scene AS CAND C.speech AS SAND (S.speaker ~ "Woman")AND S.line AS LAND (L.CONTENT ~ "leader")AND C.speech AS MAND (M.speaker = "MACBETH")
Example Ontology
thane – (a feudal lord or baron in Scotland) => lord, noble, nobleman – (a titled peer of the realm) => male aristocrat – (a man who is an aristocrat) => leader – (a person who rules or guides or inspires others)
Example Ontology
woman, adult female – (an adult female person) => amazon, virago – (a large strong and aggressive woman) => donna -- (an Italian woman of rank) => geisha, geisha girl -- (...) => lady (a polite name for any woman) ... => wife – (a married woman, a man‘s partner in marriage) => witch – (a being, usually female, imagined to have special powers derived from the devil)
Example Results
Relevance = 0.0070400005
<scene> <speech> <speaker> Second Witch </speaker>
<line> All hail, Macbeth, hail to thee, thane of Cawdor! </line> </speech> <speech> <speaker> MACBETH </speaker> <line> ... </line> </speech></scene>
XXL Runtime Measurements
Q1:Select * From IndexWhere #.publication AS A And A.~headline ~~ „XML“ And A.author% AS B
Q2:Select * From IndexWhere #.play AS A And A.#.personae AS B And B.~figure ~~ „King“ And B. title AS C
1234
#results:top-downbottom-upw/ optimization:
13114.3 sec694 sec2.68 sec (incl. 0.37 sec)2bu 1bu 3td
588.5 sec3.7 sec4.64 sec (incl. 0.33 sec)1bu 2td 3td 4td
Test data:100 XML documents with a total of 240 000 elements(ot.xml, nt.xml, ..., hamlet.xml, macbeth.xml, ..., SigmodRecord.xml)
Conclusion
should be able to find results for every search in one day (computer time) with < 1 min intellectual effortthat the best human experts can find with infinite time
Goal:
explore and leverage synergies betweenXML (querying), (relevance-ranking) IR,(domain-specific or personal) ontologies, and machine learning (for classification, annotation, etc.)
Research avenue:
pursued in CLASSIX project (joint DFG project with Norbert Fuhr‘s group in Dortmund)