inference of concise dtds from xml data geert jan bex 1 frank neven 1 thomas schwentick 2 karl tuyls...
TRANSCRIPT
![Page 1: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/1.jpg)
Inference of Concise DTDs from XML data
Geert Jan Bex1
Frank Neven1
Thomas Schwentick2
Karl Tuyls3
1 Hasselt University and Transnational University of Limburg2 Dortmund University3 Maastricht University and Transnational University of Limburg
![Page 2: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/2.jpg)
Outline
• Goals & motivation• Problem setting• iDTD: Sample SOA SORE• CRX: Sample CHARE • Experiments• Extensions• Conclusions
![Page 3: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/3.jpg)
Aims & requirements
• Problem: infer DTD from XML corpus• Requirements:
– Concise: humans can interpret/validate
– Work on large data sets
– Work on small data sets
– Robust to noise
DTD
XML
![Page 4: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/4.jpg)
Why DTD inference?
• Schema inference– ≈ 50 % of XML documents : no schema [Barbosa et al. 2005]
– ≈ 66 % of DTDs and XSDs : not valid [Bex et al. 2005]
– Improving existing schemas
– “Noisy” XML documents ≈ 90 % of XHTML docs : not valid
• Related work– Fails on real-world, large data sets
– Results not concise
![Page 5: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/5.jpg)
Why schemas?• Validation : efficiency, security
• Optimization : search, processing
• Static analysis, type checking (e.g., XQuery)
• Software development : modeling,OR-mapping
• Integration : (meta-)data sources
• Schema matching
• Semantics
![Page 6: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/6.jpg)
Outline
• Goals & motivation• Problem setting• iDTD: Sample SOA SORE• CRX: Sample CHARE • Experiments• Extensions• Conclusions
![Page 7: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/7.jpg)
XML documents
book
title author author year
…
…… ……
book
title editor year isbn
…
……
Learning regular expressionfrom set of strings
title (author+ + editor+) year isbn?
![Page 8: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/8.jpg)
Learning automata?
Well studied, but…
Learning automata≠
learning regular expressions
((b?(a+c))+ d)+ e
![Page 9: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/9.jpg)
• abbb + abbd + acd + ac– most specific regex for S
• (a + b + c + d)*– most general regex for S
Learning regular languages?S = { abbb, abbd, acd, ac }
???
<<
a (b* + c) d??
generalizationvs.
specificity
positive examples only!
Impossible…in general
![Page 10: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/10.jpg)
Subclasses
• Single Occurrence Regular Expressions
– 99 % of regular expression in DTDs/XSDs
• CHAin Regular Expressions
– 90 % of regular expression in DTDs/XSDs
Infer with iDTD
Infer with CRX
![Page 11: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/11.jpg)
Outline
• Goals & motivation• Problem setting
• iDTD: Sample SOA SORE• CRX: Sample CHARE • Experiments• Extensions• Conclusions
![Page 12: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/12.jpg)
SOREs• What’s a SORE
header . protein . organism . reference* . comment* . genetics* . complex* .
function* . classification? . keywords? . feature* . summary . sequence
authors . citation . volume? . month? . year . pages? . (title + descr)? . xrefs?
title . (author . affiliation?)+ . abstract
• … and what’s nottitle . ((author . affiliation)+ + (editor . affiliation)+) . abstract
duplicate element names
![Page 13: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/13.jpg)
Sample SOA
W = {bacacdacde, cbacdbacde, abccaadcde}
b
a
c e
d
Single Occurrence Automaton
2T-Inf
[Garcia & Vidal 1990]
![Page 14: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/14.jpg)
Sample SOA
• SOA size– || + 2 states– O(||2) transitions
• Complexity of algorithm– O(||W||)– streaming
• Algorithm sound– W L(SOA)
∑∈
=Ss
sS
in general: |S| |L(SOA)|<<
![Page 15: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/15.jpg)
SOA SORE: REWRITE
b
a
e
d
coptional b
a
e
d
cb?disjunction a, c
e
d
b?
a+c
concatenation b?, a+c
e
d
b? (a+c)
e
d
((b? (a+c))+
self-loop b? (a+c)
((b? (a+c))+ d)+ e
![Page 16: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/16.jpg)
REWRITE: properties• Theorem
– REWRITE transforms SOA into equivalent SORE for sufficient data, reports failure otherwise (sound & complete)
– Complexity: O(||4)
• SORE size– || symbols
– O(||) operators
![Page 17: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/17.jpg)
REWRITE + repairs = iDTDW = {bacacdacde, cbacdbacde}
b
a
c e
d
no rules apply !!!
almost disjunction a, c
b
a
e
d
c
((b? (a+c))+ d)+ e
Fix:enable-disjunctionenable-optional
![Page 18: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/18.jpg)
iDTD: properties• Theorem
– iDTD transforms SOA into SORE such that L(SOA) L(SORE)
• iDTD can be parameterized for performance
![Page 19: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/19.jpg)
Outline
• Goals & motivation• Problem setting• iDTD: Sample SOA SORE• CRX: Sample CHARE • Experiments• Extensions• Conclusions
![Page 20: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/20.jpg)
CHAREs
• Definition: A chain regular expression is a sequence of
factors f1,…,fn such that no alphabet symbol occurs more than once and a factor is one of
• (a1 + … + ak)
• (a1 + … + ak)?
• (a1 + … + ak)+
• (a1 + … + ak)*
CRX derives
CHAin Regular Expressions
Chain Regular expression eXtraction
![Page 21: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/21.jpg)
CHAREs
• What’s a chain
header . protein . organism . reference* . comment* . genetics* . complex* . function* .
classification? . keywords? . feature* . summary . sequence
authors . citation . volume? . month? . year . pages? . (title + descr)? . xrefs?
• … and what’s not
title . (author . affiliation?)+ . abstract
title . ((author . affiliation)+ + (editor . affiliation)+) . abstract
not a factor
duplicate element names
![Page 22: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/22.jpg)
CRX run: pre-order relation
a b c c d ec c c a db f e gb f h i
Sample W
Pre-order relation W
a bb cc dd e
c aa d
b ff ee g
f hh i
a
b
c f
e
d g
h i
![Page 23: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/23.jpg)
a W b and b W c then a W c
CRX run: transitive closure
a b c c d ec c c a db f e gb f h i
Sample W
f
e
d g
h i
a
b
c
![Page 24: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/24.jpg)
CRX run: transitive closure
a b c c d ec c c a db f e gb f h i
Sample W
f
e
d g
h i
a
b
c
a,b,c
equivalence class
a W b and b W a then a W b
Symbol occurs in exactly one equivalence class
![Page 25: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/25.jpg)
CRX run: folding
a b c c d ec c c a db f e gb f h i
Sample W
f
e
d g
h i
a,b,c
predecessor set successor set
partial order W
pred() = {’ | ’ W }
succ() = {’ | W ’}
![Page 26: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/26.jpg)
CRX run: folding
a b c c d ec c c a db f e gb f h i
Sample W
e
g
h i
a,b,c d,f
partial order W
pred() = {’ | ’ W }
succ() = {’ | W ’}
W: partial order W
![Page 27: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/27.jpg)
CRX run: multiplicity & RE
a b c c d ec c c a db f e gb f h i
Sample W
e
g
h i
a,b,c d,f+ ?
?
? ?
e?. .h? i?.g?.. (d + f)(a + b + c)+
Chain Regular Expression
topological sort
![Page 28: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/28.jpg)
CRX algorithm: properties
• Optimality: W linearly ordered CHARE r, W L(r) and L(r) L(rW): rW = r
• Performance : O(||W|| + |Σ|3)
• Training set size:Any CHARE r can be learned from{w | w L(r) w’ L(r): |w| |w’| + 2}
![Page 29: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/29.jpg)
Outline
• Goals & motivation• Problem setting• iDTD: Sample SOA SORE• CRX: Sample CHARE
• Experiments• Extensions• Conclusions
![Page 30: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/30.jpg)
Related work
• XTRACT [Garofalakis et al. 2000]
– Pioneer– More general than iDTD– Focuses on regular expressions that don’t occur
in real DTDs no concise schemas
• Trang: roughly equivalent to CRX– Inconsistent results
![Page 31: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/31.jpg)
Data
• Real world regular expressions– SOREs– Non SOREs
• Real world data when available
• Synthetic data otherwise
![Page 32: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/32.jpg)
real
wor
ld d
ata
![Page 33: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/33.jpg)
real
wor
ld r
egex
es
![Page 34: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/34.jpg)
Experiments: generalization
CRX
iDTD
no repairs
![Page 35: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/35.jpg)
Experiments: generalization
CRXiDTD
![Page 36: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/36.jpg)
Outline
• Goals & motivation• Problem setting• iDTD: Sample SOA SORE• CRX: Sample CHARE • Experiments
• Extensions• Conclusions
![Page 37: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/37.jpg)
Extensions
• Incremental computation– new data update internal representation
(SOA or partial order)
• Noise– Support for element name too small ignore element– SOA: support for edges too small delete edges
before repair
• Numerical predicates– Bookkeeping: minOccurs, maxOccurs
• Generating XSDs– Infer data types (integer, double, date,…)
![Page 38: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/38.jpg)
Outline
• Goals & motivation• Problem setting• iDTD: Sample SOA SORE• CRX: Sample CHARE • Experiments• Extensions
• Conclusions
![Page 39: Inference of Concise DTDs from XML data Geert Jan Bex 1 Frank Neven 1 Thomas Schwentick 2 Karl Tuyls 3 1 Hasselt University and Transnational University](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c9c5503460f9495bbb2/html5/thumbnails/39.jpg)
Conclusions• iDTD + CRX
– learns robust class of regexes from positive examples
– complete in their target class for sufficient data– deals with insufficient data– performs well on real world data– runs efficiently
• Future work: inferring XML Schemas