![Page 1: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/1.jpg)
Validating Streaming XML Documents
Luc Segoufin & Victor Vianu
Presented by Harel Paz
![Page 2: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/2.jpg)
The Challenge XML becoming a standard for data
exchange on the Web. Need: on-line processing of large
amounts of data in XML format, using limited memory.
Our focus: validating XML documents against given DTDs.
![Page 3: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/3.jpg)
Validating Streaming XML Documents
Restrictions over the validation: In a single pass. Using a fixed amount of memory,
depending on the DTD.
Input stream...<u><v>...</v><v><w>...<w></v>
startaccept
FSA
Yes/No
FSA
![Page 4: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/4.jpg)
The Problem in 2 Flavors There are 2 flavors to the problem:
Strong validation: validation that includes checking well-formedness.
Validation: checking satisfaction of the DTD, under the assumption that the input is a well-formed XML document.
![Page 5: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/5.jpg)
Tree Document XML documents are
abstracted by “tree documents”.
A tree document over a finite alphabet is a finite unranked tree with labels in and an order on the children of each node.
r
c
aa
b c b c
t
![Page 6: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/6.jpg)
String Representation XML documents are a
string representation of trees using opening and closing tags for each element.
For each , represents the
opening tag. represents the closing
tag for . Notation: .
aa
aa
}|{ aa
raccbabaccbcrabc
r
c
aa
b c b c
![Page 7: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/7.jpg)
DTDs A DTD consists of an extended context-
free grammar over alphabet Σ. DTD :
r a* a bc b c? c є
d
d A tree document over Σ satisfies a DTD if it is a derivation tree of the grammar.
r
c
aa
b c b c
T
satisfies T
d
![Page 8: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/8.jpg)
DTDs – cont’ Each DTD has a unique rule
for each symbol . denotes the regular expression.
is the language over consisting of the string representations of all tree documents satisfying .
aRa a
aR
d
)(dL
![Page 9: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/9.jpg)
Strong Validation of Streaming XML
Documents The problem: validating an XML
document with respect to a given DTD.
Need to characterize the DTDs , for which can be recognized by an FSA.
Such DTDs are called strongly recognizable.
d)(dL
![Page 10: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/10.jpg)
Strong Validation – Example 1
DTD d: r a a a?
. is not regular, so cannot be
strongly validated by an FSA. is not strongly recognizable.
)(dL
d
r
a
a
.
.
d}1|{)( nraradL nn
![Page 11: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/11.jpg)
Strong Validation – Example 2
DTD d: r a* a b|c
. is regular, so is
strongly recognizable.
}*))|(({)( raccbbardL )(dL d
r
aa . .
b c
![Page 12: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/12.jpg)
More Definitions Let be a DTD over . The dependency graph of , , is
the graph constructed as follows: Its set of vertices is . For each rule in , there is an
edge from to , for each occurring in .
dG
aR
d
aRa a b b
![Page 13: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/13.jpg)
More Definitions (cont’) Two labels, and , are mutually
recursive if they belong to some cycle of . is recursive if it is mutually recursive
with itself. DTD is non-recursive iff is acyclic. A DTD is fully recursive if all labels
from which recursive labels are reachable in are mutually recursive.
a
a
dGd
dG
d
b
dG
![Page 14: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/14.jpg)
Dependency Graph – Examples
DTD d: r a a a?
r
adG
r
a
b c
dG
DTD d: r a* a b|c
is non-recursive.d
is not acyclic. is not fully recursive. is recursivea
dGd
![Page 15: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/15.jpg)
Characterization of Strongly Recognizable DTDs
Proof sketch: If is a strongly recognizable DTD, there is an
FSA recognizing exactly . Suppose towards a contradiction that is recursive, and show using the pumping lemma that the above FSA accepts also non well-balanced strings.
If is non-recursive, an algorithm to build an FSA recognizing is given.
Theorem 3.1 (partial): A DTD is strongly recognizable iff it is non-recursive.
d)(dL
d
)(dLd
![Page 16: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/16.jpg)
Validating Well-Formed XML Documents
The problem: validating an XML document with respect to a given DTD , assuming the XML document is well-formed. Validation using an FSA.
Such DTDs are called recognizable. The requirement that should be
regular is now too strong. The FSA should only work correctly on well-
balanced strings representing trees.
d
)(dL
![Page 17: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/17.jpg)
Validation - Example 1 DTD d:
r a a a?
is not strongly recognizable. But, it is recognizable:
If the input is known to be well balanced, the FSA should just check that the string is of the form (more precisely ).
rara **
d
}1|{)( nraradL nn
raaraa **
![Page 18: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/18.jpg)
Validation - Example 2 DTD d:
a (ab|ca|є) b є c є
is not recognizable. An FSA cannot store
enough information to recall, when it reads , whether the corresponding node has a left sibling (in which is not allowed to its right).
a
a b
a b
c a
c a
a
bc
d
![Page 19: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/19.jpg)
Characterizing Recognizable DTDs
Which DTDs are recognizable? Non-recursive DTDs. What about recursive DTDs?
Not a trivial question. Are there any necessary conditions of
being a recognizable DTD? Are there any sub-groups of DTDs for which
the necessary conditions are also sufficient?
![Page 20: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/20.jpg)
d
wvu ,,,,
Lemma 4.2: Let be a recognizable DTD. Then the following hold, where are words over while (possibly subscripted) are individual symbols:Let be a positive integer and , be mutually recursive symbols of (not necessarily distinct). If , and for , then must be in .
zx,
k ii zx , ki 1d
11 zRx 1zk Rx
iziiiii Rwxvxu 1 ki 11z
R kk xvxvx ...221
d
Necessary Condition for a Recognizable DTD
, , , ,u v w
![Page 21: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/21.jpg)
Fully Recursive DTDs The necessary condition stated in
lemma 4.2 in order for a DTD to be recognizable, is also sufficient when the DTD is fully recursive. Next, we’ll see how to construct an
FSA for a DTD , which accepts all words in (and possibly more).
For fully recursive DTDs satisfying the conditions of Lemma 4.2, accepts precisely the words in (and possibly also non well-balanced words).
dA
)(dLdA
)(dLd
![Page 22: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/22.jpg)
The Standard FSA Let be a DTD over alphabet . Equivalence relation on
Equivalence classes are the strongly connected components of .
Let be a partial order on the classes of , where iff for some and there is an edge from to in . may have several maximal classes,
but only one minimum class.
d
dG BA Aa
Bb a bdG
![Page 23: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/23.jpg)
Example DTD d:
r aa a a?
The classes of , are and .
.}{r
}{}{ ar
}{ar
a
dG
{ }rA
{ }aA
a a
{ }aA
aa
r r
![Page 24: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/24.jpg)
Example – cont’ DTD d:
r aa a a?
aao fq 1, af2
aaA
a a
Aaa
o fq 1,A af2
Constructing FSA of class {a}’s string
representation
a
Constructing FSA for aR
For edge in add to : . .
),,( qbq
0( , , )q b q( , , )f b q
aA AA
![Page 25: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/25.jpg)
Example – cont’ DTD d:
r aa a a?
roq
arA rf
a
a a a a a
roq rf
RA
aaao fq 1,
a a
af2aa
o fq 1,
a a
a af2
a
![Page 26: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/26.jpg)
Example – cont’
The above FSA recognizes all well-balanced words produced by the above DTD.
But also other well-balanced words (such as ). There is no automaton recognizing this DTD.
DTD d: r aa a a?
roq rf
a a a a adA
aaao fq 1,
a a
af2aa
o fq 1,
a a
a af2
a
rs g
r
raaaaaar
![Page 27: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/27.jpg)
Theorem 4.1: The following are equivalent for each fully recursive DTD :
(i) is recognizable.(ii) satisfies the conditions of Lemma 4.2.(iii) The set of well-balanced strings accepted
by the FSA is precisely .
d
)(dLdA
dd
Recognizable Fully Recursive DTDs
![Page 28: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/28.jpg)
Recognizable DTDs
Which DTDs are recognizable? Non-recursive DTDs. Fully recursive DTDs satisfying the
conditions of Lemma 4.2. And others…
But, characterization in the general case remains an open question.
Partial progress: necessary conditions for recognizability.
![Page 29: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/29.jpg)
Alternative Validation Approaches
2 alternative approaches for validating DTDs that are not recognizable: Relax the constant memory
requirement. Refining the original DTD.
![Page 30: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/30.jpg)
Validation with Bounded Stack
Relaxing the constant memory requirement. Use a stack whose depth is bounded in the depth
of an XML document. Validation done in a single deterministic pass.
Appealing approach in practice. For each DTD, there exists a deterministic
PDA that accepts precisely its language. Example- the DTD:
r aa a a?
![Page 31: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/31.jpg)
Refining the DTD Refining a DTD means providing in the
tags additional information that can be used for validation.
DTD:1 2r a a
1 1 ?a a2 2 ?a a
DTD:r aa
?a a
The refined DTD can be validated by an FSA.
For every DTD, there exists such equivalent DTD of size quadratic, which is recognizable.
Example:
![Page 32: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/32.jpg)
Summary First step towards the formal
investigation of processing streaming XML.
Provided conditions under which validation can be done in a single pass and constant memory, using an FSA.
Considered alternative approaches, when validation using an FSA is not possible.
![Page 33: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/33.jpg)
Appendix
The Standard FSA Construction
![Page 34: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/34.jpg)
The Standard FSA is inductively constructed starting
from the maximal elements of . Let be a maximal element of . For each regular expression ( ),
a non-deterministic FSA is built. Disjoint states for different ’s. Initial state of is , while its final
states are
dA
cAcR
cA c
CCc
cA cq0,..., 21
cc ff
![Page 35: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/35.jpg)
The Standard FSA – cont’ Build :
Its states are the union of the states of the FSAs for .
Transitions- for each transition of , add to the transitions:
for the initial state of . for each final state of .
Cc
cA CA),,( qbq
CA
),,( 0qbq),,( qbf
0qf
bAbA
cA
must belong to
b
C
is a maximal element ofC
![Page 36: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/36.jpg)
The Standard FSA – cont’ Build for non-maximal elements of
, when all FSAs of elements , such that are already constructed: Unlike the maximal elements case, has
transitions , where (i.e., ). For such transitions, we add to :
A new disjoint copy of . for the initial state of . for each final state of .
CA EA E
EC
cA),,( qeq CeEe
CAEA
),,( 0qeq),,( qef
eA
eA0qf
![Page 37: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/37.jpg)
The Standard FSA – cont’ The final FSA is obtained by
adding to the FSA of the minimum class (containing the root label ): A new start state with transition
for the start state of . A final state with transition
for each final state of .
dA
rs ),,( 0qrs
),,( grf0q
rAg
f rA
CA
![Page 38: Validating Streaming XML Documents Luc Segoufin & Victor Vianu](https://reader036.vdocuments.us/reader036/viewer/2022081520/56814de7550346895dbb587c/html5/thumbnails/38.jpg)
Complexity of ‘s construction: . is the maximum size of an FSA for a
regular expression of . is the depth of the partial order .
Lemma 4.3: For each DTD , let be the automation described. We have:
(i) Every word in is accepted by .(ii) can be constructed from in
exponential time. d
d
dAdA)(dL
dA
dA )|(| ||dO|| d
|| d
The Standard FSA - Lemma