a syntax for data
DESCRIPTION
A syntax for Data. by Jose Carlos Cabrera Zuniga. Preface. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/1.jpg)
A syntax for Data
by Jose Carlos Cabrera Zuniga
![Page 2: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/2.jpg)
Preface
In this presentation, it is going to be introduced the relation between semistructured data and XML. To accomplish with this objective, first it is showed the semistructured data concept. Then, it is showed the use of XML to represent this kind of data.
![Page 3: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/3.jpg)
Semistructured Data
Semistructured data is often explained as schemaless or self describing, terms that indicate that there is no separate description of the type or structure of data.
![Page 6: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/6.jpg)
{ name: {first: “Alan”, last: “Black”}, tel: 2157786, email: “[email protected]”}
nameemail
tel
2157786
first last
“Alan” “Black”
![Page 7: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/7.jpg)
{ person:
{name: “Alan”, tel: 2157786, email: “[email protected]” }
person:
{name: “Sara”, tel: 2136877, email: “[email protected]” }
person:
{name: “Fred”, tel: 2157786, email: “[email protected]” }
}
![Page 8: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/8.jpg)
One of the main strengths of semistructured data is its ability to accommodate variations in structures…
{ person: {name: “Alan”, tel: 2157786, email: “[email protected]” } person: { name: {first: “Sara”, last: “Green”} tel: 2136877, email: “[email protected]” } person: {name: “Fred”, tel: 2157786, Height: 183 }}
![Page 9: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/9.jpg)
In semistructured data, we make the conscious choice of forgetting any type the data might have had, and we serialize it by annotating each data item explicitly with its description (such a name, tel, etc.). Such data is called selfdescribing.
![Page 10: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/10.jpg)
Base Types:
• Numbers start with a digit.• Strings start with a quotation mark “
• There are many other types, with defined textual encodings, such as date, time, wav, that we would like to include. For each one it would be necessary to develop a notation (in many cases it is not necessary to re-invent a notation).
![Page 11: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/11.jpg)
REPRESENTING RELATIONAL DATABASES
A relational database is normally described by a schema such as
r1(a,b,c) r2(c,d)
where r1 an r2 are the names of the relations, and a, b, c and c, d are the column names of the two relations.
![Page 12: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/12.jpg)
{ r1: { row: { a: a1, b:b1, c: c1}, row: { a: a2, b:b2, c: c2} }, r2: { row: { c:c2, d:d2}, row: { c:c3, d:d3}, row: { c:c4, d:d4}, }}
a b ca1 b1 c1a2 b2 c2
c d c2 d2c3 d3c4 d4
r1(a,b,c)
r2(c,d)
![Page 13: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/13.jpg)
{ r1: { row: { a: a1, b:b1, c: c1}, row: { a: a2, b:b2, c: c2} }, r2: { row: { c:c2, d:d2}, row: { c:c3, d:d3}, row: { c:c4, d:d4}, }}
r1 r2
row row rowrow
row
a b c
a1 b1 c1
a b c
a2 b2 c2 c2 d2
c d
c3 d3
c d
c4 d4
c d
One representation of a relational database
![Page 14: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/14.jpg)
{ r1: { row: { a: a1, b:b1, c: c1}, row: { a: a2, b:b2, c: c2} }, r2: { row: { c:c2, d:d2}, row: { c:c3, d:d3}, row: { c:c4, d:d4}, }}
r1 r2
rowrow row
row row
a b c
a1 b1 c1
a b c
a2 b2 c2 c2 d2
c d
c3 d3
c dc4 d4
c d
Other representation of a relational database
r1 r2 r2
![Page 15: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/15.jpg)
Representing Object Databases
Modern database applications handle objects, either through an object-relational or an object database. Such data can be represented as semistructured data, too.
![Page 16: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/16.jpg)
Example. Tree persons: Mary, who has two children, John and Jane.
{ person: &o1 { name: “Mary”, age: 45, child: &o2, child: &o3, },
person: &o2 { name: “John”, age: 17, relatives: { mother: &o1, sister: &o3} },
person: &o3 { name: “Jane”, country: “Canada”, mother: &o1 }}
![Page 17: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/17.jpg)
person personperson
&o1 &o3
“Mary” 45
name
“John”
name namecountry
age
relatives
17 “Jane” “Canada”
age
mother sister
child
&o2
mother
child
![Page 18: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/18.jpg)
The presence of a node label such as &o1 before a structure binds &o1 to the identity of that structure.
The names &o1, &o2, &o3 are called object identities, or oids.
At this point, the data is no longer a tree but a graph, in which each node has a unique oid.
An oid can be used to access logically and physically a collection of data.
![Page 20: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/20.jpg)
In our simple syntax for semistructured data, we allow both nodes with explicit oids and nodes without oids: the system will explicitly assign a unique oid automatically, when the data is parsed. Thus {a:&o1{b:&o2 5}} and {a:{b:5}} denote isomorphic graphs, as does {a:&o1 {b:5}}.
What could happen with:
{a: {b:3}, a: {b:3} } ?
![Page 21: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/21.jpg)
SPECIFICATION OF SYNTAX
Let’s call ssd-expression to any semistructured data expression.
<ssd-expr> ::= <value> | oid <value> |oid<value> ::= atomicvalue | <complexvalue><complexvalue> ::= {label: <ssd-expr>, … , label:<ssd-expr>}
Atomicvalue: any number or string of charactersOid : like &123
![Page 22: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/22.jpg)
Definition. We say that an object identifier o is defined in an ssd-expression s if either s is of the form o v for some value v or s is of the form {l1:e1, … , ln:en} and o is defined in one of the e1, … , en. If it occurs in any other way in s, we say it is used in s.
Definition. (Consistency) For an ssd-expression s to be consistent it must satisfy the following properties:
• Any object identifier is defined at most once in s.• If an object identifier o is used in s, it must be defined in s.
Note. This definition must be extended if it is necessary to consider external resources and external oids.
![Page 23: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/23.jpg)
THE OBJECT EXCHANGE MODEL (OEM)
An oem object is a quadruple
(label, oid, type, value)
Where label is a character string, oid is the object’s identifier, and type is either complex or some identifier denoting an atomic type (like integer, string, gif-image, etc.). When type is complex, then the object is called a complex object, and value is a set (or list) of oids. Otherwise the object is an atomic object, and value is an atomic value of that type.
![Page 24: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/24.jpg)
![Page 25: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/25.jpg)
Thus OEM data is essentially a graph, like the semistructured data described in this section, but in which labels are attached to nodes rather than edges.
![Page 26: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/26.jpg)
Definition. A graph ( N, E ) consist of a set N of nodes and a set E of edges. Associated with each edge e in E there is an (ordered) pair of nodes, the source node s(e) and the target node t(e).
s(e)
t(e)
e
![Page 27: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/27.jpg)
Definition. A path is a sequence e1, … , ek of edges such that t(ei) = s(ei+1), 1<= i <= k – 1. Such a path is called a path from the source s(e1) of e1 to the target t(ek) of ek. The number of edges in this path, k, is its length.
s(e1)
t(e1) t(e2) t(ek)
s(ek)
![Page 28: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/28.jpg)
Definition. A node r is a root for a graph (N, E) if there is a path from r to n for every n in N, n <> r.
Definition. A cycle in a graph is a path between a node and itself. A graph with no cycles is called acyclic.
Definition. A rooted graph is a tree if there is a unique path from r to n for every n in N, n <> r.
Definition. A node is terminal node or a leaf if it is not the source of any edge in E.
![Page 29: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/29.jpg)
The followed model of semistructured data is that of an edge-labeled graph.
![Page 30: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/30.jpg)
XML and Semistructured Data
{ person : { name: {first: “Alan”, last: “Black”}, tel: 2157786, email: “[email protected]” }}
<person>
<name> Alan </name> <tel> 2157786 </tel> <email> [email protected] </email> </person>
![Page 31: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/31.jpg)
For trees, let’s call T a translation function such that
T(AtomicValue ) = AtomicValue T({ l1 : v1 , … , ln : vn }) = < l1 > T[ v1 ] </l1 > …
< ln > T[ vn ] </l1 >
person
name tel email
Alan 2157786 [email protected]
person
name
ageemail
Alan 2157786 [email protected]
![Page 32: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/32.jpg)
For graphs:
< state id = “s2” > <scode> NE </scode> <sname> Nevada </sname></state>
<state id=“c2”> <ccode> CCN </ccode> <cname> Carson City </cname> <state-of idref = “s2” /></city>
Observe that <state-of> is an empty element; its only purpose is for reference.
![Page 33: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/33.jpg)
a a
b c
some string
The ssd-expressions for the next graph are:
a: { b: some string }
a: { c: some string }
![Page 34: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/34.jpg)
a a
b c
some string
<a> <b id=“&o123” > some string </b></a><a c=“&o123”/>
If the attribute c is a reference attribute…
<a b = “&o123”/><a> <c id=“&o123”> some string </c> </a>
Assuming that b is now a reference attribute.
This a is an empty element
![Page 35: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/35.jpg)
ORDER
The semistructured data model described is based on unordered collections, while XML is ordered. For example the following two pieces of semistructured data are equivalent:
person:{firstname: “John”, lastname: “Smith”}Person:{lastname: “Smith”, firstname: “John”}
![Page 36: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/36.jpg)
While the following two XML doc. are not equivalent:
<person> <firstname> John </firstname> <lastname> Smith </lastname></person>
<person> <lastname> Smith </lastname> <firstname> John </firstname> </person>
![Page 37: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/37.jpg)
To make things worse, attributes are NOT ORDERED in XML. For example, are equivalent:
<person firstname=“john” lastname=“Smith”/>
<person lastname=“Smith” firstname=“john”/>
Applications that uses XML for data exchange are likely to ignore order…
![Page 38: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/38.jpg)
MIXING ELEMENTS AND TEXT
XML allow us to mix PCDATA and subelements within an element:
<Person> This is my best friend <Name> Alessandreia </Name> <Age> 25 </Age> I am not too sure of the following email <Email> [email protected] </Email></Person>
In order to translate XML back into the syntax of ssd-expressions it is necessary to add some surrounding
standard tag for the PCDATA
![Page 39: A syntax for Data](https://reader035.vdocuments.us/reader035/viewer/2022062315/56815a69550346895dc7bc1c/html5/thumbnails/39.jpg)
XM
L END