2005 dbi 1 xml extensible markup language part 2
TRANSCRIPT
2005 http://www.cs.huji.ac.il/~dbi 3
XML Entities should not be Confused with Entities in the Sense of the ER
Model• An entity is a short string that denotes
more complex information, which may reside inside or outside the XML document or its DTD
• Entities save typing• Entities facilitate easy changes (when the
same change is likely to be made in many places)
• Sometimes entities must be used to circumvent XML syntax violations
• Applications should decode and encode entities, using their definitions
2005 http://www.cs.huji.ac.il/~dbi 4
General entities
• A general entity is defined in the DTD
<!ENTITY Name “EntityDefinition”>
• And it is used in the document by writing
&Name;
2005 http://www.cs.huji.ac.il/~dbi 5
Example<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE mdb [
<!ENTITY bm "bad movie"> <!ELEMENT mdb (movie+)>
<!ELEMENT movie (title,director,cast?,budget)>]><mdb>
<movie id="ohgod" opinion="&bm;"><title> Oh God!</title><director> Woody Allen </director><budget> $2M </budget>
</movie></mdb>
2005 http://www.cs.huji.ac.il/~dbi 7
Parameter Entities
• Parameter entities are used only within DTDs
• Internal entities are references within the DTD
• External entities are references that draw information from outside files
• Parameter Entity declaration:<!ENTITY % Name “EntityDefinition” >
2005 http://www.cs.huji.ac.il/~dbi 8
An Example of a Parameter Entity
<?xml version="1.0" encoding="UTF-8"?><!ENTITY % essential "name, tel*"><!ELEMENT email (#PCDATA)><!ELEMENT tel (#PCDATA)><!ELEMENT name (#PCDATA)><!ELEMENT person (%essential;, email, advisor?)><!ATTLIST person friend (yes | no) #IMPLIED id ID #REQUIRED knows IDREFS #IMPLIED><!ELEMENT advisor (person)><!ELEMENT addresses (person)*>
2005 http://www.cs.huji.ac.il/~dbi 9
Unparsed Entities<!DOCTYPE mdb [
<!NOTATION gif SYSTEM "c:\Program Files\Netscape\Communicator\Program\Netscape.exe"><!ENTITY starpicture SYSTEM "http://www.cs.huji.ac.il/~dbi/figures/star.gif" NDATA gif><!ENTITY bm "bad movie"><!ELEMENT mdb (movie+)><!ELEMENT movie (title,director, budget)><!ATTLIST movie id ID #REQUIRED
opinion CDATA #IMPLIED starimage ENTITY #IMPLIED>
<!ELEMENT title (#PCDATA)><!ELEMENT director (#PCDATA)><!ELEMENT budget (#PCDATA)>
]>Entities are defined
Types are
defined
2005 http://www.cs.huji.ac.il/~dbi 10
Data
<mdb>
<movie id="ohgod" opinion="&bm;" starimage="starpicture">
<title> Oh God!</title>
<director> Woody Allen </director>
<budget> $2M </budget>
</movie>
</mdb>
2005 http://www.cs.huji.ac.il/~dbi 11
Defining Entities
• Entities can be defined – in the local document as part of the DOCTYPE
definition– with a link to external files that contain the
entity data (this, too, is done through the DOCTYPE definition)
– in an external DTD
• Define locally when the entity is being used only in one particular document
• Define by a link to an external file when the entity is being used in many documents
2005 http://www.cs.huji.ac.il/~dbi 12
Defining Entities – An Example
• Local Definition:
<!DOCTYPE [ <!ENTITY copyright
"Copyright 2000, As The World Spins Corp. All
rights reserved. Please do not copy or use without
authorization. For authorization contact
[email protected]."> ]>
• Global Definition:<!DOCTYPE [ <!ENTITY copyright SYSTEM
"http://www.worldspins.com/legal/copyright.xml"> ]>
2005 http://www.cs.huji.ac.il/~dbi 13
Another Example<?xml version="1.0"><!DOCTYPE [ <!ENTITY copyright "Copyright 2000, As The World Spins Corp. All rights reserved. Please do not copy or use without authorization. For authorization [email protected].">
<!ENTITY trademark SYSTEM "http://www.worldspins.com/legal/trademark.xml">
]>
2005 http://www.cs.huji.ac.il/~dbi 14
Example (cont’d)<PRESSRELEASE><HEAD>Mini-globe revolutionizes keychain industry
</HEAD><LEAD>Today As The World Spins introduces a new approach to keychains. With the new MINI-GLOBE keys can be kept inside achain, called for upon demand, and stored safely. Never
more will consumers lose a key or stand at a door flipping through a stack of keys seeking the right one.
</LEAD><LEGAL>&trademark;©right;</LEGAL></PRESSRELEASE>
2005 http://www.cs.huji.ac.il/~dbi 16
XML Namespaces
• When an element name appears in two different XML documents, we would like to know that it has the same meaning in both documents– Is the tag <title> used as the <title>
XHTML tag in both documents?– If two documents about books have
the tag <number>, does it mean that they use the same system for cataloging books?
2005 http://www.cs.huji.ac.il/~dbi 17
What XML Namespaces are
and What They are not• Namespaces merely provide a
mechanism for creating unique names (for elements and attributes) that can be used in XML documents all over the Web– A namespace is just a collection of names
that were created for a specific domain of applications
• Namespaces are not DTDs and they do not provide a mechanism for validation of XML documents using multiple DTDs
2005 http://www.cs.huji.ac.il/~dbi 18
Identifying an XML Namespace
• A name space is identified by a URI• The URI does not have to point to
anything– It is merely used as a mechanism for
creating unique names• An element or attribute name from
a namespace has two partsprefix:name prefix
identifies the namespace
name is just a name from the namespace
2005 http://www.cs.huji.ac.il/~dbi 19
Namespaces are not Part of the XML 1.0
Recommendation• When an XML 1.0 parser sees a qualified name
prefix:namethe parser treats this name just as it would treat any other attribute or element name (it is legal to use the character “:” in element and attribute names)Namespaces must be hardwired into DTDs
2005 http://www.cs.huji.ac.il/~dbi 20
But
• When an application sees a qualified name, it may recognize it and act accordingly– A browser identifies tags that belong
to the XHTML namespace and processes them
– An XSLT processor identifies tags and attributes that belong to the XSLT namespace and executes them
2005 http://www.cs.huji.ac.il/~dbi 21
The W3C Recommendation for Namespaces in XML
• The two-part naming system is the only thing defined in the W3C Namespace recommendation– and even that is not so short!
• This recommendation is just a collection of syntactic rules– Some rules are rather subtle
2005 http://www.cs.huji.ac.il/~dbi 22
Declaring a Namespace
• An XML namespace is declared in the xmlns attribute
<foo:book xmlns:foo=“http://www.foo.org/”><foo:title> XML Namespaces </foo:title><foo:author> John Doe </foo:author>
</foo:book>Using foo as the prefix, instead of using the URI, is more convenient
2005 http://www.cs.huji.ac.il/~dbi 23
The Default Namespace
• The default namespace is declared without a prefix
<book xmlns=“http://www.foo.org/”><title> XML Namespaces </title><author> John Doe </author>
</book>
All the elements belong to the default namespace
2005 http://www.cs.huji.ac.il/~dbi 24
Technically
• The namespace mechanism is just a mapping from prefixes to URIs, e.g.,– <foo:title> is replaced with
<{http://www.foo.org/}title>• It is done in a processing layer that
operates on the element tree resulting from XML 1.0 parsing
• It creates unique names
2005 http://www.cs.huji.ac.il/~dbi 25
DTDs as Namespaces
• The URI of a namespace may point to a DTD
• A DTD defines a namespace comprising all its element names and attribute names– But it is just a namespace – not a
DTD!
2005 http://www.cs.huji.ac.il/~dbi 26
Example<bib:book>
xmlns:bib=“http://www.acm.org/bibliography.dtd”xmlns:isbn=“http://www.isbn-org.org/def.dtd”><bib:title> Proceedings of SIGMOD </bib:title><bib:number> 472010 </bib:number><isbn:number> 1-58113-332-4 </isbn:number>
</bib:book>
This document is invalid according to either DTD!
But the document is well formed! (e.g., in the book element, attribute names are unique)
2005 http://www.cs.huji.ac.il/~dbi 27
Alternatively, One Namespace can be Declared
as the Default <book>
xmlns=“http://www.acm.org/bibliography.dtd”xmlns:isbn=“http://www.isbn-org.org/def.dtd”><title> Proceedings of SIGMOD <title><number> 472010 <number><isbn:number> 1-58113-332-4 </isbn:number>
<book>This document is well formed but invalid according to either DTD!
2005 http://www.cs.huji.ac.il/~dbi 28
Scope of Namespaces• The scope of a namespace declaration
is the element containing the declaration and all descendant elements– Must use the prefix anywhere in the scope
• Only the default namespace can be redeclared
• More than one namespace can be declared in the same scope– At most one can be the default namespace– All others must have unique prefixes
2005 http://www.cs.huji.ac.il/~dbi 29
What about Attributes?
• Recall that element names and attribute names must be qualified if they belong to a nondefault namespace
• Unqualified element names belong to the default namespace (if they are inside the scope)
• However, an unqualified attribute does not belong to the default namespace
• An unqualified attribute is processed according to the rules that apply to its element name
2005 http://www.cs.huji.ac.il/~dbi 30
Namespaces and DTDs:The Problem
• DTD syntax does not support namespaces
• The previous example showed an XML document with two DTDs that were used as namespaces – It is impossible to declare constraints
that specify where fragments from each namespace can occur
2005 http://www.cs.huji.ac.il/~dbi 31
Namespaces and DTDs:The Solutions
• Use a namespace-aware schema language, or
• Modify one of the two DTDs so that it will be a DTD for the new document– Two alternatives, as illustrated on the
next two slides, using the previous example
2005 http://www.cs.huji.ac.il/~dbi 32
One Alternative
• Add the required new elements to the DTD
• Give the appropriate unique names to these elements using parameter entities
<!ENTITY % isbn “{http://www.isbn-org.org/def.dtd}”><!ENTITY % number “%isbn;number”><!ELEMENT book (title,author,number, %number;)>
2005 http://www.cs.huji.ac.il/~dbi 33
The Second Alternative
• Add the required new elements to the DTD, using qualified names
• Use the attribute-list declaration for the new elements to declare the namespace as a fixed value
<!ATTLIST isbn:number xmlns:isbn CDTAT #FIXED “http://www.isbn-org.org/def.dtd”><!ELEMENT book (title,author,number, isbn:number;)>
2005 http://www.cs.huji.ac.il/~dbi 35
Exchanging Relational Data
• Each tuple can be wrapped inside an element
• See example on the following slides
2005 http://www.cs.huji.ac.il/~dbi 36
Two Ways of Wrapping Relations in XML
Documents projects:
title budget managedBy
employees:
name ssn age
2005 http://www.cs.huji.ac.il/~dbi 37
The Project and EmployeeRelations in XML
<db> <project> <title> Pattern recognition
</title> <budget> 10000
</budget> <managedBy> Joe
</managedBy> </project> <employee> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 < /age> </employee>
<employee> <name> Sandra </name> <ssn> 2234 </ssn> <age> 35 </age> </employee> <project> <title> Auto guided vehicle
</title> <budget> 70000 </budget> <managedBy> Sandra </managedBy> </project> :</db>
Projects and employees are intermixed
2005 http://www.cs.huji.ac.il/~dbi 38
<db> <projects> <project> <title> Pattern recognition </title> <budget> 10000 </budget> <managedBy> Joe</managedBy> </project> <project> <title> Auto guided vehicles </title> <budget> 70000 </budget>
<managedBy>Sandra</managedBy>
</project> : </projects>
<employees><employee>
<name> Joe </name>
<ssn> 344556 </ssn>
<age> 34 </age> </employee> <employee>
<name>Sandra</name> <ssn> 2234 </ssn>
<age>35 </age> </employee> : <employees></db>
Employees follow projects
Projects
Employees
2005 http://www.cs.huji.ac.il/~dbi 39
<db> <projects> <title> Pattern recognition
</title> <budget> 10000 </budget> <managedBy> Joe
</managedBy> <title> Auto guided vehicles
</title> <budget> 70000 </budget> <managedBy> Sandra
</managedBy> : </projects>
<employees> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 </age> <name> Sandra </name> <ssn> 2234 </ssn> <age> 35 </age> : </employees></db>
Or without “separator” tags …
Can be done if it is clearwhere each employeeand each project starts
2005 http://www.cs.huji.ac.il/~dbi 40
DTDs for the First Two Documents
<!DOCTYPE db [<!ELEMENT db (projects,employees)><!ELEMENT projects (project*)><!ELEMENT employees (employee*)>
<!ELEMENT project (title, budget, managedBy)>
<!ELEMENT employee (name, ssn, age)>...
]><!DOCTYPE db [
<!ELEMENT db (project | employee)*><!ELEMENT project (title, budget,
managedBy)><!ELEMENT employee (name, ssn, age)>...]>
2005 http://www.cs.huji.ac.il/~dbi 41
Wrapping Relations is not a Good Design Strategy
• When designing XML documents from ER diagrams,– ER entities are described by XML
elements– ER attributes can be described either by
XML attributes or by subelements– How to represent ER relationships?
• By using the built-in relationship in XML between elements and subelements
• But it is not always possible, so ID references might have to be used
2005 http://www.cs.huji.ac.il/~dbi 42
How to use XML Attributes
• XML attributes describe properties of the contents, rather than the contents
<entry> <word language = “en”> cheese</word> <word language = “fr”> fromage</word> <word language = “ro”> branza </word> <meaning> A food made … </meaning></entry>
2005 http://www.cs.huji.ac.il/~dbi 43
Attributes (cont’d)
Another common use for attributes is to express dimensions or types
<picture> <height dim= “cm”> 2400 </height> <width dim= “in”> 96 </width> <data encoding = “gif” compression = “zip”> M05-.+C$@02!G96YE<FEC ... </data></picture>
2005 http://www.cs.huji.ac.il/~dbi 44
<addresses ><person friend="yes">
<name> Jeff Cohen</name><tel> 04-828-1345 </tel><tel> 054-470-778 </tel><email> [email protected] </email>
</person><person friend="no">
<name> Irma Levy</name><tel> 03-426-1142 </tel><email>[email protected]</email>
</person></addresses>
UsingAttributes
2005 http://www.cs.huji.ac.il/~dbi 45
It is not Always ClearWhen to Use Attributes
<person ssno= “123 4589”> <name> L. Simpson
</name> <email> [email protected] </email> ...</person>
<person> <ssno> 123 4589 </ssno> <name> L. Simpson </name> <email> [email protected] </email> ...</person>
2005 http://www.cs.huji.ac.il/~dbi 46
Using IDs
<person id="jeff" friend="yes" knows="irma"><name> Jeff Cohen</name><tel> 04-828-1345 </tel><tel> 054-470-778 </tel><email> [email protected] </email>
</person><person id="irma" friend="no" knows="jeff">
<name> Irma Levy</name><tel> 03-426-1142 </tel><email>[email protected]</email>
</person>
IDattributes
2005 http://www.cs.huji.ac.il/~dbi 47
How to Represent Relationships
• Two related ER entities, e.g., employees and departments, can be represented as follows
• A department is an element, and the employees are subelements of the department
• The relationship must be many-to-one or one-to-one– Subelements are the “many”
2005 http://www.cs.huji.ac.il/~dbi 48
No Multiple Copies of the Same Element (to Avoid
Redundancies)• Cannot represent in this way
– A many-to-many relationship– A relationship with more than two
entities– A binary relationship between an entity
and itself or between two entities that are related by an ISA relationship
• ID references must be used in the above cases
2005 http://www.cs.huji.ac.il/~dbi 49
More Problematic Cases
• If there are several many-to-one relationships between two ER entities, then only one can be represented as an element-subelement relationship
• For example, employees can be subelements of their department
• But the relationship between a department and its manager (who is one of the employees) must be represented by an IDREF
2005 http://www.cs.huji.ac.il/~dbi 50
Missing Informationis another Problem
• If there could be an employee without a department, then employees cannot be represented as subelements of departments– IDREFS have to be used
2005 http://www.cs.huji.ac.il/~dbi 51
Inverse Relationships
• XML does not have built-in inverse relationships
• Must use IDREF to represent inverse relationships
• For example, add an IDREF attribute to each employee element for denoting the department of the employee
2005 http://www.cs.huji.ac.il/~dbi 52
XML Schemas
W3Schools on XML Schemas
2005 http://www.cs.huji.ac.il/~dbi 53
XML Schemas
• W3C XML Schema Language, also known as the language for XML Schema Definition (XSD)
• There are other proposals for XML Schemas
2005 http://www.cs.huji.ac.il/~dbi 54
XSDs have Types
• XSDs use complex types that generalize the content model of DTDs (i.e., the regular expressions for describing elements)
• Many simple types, e.g., String, Integer– Generalize PCDATA and CDATA
• Many facets of simple types, e.g., length, maxInclusive, maxExclusive
2005 http://www.cs.huji.ac.il/~dbi 55
xs:sequence and xs:all
• Can specify that subelements should appear in a specific order (i.e., sequence) or in any order (i.e., all)– But xs:all is not as general as
xs:sequence
• Can restrict the number of occurrences of subelements, e.g., a departments can have between 10 and 100 employees
2005 http://www.cs.huji.ac.il/~dbi 56
References
• References are to specific elements or attributes, e.g., a reference to “person”, where “person” is the name of an element
2005 http://www.cs.huji.ac.il/~dbi 57
More Features
• Mixed content can be defined more generally, compared to DTDs
• Local and global definitions of elements and types
• Derived types by restriction or extension
2005 http://www.cs.huji.ac.il/~dbi 58
XSDs and Namespaces
• XSDs recognize namespaces• Easier (than with DTDs) to check
validity of a document with respect to multiple schemas– A very important feature when
collecting information from multiple heterogeneous sources
– XSDs are more extensible than DTDs
2005 http://www.cs.huji.ac.il/~dbi 59
Summary of XML• XML is a new data format andits main
virtues:– widespread acceptance – the (important) ability to handle
semistructured data (data without schema)
• DTDs provide some useful syntactic constraints on documents, but as schemas they are weak
• How to store large XML documents?• How to query them?• How to map between XML and other
representations?