2005 dbi 1 xml extensible markup language part 2

59
2005 http://www.cs.huji.ac.il/ ~dbi 1 XML eXtensible Markup Language Part 2

Upload: emory-austin

Post on 29-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

2005 http://www.cs.huji.ac.il/~dbi 1

XMLeXtensible Markup

LanguagePart 2

2005 http://www.cs.huji.ac.il/~dbi 2

XML Entities

2005 http://www.cs.huji.ac.il/~dbi 3

XML Entities should not be Confused with Entities in the Sense of the ER

Model• An entity is a short string that denotes

more complex information, which may reside inside or outside the XML document or its DTD

• Entities save typing• Entities facilitate easy changes (when the

same change is likely to be made in many places)

• Sometimes entities must be used to circumvent XML syntax violations

• Applications should decode and encode entities, using their definitions

2005 http://www.cs.huji.ac.il/~dbi 4

General entities

• A general entity is defined in the DTD

<!ENTITY Name “EntityDefinition”>

• And it is used in the document by writing

&Name;

2005 http://www.cs.huji.ac.il/~dbi 5

Example<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE mdb [

<!ENTITY bm "bad movie"> <!ELEMENT mdb (movie+)>

<!ELEMENT movie (title,director,cast?,budget)>]><mdb>

<movie id="ohgod" opinion="&bm;"><title> Oh God!</title><director> Woody Allen </director><budget> $2M </budget>

</movie></mdb>

2005 http://www.cs.huji.ac.il/~dbi 6

Browser View

2005 http://www.cs.huji.ac.il/~dbi 7

Parameter Entities

• Parameter entities are used only within DTDs

• Internal entities are references within the DTD

• External entities are references that draw information from outside files

• Parameter Entity declaration:<!ENTITY % Name “EntityDefinition” >

2005 http://www.cs.huji.ac.il/~dbi 8

An Example of a Parameter Entity

<?xml version="1.0" encoding="UTF-8"?><!ENTITY % essential "name, tel*"><!ELEMENT email (#PCDATA)><!ELEMENT tel (#PCDATA)><!ELEMENT name (#PCDATA)><!ELEMENT person (%essential;, email, advisor?)><!ATTLIST person friend (yes | no) #IMPLIED id ID #REQUIRED knows IDREFS #IMPLIED><!ELEMENT advisor (person)><!ELEMENT addresses (person)*>

2005 http://www.cs.huji.ac.il/~dbi 9

Unparsed Entities<!DOCTYPE mdb [

<!NOTATION gif SYSTEM "c:\Program Files\Netscape\Communicator\Program\Netscape.exe"><!ENTITY starpicture SYSTEM "http://www.cs.huji.ac.il/~dbi/figures/star.gif" NDATA gif><!ENTITY bm "bad movie"><!ELEMENT mdb (movie+)><!ELEMENT movie (title,director, budget)><!ATTLIST movie id ID #REQUIRED

opinion CDATA #IMPLIED starimage ENTITY #IMPLIED>

<!ELEMENT title (#PCDATA)><!ELEMENT director (#PCDATA)><!ELEMENT budget (#PCDATA)>

]>Entities are defined

Types are

defined

2005 http://www.cs.huji.ac.il/~dbi 10

Data

<mdb>

<movie id="ohgod" opinion="&bm;" starimage="starpicture">

<title> Oh God!</title>

<director> Woody Allen </director>

<budget> $2M </budget>

</movie>

</mdb>

2005 http://www.cs.huji.ac.il/~dbi 11

Defining Entities

• Entities can be defined – in the local document as part of the DOCTYPE

definition– with a link to external files that contain the

entity data (this, too, is done through the DOCTYPE definition)

– in an external DTD

• Define locally when the entity is being used only in one particular document

• Define by a link to an external file when the entity is being used in many documents

2005 http://www.cs.huji.ac.il/~dbi 12

Defining Entities – An Example

• Local Definition:

<!DOCTYPE [ <!ENTITY copyright

"Copyright 2000, As The World Spins Corp. All

rights reserved. Please do not copy or use without

authorization. For authorization contact

[email protected]."> ]>

• Global Definition:<!DOCTYPE [ <!ENTITY copyright SYSTEM

"http://www.worldspins.com/legal/copyright.xml"> ]>

2005 http://www.cs.huji.ac.il/~dbi 13

Another Example<?xml version="1.0"><!DOCTYPE [ <!ENTITY copyright "Copyright 2000, As The World Spins Corp. All rights reserved. Please do not copy or use without authorization. For authorization [email protected].">

<!ENTITY trademark SYSTEM "http://www.worldspins.com/legal/trademark.xml">

]>

2005 http://www.cs.huji.ac.il/~dbi 14

Example (cont’d)<PRESSRELEASE><HEAD>Mini-globe revolutionizes keychain industry

</HEAD><LEAD>Today As The World Spins introduces a new approach to keychains. With the new MINI-GLOBE keys can be kept inside achain, called for upon demand, and stored safely. Never

more will consumers lose a key or stand at a door flipping through a stack of keys seeking the right one.

</LEAD><LEGAL>&trademark;&copyright;</LEGAL></PRESSRELEASE>

2005 http://www.cs.huji.ac.il/~dbi 15

XML Namespaces

2005 http://www.cs.huji.ac.il/~dbi 16

XML Namespaces

• When an element name appears in two different XML documents, we would like to know that it has the same meaning in both documents– Is the tag <title> used as the <title>

XHTML tag in both documents?– If two documents about books have

the tag <number>, does it mean that they use the same system for cataloging books?

2005 http://www.cs.huji.ac.il/~dbi 17

What XML Namespaces are

and What They are not• Namespaces merely provide a

mechanism for creating unique names (for elements and attributes) that can be used in XML documents all over the Web– A namespace is just a collection of names

that were created for a specific domain of applications

• Namespaces are not DTDs and they do not provide a mechanism for validation of XML documents using multiple DTDs

2005 http://www.cs.huji.ac.il/~dbi 18

Identifying an XML Namespace

• A name space is identified by a URI• The URI does not have to point to

anything– It is merely used as a mechanism for

creating unique names• An element or attribute name from

a namespace has two partsprefix:name prefix

identifies the namespace

name is just a name from the namespace

2005 http://www.cs.huji.ac.il/~dbi 19

Namespaces are not Part of the XML 1.0

Recommendation• When an XML 1.0 parser sees a qualified name

prefix:namethe parser treats this name just as it would treat any other attribute or element name (it is legal to use the character “:” in element and attribute names)Namespaces must be hardwired into DTDs

2005 http://www.cs.huji.ac.il/~dbi 20

But

• When an application sees a qualified name, it may recognize it and act accordingly– A browser identifies tags that belong

to the XHTML namespace and processes them

– An XSLT processor identifies tags and attributes that belong to the XSLT namespace and executes them

2005 http://www.cs.huji.ac.il/~dbi 21

The W3C Recommendation for Namespaces in XML

• The two-part naming system is the only thing defined in the W3C Namespace recommendation– and even that is not so short!

• This recommendation is just a collection of syntactic rules– Some rules are rather subtle

2005 http://www.cs.huji.ac.il/~dbi 22

Declaring a Namespace

• An XML namespace is declared in the xmlns attribute

<foo:book xmlns:foo=“http://www.foo.org/”><foo:title> XML Namespaces </foo:title><foo:author> John Doe </foo:author>

</foo:book>Using foo as the prefix, instead of using the URI, is more convenient

2005 http://www.cs.huji.ac.il/~dbi 23

The Default Namespace

• The default namespace is declared without a prefix

<book xmlns=“http://www.foo.org/”><title> XML Namespaces </title><author> John Doe </author>

</book>

All the elements belong to the default namespace

2005 http://www.cs.huji.ac.il/~dbi 24

Technically

• The namespace mechanism is just a mapping from prefixes to URIs, e.g.,– <foo:title> is replaced with

<{http://www.foo.org/}title>• It is done in a processing layer that

operates on the element tree resulting from XML 1.0 parsing

• It creates unique names

2005 http://www.cs.huji.ac.il/~dbi 25

DTDs as Namespaces

• The URI of a namespace may point to a DTD

• A DTD defines a namespace comprising all its element names and attribute names– But it is just a namespace – not a

DTD!

2005 http://www.cs.huji.ac.il/~dbi 26

Example<bib:book>

xmlns:bib=“http://www.acm.org/bibliography.dtd”xmlns:isbn=“http://www.isbn-org.org/def.dtd”><bib:title> Proceedings of SIGMOD </bib:title><bib:number> 472010 </bib:number><isbn:number> 1-58113-332-4 </isbn:number>

</bib:book>

This document is invalid according to either DTD!

But the document is well formed! (e.g., in the book element, attribute names are unique)

2005 http://www.cs.huji.ac.il/~dbi 27

Alternatively, One Namespace can be Declared

as the Default <book>

xmlns=“http://www.acm.org/bibliography.dtd”xmlns:isbn=“http://www.isbn-org.org/def.dtd”><title> Proceedings of SIGMOD <title><number> 472010 <number><isbn:number> 1-58113-332-4 </isbn:number>

<book>This document is well formed but invalid according to either DTD!

2005 http://www.cs.huji.ac.il/~dbi 28

Scope of Namespaces• The scope of a namespace declaration

is the element containing the declaration and all descendant elements– Must use the prefix anywhere in the scope

• Only the default namespace can be redeclared

• More than one namespace can be declared in the same scope– At most one can be the default namespace– All others must have unique prefixes

2005 http://www.cs.huji.ac.il/~dbi 29

What about Attributes?

• Recall that element names and attribute names must be qualified if they belong to a nondefault namespace

• Unqualified element names belong to the default namespace (if they are inside the scope)

• However, an unqualified attribute does not belong to the default namespace

• An unqualified attribute is processed according to the rules that apply to its element name

2005 http://www.cs.huji.ac.il/~dbi 30

Namespaces and DTDs:The Problem

• DTD syntax does not support namespaces

• The previous example showed an XML document with two DTDs that were used as namespaces – It is impossible to declare constraints

that specify where fragments from each namespace can occur

2005 http://www.cs.huji.ac.il/~dbi 31

Namespaces and DTDs:The Solutions

• Use a namespace-aware schema language, or

• Modify one of the two DTDs so that it will be a DTD for the new document– Two alternatives, as illustrated on the

next two slides, using the previous example

2005 http://www.cs.huji.ac.il/~dbi 32

One Alternative

• Add the required new elements to the DTD

• Give the appropriate unique names to these elements using parameter entities

<!ENTITY % isbn “{http://www.isbn-org.org/def.dtd}”><!ENTITY % number “%isbn;number”><!ELEMENT book (title,author,number, %number;)>

2005 http://www.cs.huji.ac.il/~dbi 33

The Second Alternative

• Add the required new elements to the DTD, using qualified names

• Use the attribute-list declaration for the new elements to declare the namespace as a fixed value

<!ATTLIST isbn:number xmlns:isbn CDTAT #FIXED “http://www.isbn-org.org/def.dtd”><!ELEMENT book (title,author,number, isbn:number;)>

2005 http://www.cs.huji.ac.il/~dbi 34

Data Exchange and Data Representation in XML

2005 http://www.cs.huji.ac.il/~dbi 35

Exchanging Relational Data

• Each tuple can be wrapped inside an element

• See example on the following slides

2005 http://www.cs.huji.ac.il/~dbi 36

Two Ways of Wrapping Relations in XML

Documents projects:

title budget managedBy

employees:

name ssn age

2005 http://www.cs.huji.ac.il/~dbi 37

The Project and EmployeeRelations in XML

<db> <project> <title> Pattern recognition

</title> <budget> 10000

</budget> <managedBy> Joe

</managedBy> </project> <employee> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 < /age> </employee>

<employee> <name> Sandra </name> <ssn> 2234 </ssn> <age> 35 </age> </employee> <project> <title> Auto guided vehicle

</title> <budget> 70000 </budget> <managedBy> Sandra </managedBy> </project> :</db>

Projects and employees are intermixed

2005 http://www.cs.huji.ac.il/~dbi 38

<db> <projects> <project> <title> Pattern recognition </title> <budget> 10000 </budget> <managedBy> Joe</managedBy> </project> <project> <title> Auto guided vehicles </title> <budget> 70000 </budget>

<managedBy>Sandra</managedBy>

</project> : </projects>

<employees><employee>

<name> Joe </name>

<ssn> 344556 </ssn>

<age> 34 </age> </employee> <employee>

<name>Sandra</name> <ssn> 2234 </ssn>

<age>35 </age> </employee> : <employees></db>

Employees follow projects

Projects

Employees

2005 http://www.cs.huji.ac.il/~dbi 39

<db> <projects> <title> Pattern recognition

</title> <budget> 10000 </budget> <managedBy> Joe

</managedBy> <title> Auto guided vehicles

</title> <budget> 70000 </budget> <managedBy> Sandra

</managedBy> : </projects>

<employees> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 </age> <name> Sandra </name> <ssn> 2234 </ssn> <age> 35 </age> : </employees></db>

Or without “separator” tags …

Can be done if it is clearwhere each employeeand each project starts

2005 http://www.cs.huji.ac.il/~dbi 40

DTDs for the First Two Documents

<!DOCTYPE db [<!ELEMENT db (projects,employees)><!ELEMENT projects (project*)><!ELEMENT employees (employee*)>

<!ELEMENT project (title, budget, managedBy)>

<!ELEMENT employee (name, ssn, age)>...

]><!DOCTYPE db [

<!ELEMENT db (project | employee)*><!ELEMENT project (title, budget,

managedBy)><!ELEMENT employee (name, ssn, age)>...]>

2005 http://www.cs.huji.ac.il/~dbi 41

Wrapping Relations is not a Good Design Strategy

• When designing XML documents from ER diagrams,– ER entities are described by XML

elements– ER attributes can be described either by

XML attributes or by subelements– How to represent ER relationships?

• By using the built-in relationship in XML between elements and subelements

• But it is not always possible, so ID references might have to be used

2005 http://www.cs.huji.ac.il/~dbi 42

How to use XML Attributes

• XML attributes describe properties of the contents, rather than the contents

<entry> <word language = “en”> cheese</word> <word language = “fr”> fromage</word> <word language = “ro”> branza </word> <meaning> A food made … </meaning></entry>

2005 http://www.cs.huji.ac.il/~dbi 43

Attributes (cont’d)

Another common use for attributes is to express dimensions or types

<picture> <height dim= “cm”> 2400 </height> <width dim= “in”> 96 </width> <data encoding = “gif” compression = “zip”> M05-.+C$@02!G96YE<FEC ... </data></picture>

2005 http://www.cs.huji.ac.il/~dbi 44

<addresses ><person friend="yes">

<name> Jeff Cohen</name><tel> 04-828-1345 </tel><tel> 054-470-778 </tel><email> [email protected] </email>

</person><person friend="no">

<name> Irma Levy</name><tel> 03-426-1142 </tel><email>[email protected]</email>

</person></addresses>

UsingAttributes

2005 http://www.cs.huji.ac.il/~dbi 45

It is not Always ClearWhen to Use Attributes

<person ssno= “123 4589”> <name> L. Simpson

</name> <email> [email protected] </email> ...</person>

<person> <ssno> 123 4589 </ssno> <name> L. Simpson </name> <email> [email protected] </email> ...</person>

2005 http://www.cs.huji.ac.il/~dbi 46

Using IDs

<person id="jeff" friend="yes" knows="irma"><name> Jeff Cohen</name><tel> 04-828-1345 </tel><tel> 054-470-778 </tel><email> [email protected] </email>

</person><person id="irma" friend="no" knows="jeff">

<name> Irma Levy</name><tel> 03-426-1142 </tel><email>[email protected]</email>

</person>

IDattributes

2005 http://www.cs.huji.ac.il/~dbi 47

How to Represent Relationships

• Two related ER entities, e.g., employees and departments, can be represented as follows

• A department is an element, and the employees are subelements of the department

• The relationship must be many-to-one or one-to-one– Subelements are the “many”

2005 http://www.cs.huji.ac.il/~dbi 48

No Multiple Copies of the Same Element (to Avoid

Redundancies)• Cannot represent in this way

– A many-to-many relationship– A relationship with more than two

entities– A binary relationship between an entity

and itself or between two entities that are related by an ISA relationship

• ID references must be used in the above cases

2005 http://www.cs.huji.ac.il/~dbi 49

More Problematic Cases

• If there are several many-to-one relationships between two ER entities, then only one can be represented as an element-subelement relationship

• For example, employees can be subelements of their department

• But the relationship between a department and its manager (who is one of the employees) must be represented by an IDREF

2005 http://www.cs.huji.ac.il/~dbi 50

Missing Informationis another Problem

• If there could be an employee without a department, then employees cannot be represented as subelements of departments– IDREFS have to be used

2005 http://www.cs.huji.ac.il/~dbi 51

Inverse Relationships

• XML does not have built-in inverse relationships

• Must use IDREF to represent inverse relationships

• For example, add an IDREF attribute to each employee element for denoting the department of the employee

2005 http://www.cs.huji.ac.il/~dbi 52

XML Schemas

W3Schools on XML Schemas

2005 http://www.cs.huji.ac.il/~dbi 53

XML Schemas

• W3C XML Schema Language, also known as the language for XML Schema Definition (XSD)

• There are other proposals for XML Schemas

2005 http://www.cs.huji.ac.il/~dbi 54

XSDs have Types

• XSDs use complex types that generalize the content model of DTDs (i.e., the regular expressions for describing elements)

• Many simple types, e.g., String, Integer– Generalize PCDATA and CDATA

• Many facets of simple types, e.g., length, maxInclusive, maxExclusive

2005 http://www.cs.huji.ac.il/~dbi 55

xs:sequence and xs:all

• Can specify that subelements should appear in a specific order (i.e., sequence) or in any order (i.e., all)– But xs:all is not as general as

xs:sequence

• Can restrict the number of occurrences of subelements, e.g., a departments can have between 10 and 100 employees

2005 http://www.cs.huji.ac.il/~dbi 56

References

• References are to specific elements or attributes, e.g., a reference to “person”, where “person” is the name of an element

2005 http://www.cs.huji.ac.il/~dbi 57

More Features

• Mixed content can be defined more generally, compared to DTDs

• Local and global definitions of elements and types

• Derived types by restriction or extension

2005 http://www.cs.huji.ac.il/~dbi 58

XSDs and Namespaces

• XSDs recognize namespaces• Easier (than with DTDs) to check

validity of a document with respect to multiple schemas– A very important feature when

collecting information from multiple heterogeneous sources

– XSDs are more extensible than DTDs

2005 http://www.cs.huji.ac.il/~dbi 59

Summary of XML• XML is a new data format andits main

virtues:– widespread acceptance – the (important) ability to handle

semistructured data (data without schema)

• DTDs provide some useful syntactic constraints on documents, but as schemas they are weak

• How to store large XML documents?• How to query them?• How to map between XML and other

representations?