xml for information management – day 1 airi salminen xml for information management university of...
Post on 21-Dec-2015
216 views
TRANSCRIPT
XML for Information Management – Day 1Airi Salminen
XML for Information Management
University of Erlangen-NurembergComputational Linguistics
Instructor: Professor Airi Salminenhttp://users.jyu.fi/~airi/
12.1.-16.1. 2009
XML for Information Management – Day 1Airi Salminen
2
1. Course introduction 2. XML examples 3. XML concepts
Day 1: Course introduction, XML examples and concepts
Outline
XML for Information Management – Day 1Airi Salminen
3
1. Course introduction: Instructor
‣ Home university: University of Jyväskylä in Finland, Faculty of Information Technology
‣ Home page: http://users.jyu.fi/~airi/
‣ Experience Jyväskylä:• http://www3.jkl.fi/international/experience/ind
ex.html
XML for Information Management – Day 1Airi Salminen
4
‣ My research areas: structured documents, content management in organizations, document standardization, semantic web, information retrieval
‣ My XML-related research has concerned:• modelling structured text• querying structured text• SGML/XML standardization
1. Course introduction: Instructor
XML for Information Management – Day 1Airi Salminen
5
Tague, J., Salminen, A., & McClellan, C. (1991). Complete formal model for information retrieval systems. In Proc. of the 14th ACM SIGIR Conference, 14-20. New York: ACM Press.
Salminen, A., & Watters, C. (1992). A two-level structure for textual databases to support hypertext access. Journal of the American Society for Information Science 43 (6), 432-447.
Salminen, A., & Tompa, F. (1993). PAT expressions: an algebra for text search Acta Linguistica Hungarica, 41 (1-4), 277-306. http://www.cs.jyu.fi/~airi/papers/COMPLEX-1992.pdf
Salminen, A., Tague-Sutcliffe, J., & McClellan, C. (1995). From text to hypertext by indexing. ACM Transactions on Information Systems 13 (1), 69-99.
Salminen, A., Lehtovaara, M., & Kauppinen, K. (1996). Standardization of digital legislative documents - a case study. In Proceedings of the Twenty-Ninth Hawaii International Conference on System Sciences (pp. 72-81). Los Alamitos, CA: IEEE Computer Society Press.
Kuikka, E., & Salminen, A. (1997). Two-dimensional filters for structured text. Information Processing and Management 33 (1), 37-54.
1. Course introduction: Instructor
XML for Information Management – Day 1Airi Salminen
6
Salminen, A., Kauppinen, K., & Lehtovaara, M. (1997). Towards a methodology for document analysis. Journal of the American Society for Information Science 48 (7), Special Issue on Structured Information/Standards for Document Architectures, 644-655.
Salminen, A., & Tompa, F. (1999). Grammars++ for modelling information in text. Information Systems 24 (1), 1-24.
Salminen, A., Tiitinen, P., & Lyytikäinen, V. (1999). Usability evaluation of a structured document archive. In Proc. of the Thirty-Second Hawaii International Conference on System Sciences. Los Alamitos, CA: IEEE Computer Society Press.
Lyytikäinen, V., Tiitinen, P., & Salminen, A. (2001). XML metadata for accessing heterogeneous legal databases. In Proc. of the XML Europe 2001 Conference. http://www.gca.org/papers/xmleurope2001/papers/html/s27-4.html
Salminen, A., & Tompa, F.W. (2001). Requirements for XML document database systems. In Proc. of the ACM Symposium on Document Engineering (DocEng '01), 85-94. New York: ACM Press.
Salminen, A., Lyytikäinen, V., Tiitinen, P., & Mustajärvi, O. (2001). Experiences of SGML standardization: The case of the Finnish legislative documents. In Proc. of the Thirty-Fourth Hawaii International Conference on System Sciences. Los Alamitos, CA: IEEE Computer Society Press.
1. Course introduction: Instructor
XML for Information Management – Day 1Airi Salminen
7
Salminen, A. (2003). Document analysis methods. Encyclopedia of Library and Information Science, Second Edition, Revised and Expanded (pp. 916-927). New York: Marcel Dekker. New York: ACM Press.
Korhonen, R. & Salminen, A. (2003). Visualization of EDI messages: Facing the problems in the use of XML. In Proc. of the Fifth International Conference on Electronic Commerce, 466-473. New York: ACM Press.
Salminen, A., Lyytikäinen, V., Tiitinen, P., & Mustajärvi, O. (2004). Implementing digital government in the Finnish Parliament. In Digital Government: Strategies and Implementation (pp. 242-259). Hersley, PA: IDEA Group Publishing
Salminen, A. (2005). Building digital government by XML. In Proc. of the Thirty-Eighth Hawaii International Conference on System Sciences. Los Alamitos, CA: IEEE Computer Society Press.
Salminen, A., Nurmeksela, R., Lehtinen, A., Lyytikäinen, V., & Mustajärvi, O. (2006). Content production strategies for e-Government. In Encyclopedia of Digital Government, Vol. I (pp. 224-230). Hersley, PA: IDEA Group Publishing.
Nurmeksela, R., Jauhiainen, E., Salminen, A., & Honkaranta, A. (2007). XML document implementation: Experiences from three cases. In Proceedings of the Second International Conference on Digial Information Management (pp. 224-229). Los Alamitos, CA: IEEE.
1. Course introduction: Instructor
XML for Information Management – Day 1Airi Salminen
8
XML-related projects
‣ RASKE (1994-1998): Developing Standards for Structured Documents
‣ inSGML (1998-2001): Methods for SGML standardization in industry
‣ EULEGIS (1998-2000): European User Views to Legislative Information in Structured Form
‣ AirXML (2002-2004): XML and Data Warehousing in Air Defence
‣ RASKE2 (2003-2006): Methods for the Integration of Systems and Services in e-Government
1. Course introduction: Instructor
XML for Information Management – Day 1Airi Salminen
9
1. Course introduction
‣ Syllabus:• http://users.jyu.fi/~airi/opetus/xml/erlangen/
‣ Course Readings:• available on the course web site
‣ Project Assignment: • http://
users.jyu.fi/~airi/opetus/xml/erlangen/project.html
‣ Contact by email: [email protected]
XML for Information Management – Day 1Airi Salminen
10
1. Course introduction: project
‣ Purpose• The projects are intended to explore the application of
XML in various contexts. Students interested in practical XML exercises are free to suggest a practical project where they can test some XML software and/or build an application of their own.
• The project can also be an investigation of an existing or planned XML solution in an organizational context together with an analysis of the impacts of the solution.
‣ Topics: Proposed by students‣ Teams of two, or individual projects‣ The phases
• 2 page topic proposal: due on Feb. 20• Project report: due on March 31
XML for Information Management – Day 1Airi Salminen
11
2. XML examples
• separation of the primary content and markup
• markup is metadata adding some information to the primary content
<?xml version = "1.0"?><poem author = ”Murasaki Shikibu” author_born = ”974”><stanza><line>This life of ours would not cause you sorrow</line><line>if you thought of it as like</line><line>the mountain cherry blossoms </line><line>which bloom and fade in a day.</line></stanza></poem>
Note: The text of the line elements is taken fromhttp://www.bopsecrets.org/rexroth/translations/japanese.htm,containing Kenneth Rexroth’s translations of Japanese poetry
XML for Information Management – Day 1Airi Salminen
12
2. XML examples
This life of ours would not cause you sorrowif you thought of it as like
the mountain cherry blossomswhich bloom and fade in a day.
External presentation for human perception can be defined in a separate stylesheet. By a proper stylesheet the previous XML
document might look like:
Examples of the attachment of stylesheets. Try ”xml examples” by Google.
XML for Information Management – Day 1Airi Salminen
13
2. XML examples
http://www.tei-c.org/Guidelines/Customization/Lite/U5-eg.html
A piece of prose in the TEI Guidelines:
XML for Information Management – Day 1Airi Salminen
14
3. XML concepts
XML = Extensible Markup Language
T. Bray, J. Paoli, & C. M. Sperberg-McQueen (Eds.), Extensible Markup Language (XML) 1.0,W3C Recommendation 10- February-1998, http://www.w3.org/TR/1998/REC-xml-19980210/
T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler, & F. Yergeau (Eds.), Extensible Markup Language (XML) 1.0 (Fifth Edition), W3C Recommendation 16 August 2006, http://www.w3.org/TR/2008/REC-xml-20081126/
T. Bray, J. Paoli, C.M. Sperberg-McQueen, E. Maler, F. Yergeau, & J. Cowan (Eds.), Extensible Markup Language (XML) 1.1. (Second Edition) W3C Recommendation 16 August 2006. http://www.w3.org/TR/2006/REC-xml11-20060816/
A set of rules for defining and representing information as structured documents for applications on the Internet. XML is a restricted form of the older markup language called SGML.
XML Development History: http://www.w3.org/XML/hist2002
XML for Information Management – Day 1Airi Salminen
15
Processing XML documents
XML Document
XML Processor
Application
3. XML concepts: XML processor
XML for Information Management – Day 1Airi Salminen
16
physical structure, consisting of entities
logical structure where elements are the core composites
XML processor recognizes from a document two structures:
3. XML concepts: physical and logical structure
XML for Information Management – Day 1Airi Salminen
17
Entity
file (text or some other kind of data)
named piece of text
3. XML concepts: entity
XML for Information Management – Day 1Airi Salminen
18
Example of an entity structure
part 1
root entity
part 2
entity entity reference
figure1.jpg figure2.jpg figure3.jpg
3. XML concepts: entity
XML for Information Management – Day 1Airi Salminen
19
Entity as a named piece of text, like in HTML:
name value reference
auml ä ä
ouml ö ö
Yö Jyväskylässä Yö Jyväskylässä
3. XML concepts: entity
XML for Information Management – Day 1Airi Salminen
20
An element is marked-up by a begin-tag and an end-tag.
<year>1654</year>
begin-tag end-tag
content
Element
3. XML concepts: element
XML for Information Management – Day 1Airi Salminen
21
<?xml version="1.0"?><rhymecollection><rhyme><line>Ole aina iloinen</line><line> niin kuin pikku varpunen</line></rhyme><rhyme><line>See, see! What shall I see?</line><line>A horse's head where his tail should be</line></rhyme></rhymecollection>
Example 1: a document of seven elements
3. XML concepts: element
XML for Information Management – Day 1Airi Salminen
22
Example 1 as an element tree
root element rhymecollection
rhymerhyme
lineline
lineline
3. XML concepts: tree structure
There is always one root element
Every non-root element is a child element of a parent element
XML for Information Management – Day 1Airi Salminen
23
• name• value (character
string)
Extra information can be attached to elements by attributes
An attribute has:
<lastname earlier=“Rantanen”>Korhonen</lastname>
name value
3. XML concepts: attribute
xml:lang for identifying the language of the content of an element
xml:space for signaling that the white spaces should be preserved by the application
Two predefined attributes: xml:lang and xml:space.
XML for Information Management – Day 1Airi Salminen
24
• as element content
• as attribute value
Data in XML elements:
3. XML concepts: elements and attributes
XML for Information Management – Day 1Airi Salminen
25
<lastname earlier=“Rantanen”>Korhonen</lastname>
Three alternative ways for giving two lastnames for a person:
<lastname><earlier>Rantanen</earlier><now>Korhonen </now></lastname>
<lastname earlier=“Rantanen” now=“Korhonen”></lastname>
What is the difference?
3. XML concepts: elements and attributes
1.
2.
3.
XML for Information Management – Day 1Airi Salminen
26
Child elements of a parent element are ordered.
The writing order of attributes in an element is insignificant.
In the logical structure
3. XML concepts: elements and attributes
XML for Information Management – Day 1Airi Salminen
27
Different structures:
<lastname><earlier>Rantanen</earlier><now>Korhonen </now></lastname>
<lastname><now>Korhonen </now><earlier>Rantanen</earlier></lastname>
1. child element
2. child element1. child element
2. child element
3. XML concepts: elements and attributes
XML for Information Management – Day 1Airi Salminen
28
Equivalent solutions:
<lastname earlier=“Rantanen” now=“Korhonen”></lastname>
<lastname now=“Korhonen” earlier=“Rantanen” ></lastname>
3. XML concepts: elements and attributes
XML for Information Management – Day 1Airi Salminen
29
XML documents encoded in: Unicode
intended for content written in any natural language of the world
3. XML concepts: Unicode
The latest version: Unicode 5.1.0
The development work done by the Unicode Consortium
XML for Information Management – Day 1Airi Salminen
30
XML is a meta language intended to define languages for special application areas
Document Type Definition (DTD) is the mechanism to define languages
3. XML concepts: DTD
XML for Information Management – Day 1Airi Salminen
31
DTD :
Example 1 meets the constraints defined in the DTD.
<!DOCTYPE rhymecollection [<!ELEMENT rhymecollection (title?, rhyme+)><!ELEMENT title (#PCDATA)><!ELEMENT rhyme (line+)><!ELEMENT line (#PCDATA)> ]>
3. XML concepts: DTD
XML for Information Management – Day 1Airi Salminen
32
Attributes added
<!DOCTYPE rhymecollection [<!ELEMENT rhymecollection (title?, rhyme+)><!ELEMENT title (#PCDATA)><!ELEMENT rhyme (line+)><!ATTLIST rhyme
xml:lang NMTOKEN #REQUIREDauthor CDATA #IMPLIED >
<!ELEMENT line (#PCDATA)> ]>
3. XML concepts: DTD
XML for Information Management – Day 1Airi Salminen
33
DTD can be attached to a document
as in an internal subset
as an external subset
by combining internal and external markup declarations
DTD consists of all markup declarations together.
3. XML concepts: DTD
XML for Information Management – Day 1Airi Salminen
34
<?xml version="1.0" ?><!DOCTYPE rhymecollection [<!ELEMENT rhymecollection (title?, rhyme+)><!ELEMENT title (#PCDATA)><!ELEMENT rhyme (line+)><!ATTLIST rhyme
xml:lang NMTOKEN #REQUIREDauthor CDATA #IMPLIED >
<!ELEMENT line (#PCDATA)> ]><rhymecollection><rhyme><line>See, see! What shall I see?</line><line>A horse's head where his tail should be</line></rhyme></rhymecollection>
Internal DTD
3. XML concepts: DTD
XML for Information Management – Day 1Airi Salminen
35
System identifier ”myrhyme.dtd" gives the address for the external DTD
<?xml version="1.0"?><!DOCTYPE rhymecollection SYSTEM ”myrhyme.dtd”><rhymecollection><rhyme><line>See, see! What shall I see?</line><line>A horse's head where his tail should be</line></rhyme></rhymecollection>
3. XML concepts: DTD
XML for Information Management – Day 1Airi Salminen
36
markup declarations in ”myrhyme.dtd”:
Text Declaration<?xml version="1.0"?><!DOCTYPE rhymecollection [<!ELEMENT rhymecollection (title?, rhyme+)><!ELEMENT title (#PCDATA)><!ELEMENT rhyme (line+)><!ATTLIST rhyme
xml:lang NMTOKEN #REQUIREDauthor CDATA #IMPLIED >
<!ELEMENT line (#PCDATA)> ]>
3. XML concepts: DTD
XML for Information Management – Day 1Airi Salminen
37
DTD is just one definition mechanism available for constraining XML data. The most important:
3. XML concepts: DTD
XML Schema
RELAX NG
The term schema or (XML schema) can refer to a definition written by any definion mechanism developed for XML data. The languages for defining schemas are called schema languages.
XML for Information Management – Day 1Airi Salminen
38
Examples of XML applications:
XHTML: http://www.w3.org/TR/xhtml1/ RSS (Really Simple Syndication):
http://blogs.law.harvard.edu/tech/rss TEI (Text Encoding Initiative): http://www.tei-
c.org/index.xml ebXML (Electronic Business using XML):
http://www.ebxml.org/
3. XML concepts: XML application
An XML application is an XML-based language, (usually) defined by some schema language.
XML for Information Management – Day 1Airi Salminen
39
XML is a subset of SGML
HTML is an SGML application
XHTML is an XML application
XML -- SGML – HTML -- XHTML
3. XML concepts: XML application
XML for Information Management – Day 1Airi Salminen
40
Two kinds of constraints in the XML specification: well-formedness constraints: all XML documents
have to meet them and they are called well-formed
validity constraints: documents associated with a DTD and meeting the constraints (including that they have to meet the constraints expressed in the DTD) are called valid
3. XML concepts: well-formed and valid
XML for Information Management – Day 1Airi Salminen
41
A requirement for well-formed documents:
each child element has to be contained in the parent element
<date><day>24<month>1</day></month><year>2005</year></date>
NOT well-formed
3. XML concepts: well-formed and valid
XML for Information Management – Day 1Airi Salminen
42
<!DOCTYPE rhymecollection [<!ELEMENT rhymecollection (title?, rhyme+)><!ELEMENT title (#PCDATA)><!ELEMENT rhyme (line+)><!ATTLIST rhyme
xml:lang NMTOKEN #REQUIREDauthor CDATA #IMPLIED >
<!ELEMENT line (#PCDATA)> ]>
<?xml version="1.0" ?>
<rhymecollection><rhyme xml:lang = “fi”><line>See, see! What shall I see?</line><line>A horse's head where his tail should be</line></rhyme></rhymecollection>
VALID, even though the attribute value is not correct
3. XML concepts: well-formed and valid
XML for Information Management – Day 1Airi Salminen
43
<!DOCTYPE rhymecollection [<!ELEMENT rhymecollection (title?, rhyme+)><!ELEMENT title (#PCDATA)><!ELEMENT rhyme (line+)><!ATTLIST rhyme
xml:lang NMTOKEN #REQUIREDauthor CDATA #IMPLIED >
<!ELEMENT line (#PCDATA)> ]>
<?xml version="1.0" ?>
<rhymecollection><rhyme><line>See, see! What shall I see?</line><line>A horse's head where his tail should be</line></rhyme></rhymecollection>
NOT valid
3. XML concepts: well-formed and valid
XML for Information Management – Day 1Airi Salminen
44
Often need to use elements and attributes originating from different environments (or applications).
Vocabularies in two environments may include common names intended for different purposes.
If multiple declarations used in a single DTD, name collisions must avoided.
3. XML concepts: Namespaces
XML for Information Management – Day 1Airi Salminen
45
XML namespaces
Provides a method for qualifying element and attribute names so that name collisions can be avoided
Motivation: modularity and documentation
If a well-understood markup vocabulary for element and attribute names exists, it shoud be re-used rather than re-invented, especially if there is also software available.
http://www.w3c.org/TR/REC-xml-names
3. XML concepts: Namespaces
XML for Information Management – Day 1Airi Salminen
46
Collection of names, identified by a URI
No formal rules for defining names in a namespace
URI (Uniform Resource Identifier)
• URL (Uniform Resource Locator) or• URN (Uniform Resource Name)
XML namespace
Generic Syntax, RFC 3986: http://www.ietf.org/rfc/rfc3986.txt
3. XML concepts: Namespaces
In XML Names 1.1 URI has been replaced by IRI (Internationalized Resource Identifier, RFC 3987: http://www.rfc-editor.org/rfc/rfc3987.txt
XML for Information Management – Day 1Airi Salminen
47
Example
Namespace: http://uwaterloo.caElement names: department, name, professor, student, last_name, first_name, ...Global attribute names: id, ...Per-element-type attribute names: student: supervisor, ...
3. XML concepts: Namespaces
XML for Information Management – Day 1Airi Salminen
48
Namespace declaration: defines a label (prefix) for the namespace and associates it to the namespace identifier (URI)
Qualified name: a namespace prefix and a local part, separated by a colon
<?xml version="1.0"?>
<report xmlns:uw="http://uwaterloo.ca">
<uw:department>
<uw:name>Department of Computer Science</uw:name>
...
</report>
3. XML concepts: Namespaces
XML for Information Management – Day 1Airi Salminen
49
Prefix xml is reserved for W3C development work and its identifier is http://www.w3.org/XML/1998/namespace.
The namespace can be declared in a document but it can be used without declaration.
Prefix xmlns is used only for declaring namespaces. It cannot be used as a name of a namespace.
3. XML concepts: Namespaces
XML for Information Management – Day 1Airi Salminen
50
Open source software for experimentations:
http://www.w3.org/Status