from characters to text: xml in a nutshell tamás váradi [email protected]
TRANSCRIPT
From characters to text:XML in a nutshell
Tamás Vá[email protected]
BTANT129 w3 2
Introduction
• The need for text markup• A simple example• XML annotation• HTML vs. XML• Benefits• Applications• Tools
BTANT129 w3 3
From characters to texts
• If the computer sees only character streams, how to build up the notion of texts?
• Conventional means to indicate text structure: – formatting– layout– still we often need to understand the
text element to recognize its role (semantics!)
BTANT129 w3 4
A simple example
Tamás VáradiChairDepartment of English LinguisticsMiskolc University
Miskolc-Egyetemváros(+36) 46 1234589/[email protected]
name
position
dept.
univ.
address
dept.
BTANT129 w3 5
A simple example
Tamás Váradi</name>ChairDepartment of English LinguisticsMiskolc University
Miskolc-Egyetemváros(+36) 46 1234589/[email protected]
position
dept.
univ.
address
dept.
<name>
BTANT129 w3 6
A simple example
Tamás Váradi</name>Chair</position>Department of English LinguisticsMiskolc University
Miskolc-Egyetemváros(+36) 46 1234589/[email protected]
dept.
univ.
address
dept.
<name><position>
BTANT129 w3 7
A simple example
Tamás Váradi</name>Chair</position>Department of English Linguistics</dept>Miskolc University
Miskolc-Egyetemváros(+36) 46 1234589/[email protected]
univ.
address
dept.
<name><position>
<dept>
BTANT129 w3 8
A simple example
Tamás Váradi</name>Chair</position>Department of English Linguistics</dept>Miskolc University</univ>
Miskolc-Egyetemváros(+36) 46 1234589/[email protected]
address
dept.
<name><position>
<dept><univ>
BTANT129 w3 9
A simple example
Tamás Váradi</name>Chair</position>Department of English Linguistics</dept>Miskolc University</univ>
Miskolc-Egyetemváros</address>(+36) 46 1234589/[email protected]
dept.
<name><position>
<dept><univ>
<address>
BTANT129 w3 10
A simple example
Tamás Váradi</name>Chair</position>Department of English Linguistics</dept>Miskolc University</univ>
Miskolc-Egyetemváros</address>(+36) 46 1234589/10-75</tel>[email protected]
<name><position>
<dept><univ>
<address><tel>
BTANT129 w3 11
A simple example
Tamás Váradi</name>Chair</position>Department of English Linguistics</dept>Miskolc University</univ>
Miskolc-Egyetemváros</address>(+36) 46 1234589/10-75</tel>[email protected]</email>
<name><position>
<dept><univ>
<address><tel>
<email>
BTANT129 w3 12
No need for formatting anymore
<position>Tamás Váradi</name>Chair</position>Department of English Linguistics</dept>Miskolc University</univ>
<name>
<dept><univ>Miskolc-Egyetemváros</address>
(+36) 46 1234589/10-75</tel>[email protected]</email>
<address><tel>
<email>
BTANT129 w3 13
An XML file is born<addressBook> <card> <name>Tamás Váradi</name> <position>Chair</position> <dept>Department of English Linguistics</dept> <univ>Miskolc University</univ> <address>Miskolc-Egyetemváros</address> <address>Miskolc-Egyetemváros</address> <email>[email protected]</email> </card> <card> <name>Zuzsanna Fülöp</name> <!– etc. etc. --!> </card> </addressBook>
http://www.oasis-open.org/committees/relax-ng/tutorial.html
BTANT129 w3 14
A close look at XML tags
http://www.javaworld.com/javaworld/jw-04-1999/jw-04-xml-p2.html
BTANT129 w3 15
An empty tag
http://www.javaworld.com/javaworld/jw-04-1999/jw-04-xml-p2.html
BTANT129 w3 16
Well-formed XML: some basic rules
• Everything is embedded in one element– The whole file has one root element
• No overlapping tags – proper embedding
<Tomato>Let's call <Potato>the whole thing off</Tomato></Potato>
WRONG!
<Tomato>Let's call <Potato>the whole thing off</Potato></Tomato>
RIGHT!
BTANT129 w3 17
Well-formed XML –some basic rules
• No unclosed tags• Attribute values must be in quotes• The text characters (<),(>) and („)
must be in character entities– <– >– "
BTANT129 w3 18
DTD: Document type description
<!DOCTYPE addressBook [ <!ELEMENT addressBook (card*)> <!ELEMENT card (name,position,dept,univ,address,tel,email)> <!ELEMENT name (#PCDATA)><!ELEMENT position (#PCDATA)> <!ELEMENT dept (#PCDATA)><!ELEMENT univ (#PCDATA)><!ELEMENT address (#PCDATA)><!ELEMENT tel (#PCDATA)> <!ELEMENT email (#PCDATA)> ]>
BTANT129 w3 19
HTML
• HyperText Markup Language• The Internet is based on the notion of
Hypertext• HTML goes back to SGML
(Standard Generalized Markup Language)
• HTML: Display oriented markup language
• XML: Designed to capture Content
BTANT129 w3 20
The pretty-printed surface
http://www.javaworld.com/javaworld/jw-04-1999/jw-04-xml-p2.html
BTANT129 w3 21
the code in HTML<HTML><HEAD><TITLE>Lime Jello Marshmallow Cottage Cheese Surprise</TITLE></HEAD><BODY><H3>Lime Jello Marshmallow Cottage Cheese Surprise</H3>My grandma's favorite (may she rest in peace).<H4>Ingredients</H4><TABLE BORDER="1"><TR BGCOLOR="#308030"><TH>Qty</TH><TH>Units</TH><TH>Item</TH></TR><TR><TD>1</TD><TD>box</TD><TD>lime gelatin</TD></TR><TR><TD>500</TD><TD>g</TD><TD>multicolored tiny marshmallows</TD></TR><TR><TD>500</TD><TD>ml</TD><TD>cottage cheese</TD></TR><TR><TD></TD><TD>dash</TD><TD>Tabasco sauce (optional)</TD></TR></TABLE><P><H4>Instructions</H4><OL><LI>Prepare lime gelatin according to package instructions...</LI><!-- and so on --></BODY></HTML>
BTANT129 w3 22
The same info coded in XML<?xml version="1.0"?><Recipe> <Name>Lime Jello Marshmallow Cottage Cheese Surprise</Name> <Description> My grandma's favorite (may she rest in peace). </Description> <Ingredients> <Ingredient> <Qty unit="box">1</Qty> <Item>lime gelatin</Item> </Ingredient> <Ingredient> <Qty unit="g">500</Qty> <Item>multicolored tiny marshmallows</Item> </Ingredient> <Ingredient> <Qty unit="ml">500</Qty> <Item>Cottage cheese</Item> </Ingredient> <Ingredient> <Qty unit="dash"/> <Item optional="1">Tabasco sauce</Item> </Ingredient> </Ingredients> <Instructions> <Step>Prepare lime gelatin according to package instructions </Step> <!-- And so on... --> </Instructions></Recipe>
BTANT129 w3 23
The "grammar" of text: DTD
<!ELEMENT Recipe (Name, Description?, Ingredients?, Instructions?)><!ELEMENT Name (#PCDATA)><!ELEMENT Description (#PCDATA)><!ELEMENT Ingredients (Ingredient)*><!ELEMENT Ingredient (Qty, Item)><!ELEMENT Qty (#PCDATA)><!ATTLIST Qty unit CDATA #REQUIRED><!ELEMENT Item (#PCDATA)><!ATTLIST Item optional CDATA "0" isVegetarian CDATA "true"><!ELEMENT Instructions (Step)+>
BTANT129 w3 24
HTML vs. XML
• HTML: Display oriented markup language
• Tag-set is fixed and serves display
• Here to stay as a page description language
• XML: Designed to capture Content
• Tag-set is open and can be suited for content
• XML is a general purpose annotation scheme to encode data
BTANT129 w3 25
Conclusions
• Computers do not recognize text structure and text elements
• To do so, they often would need to "understand" and "interpret" text
• Until they are smart enough (if ever) to do so, we explicitely mark up text in a standard way, using annotation that the machines can parse