from characters to text: xml in a nutshell tamás váradi [email protected]

25
From characters to text: XML in a nutshell Tamás Váradi [email protected]

Upload: uriel-wilder

Post on 31-Mar-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

From characters to text:XML in a nutshell

Tamás Vá[email protected]

Page 2: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 2

Introduction

• The need for text markup• A simple example• XML annotation• HTML vs. XML• Benefits• Applications• Tools

Page 3: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 3

From characters to texts

• If the computer sees only character streams, how to build up the notion of texts?

• Conventional means to indicate text structure: – formatting– layout– still we often need to understand the

text element to recognize its role (semantics!)

Page 4: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 4

A simple example

Tamás VáradiChairDepartment of English LinguisticsMiskolc University

Miskolc-Egyetemváros(+36) 46 1234589/[email protected]

name

position

dept.

univ.

address

dept.

email

Page 5: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 5

A simple example

Tamás Váradi</name>ChairDepartment of English LinguisticsMiskolc University

Miskolc-Egyetemváros(+36) 46 1234589/[email protected]

position

dept.

univ.

address

dept.

email

<name>

Page 6: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 6

A simple example

Tamás Váradi</name>Chair</position>Department of English LinguisticsMiskolc University

Miskolc-Egyetemváros(+36) 46 1234589/[email protected]

dept.

univ.

address

dept.

email

<name><position>

Page 7: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 7

A simple example

Tamás Váradi</name>Chair</position>Department of English Linguistics</dept>Miskolc University

Miskolc-Egyetemváros(+36) 46 1234589/[email protected]

univ.

address

dept.

email

<name><position>

<dept>

Page 8: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 8

A simple example

Tamás Váradi</name>Chair</position>Department of English Linguistics</dept>Miskolc University</univ>

Miskolc-Egyetemváros(+36) 46 1234589/[email protected]

address

dept.

email

<name><position>

<dept><univ>

Page 9: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 9

A simple example

Tamás Váradi</name>Chair</position>Department of English Linguistics</dept>Miskolc University</univ>

Miskolc-Egyetemváros</address>(+36) 46 1234589/[email protected]

dept.

email

<name><position>

<dept><univ>

<address>

Page 10: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 10

A simple example

Tamás Váradi</name>Chair</position>Department of English Linguistics</dept>Miskolc University</univ>

Miskolc-Egyetemváros</address>(+36) 46 1234589/10-75</tel>[email protected]

email

<name><position>

<dept><univ>

<address><tel>

Page 11: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 11

A simple example

Tamás Váradi</name>Chair</position>Department of English Linguistics</dept>Miskolc University</univ>

Miskolc-Egyetemváros</address>(+36) 46 1234589/10-75</tel>[email protected]</email>

<name><position>

<dept><univ>

<address><tel>

<email>

Page 12: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 12

No need for formatting anymore

<position>Tamás Váradi</name>Chair</position>Department of English Linguistics</dept>Miskolc University</univ>

<name>

<dept><univ>Miskolc-Egyetemváros</address>

(+36) 46 1234589/10-75</tel>[email protected]</email>

<address><tel>

<email>

Page 13: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 13

An XML file is born<addressBook> <card> <name>Tamás Váradi</name> <position>Chair</position> <dept>Department of English Linguistics</dept> <univ>Miskolc University</univ> <address>Miskolc-Egyetemváros</address> <address>Miskolc-Egyetemváros</address> <email>[email protected]</email> </card> <card> <name>Zuzsanna Fülöp</name> <!– etc. etc. --!> </card> </addressBook>

http://www.oasis-open.org/committees/relax-ng/tutorial.html

Page 14: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 14

A close look at XML tags

http://www.javaworld.com/javaworld/jw-04-1999/jw-04-xml-p2.html

Page 15: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 15

An empty tag

http://www.javaworld.com/javaworld/jw-04-1999/jw-04-xml-p2.html

Page 16: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 16

Well-formed XML: some basic rules

• Everything is embedded in one element– The whole file has one root element

• No overlapping tags – proper embedding

<Tomato>Let's call <Potato>the whole thing off</Tomato></Potato>

WRONG!

<Tomato>Let's call <Potato>the whole thing off</Potato></Tomato>

RIGHT!

Page 17: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 17

Well-formed XML –some basic rules

• No unclosed tags• Attribute values must be in quotes• The text characters (<),(>) and („)

must be in character entities– &lt;– &gt;– &quot;

Page 18: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 18

DTD: Document type description

<!DOCTYPE addressBook [ <!ELEMENT addressBook (card*)> <!ELEMENT card (name,position,dept,univ,address,tel,email)> <!ELEMENT name (#PCDATA)><!ELEMENT position (#PCDATA)> <!ELEMENT dept (#PCDATA)><!ELEMENT univ (#PCDATA)><!ELEMENT address (#PCDATA)><!ELEMENT tel (#PCDATA)> <!ELEMENT email (#PCDATA)> ]>

Page 19: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 19

HTML

• HyperText Markup Language• The Internet is based on the notion of

Hypertext• HTML goes back to SGML

(Standard Generalized Markup Language)

• HTML: Display oriented markup language

• XML: Designed to capture Content

Page 20: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 20

The pretty-printed surface

http://www.javaworld.com/javaworld/jw-04-1999/jw-04-xml-p2.html

Page 21: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 21

the code in HTML<HTML><HEAD><TITLE>Lime Jello Marshmallow Cottage Cheese Surprise</TITLE></HEAD><BODY><H3>Lime Jello Marshmallow Cottage Cheese Surprise</H3>My grandma's favorite (may she rest in peace).<H4>Ingredients</H4><TABLE BORDER="1"><TR BGCOLOR="#308030"><TH>Qty</TH><TH>Units</TH><TH>Item</TH></TR><TR><TD>1</TD><TD>box</TD><TD>lime gelatin</TD></TR><TR><TD>500</TD><TD>g</TD><TD>multicolored tiny marshmallows</TD></TR><TR><TD>500</TD><TD>ml</TD><TD>cottage cheese</TD></TR><TR><TD></TD><TD>dash</TD><TD>Tabasco sauce (optional)</TD></TR></TABLE><P><H4>Instructions</H4><OL><LI>Prepare lime gelatin according to package instructions...</LI><!-- and so on --></BODY></HTML>

Page 22: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 22

The same info coded in XML<?xml version="1.0"?><Recipe> <Name>Lime Jello Marshmallow Cottage Cheese Surprise</Name> <Description> My grandma's favorite (may she rest in peace). </Description> <Ingredients> <Ingredient> <Qty unit="box">1</Qty> <Item>lime gelatin</Item> </Ingredient> <Ingredient> <Qty unit="g">500</Qty> <Item>multicolored tiny marshmallows</Item> </Ingredient> <Ingredient> <Qty unit="ml">500</Qty> <Item>Cottage cheese</Item> </Ingredient> <Ingredient> <Qty unit="dash"/> <Item optional="1">Tabasco sauce</Item> </Ingredient> </Ingredients> <Instructions> <Step>Prepare lime gelatin according to package instructions </Step> <!-- And so on... --> </Instructions></Recipe>

Page 23: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 23

The "grammar" of text: DTD

<!ELEMENT Recipe (Name, Description?, Ingredients?, Instructions?)><!ELEMENT Name (#PCDATA)><!ELEMENT Description (#PCDATA)><!ELEMENT Ingredients (Ingredient)*><!ELEMENT Ingredient (Qty, Item)><!ELEMENT Qty (#PCDATA)><!ATTLIST Qty unit CDATA #REQUIRED><!ELEMENT Item (#PCDATA)><!ATTLIST Item optional CDATA "0"                  isVegetarian CDATA "true"><!ELEMENT Instructions (Step)+>

Page 24: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 24

HTML vs. XML

• HTML: Display oriented markup language

• Tag-set is fixed and serves display

• Here to stay as a page description language

• XML: Designed to capture Content

• Tag-set is open and can be suited for content

• XML is a general purpose annotation scheme to encode data

Page 25: From characters to text: XML in a nutshell Tamás Váradi varadi@nytud.hu

BTANT129 w3 25

Conclusions

• Computers do not recognize text structure and text elements

• To do so, they often would need to "understand" and "interpret" text

• Until they are smart enough (if ever) to do so, we explicitely mark up text in a standard way, using annotation that the machines can parse