markup languages

33
Stein Markup 1. 1 Markup Markup Languages Languages ML ML Yaakov J. Stein Chief Scientist RAD Data Communications SG SG W W X X HT HT VOX VOX DS DS DHT DHT G G math math legal-X C SSS A CP

Upload: senthil-kanth

Post on 13-Dec-2014

65 views

Category:

Education


5 download

DESCRIPTION

Markup Languages by Yaakov J. Stein Chief Scientist RAD Data Communications What do I do? Why is text analysis hard? Are MLs computer languages?

TRANSCRIPT

Page 1: Markup Languages

Stein Markup 1.1

MarkupMarkupLanguagesLanguages

MLML

MarkupMarkupLanguagesLanguages

MLML

Yaakov J. Stein

Chief ScientistRAD Data Communications

SGSG

WWXXHTHT

VOXVOX

DSDS

DHTDHTGG

mathmath

legal-X

C

SSS A

CP

Page 2: Markup Languages

Stein Markup 1.2

What do I do?What do I do?

business letters email meeting summaries proposals reports requirement specifications project plans

web pages research articles review articles books

I digest, edit and produce documents

Page 3: Markup Languages

Stein Markup 1.3

What do others do?What do others do?

Pretty much the same

US corporations produce >100 billion documents per year

90% of a modern institution’s information is in documents

>50% of typical corporation’s efforts involves documents

That’s why word processing SW was expected to bring efficiency increases

But didn’t!

Page 4: Markup Languages

Stein Markup 1.4

Word processing?Word processing?

PROs makes nicer looking documents expedites document sharing during creation

CONs typically 30% of effort on format and reformat doesn’t increase information accessibility doesn’t facilitate information mining

Page 5: Markup Languages

Stein Markup 1.5

Databases?Databases?

The natural alternative to documents are databases

PROs increase information accessibility facilitate information mining

CONs not human readable format inflexible

Page 6: Markup Languages

Stein Markup 1.6

The solutionThe solutionWhat we really want is to write unconstrained text

but to have information retrieval as well !

Method 1 Automatic text analysisAI program analyzes textRecognizes document structure, sentence syntaxPerforms gisting, facilitates information miningComplete solution equivalent to solving Turing test

Method 2 Manual markupDocument author responsible for markingClarifies document structureEnables automated retrieval of selected informationSuggests presentation format

Page 7: Markup Languages

Stein Markup 1.7

Why is text analysis hard?Why is text analysis hard?

The man cried FIRE !

The man cried FIRE the gun !

The man cried FIRE the gun maker !

Page 8: Markup Languages

Stein Markup 1.8

AreAre ML MLss computer languages?computer languages?There are many different types of computer languages:procedural languages

for (n=0;n<10;i++) if (n>5) printf(“markup languages are fun!\n”);

graphic languagesnewpath0 0 moveto 0 1 lineto 1 1 lineto 1 0 lineto closepath fill

database languagesSELECT book FROM biblio WHERE subject=‘DSP’ AND author=‘STEIN’ ;

logical languagesuseful(DSP), useful(hardware), fun(DSP), fun(web)interesting(X) if useful(X) and fun(X)?-interesting(X)

Page 9: Markup Languages

Stein Markup 1.9

They are!They are!

Markup languages do not directly instruct computers

like procedural languages

rather indirectly instruct computer

like logical languages

They do this by using:elements

attributes

entities

text

<BOOK SUBJECT=“dsp”> <TITLE FORMAT=“short”>DSP-CSP</TITLE> <AUTHOR>J. Stein</AUTHOR> This is a great book! &standard-disclaimer</BOOK>

(tags)}

Page 10: Markup Languages

Stein Markup 1.10

Some markup element functionsSome markup element functions Structural

– Clarifies document structure– Delineates document parts

Descriptive (informative)– Indicates – Facilitates information retrieval

Presentational (display)– Presents information in nice format– Helps human readability

Referential (links, applications)– Provide hypertext links– Launch applications

Page 11: Markup Languages

Stein Markup 1.11

Structural MarkupStructural Markup<HEADING>September 1, 2000</HEADING>

<GREETING>Dear Prof. Stein, </GREETING>

<BODY>

I would like to tell you how much I enjoyed reading your new text

“Digital Signal Processing, A Computer Science Perspective”.

I hope we will be able to meet at the next conference.

</BODY>

<SIGNATURE>

Sincerely,

Dee Espy

</SIGNATURE>

Page 12: Markup Languages

Stein Markup 1.12

Descriptive MarkupDescriptive Markup<DATE>September 1, 2000</DATE>

Dear <PERSON>Prof. Stein,</PERSON>

I would like to tell you how much I enjoyed reading your new text

<BOOK>

“Digital Signal Processing, A Computer Science Perspective”.

</BOOK>

I hope we will be able to meet at the next <EVENT>conference.</EVENT>

Sincerely,

<PERSON>Dee Espy</PERSON>

Page 13: Markup Languages

Stein Markup 1.13

Presentational MarkupPresentational Markup<RIGHT-JUSTIFY>September 1, 2000</RIGHT-JUSTIFY>

<BOLD>Dear Prof. Stein,</BOLD>

I would like to tell you how much I enjoyed reading your new text

<UNDERLINE>

“Digital Signal Processing, A Computer Science Perspective”.

</UNDERLINE>

I hope we will be able to meet at the next

<BLINK>conference.</BLINK>

Sincerely,

<IMAGE SRC=“deesignature.jpg” ALIGN=“left”>

<FONT FACE=“Times-Roman”>Dee Espy</FONT>

Page 14: Markup Languages

Stein Markup 1.14

Relational MarkupRelational Markup<today xlink:form=“simple” href=“date” actuate=“auto”>

Dear Prof. Stein,

I would like to tell you how much I enjoyed reading your new text

<A HREF=“www.amazon.com/exec/obidos/ASIN/04712954”>

“Digital Signal Processing, A Computer Science Perspective”.

</A>

I hope we will be able to meet at the next

<A HREF=“conference”>conference.</A>

Sincerely,

<IMAGE SRC=“dee-signature.jpg” ALIGN=“left”>

<A HREF=“mailto:[email protected]”>Dee Espy</A>

Page 15: Markup Languages

Stein Markup 1.15

GGeneralizedeneralized M Markuparkup L Languageanguage

William Tunnicliffe, Stanley Rice [1960s](independently) invent idea of structural markup language

Problem: need different ML for each type of document (letter, report, article, book, etc)

Charles Goldfarb, Edward Mosher, Raymond Lorie (IBM) [1973]invent Generalized Markup Language (GML)

Solution: use metalanguage Document Type Definition (DTD) defines tags

IBM marked up 90% of its documents with GML

Page 16: Markup Languages

Stein Markup 1.16

With GML structure is evidentWith GML structure is evident

Library

Novels

Journals

Textbooks

Algebraic zoology

Botanical history

Computer poetry

DSP

DSP-CSP

DSP just for fun

Elementary QED

Title

Full: Digital Signal Processing a Computer Science Perspective

Short: DSPCSP

Author

Name: Jonathan (Y) Stein

Association: RAD Data Comm.

Publication

Publisher: John Wiley

Year: 2000

Location: New York

ISBN: 04712954

Page 17: Markup Languages

Stein Markup 1.17

SStandardtandard G Generalizedeneralized M Markuparkup L Languageanguage

Problems with GML:– No validating parser– Not portable (between computer systems)

Solution:

SGML

ANSI [1978]

ISO/IEC 8879 [1986] (Intl Org for Standardization / Intl Electrotechnical Commission)

JTC1/SC34/WG1 (WG 1 of SubCommittee 34 of Joint Technical Committee 1)

For presentation:Document Style Semantics and Specification Language

Page 18: Markup Languages

Stein Markup 1.18

SGML - cont.SGML - cont.

If SGML is so good why doesn’t anyone use it ?

Complexity – base standard >500 pages– SGML is a metalanguage– writing DTD is complex programming– marked up text is hard to read– DSSSL adds to complexity

Inflexibility - requires absolute conformity– assumes only one correct way to markup– constrains author to dictated structure– not good at capturing author’s structure

Page 19: Markup Languages

Stein Markup 1.19

HHyperyperTTextext M Markuparkup L Languageanguage

CERN (particle physics institute in Switzerland) was an early Internet adopter Used extensively for collaboration (articles have long author lists)

Major problems with format incompatibility– only straight ASCII worked reliably

Tim Berners-Lee (computer specialist) defined requirements simplicity (couldn’t expect physicists to use SGML) freedom (didn’t need validation, let browser ignore bad markup) needed hypertext links (including to documents over Internet) presentational markup (papers must look nice - authors used to TEX)

Solution: HTML - a specific application of SGML (not metalanguage)

Page 20: Markup Languages

Stein Markup 1.20

HTML versionsHTML versionsHTML 1.0 (1989) Berners-Lee original CERN versionhypertext, images, head+body structure, presentational markup

HTML 2.0 (1994) IETF standard - RFC 1866added lists, forms, etc.

HTML 3.2 (1997) W3C recommendation (incorporates Netscape extensions)

added tables, applets, super/sub-scripts

HTML 4.0 (1997) W3C recommendation (and similar ISO/IEC 15445)

minimizes presentational markup

XHTML 1.0 (2000) present W3C recommendationreformulates HTML in XML

Page 21: Markup Languages

Stein Markup 1.21

HTML document structureHTML document structure

<HTML>

<HEAD>

global definitions such as

<TITLE>Web page title</TITLE>

</HEAD>

<BODY>

marked-up text

</BODY>

</HTML>

Page 22: Markup Languages

Stein Markup 1.22

Some HTML (body) elementsSome HTML (body) elements <H1>Level 1 Heading</H1> Level 1 Heading <H2>Level 2 Heading</H2> Level 2 Heading <H3>Level 3 Heading</H3> Level 3 Heading <EM> emphasized </EM> emphasized <P> Paragraph </P> Paragraph <A HREF=url>link</A> link <UL> <LI> item 1 </LI> .item 1

<LI> item 2 </LI> . item 2 </UL> <OL> <LI> item 1 </LI> 1 item 1

<LI> item 2 </LI> 2 item 2 </OL> <IMG SRC=url>

Page 23: Markup Languages

Stein Markup 1.23

Problems with HTMLProblems with HTMLPresentational aspects have predominated

<B> bold text </B><BLINK> blinking text </BLINK><FONT COLOR=“red”> red text </FONT>

Practically no descriptive markupSearch engines are reduced to flat text searchSearch by topic only through keywords or portals

Not extensibleCan’t add new tagsUnknown tags ignored

Links are relatively simpleUsually user action is required (except IMG)Only full document (with offset) linkableLink management is logistic nightmare

Page 24: Markup Languages

Stein Markup 1.24

Not everything is HTMLNot everything is HTML

Due to HTML limitations other tools are also used:

Multimedia extensions– (dynamic) gif, jpg, …– streaming audio

Common Gateway Interface– generate HTML on-the-fly– Perl, C, …

Server Push - Server Pull Javascript Java

Page 25: Markup Languages

Stein Markup 1.25

eeXXtensibletensible M Markuparkup L Languageanguage

Simplified (best parts of) SGML (subset of features)

Flexible content management tool

W3C recommendation(s)

Extensible - can add new elements (even without DTD)

Easy to create special purpose languages (with DTD/SCHEMA)

Includes HTML-like hypertext links

– and extensions (XLINK, XPOINTER)

The future of the web !

Page 26: Markup Languages

Stein Markup 1.26

XML - an ExampleXML - an Example<?xml version="1.0" standalone="yes"?>

<bibliography>

<book isbn=04712954>

<title>Digital Signal Processing: a Computer Science Perspective</title>

<author>Jonathan (Y) Stein</author>

<publisher>John Wiley and Sons</publisher>

</book>

<article>

<title>False Alarm Reduction for ASR and OCR</title>

<author>Yaakov Stein</author>

<proceedings>Tenth AICVNN Symposium</proceedings>

<pages>195-200</pages>

</article> ...</bibliography>

Page 27: Markup Languages

Stein Markup 1.27

What can we do with an XML fileWhat can we do with an XML file??

Check if well-formed Check if valid (against DTD or schema) Display “as-is” in browser Parse in special-purpose program (SAX, DOM) Process (XSL) to XML, HTML, etc. Display after processing

Page 28: Markup Languages

Stein Markup 1.28

WWirelessireless M Markuparkup L Languageanguage

Markup language element of Wireless Application Protocol

WAP forum (1997)– Ericsson, Motorola, Nokia, Unwired Planet (phone.com)– bring Internet to cellular phone users– re-use fundamental Internet concepts (TCP/IP, http, html, javascript)

but adapted to lower bandwidth smaller screen limited input facilities limited computational resources

– applications scale across transport options (GSM, TDMA, CDMA, 3G)

and device types (mobile phones, personal assistants)

Page 29: Markup Languages

Stein Markup 1.29

WML PhilosophyWML Philosophy

Defined using XML

Transported in compressed binary (for BW reduction)

Applications are modeled as decks of cards

Features:

Actions (OK, navigation, help) can be performed

Hyperlinks (like in HTML)

String variables

Timers

wbmp images (B&W)

Select boxes, forms (for input)

wmlscript (like javascript)

Page 30: Markup Languages

Stein Markup 1.30

WML structureWML structure< ? xml version=“1.0” ? ><!DOCTYPE wml …>

<wml><card>

<p>text

</p><p>

text</p>

</card><card>...</card>

</wml>

Page 31: Markup Languages

Stein Markup 1.31

Some WML elementsSome WML elements

<p> </p> text <a href=...> </a> hyperlink (anchor) <do> </do> action <go href=.../> goto wml page <timer> trigger event (units = tenths of a second) <input/> input user text <prev/> return to previous page $(…) value of variable <img src=… /> display image <postfield name=… value=…/> set variable <select > <option> <option> </select> select box

Page 32: Markup Languages

Stein Markup 1.32

Some more markup languagesSome more markup languages VML = Vector (graphics) Markup Language VoiceXML SSML = Speech Synthesis Markup Language CPML = Call Policy Markup Language DSML = Directory Services Markup Language MathML = Mathematical Markup Language CML = Chemical Markup Language AML = Astronomical Markup Language LegalXML BSML = Bioinformatic Sequence Markup Language GedML = Genealogical Data Markup Language FinXML = Financial market Markup Language ChessML SDML = Signed Document Markup Language RELML = Real Estate Listing Markup Language etc. etc. etc. ...

Page 33: Markup Languages

Stein Markup 1.33

ExamplesExamples

HTML– html examples

XML– xml-file xsl-file xml

VML– vml-file

WML (get M3gate emulator)– wml examples