text encoding for interchange: myths and realities yesterday's information tomorrow? lou...

48
Text Encoding for Interchange: Myths and Realities Yesterday' s Informatio n Tomorrow? Lou Burnard Oxford University Computing Services

Upload: jasmine-henderson

Post on 30-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

Text Encoding for Interchange: Myths and

RealitiesYesterday's Information Tomorrow?

Lou BurnardOxford

University Computing

Services

Page 2: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

We live in interesting times

Traditional academic goals sharing and exchange of information creation of re-usable resources dual focus on teaching and research

Digital technologies can contribute to these traditional goals, not subvert them

Page 3: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

Digital technologies offer opportunities…

integration of disparate sources texts, commentaries, sources, variations… multimedia, manuscripts, transcriptions, metadata…

a new way of preservation media disappear, data remain "multiplication beyond the reach of accident"

a huge expansion of accessibility quantitative qualitatitive

Page 4: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

.…and challenges

integration of disparate sources Different user communities have different -- and sometimes

contradictory -- agendas and priorities a new way of preservation

The business model is unclear The technical problems may be insuperable

a huge expansion of accessibility Depends on huge expansion of metadata provision Both quantitative and qualitative expansion

Page 5: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

Academia offers the technical world:

a range of interesting technical problems a new raison d’ être: conservation of cultural

heritage … and also of contemporary culture some tried and tested techniques

hermeneutics/semiotics linguistic insights robust and modular encoding schemes

Page 6: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

Resources

digital resources

encoding

analysis

abstractmodel

Page 7: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

Making digital resources

Texts are more than simply sequences of glyphs They have structure and context They also have multiple readings

Encoding or markup provides a means of making such readings explicit only that which is explicit can be digitally processed

Not all resources are textual – but they all require reading.

Page 8: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

Quick recap: what’s markup for?

Markup is a way of making explicit the distinctions we want a computer to make when it processes a string of bytes (aka a text)

It’s a way of naming and identifying the parts of a document in a controlled way

It’s (usually) more useful to markup what things are than what they look like (or should look like)

Page 9: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

What’s the point of markup?

To make explicit (for a machine) what is implicit (to a person)

To add value by multiple annotations To facilitate re-use of digital resources

In different contexts In different formats For different audiences

Page 10: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

XML: what it is and why you should care

XML is a generic markup language It simplifies the representation of structured data as

linear character strings XML looks like HTML, except that:-

XML is extensible XML must be well-formed XML can be validated XML is application-, platform-, and vendor- independent

XML empowers the content provider and facilitates data integration

Page 11: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

XML concepts: a review

an XML object is composed of identifiable objects or elements

elements have a type (name, or GI) a textual grammar (a schema) may be defined which

specifies what elements exist how they may be combined

elements also bear descriptive named attributes an XML object contains a single hierarchy of elements But elements may reference other elements in arbitrary

ways

Page 12: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

For example:

a newspaper story consists of metadata fields, followed by a headline, and a series of paragraphs, which may contain proper names or just text

it also has an identifier and a language the metadata fields include a date, a source, and

one or more keywords

Page 13: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

… like this

The Guardian, July 1, 1997, Empire, Hong Kong

A last hurrah and an empire closes down

With a clenched-jaw nod from the Prince of Wales, a last rendition of God Save the Queen, and a wind machine to keep the Union flag flying for a final 16 minutes of indoor pomp...

paragraph

headline

metadatafieldsstory<story><metadata><source>The

Guardian</source><date> July 1, 997</date><keywords><term> Empire</term><term> Hong Kong</term></keywords></metadata>

<body><div><head>A last hurrah and an empire closes down</head>

<p>With a clenched-jaw nod from the <name>Prince of Wales</name>, a last rendition of <title>God Save the Queen</title>, and a wind machine to keep the Union flag flying for a final 16 minutes of indoor pomp</p>...</body></story>

Page 14: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

… or like this

<documentLikeObject>

<metadata> …</metadata>

<sound URI=“…”/>

<image URI=“…”/>

<transcription URI=“…”/>

</documentLikeObject>

Page 15: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

Encoding implies making decisions

We may wish to allow for many views of what a resource “is”

but avoid “markup voodoo” Necessarily, there must be compromise

what is needed now what might be needed some time

Page 16: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

The Beowulf Manuscript

MS Cotton Vitellius A xv

Page 17: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

Printed version (Wrenn,1953)

Hwæt we Gar-Dena in gear-dagumþeod-cyninga þrym gefrunon,hu ða æþelingas ellen fremedon.

Oft Scyld Scefing sceaþena þreatum,monegum mægþum meodo-setla ofteah;egsode Eorle, syððan ærest wearðfeasceaft funden. He þæs frofre gebad…

Page 18: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

One encoding…

<lg><l>Hwæt we Gar-Dena in gear-dagum</l><l>þeod-cyninga þrym gefrunon,</l><l>hu đa æþelingas ellen fremedon.<l></lg><lg><l>Oft Scyld Scefing sceaþena þreatum,</l>

<l>monegum mægþum meodo-setla ofteah; </l><l>egsode Eorle, syđđan ærest wearþ</l><l>feasceaft funden. He þæs frofre gebad </l>...

Page 19: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

… another encoding

<hi rend=‘caps’>&H;&Wyn;ÆT &Wyn;E GARDE</hi><lb/>na in gear-dagum þeod cyninga<lb/> þrym gefrunon hu đa æþelinga&s; ellen<lb/> fremedon. oft Scyld Scefing sceaþe<add>na</add><lb/>þreatum, moneg<expan>um</expan> mæ;gþum meodo-setla <lb/>

of<damage desc=‘blot’/>teah egsode <sic corr=‘Eorle’>eorl</sic> syđđan ærest wearþ<lb/> feasceaft funden...

Page 20: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

…yet another encoding

<figure><!-- detailed description of digital image --></figure><sourceDesc><!-- detailed description of original source--></sourceDesc><publicationStmt><!– access control metadata --></publicationStmt><classCode><!– descriptive metadata --></classCode><!– etc -->

Page 21: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

Where is XML used?

in well-defined application areas b2b news stories chemical modelling

by well-defined user communities EAD electronic editors

Page 22: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

XML: the very next thing

XML defines a simple syntax for encoding linearized hierarchic structures which is extensible and verifiable

XML is being taken up enthusiastically as a way of adding semantics to the web (RDF, Topic Maps) standardizing application interfaces (SMIL, SOAP)

.. even though XML is semantics-free

Page 23: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

Reality check: what (exactly) is markup?

markup makes explicit a theory about some aspect of a document

some theories are more useful or generalizable than others

… so no markup language can reasonably claim to be exhaustive

… so are we doomed to a further confusion of tongues?

Page 24: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

The risks of fragmentation

If we have… historical records using a “historical markup

language” linguistic data using a “linguistic markup language” illustrations using a “visual markup language”

How will we integrate these resources? Why did we get into this business?

Page 25: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

Once upon a time long ago in a far away galaxy ….

Page 26: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

The Text Encoding Initiative

1987: Vassar College Conference

Page 27: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

Loomings“CALL me Ishmael. Some years ago --- never mind how long precisely--- having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world”

We’ve been here before…

|chap1<C 1> Loomings\chapter\chapter[1]{Loomings}:h1.1. LoomingsMOBY001001LOOMINGS|C1.chapter Loomings.cp;.sp 6 a;.ce .bd 1. Loomings~x

Good news: there is software capable of translating amongst 400 different encoding

formats

Bad news: there ARE 400 different encoding formats…

Page 28: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

Loomings“CALL me Ishmael. Some years ago --- never mind how long precisely--- having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world”

Loomings“CALL me Ishmael. Some years ago --- never mind how long precisely--- having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world”

|chap1<C 1> Loomings\chapter\chapter[1]{Loomings}:h1.1. LoomingsMOBY001001LOOMINGS|C1.chapter Loomings.cp;.sp 6 a;.ce .bd 1. Loomings~x

Good news: you can get a program that converts among 300 file formats

Good news: you can get a program that converts among 300 file formats

We’ve been here before…

Bad news: there ARE 400 different encoding formats…

Page 29: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

Information Interchange (1)

A

B

C D

E

20 translations required (n2-n)

Page 30: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

Information Interchange (2)

A

B

C D

ECommonInterchang

eStandard

10 translations required (2n)

Page 31: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

The T E what?

Originally, a research project within the humanities Sponsored by ALLC, ACH, ACL Funded 1990-1994 by US NEH, EU LE Programme et

al

Major influences digital libraries and text collections language corpora scholarly datasets

Now an international membership consortium incorporated Jan 2001

http://www.tei-c.org

Page 32: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

Goals of the TEI

interchange and integration of scholarly data support for all texts, in all languages, from all

periods guidance for the perplexed: what to encode

hence, a user-driven codification of existing best practice

assistance for the specialist: how to encode hence, a loose framework into which unpredictable

extensions can be fitted

Page 33: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

Legacy of the TEI

The TEI Guidelines: a comprehensive way of looking at what texts are and how to organize them

Expressed as a very large set of c. 600 element definitions, tied into a rather loose DTD

A mechanism for customization and specialization of the above

Tutorials, Guides,codification of shared practice etc.

Page 34: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

Who uses TEI?

digital libraries and text collections HTI, UVA, OTA, BiMiCesa, CRILet ...

linguistic corpora EAGLES, BNC, MULTEX, Silfide …

research projects Women Writers Project, Model Editions Partnership, Lorelei

Projekt, …

publishers – both web and otherwise NLR, OUCS, …

http://www.tei-c.org/Applications/

Page 35: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

Current TEI activity (1)

Annual Members Meetings (since Nov 2001) Annually elected TEI Technical Council (since

January 2002) XML revision (P4X) published in print, June 2002 Project on SGML-XML conversion (completed

2003) Next major revision (TEI P5) due mid 2004 Special Interest Groups set up end 2003

http://www.tei-c.org/Services/order/

Page 36: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

TEI P5

New work groups on character set issues: convergence with Unicode manuscript description hyperlinking/W3C standards

Work in progress SGML/XML conversion Software usability and tools Training

Funding problems and opportunities

Page 37: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

The scope of “intelligent” markup

orthographic transcription links to digital recordings, images… proper nouns, dates, times, etc. part-of-speech and morphological tagging syntactic analysis discourse analysis cross references to other material on the topic meta-textual status (correction etc) editorial commentary and annotation etc., etc., etc.

How can all these things co-exist?

Page 38: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

Frequently Answered Questions

re-use of common text for multiple purposes scholarly edition, school edition, speaking edition

alignment of transcription with sound image

multiple annotations of a common text additive alternative

authoring!

Page 39: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

Fortunately, the TEI was designed for scholarly use

all texts are alike -- but every text is different

multiple perspectives are the norm not one size fits all but who would you like

to be today? one construct, many views each view a selection from the whole

Page 40: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

The TEI solution: modularization

a (very) large number of element and attribute definitions organized as tagsets aka modules (core, base, additional,

or auxiliary) grouped into classes combined according to a defined procedure (the pizza

model) which permits controlled extension and modification

http://www.tei-c.org/pizza.html

Page 41: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

What use is a DTD?

A DTD is very useful at data preparation time (e.g. to enforce consistency), but redundant at other times If a document is well-formed, its DTD can be (almost) entirely

recreated from it. DTDs don't allow you to specify much by the way of content

validation Unlike other parts of the XML family, DTDs are not expressed

in XML The XML Schema Language addresses these issues, and

may eventually replace the DTD entirely... maybe.

Page 42: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

DTD : what does it really mean?

To get the best out of XML, you need two kinds of DTD: document type declaration: elements, attributes,

entities, notations (syntactic constraints) document type definition: usage and meaning

constraints on the foregoing Published specifications (if you can find them) for

XML DTDs usually combine the two, hence they lack modularity

The TEI model is to provide definitions which can be fitted to multiple declarations

Page 43: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

TEI as an interlingua

TEI defines generic classes of textual object<div>, <ab>, <seg> rather than chapter, paragraph, metaphor

Modification allows these to be more tightly constrained without loss of generality<metaphor TEIform=“seg”>fresh

ideas</metaphor> Cf architectural forms

Page 44: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

SGML, XML, and …

The TEI originally used SGML for pragmatic reasons

existing standard, widely used for theoretical reasons

declarative, verifiable expressive power adequate to needs of research

It is now re-expressed in XML…

Page 45: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

… after XML?

In fact, the TEI expresses an abstract model, which can be represented in SGML or XML

A TEI DTD can be constructed in either. Work on generating Relax or W3C Schemas from

the same source is ongoing This will enable us to implement better TEI

validation

Page 46: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

Why bother?

The TEI is a well-known reference point Using the TEI enables

sharing of data and resources shared modular software development lower learning curve and reduced training costs

The TEI is stable, rigorous, and well-documented The TEI is also flexible, customizable, and extensible in

documented ways Its architectural approach offers a good practical

compromise between generality and implementability

Page 47: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

Transmitting the hermeneutic

scholarship depends on continuity it is not enough to preserve the bytes of an

encoding there must also be a continuity of

comprehension: the encoding must be self-descriptive

Page 48: Text Encoding for Interchange: Myths and Realities Yesterday's Information Tomorrow? Lou Burnard Oxford University Computing Services

The wider picture

TEI is not just about exchanging data between machines It's also about communication between humans

TEI/XML is not just about the web It's about information in general

TEI is not just about technology It's about the relationship between content

creators and software developers It’s also about scholarship