uva mdst 3073 texts and models-2012-09-11
DESCRIPTION
TRANSCRIPT
Lecture 4: Texts and Models
Prof. AlvaradoMDST 3703/7703
11 September 2012
Review
• Posting “Hello, World!”– Put file in the public_html directory of your UVA
Home Directory– Create a post and insert a link to this file– Categorize as: 09.06: (S) HTML
• If you cannot get to your home directory, try uploading tohttp://homedir.virginia.edu
Some Quick Corrections
• Digital text is not necessary– It’s an open question (i.e. do we have to have it?)
• Nelson did not conceive of “trails,” Bush did• HTML is not the “first big idea” in the
liberal arts; hypertext is (according to me)• The idea that “text shapes knowledge” is
not ancient, but relatively new– Media determinism is a 20th century perspective– Although Plato notes the effects of literacy in the Phaedo
• Not everything can be translated into HTML– i.e. HTML is not the richest framework for digital representation
Your Questions and Observations
• Is commercialization killing creativity? – What is the relationship between how the web is
organized economically and how it shapes expression? EFFECT OF SOCIAL ORGANIZATION
• What happens if the associations that someone makes is ‘off’ and illogical to others?– Does it loosen the way logical connections can be
made and argued? EFFECT ON LOGIC
Your Questions and Observations
• Computers in general still heavily rely on a hierarchical structure – To what extent rationalization has occurred with the
invention of hypertext?• Do things lose value and meaning in
exchange for digital coding?– What is the effect of digitization on value?
• Hypertexts and links online can be distracting– Non-linear thinking or mindless surfing?
Your Questions and Observations
• People are trying to create the same exact classroom experience online that exists in the physical classroom, which is impossible– We need to rethink and restructure the online learning
experience as a new and unique learning experience• How can we keep hypertext from
altering us too much?• The beauty and the risk of an open
source web
Practical Questions• How can an HTML webpage on your own
computer be found by the search bar but not be on the web?– Your browser lives on your machine– The protocol name tells it where to look
• I wondered if the picture from my computer would still show up if I opened the page from another computer?
• It is interesting to see how one little thing out of place can ruin the entire code– Computers are stupid in that way
• Why should coders learn HTML? – HTML is an interface language that can be easily generated from print
statements in your code
What is HTML?
• HTML is not a programming language– Programming languages express IF … THEN logic– But it is code that obeys a syntax & gets interpreted– And it is produced and consumed by programs
• HTML is a very general interface language
• HTML is written in XML, which we discuss today– Technically called “XHTML”– The original version was written in SGML
In general, don’t conflate HTML with hypertext or with digital representation in general
HTML is a language that generates a species of hypertext
which is, in turn, a species of digital representation
A provisionaltaxonomy
Is hypertext new?
[Study Bible]
1 = Mishna, the first major transcription of the oral law2 = Gemara, analytical discussions3 = Rashi, glossary4 = Tosefos, additions5 = Hananel, comments6 = Eye of Justice, legal decisions8 = Light of the Bible, references to Biblical quotations.9 = Bach's Annotations 10 = Gra's Annotations
[Talmud]
[Charrette]
[The Wasteland]
[Critical Edition]
[OED]
These are all examples of traditional texts
They exhibit “latent hypertext”
Landow
• The concept of hypertext parallels poststructuralist views of text– Barthes, Foucault, Derrida, Kristeva, et al.
• In this view, a text is not, and has never been, a bounded, closed thing– it is a network of signifiers that connect meanings
across time and space …
Digital humanists have been concerned with encoding historical texts since at least 1949
Father Busa
• Creator of the Index Thomisticus• Saw the computer as a solution to
indexing the works of Aquinas in 1949– 13,000,000 words– “in” took 4 years
• Solution:– Lemmatization– Variations tagged as
instances of a type
The complete works of Aquinas will be typed onto punch cards; the machines will then work through the words and produce a systematic index of every word St. Thomas used, together with the number of times it appears, where it appears, and the six words immediately preceding and following each appearance (to give the context). This will take the machines 8,125 hours; the same job would be likely to take one man a lifetime.
Time Magazine, 1956, “Religion: Sacred: Electronics”
So, what is text?
Let’s look at some material examples
page o’ text
Real world text comes packaged in documents
How is text conveyed in a document?
A document is a material artifact
What is text?
Visual Signifiers
• Small caps• Indentation• Alignment• Italics• Space
All used to signify elements of text
Documents have thee Levels: Content, Structure, Style
• Content– TEXT, images, video clips, etc.
• Structure– The organization of content into units (elements)
and logical relationships (e.g. reading order)• Style– Screen and print layout– Fonts, colors, etc.
Descriptive markup languages allow us to define structure of documents for
computational purposes
Theoretically, they do not specify layout or content
[PDF, Procedural Markup]
In contrast to procedural markup like PDF
So, how are docs structured?
Hierarchically …
(theoretically)
Document Elements and StructuresPlay– Act +
• Scene +– Line +
Book– Chapter +
• Verse +
Letter
– Heading• Return Address• Date• Recipient Info
– Name– Title– Address
– Content• Salutation• Paragraph +• Closing
These are all “trees”
XML is a markup language
What is XML?
• Stands for eXtensible Markup Language– Actually invented after the web– A simplification of SGML, the language used to create HTML– It specifies a set of rules for creating specialized markup
languages such as HTML and TEI• It is simplified version of the SGML
– Standard Generalized Markup Language• SGML was invented in the early 1970s to
wrest the control of documents from computer people who were taking over industries like law and accounting
XML looks like this
Notice how the element names reference units, not layout or style
Also markup for “in-line” elements
XML Premises
1. All documents are comprised of elements.
2. Elements contain content.3. Elements have no layout.4. Elements are hierarchically
ordered.5. Elements are to be indicated by
“markup” – tags that define the beginning and end of an element
XML Markup Rules
• Tags signify structural elements• Three kinds of tag– Start and End, e.g <p> and </p>– Singleton, e.g <br />
• Start and singleton tags can have attributes– Simple key/value pairs– <div class="stanza" style="color:red;">
• Basic rules– All attributes must be quoted– All tags must nest (no overlaps!)
Documents in XML that meet these rules are “well formed”
XML also provides Document Types• A Document Type Definition (DTD)
defines a set of tags and rules for using them– Specifies elements, attributes, and possible combinations– E.g. in HTML, the ol and ul elements must contain li elements
• A DTD is just one kind of schema system used by XML
• Schema express data models of/for texts– TEI is a powerful way of describing primary source materials
for scholars• Documents that use a schema properly
are called “valid”
Originally, DTDs defined “genres” like business letter or mortgage form
They were later used to define more abstract models of textual content
XML is used everywhere
• HTML– E.g. Embed codes
• TEI (Text Encoding Initiative)• RSS• Civilization IV• Playlists (e.g. XSPF or “spiff”)• Google Maps (KML)
A Look Again at HTML
• aka XHTML– And now becoming HTML5
• An instance of XML (formerly SGML)
• An interface language• Language of the World Wide Web• Defined by a DTD that prescribes a
specific set of elements and relations
HTML Document Structure
• Head– Title– [Directives]
• Body– H1+– H2+• P+• UL
– LI
Basic Elements with associated TagsElement Tags Attributes
Paragraph <p> ... </p>
Numbered List <ol> <li> ... </li></ol>
Bulleted List <ul> <li> ... </li></ul>
Table <table> <tr> <td> ... </td> </tr></table>
Anchor <a> ... </a> href, target
Image <img/> src, border
Object <object> ... </object>
The Text Encoding Initiative created TEI to mark up scholarly documents
Mainly primary sources such as books and
manuscripts
TEI
• The dominant language used to encode scholarly text
• The current room was the locations of UVa’s EText Center– World famous for text encoding– Now part of the library and catalog
• Scholars create their own schema to match what they are interested in
Examples
• The TEI Header– http://tbe.kantl.be/TBE/examples/TBED02v00.
htm• TEI Prose– http://tbe.kantl.be/TBE/examples/TBED03v00.
htm • Find others at the TEI By
Example Project– http://tbe.kantl.be/TBE/
XML contains an implicit theory of text
What is it?
OCHO
• XML (and therefore HTML and TEI) imply a certain theory of text– A text is an OHCO
• OHCO– Ordered Hierarchy of Content Objects
• An OHCO is a kind of tree– Elements follow each other in sequences– Elements can contain other elements
What are the advantages of this view?
OHCO allows for easy processing
• Every element has a precise address in the text– E.g. HTML/body/p[1]
• Texts can be described in the language of kinship– Ancestors, parents, siblings, children, etc.
• Texts can be restructured and manipulated by known patterns and algorithms– Traversing– Pruning– Cross-referencing
What are the disadvantages of OCHO?
Logical vs. Physical Structure
Two common structures that overlap
Pages and Paragraphs
<page n=“2”>. . .<p id=“foo”>His good looks and his rank had one fair claim on his attachment, since to them he must have owed a wife</p> </page><page n=“3”><p id=“bar” prev_id=“foo”> a very superior character to anything deserved by his own.</p>. . .</page>
Solution 1: Split Elements
<p>His good looks and his rank had one fair claim on his attachment, since to them he must have owed a wife <pb n=“3” /> a very superior character to anything deserved by his own.</p>
Solution 2: Use “Milestones”
One structure gets backgrounded
Wittgenstein’s Manuscripts
What about this?
[Charrette]
The problem of overlap suggests the need for a richer set of tools
What tools do McCarty and Unsworth reference?
Tables
A database for Ovid
McCarty
• A different use of markup – From document description to interpretation – Creative “misuse”
• Reverse engineering a “grammar” of personification from a markup strategy– Thickness = description (of text)– Depth = explanation (of text by reference to grammar)
• Is forced to use tables in collaboration with markup
Thick description = MarkupDeep description = Tables
How to reconcile these tools?
A Proposed Model
• Texts are not documents– Documents are media, Texts are messages
• Texts and documents are part of a system comprised of “levels”– They are effectively archaeology sites with stratigraphic layers– Erasures are like cities building on top of each other
• Each level of the system is described by an appropriate set of tools– Document structures XML– Textual structures, embedded ontologies Tables
Basic Levels
• Document– Physical objects (paper)– Logical objects (defined by space, style, punctuation, etc.)– Style and layout (also defined by space, color, etc.)– Can have superimposed versions
• Text– Sequences of characters– Grammatical features– Figures and poetic features– Etc.