texts 2: markup languages, software for manipulating...
Post on 18-Feb-2020
18 Views
Preview:
TRANSCRIPT
Texts 2: Markup languages,software for manipulating text
László Kálmán1 Csaba Oravecz1 Péter Szigetvári2
1Research Institute for LinguisticsHungarian Academy of Sciences
2Department of English LinguisticsEötvös Loránd University
Lecture 4 / 3 Oct 2007
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
outline
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
abstract
this lecture tells you about
• ways of formatting electronic text
• important software for creating and manipulating electronictext
• the features and functions of such software
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
markup languages 1: ∗MLs
SGML (Standard Generalized Markup Language; ISO 8879)
a metalanguage used to define specific markup schemes (asystem of tags)
HTML (Hypertext Markup Language)
an implementation of SGML, used for web documents
XML (Extensible Markup Language)
a simplified subset of SGML
XHTML (Extensible Hypertext Markup Language)
an implementation of XML, used for web documents(HTML : SGML = XHTML : XML)
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
a chunk of SGML code
Figure: part of thesource for the entryfor quiz inOrszág–Magay’sEnglish–Hungariandictionary
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
the entry for quiz printed
Figure: the printed entry for quiz in Ország–Magay’sEnglish–Hungarian dictionary
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
markup languages 1: ∗MLs
SGML (Standard Generalized Markup Language; ISO 8879)
a metalanguage used to define specific markup schemes (asystem of tags)
HTML (Hypertext Markup Language)
an implementation of SGML, used for web documents
XML (Extensible Markup Language)
a simplified subset of SGML
XHTML (Extensible Hypertext Markup Language)
an implementation of XML, used for web documents(HTML : SGML = XHTML : XML)
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
example HTML code. . .
Figure: a sample HTML source file. . .
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
. . . shown in Firefox and Opera
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
. . . shown in w3m
Figure: . . . and its output in w3m (a CLI browser)
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
markup languages 1: ∗MLs
SGML (Standard Generalized Markup Language; ISO 8879)
a metalanguage used to define specific markup schemes (asystem of tags)
HTML (Hypertext Markup Language)
an implementation of SGML, used for web documents
XML (Extensible Markup Language)
a simplified subset of SGML
XHTML (Extensible Hypertext Markup Language)
an implementation of XML, used for web documents(HTML : SGML = XHTML : XML)
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
markup languages 2: lightweight
• less elaborate systems used for specific purposes, e.g.,
BBCode (Bulletin Board Code)
used on bulletin boards, like the SEAS Forum (btw. have youjoined yet?); contains only some formatting (italics, boldface,colour, size), hyperlink tags, and emoticons (smilies)
Wikitext
used on Wiki sites, some formatting, links to other Wiki pages,external links, pictures, maps
• Wikitext is copiously documented in the relevant Wikipages, BBCode is also usually explained in forum FAQs
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
editing BBCode
Figure: editing BBCode
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
the code itself. . .
The [b]sole[/b] aim of [i]this[/i] [u]message[/u]
is to [color=orange]exemplify[/color]
[code]BBCode[/code] for students of
[url=http://budling.nytud.hu/itcourse]this
course[/url].
[list=a][*]first item[*]second item[/list]
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
forum post
Figure: . . . and the result
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
markup languages 2: lightweight
• less elaborate systems used for specific purposes, e.g.,
BBCode (Bulletin Board Code)
used on bulletin boards, like the SEAS Forum (btw. have youjoined yet?); contains only some formatting (italics, boldface,colour, size), hyperlink tags, and emoticons (smilies)
Wikitext
used on Wiki sites, some formatting, links to other Wiki pages,external links, pictures, maps
• Wikitext is copiously documented in the relevant Wikipages, BBCode is also usually explained in forum FAQs
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
Wikipedia: Polcz Alaine
Figure: Wikipedia’s entry for Polcz Alaine
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
Wikipedia: editing Polcz Alaine
Figure: editing Wikipedia’s entry for Polcz Alaine
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
markup languages 2: lightweight
• less elaborate systems used for specific purposes, e.g.,
BBCode (Bulletin Board Code)
used on bulletin boards, like the SEAS Forum (btw. have youjoined yet?); contains only some formatting (italics, boldface,colour, size), hyperlink tags, and emoticons (smilies)
Wikitext
used on Wiki sites, some formatting, links to other Wiki pages,external links, pictures, maps
• Wikitext is copiously documented in the relevant Wikipages, BBCode is also usually explained in forum FAQs
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
markup languages 3: TEX & co.
used for professional typesetting
TEX
a typesetting system created by Donald Knuth to typeset thesecond edition of the second volume of his book The Art ofComputer Programming
LATEX
a set of macros built upon the above to ease the user’s life
other TEXes
there are many other types of TEXes, e.g., AMS-TEX
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
TEX source . . .
Figure: plain TEX source file
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
. . . and the result
left margincentre of line
right marginone-third of line width
1) item 12) item 2
2/a) subitem within item 23) item 3
Figure: output of the above TEX source file
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
markup languages 3: TEX & co.
used for professional typesetting
TEX
a typesetting system created by Donald Knuth to typeset thesecond edition of the second volume of his book The Art ofComputer Programming
LATEX
a set of macros built upon the above to ease the user’s life
other TEXes
there are many other types of TEXes, e.g., AMS-TEX
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
LATEX source . . .
Figure: LATEX source file (of the above output)
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
markup languages 3: TEX & co.
used for professional typesetting
TEX
a typesetting system created by Donald Knuth to typeset thesecond edition of the second volume of his book The Art ofComputer Programming
LATEX
a set of macros built upon the above to ease the user’s life
other TEXes
there are many other types of TEXes, e.g., AMS-TEX
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
machine-generated text formats
RTF (Rich Text Format)
Microsoft’s proprietary platform-independent document format,human readable, but rarely edited directly
PostScript
a page description and programming language, the de factostandard for printing; human readable, editable
PDF (Portable Document Format)
Adobe’s proprietary document format, based on PostScript,encoding the exact look of the document; the most widespreadformat of publishing heavily formatted documents on the web;usually non-human-readable, compressed
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
RTF source and output
the RTF source. . .{\rtf1\ansi{\fonttbl\f0\fswiss Helvetica;}\f0Hello!\parThis is some {\b bold} text.\par}
. . . outputs the following
Hello!This is some bold text.
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
machine-generated text formats
RTF (Rich Text Format)
Microsoft’s proprietary platform-independent document format,human readable, but rarely edited directly
PostScript
a page description and programming language, the de factostandard for printing; human readable, editable
PDF (Portable Document Format)
Adobe’s proprietary document format, based on PostScript,encoding the exact look of the document; the most widespreadformat of publishing heavily formatted documents on the web;usually non-human-readable, compressed
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
PostScript fragment
Figure: the first few of about 2M lines from the PostScript version ofthe present slide show
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
machine-generated text formats
RTF (Rich Text Format)
Microsoft’s proprietary platform-independent document format,human readable, but rarely edited directly
PostScript
a page description and programming language, the de factostandard for printing; human readable, editable
PDF (Portable Document Format)
Adobe’s proprietary document format, based on PostScript,encoding the exact look of the document; the most widespreadformat of publishing heavily formatted documents on the web;usually non-human-readable, compressed
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
Portable Document Format fragment
Figure: the first few lines of the PDF version of the present slide show
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
types of markup
procedural markup
uses explicit instructions, like
• set this in italics: \textit{this}
• skip a line, set the text in larger boldface font, skip a lineagain: <br><p><font size=+1><b>Sectiontitle</b></font></p><br>
logical/descriptive/semantic/generic markup
• emphasize: \emph{this}
• typeset a section heading:<h1>Title of section</h1>
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
comparison of markup types
logical
• depends heavily on laterinterpretation (esp. in webdocuments)
• interpretation of markuphas to be customized
• flexible on format: e.g.,\emph{} produces italicsin a roman context, androman in an italic context
• style easily modifiable later
procedural
• firmer control over output
• less customizationnecessary
• (often) premature stanceon format
• style modifiable byextensive replacement ofmarkup
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
types of editors
editors• hex editor (for experts): shows “character” codes
• line editor (spartan, obsolete): can edit only one line of thetext at a time, e.g., ed (Unix/Linux), Edlin (MS-DOS,Windows)
• text editor (formatting by markup)e.g., vi, Emacs, Notepad, Simple Text
• word processor (usually WYSIWYG)e.g., Microsoft Word, AbiWord, Open Office Writer
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
types of editors: a hex editor
Figure: Screenshot of Hex Editor (HHD Software) (from Wikipedia)
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
types of editors
editors• hex editor (for experts): shows “character” codes
• line editor (spartan, obsolete): can edit only one line of thetext at a time, e.g., ed (Unix/Linux), Edlin (MS-DOS,Windows)
• text editor (formatting by markup)e.g., vi, Emacs, Notepad, Simple Text
• word processor (usually WYSIWYG)e.g., Microsoft Word, AbiWord, Open Office Writer
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
types of editors: a line editor
Figure: Screenshot of a GNU ed session
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
types of editors
editors• hex editor (for experts): shows “character” codes
• line editor (spartan, obsolete): can edit only one line of thetext at a time, e.g., ed (Unix/Linux), Edlin (MS-DOS,Windows)
• text editor (formatting by markup)e.g., vi, Emacs, Notepad, Simple Text
• word processor (usually WYSIWYG)e.g., Microsoft Word, AbiWord, Open Office Writer
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
types of editors: a text editor
Figure: Screenshot of Emacs
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
types of editors
editors• hex editor (for experts): shows “character” codes
• line editor (spartan, obsolete): can edit only one line of thetext at a time, e.g., ed (Unix/Linux), Edlin (MS-DOS,Windows)
• text editor (formatting by markup)e.g., vi, Emacs, Notepad, Simple Text
• word processor (usually WYSIWYG)e.g., Microsoft Word, AbiWord, Open Office Writer
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
types of editors: a word processor
Figure: Screenshot of Open Office Writer
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
borderline case 1
Figure: “WYSIWYG” in Emacs, a text editor
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
borderline case 2
Figure: markup in Open Office Writer, a WYSIWYG word processor
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
syntax highlighting 1
Figure: syntax highlighting for HTML in Emacs
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
syntax highlighting 2
Figure: syntax highlighting for shell scripts in Emacs
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
syntax highlighting 3
Figure: syntax highlighting for perl in Emacs
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
text formatters
what’s that?
text formatters are programmes that feed on marked-up text,and produce formatted output from it, e.g.,
HTML INPUT (EATEN BY BROWSERS) TEX INPUT OUTPUT
<em>vis-à-vis</em> {\it vis-\‘a-vis} vis-à-vis2<sup>2</sup>=<strong>4</strong> $2^2={\bf 4}$ 22
= 4
some common text formatters• RUNOFF (1964), nroff, troff, groff
• TEX, LATEX
• web browsers contain an HTML formatter to be able todisplay HTML source files
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
ambiguous terminology
N.B. words like TEX are used ambiguously: both for themarkup language and for the text formatting programme;this ambiguity does not normally cause anymisunderstanding
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
a “definition” of word processor
a word processor
is a text editor and formatter in one
using a word processor as a text editor only
editing a file and saving it as plain text (.txt)
using a word processor as a text formatter only
opening a file and saving it as PDF (or PostScript)
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
concepts 1a
• opening/reading/retrieving a file: copying (part of) a file intothe memory (this part of the memory will be called buffer),and usually displaying (part of) it on the screen, so that itcan be read or modified by the user
• opening a new file: presenting an empty buffer so that a filecan be created from scratch
• saving/writing a file: writing the contents of the buffer to thedisk (this usually destroys the original file, but see VERSION
CONTROL on week 13)
• saving a file as: writing the contents of the buffer to the diskwith a file name different form the original, or in a formatdifferent form the original (or what the editor defaults to)
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
concepts 1b
• auto(matic )saving: regular automatic saving of thecontents of the buffer to minimize data loss in case ofpower failure
• recovering a file: restoring the contents of an unsaved filefrom the automatically saved version
• quitting: in some editors quitting leaves the original filesintact, i.e., all the changes you made in the session are lost(except if there was autosaving during the session)
• exiting: modern editors usually ask if unsaved buffersought to be written to the disk; sometimes this does nothappen if you shut down the computer: to be on the safeside you had always better save buffers manually andperhaps close the editor before shutting down thecomputer
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
concepts 2
• cursor: an underscore (_), vertical line ( ), rectangular box( ), which indicates the point where text will be entered ifyou begin to type; it may blink; often its shape is differentdepending on input mode (insert or overwrite)
• insert mode: typed text will be inserted, pushing outcharacters to the right (left in right-to-left scripts)
• overwrite mode: typed text will overwrite characters to theright (left in right-to-left scripts)
• mark: another point in the text; the region between thecursor and the mark is selected for some operation
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
concepts 2: cursor in insert mode
Figure: cursor in Open Office Writer in insert modeKálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
concepts 2
• cursor: an underscore (_), vertical line ( ), rectangular box( ), which indicates the point where text will be entered ifyou begin to type; it may blink; often its shape is differentdepending on input mode (insert or overwrite)
• insert mode: typed text will be inserted, pushing outcharacters to the right (left in right-to-left scripts)
• overwrite mode: typed text will overwrite characters to theright (left in right-to-left scripts)
• mark: another point in the text; the region between thecursor and the mark is selected for some operation
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
concepts 2: cursor in overwrite mode
Figure: cursor in Open Office Writer in overwrite modeKálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
concepts 2
• cursor: an underscore (_), vertical line ( ), rectangular box( ), which indicates the point where text will be entered ifyou begin to type; it may blink; often its shape is differentdepending on input mode (insert or overwrite)
• insert mode: typed text will be inserted, pushing outcharacters to the right (left in right-to-left scripts)
• overwrite mode: typed text will overwrite characters to theright (left in right-to-left scripts)
• mark: another point in the text; the region between thecursor and the mark is selected for some operation
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
concepts 2: selected text in Open Office Writer
Figure: selected text in Open Office WriterKálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
concepts 2: selected text in Emacs
Figure: selected text in Emacs: mark is in line 895 column 0, cursor inline 896, column 11
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
concepts 3
• cutting/killing:1 removing the selected region from the textand putting it to the clipboard/kill ring
• copying: copying the selected region to the clipboard/killring
• pasting/yanking: copying the contents of the clipboard/killring into the buffer
1MS dialect/Emacs dialect on this pageKálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
concepts 4
• find/search: looking for a given pattern in the buffer
• overwrapped search: looking for occurrences of the patternfrom the begin of the file after the end of the file has beenreached (or from the end in the case of reverse/backwardsearching)
• incremental search: looking for a given pattern on the fly
• replace: removing the first given pattern from the bufferand inserting the second given patter in its place
regular expressions
offer a very powerful tool in replacing patters (more on them onweek 10)
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
text justification
types of justification
centred
flush left, ragged right
flush right, ragged left
justified
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
justification of monospace text
When a monospaced font is used, there is a way to justify text without insertingextra spaces. Careful word choice allows the author to write with exactly eightycharacters per line, creating a visual effect of justification. Since many wordsin English mean the same thing but are different lengths, it is just a matter oftrial and error to find the proper line length. For extra points, you should endthe last line after eighty characters as well, creating an invincible paragraph.
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
comparison of markup and WYSIWYG
markup
• daunting at first sight
• powerful (e.g., “put thisone-third on the way betweenthe two margins”)
• persuades user moreeffectively to use logicalmarkup
• both on CLIs and GUIs
• uses less computerresources
• user sees everything in thefile
WYSIWYG
• intuitive, easy at first sight
• “what you see is all you get”
• allows user to use primitiveformatting techniques
• possible only on GUIs
• uses huge computerresources
• data in the file are hiddenfrom user
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
a horrendous example
Figure: hanging and normal indentation: never do it this way!
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
comparison of markup and WYSIWYG
markup
• daunting at first sight
• powerful (e.g., “put thisone-third on the way betweenthe two margins”)
• persuades user moreeffectively to use logicalmarkup
• both on CLIs and GUIs
• uses less computerresources
• user sees everything in thefile
WYSIWYG
• intuitive, easy at first sight
• “what you see is all you get”
• allows user to use primitiveformatting techniques
• possible only on GUIs
• uses huge computerresources
• data in the file are hiddenfrom user
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
hyphenation
hyphenation: the WYSIWYG way
points of hyphenation are calculated at the end of each line,they do not change later on, occasionally yielding very looselines; paragraph-based hyphenation would burden the systemtoo much, and would result in constantly flickering characterswhile text is entered
hyphenation: the TEX way
points of hyphenation are calculated for a whole paragraph, andrecalculated several times, until the optimal solution is achieved(this is possible, because the calculation does not take placeduring the editing of the text)
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
comparison of markup and WYSIWYG
markup
• daunting at first sight
• powerful (e.g., “put thisone-third on the way betweenthe two margins”)
• persuades user moreeffectively to use logicalmarkup
• both on CLIs and GUIs
• uses less computerresources
• user sees everything in thefile
WYSIWYG
• intuitive, easy at first sight
• “what you see is all you get”
• allows user to use primitiveformatting techniques
• possible only on GUIs
• uses huge computerresources
• data in the file are hiddenfrom user
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
data potentially hidden in a WYSIWYG document
• data about the owner of the word procesor
• data about previous edits in the documentyou can lose your job if you are unwary, e.g.,
Rossz verzió
Menesztették Dobos Gabriellát, a Fovárosi Foügyészség sajtóosztályának vezetojét,és felmentették szóvivoi posztjáról is. [. . . ]Ugyanakkor hiba történt, amelynek felelose van, hiszen a vádirat lerövidítése és arövidített, adatokat nem sérto változat elkészítése Dobos Gabriella osztályvezetofeladata volt. Mint az ügyészségi vizsgálatban kiderült, Dobos valóban le is rövidítettea vádiratot, de csak kijelölte a törlendo részeket, és úgy küldte a rövidített verziót aLegfobb Ügyészségre, hogy abból véglegesen nem törölték az adatokat.Így egy órán át az internetre kitett verziót (bizonyos billentyuk megnyomásával) bárkikiegészíthette a teljes verzióból kihúzott részekkel. Így a személyes adatokhoz is bárkihozzáférhetett, ami az adatvédelmi törvényt sérti.
— http://index.hu/politika/belfold/kendh050830/
Kálmán, Oravecz, Szigetvári Texts 2: Markup languages, software
top related