digitization
TRANSCRIPT
Digitization
What is it?
• Digitization is the process of converting analog materials into a digital format that computer systems can understand and read
• In other words, materials become machine-readable
Are digitized materials the “real thing?”
• Digitized materials are representations of hands-on items.
What and how?
• written records, photographs, oral history tapes, films, material culture, pretty much most analog documents and artifacts
• digital data is only a sampling of the original data that is then encoded into the 1s and 0s that a computer understands. Information is translated into numerical values.
Cons?
• With digitization, data (information) is lost. – CDs versus vinyl– Text formatting, e.g. page layout, spacing,
handwritten information.
PROS
• Greater accessibility– More materials– More efficient search mechanisms
• Larger sets of information• Higher end technology gives us a clearer view
of some content
What difference does it make?
• Digitization transforms the way we research, present, and even preserve the past
• It transforms access to materials
Page Image
• Scanned or photographed printed page or microfilm
• Disadvantages:– not machine-readable, therefore not searchable.
You have to go through the pages one by one– They can be huge—slow to load, cumbersome to
navigate• Advantages: may more closely represent the
original materialhttps://www.flickr.com/photos/leeanncafferata/sets/72157626269585342/
Markup
• The digital version of the traditional copy editor
• In historical documents, TEI (Text Encoding Initiative) and XML (eXtensible Markup Language) markup is often used.
• Very simply, that’s a set of tags that describe the parts of a document. It is machine-readable.
This TEI/XML from the Folger
• <?xml version="1.0" encoding="utf-8"?>• <?xml-stylesheet type="text/xsl" href="fdt.xsl"?>• <TEI xmlns="http://www.tei-c.org/ns/1.0">• <teiHeader>• <fileDesc>• <titleStmt>• <title>As You Like It</title>• <author>William Shakespeare</author>• <editor xml:id="BAM">Barbara A. Mowat</editor>• <editor xml:id="PW">Paul Werstine</editor>
http://www.folgerdigitaltexts.org/?chapter=0
Creates this…
OCR (Optical Character Recognition)
• A system including an optical scanner that reads text and software that analyzes the scanned image
• The result: machine-readable, searchable, editable materials
Challenges
• Not too good at handwritten materials• Ability to read different languages and fonts
depends on sophistication of technology• Less likely to represent the origin—particularly
important in cases of historic documents, annotated texts, revised manuscripts
Some Examples
• Making of America: http://ebooks.library.cornell.edu/m/moa/
• Library of Congress: Chronicling America
• JSTOR: shows us the scanned image, searches OCR
• Google Books