document formats how to build a digital library ian h. witten and david bainbridge

15
Document Document Formats Formats How to Build a Digital Library How to Build a Digital Library Ian H. Witten and David Bainbridge Ian H. Witten and David Bainbridge

Upload: hollie-baker

Post on 01-Jan-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge

Document Document FormatsFormats

How to Build a Digital LibraryHow to Build a Digital LibraryIan H. Witten and David BainbridgeIan H. Witten and David Bainbridge

Page 2: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge

DocumentsDocuments

Building blocks of digital librariesBuilding blocks of digital libraries Many different standards for documentsMany different standards for documents InternationalizationInternationalization Fixed versus fluidFixed versus fluid Permanent versus transientPermanent versus transient IndexingIndexing

Page 3: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge

Representing CharactersRepresenting Characters

EBCDICEBCDIC Extended Binary Coded Decimal Interchange CodeExtended Binary Coded Decimal Interchange Code Represented in 8 bitsRepresented in 8 bits

ASCII (1968)ASCII (1968) American Standard Code for Information InterchangeAmerican Standard Code for Information Interchange Represented with 7 bitsRepresented with 7 bits Does not support many foreign languagesDoes not support many foreign languages Many expansions made to the basic ASCII character setMany expansions made to the basic ASCII character set

ISCII (1983)ISCII (1983) Indian Script Code for Information InterchangeIndian Script Code for Information Interchange Hindi and related languagesHindi and related languages

GB and Big-5 for ChineseGB and Big-5 for Chinese

Page 4: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge

UnicodeUnicode

Successor of ASCIISuccessor of ASCII ISO-10646 (1993) ISO-10646 (1993) UniversalUniversal

Aims to represent ALL the world’s languagesAims to represent ALL the world’s languages Default encoding for HTML and XMLDefault encoding for HTML and XML

Development began in 1988 as a joint effort Development began in 1988 as a joint effort between Apple and Xeroxbetween Apple and Xerox

Unicode standard continues to evolveUnicode standard continues to evolve Round-trip compatibility – Unicode can be Round-trip compatibility – Unicode can be

mapped to/from any character set without lossmapped to/from any character set without loss

Page 5: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge

Unicode Character SetUnicode Character Set

Unicode standard is massiveUnicode standard is massive Two subsets of standard: ISO 10646-1/2 Two subsets of standard: ISO 10646-1/2 94,000 characters defined94,000 characters defined

Represents scriptsRepresents scripts Scripts versus languagesScripts versus languages Punctuation shared among scriptsPunctuation shared among scripts

Universal character set – characters at the Universal character set – characters at the core of Unicodecore of Unicode

Page 6: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge

Representing DocumentsRepresenting Documents

Plain textPlain text Full-text indexingFull-text indexing

Bag of wordsBag of words Inversion of the textInversion of the text Inverted filesInverted files Granularity of documentGranularity of document Granularity of indexGranularity of index

Word segmentationWord segmentation Chinese and Japanese are written without spacesChinese and Japanese are written without spaces Spacing in Chinese sentences can completely Spacing in Chinese sentences can completely

change the meaningchange the meaning

Page 7: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge

Page Description LanguagesPage Description Languages

Device independenceDevice independence PostScript (also a programming language)PostScript (also a programming language)

First commercially developed page First commercially developed page description language (1985)description language (1985)

Fonts: Type 1, TrueType, OpenTypeFonts: Type 1, TrueType, OpenType Text extractionText extraction Using PostScript in a digital libraryUsing PostScript in a digital library

Page 8: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge

Page Description Page Description LanguagesLanguages

Portable Document Format (PDF)Portable Document Format (PDF) PDF versus PostscriptPDF versus Postscript

Not a full-scale programming languageNot a full-scale programming language New features for interactive displayNew features for interactive display

Random access to pagesRandom access to pages Hierarchically structured contentHierarchically structured content Navigation within a documentNavigation within a document HyperlinksHyperlinks

File format: header, objects, cross-File format: header, objects, cross-references, trailerreferences, trailer

Searchable image optionSearchable image option

Page 9: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge

Word-Processor DocumentsWord-Processor Documents

Rich Text FormatRich Text Format 240 page specification240 page specification Document-level metadataDocument-level metadata Conversion to HTML software availableConversion to HTML software available

Native Word formatsNative Word formats BinaryBinary ProprietaryProprietary

LaTeX formatLaTeX format Typed formatting commandsTyped formatting commands Non-proprietaryNon-proprietary

Page 10: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge

Representing ImagesRepresenting Images

Lossless image compressionLossless image compression GIFGIF PNGPNG JPEG-LosslessJPEG-Lossless JPEG-2000JPEG-2000

Lossy image compressionLossy image compression JPEGJPEG

Progressive refinementProgressive refinement

Page 11: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge

Representing Audio and VideoRepresenting Audio and Video

Evolution of signals over timeEvolution of signals over time Sample rateSample rate Samples per secondSamples per second Multimedia compressionMultimedia compression

CodecCodec AsymmetryAsymmetry RedundancyRedundancy

Page 12: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge

MPEGMPEG ISO Moving Picture Experts Group (1988)ISO Moving Picture Experts Group (1988) Audio and Video at 1.5 Mbit/secondAudio and Video at 1.5 Mbit/second Family of standardsFamily of standards MPEG-1MPEG-1

Low resolution video, 30 fps, near CD qualityLow resolution video, 30 fps, near CD quality Layer 3 – MP3Layer 3 – MP3

MPEG-2MPEG-2 Higher quality video (DVD)Higher quality video (DVD) Supports interlaced images (Broadcast TV)Supports interlaced images (Broadcast TV) Multichannel audioMultichannel audio

Page 13: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge

MPEGMPEG MPEG-3 abandonedMPEG-3 abandoned MPEG-4MPEG-4

Low bandwidth networks – mobile and WWWLow bandwidth networks – mobile and WWW Object based (vs. signal based)Object based (vs. signal based) InteractiveInteractive Strategies for identifying and managing intellectual Strategies for identifying and managing intellectual

propertyproperty MPEG-7MPEG-7

Metadata description for content delivered via MPEG-Metadata description for content delivered via MPEG-1,2,41,2,4

MPEG-21MPEG-21 Multimedia lifecycleMultimedia lifecycle InteroperabilityInteroperability

Page 14: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge

Other Multimedia Other Multimedia FormatsFormats

Audio and VideoAudio and Video AVI (Microsoft)AVI (Microsoft) Quicktime (Apple)Quicktime (Apple) StreamingStreaming

RealAudio, RealVideo, RealOne (Realsystems)RealAudio, RealVideo, RealOne (Realsystems) ASF (Microsoft)ASF (Microsoft)

Audio onlyAudio only WAV (Microsoft, IBM)WAV (Microsoft, IBM) AIFF (Apple)AIFF (Apple) AU (Sun)AU (Sun)

Page 15: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge

Multimedia in a Digital Multimedia in a Digital LibraryLibrary

Indexing and browsing structuresIndexing and browsing structures Text-basedText-based Content-basedContent-based

Summarizing audio and videoSummarizing audio and video Digitizing mediaDigitizing media

Linear resolution, color depth, frame Linear resolution, color depth, frame rate, sample raterate, sample rate

Preservation issuesPreservation issues