document formats how to build a digital library ian h. witten and david bainbridge
TRANSCRIPT
![Page 1: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge](https://reader036.vdocuments.us/reader036/viewer/2022072015/56649ebe5503460f94bc89f2/html5/thumbnails/1.jpg)
Document Document FormatsFormats
How to Build a Digital LibraryHow to Build a Digital LibraryIan H. Witten and David BainbridgeIan H. Witten and David Bainbridge
![Page 2: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge](https://reader036.vdocuments.us/reader036/viewer/2022072015/56649ebe5503460f94bc89f2/html5/thumbnails/2.jpg)
DocumentsDocuments
Building blocks of digital librariesBuilding blocks of digital libraries Many different standards for documentsMany different standards for documents InternationalizationInternationalization Fixed versus fluidFixed versus fluid Permanent versus transientPermanent versus transient IndexingIndexing
![Page 3: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge](https://reader036.vdocuments.us/reader036/viewer/2022072015/56649ebe5503460f94bc89f2/html5/thumbnails/3.jpg)
Representing CharactersRepresenting Characters
EBCDICEBCDIC Extended Binary Coded Decimal Interchange CodeExtended Binary Coded Decimal Interchange Code Represented in 8 bitsRepresented in 8 bits
ASCII (1968)ASCII (1968) American Standard Code for Information InterchangeAmerican Standard Code for Information Interchange Represented with 7 bitsRepresented with 7 bits Does not support many foreign languagesDoes not support many foreign languages Many expansions made to the basic ASCII character setMany expansions made to the basic ASCII character set
ISCII (1983)ISCII (1983) Indian Script Code for Information InterchangeIndian Script Code for Information Interchange Hindi and related languagesHindi and related languages
GB and Big-5 for ChineseGB and Big-5 for Chinese
![Page 4: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge](https://reader036.vdocuments.us/reader036/viewer/2022072015/56649ebe5503460f94bc89f2/html5/thumbnails/4.jpg)
UnicodeUnicode
Successor of ASCIISuccessor of ASCII ISO-10646 (1993) ISO-10646 (1993) UniversalUniversal
Aims to represent ALL the world’s languagesAims to represent ALL the world’s languages Default encoding for HTML and XMLDefault encoding for HTML and XML
Development began in 1988 as a joint effort Development began in 1988 as a joint effort between Apple and Xeroxbetween Apple and Xerox
Unicode standard continues to evolveUnicode standard continues to evolve Round-trip compatibility – Unicode can be Round-trip compatibility – Unicode can be
mapped to/from any character set without lossmapped to/from any character set without loss
![Page 5: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge](https://reader036.vdocuments.us/reader036/viewer/2022072015/56649ebe5503460f94bc89f2/html5/thumbnails/5.jpg)
Unicode Character SetUnicode Character Set
Unicode standard is massiveUnicode standard is massive Two subsets of standard: ISO 10646-1/2 Two subsets of standard: ISO 10646-1/2 94,000 characters defined94,000 characters defined
Represents scriptsRepresents scripts Scripts versus languagesScripts versus languages Punctuation shared among scriptsPunctuation shared among scripts
Universal character set – characters at the Universal character set – characters at the core of Unicodecore of Unicode
![Page 6: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge](https://reader036.vdocuments.us/reader036/viewer/2022072015/56649ebe5503460f94bc89f2/html5/thumbnails/6.jpg)
Representing DocumentsRepresenting Documents
Plain textPlain text Full-text indexingFull-text indexing
Bag of wordsBag of words Inversion of the textInversion of the text Inverted filesInverted files Granularity of documentGranularity of document Granularity of indexGranularity of index
Word segmentationWord segmentation Chinese and Japanese are written without spacesChinese and Japanese are written without spaces Spacing in Chinese sentences can completely Spacing in Chinese sentences can completely
change the meaningchange the meaning
![Page 7: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge](https://reader036.vdocuments.us/reader036/viewer/2022072015/56649ebe5503460f94bc89f2/html5/thumbnails/7.jpg)
Page Description LanguagesPage Description Languages
Device independenceDevice independence PostScript (also a programming language)PostScript (also a programming language)
First commercially developed page First commercially developed page description language (1985)description language (1985)
Fonts: Type 1, TrueType, OpenTypeFonts: Type 1, TrueType, OpenType Text extractionText extraction Using PostScript in a digital libraryUsing PostScript in a digital library
![Page 8: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge](https://reader036.vdocuments.us/reader036/viewer/2022072015/56649ebe5503460f94bc89f2/html5/thumbnails/8.jpg)
Page Description Page Description LanguagesLanguages
Portable Document Format (PDF)Portable Document Format (PDF) PDF versus PostscriptPDF versus Postscript
Not a full-scale programming languageNot a full-scale programming language New features for interactive displayNew features for interactive display
Random access to pagesRandom access to pages Hierarchically structured contentHierarchically structured content Navigation within a documentNavigation within a document HyperlinksHyperlinks
File format: header, objects, cross-File format: header, objects, cross-references, trailerreferences, trailer
Searchable image optionSearchable image option
![Page 9: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge](https://reader036.vdocuments.us/reader036/viewer/2022072015/56649ebe5503460f94bc89f2/html5/thumbnails/9.jpg)
Word-Processor DocumentsWord-Processor Documents
Rich Text FormatRich Text Format 240 page specification240 page specification Document-level metadataDocument-level metadata Conversion to HTML software availableConversion to HTML software available
Native Word formatsNative Word formats BinaryBinary ProprietaryProprietary
LaTeX formatLaTeX format Typed formatting commandsTyped formatting commands Non-proprietaryNon-proprietary
![Page 10: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge](https://reader036.vdocuments.us/reader036/viewer/2022072015/56649ebe5503460f94bc89f2/html5/thumbnails/10.jpg)
Representing ImagesRepresenting Images
Lossless image compressionLossless image compression GIFGIF PNGPNG JPEG-LosslessJPEG-Lossless JPEG-2000JPEG-2000
Lossy image compressionLossy image compression JPEGJPEG
Progressive refinementProgressive refinement
![Page 11: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge](https://reader036.vdocuments.us/reader036/viewer/2022072015/56649ebe5503460f94bc89f2/html5/thumbnails/11.jpg)
Representing Audio and VideoRepresenting Audio and Video
Evolution of signals over timeEvolution of signals over time Sample rateSample rate Samples per secondSamples per second Multimedia compressionMultimedia compression
CodecCodec AsymmetryAsymmetry RedundancyRedundancy
![Page 12: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge](https://reader036.vdocuments.us/reader036/viewer/2022072015/56649ebe5503460f94bc89f2/html5/thumbnails/12.jpg)
MPEGMPEG ISO Moving Picture Experts Group (1988)ISO Moving Picture Experts Group (1988) Audio and Video at 1.5 Mbit/secondAudio and Video at 1.5 Mbit/second Family of standardsFamily of standards MPEG-1MPEG-1
Low resolution video, 30 fps, near CD qualityLow resolution video, 30 fps, near CD quality Layer 3 – MP3Layer 3 – MP3
MPEG-2MPEG-2 Higher quality video (DVD)Higher quality video (DVD) Supports interlaced images (Broadcast TV)Supports interlaced images (Broadcast TV) Multichannel audioMultichannel audio
![Page 13: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge](https://reader036.vdocuments.us/reader036/viewer/2022072015/56649ebe5503460f94bc89f2/html5/thumbnails/13.jpg)
MPEGMPEG MPEG-3 abandonedMPEG-3 abandoned MPEG-4MPEG-4
Low bandwidth networks – mobile and WWWLow bandwidth networks – mobile and WWW Object based (vs. signal based)Object based (vs. signal based) InteractiveInteractive Strategies for identifying and managing intellectual Strategies for identifying and managing intellectual
propertyproperty MPEG-7MPEG-7
Metadata description for content delivered via MPEG-Metadata description for content delivered via MPEG-1,2,41,2,4
MPEG-21MPEG-21 Multimedia lifecycleMultimedia lifecycle InteroperabilityInteroperability
![Page 14: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge](https://reader036.vdocuments.us/reader036/viewer/2022072015/56649ebe5503460f94bc89f2/html5/thumbnails/14.jpg)
Other Multimedia Other Multimedia FormatsFormats
Audio and VideoAudio and Video AVI (Microsoft)AVI (Microsoft) Quicktime (Apple)Quicktime (Apple) StreamingStreaming
RealAudio, RealVideo, RealOne (Realsystems)RealAudio, RealVideo, RealOne (Realsystems) ASF (Microsoft)ASF (Microsoft)
Audio onlyAudio only WAV (Microsoft, IBM)WAV (Microsoft, IBM) AIFF (Apple)AIFF (Apple) AU (Sun)AU (Sun)
![Page 15: Document Formats How to Build a Digital Library Ian H. Witten and David Bainbridge](https://reader036.vdocuments.us/reader036/viewer/2022072015/56649ebe5503460f94bc89f2/html5/thumbnails/15.jpg)
Multimedia in a Digital Multimedia in a Digital LibraryLibrary
Indexing and browsing structuresIndexing and browsing structures Text-basedText-based Content-basedContent-based
Summarizing audio and videoSummarizing audio and video Digitizing mediaDigitizing media
Linear resolution, color depth, frame Linear resolution, color depth, frame rate, sample raterate, sample rate
Preservation issuesPreservation issues