cis 702 communication/information technologies (cit) philip robbins – march 7, 2013 dr. luz...
TRANSCRIPT
![Page 1: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/1.jpg)
CIS 702 Communication/Information Technologies (CIT)
Philip Robbins – March 7, 2013Dr. Luz Quiroga, Ph.D.
Chapter 6Documents: Language & Properties
Communication & Information Sciences Ph.D. ProgramUniversity of Hawai'i at Mānoa
Teaching Session #9
1
![Page 2: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/2.jpg)
Documents: Language & Properties
Chapter Contents• Metadata• Document Formats• Markup Languages• Text Properties• Document Preprocessing• Organizing Documents• Text Compression
2
![Page 3: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/3.jpg)
Introduction
Document• Denotes a single unit of information• Structure and a Syntax• Semantics, specified by the author• Presentation style
3
![Page 4: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/4.jpg)
Introduction
4
![Page 5: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/5.jpg)
Introduction
Document Syntax• Expresses structure, presentation style,
semantics• Implicit in its content• Expressed in a simple declarative language• Expressed in a programming language
Text• Can be written in natural language (Hard to
process)
5
![Page 6: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/6.jpg)
Introduction
Document Style• How a document is visualized or printed• Can be embedded in the document i.e. RTF files• Can be complemented by macros
6
![Page 7: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/7.jpg)
Introduction
Queries• Short pieces of text• Differ from normal text• Semantics often ambiguous due to polysemy• User intent behind a query is not easy to infer
7
![Page 8: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/8.jpg)
Metadata
Metadata• Data about data• Information on the organization of the data,
various data domains, and their relationship• Metadata is associated with most documents
8
![Page 9: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/9.jpg)
Metadata
Descriptive Metadata• External to the meaning of the document and
pertain more to how it was created.• Author of the text• Date of publication• Source of the publication• Documentation length
9
![Page 10: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/10.jpg)
Metadata
Semantic Metadata• Characterizes the subject matter within the
document contents• Associated with a wide number of documents• Availability is increasing
10
![Page 11: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/11.jpg)
Metadata
Metadata Format• Machine Readable Cataloging Record (MARC)• Format used for most library records• Includes fields for distinct attributes of a
bibliographic entry such as: title, author, publication venue.
11
![Page 12: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/12.jpg)
Metadata
Metadata in Web Documents• Increase in web data has led to adding metadata
information to web pages.• Cataloging and content rating• Intellectual property rights and digital signatures• Electronic Commerce
12
![Page 13: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/13.jpg)
Metadata
Resource Description Framework (RDF)• New standard for Web metadata• Allows describing Web resources to facilitate
automated processing.• Does not assume any particular application or
semantic domain.• Consists of a description of nodes and attached
attribute/value pairs.
13
![Page 14: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/14.jpg)
Text
Text• Computers represent characters in binary, which
is done through coding schemes:• EBCDIC (7 bits)• ASCII (8 bits)• UNICODE (16 bits)• IR systems should be able to retrieve information
from many text formats (doc, pdf, html, txt)• IR systems have filters to handle most
documents (might not be possible with proprietary formats)
14
![Page 15: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/15.jpg)
Text
Text Formats• For document exchange: Rich Text Format (RTF)• For printing and displaying: Portable Document
Format (PDF)• For printing and displaying: Postscript (PS)
15
![Page 16: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/16.jpg)
Text
Interchange Formats• For encoding email: Multipurpose Internet Mail
Exchange (MIME)• For compressing text: ZIP
16
![Page 17: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/17.jpg)
Multimedia
Multimedia• For applications that handle different types of
data:• Text• Sounds• Images• Video• Different types of formats are necessary for storing
each media
17
![Page 18: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/18.jpg)
Images
Image Formats• Simplest image formats are direct representations
of a bit-mapped display: XBM, BMP, PCX• These formats have lots of redundancy and can be
compressed efficiently: GIF
18
![Page 19: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/19.jpg)
Images
Lossy Compression• To improve compression ratios.• Uncompressing a compressed image does not
yield exactly the original image.• Joint Photographic Experts Group (JPEG)• Eliminates parts of the image that have less
impact in the human eye.• Parametric format – loss can be tuned.
19
![Page 20: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/20.jpg)
Images
Interchange Formats for Images• Tagged Image File Format (TIFF)• Provides for metadata, compression, and varying
number of colors.• Standard de facto for images on the Web: • Portable Network Graphics (PNG)
20
![Page 21: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/21.jpg)
Audio
Audio Formats• Audio is digitalized• MIDI is the standard format to interchange music
between electronic instruments and computers.• AU, WAVE
21
![Page 22: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/22.jpg)
Movies
Movie Formats• Works by coding changes in consecutive frames• Takes advantage of temporal image redundancy• Includes audio signal associated with the video• Audio: MP3, Video: MP4• AVI, FLI, Quicktime
22
![Page 23: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/23.jpg)
Graphics
Format for 3-D Graphics• Computer Graphics Metafile (CGM)• Virtual Reality Modeling Language (VRML)• VRML is the universal interchange format for 3-D
graphics and multimedia.
23
![Page 24: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/24.jpg)
Markup
Markup Languages• Defined as extra syntax used to describe
formatting actions, structure information, text semantics, attributes
• XML: eXtensible Markup Language• HTML: Hyper Text Markup Language• SGML: Standard Generalized Markup
Language
24
![Page 25: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/25.jpg)
Markup
Standard Generalized Markup Language (SGML)• ISO 8879• Meta-language for tagging text• Provides rules for defining a markup language
based on tages• Includes a description of the document structure:
“document type definition”• SGML document defined by: document type
definition with the text itself marked with tags describing the structure
25
![Page 26: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/26.jpg)
Markup
SGML Document Type Definition• Describes the pieces that a document is
composed of• Defines how those pieces relate to each other• Part of the definition can be specified by an
SGML Document Type Declaration (DTD)• Other parts (i.e. semantics of elements &
attributes) cannot be express formally in SGML
26
![Page 27: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/27.jpg)
Markup
SGML Document Type Definition
27
![Page 28: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/28.jpg)
Markup
SGML Document Type Definition
28
![Page 29: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/29.jpg)
Markup
SGML• Tags are denoted by angle brackets < >• Used to identify the beginning and ending of an
element• Ending tags include a slash before the tag name• Attributes are specified inside the beginning tag
29
![Page 30: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/30.jpg)
Markup
SGML• Document description does not specify how a
document is printed• Output specifications are added to SGML
documents:• DSSSL: Document Style Semantic Specification
Language• FOSI: Formatted Output Specification Instance• These standards define mechanisms for
associating style information with SGML document instances
• Allows defining data identified by a tag should be typeset in some particular font 30
![Page 31: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/31.jpg)
Markup
HyperText Markup Language (HTML)• Instance of SGML• Created in 1992• Latest Version is 4.0 (HTML5 under development)• Includes support for style sheets, frames, tables,
forms, etc.• Backwards compatible• Most documents on the Web are stored and
transmitted in HTML• HTML tags follow all SGML conventions and
include formatting directives.
31
![Page 32: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/32.jpg)
Markup
HyperText Markup Language (HTML)• Can have media embedded within, such as
images or audio• Has fields for metadata• Adding programs (i.e. Javascript) inside a
webpage makes it dynamic (hence dynamic HTML).
32
![Page 33: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/33.jpg)
Markup
HyperText Markup Language (HTML)
33
![Page 34: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/34.jpg)
Markup
HyperText Markup Language (HTML)
34
![Page 35: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/35.jpg)
Markup
Cascade Style Sheets (CSS)• Because HTML does not fix a presentation style,
CSS was introduced.• 1997• Way for authors to improve the aesthetics of
HTML pages• Information about presentation is separate from
document content• Support for CSS in current browsers in still
modest
35
![Page 36: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/36.jpg)
Markup
eXtensible Markup Language (XML)• Is a simplified subset of SGML• Not a markup language (like HTML) but a meta-
language (like SGML)• Allows human-readable sematic markup, which
is also machine-readable• Does not have the restriction of HTML• Allows any user to define new tags• More rigid syntax on the syntax: • Ending tags cant be omitted• Distinguishes upper and lower case• Attribute values must be in quotes
36
![Page 37: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/37.jpg)
Markup
eXtensible Style Sheet Language (XSL)• The XML counterpart of Cascading Style Sheets
(CSS)• Syntax based on XML• Designed to transform and style highly-
structured, data-rich documents written in XML• i.e. With XML it would be possible to
automatically extract a table of contents from a document
37
![Page 38: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/38.jpg)
Markup
Hypermedia/Time-based Structuring Language• SGML architecture that specifies the generic
hypermedia structure of documents• Includes complex locating of document objects• Includes relationships (hyperlinks) between
document objects• Includes numeric, measured associations
between document objects• Does not specify graphical interfaces, user
navigation or user interaction.
38
![Page 39: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/39.jpg)
Theory
Information Theory• It is difficult to formally capture how much
information there is in a given text• However, distribution of symbols is related to it• A text where one symbol appears almost all the
time does not convey much information• Information Theory defines a special concept,
entropy, to capture information content
39
![Page 40: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/40.jpg)
Theory
Entropy
40
![Page 41: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/41.jpg)
Theory
Entropy
41
![Page 42: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/42.jpg)
Theory
Modeling Natural Language• We can divide the symbols of a text in two
disjoint subsets:• Symbols that separate words;• Symbols that belong to words;• Symbols are not uniformly distributed in a text• i.e. In English the vowels are usually more
frequent than most consonants.
42
![Page 43: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/43.jpg)
Theory
Modeling Natural Language• A simple model to generate text is the Binomial
model• The probability of a symbol depends on previous
symbol.• i.e. f cannot appear after a letter c• A finite-context or Markovian model can be used
to reflect this dependency.• Second issue: is how the different words are
distributed inside each document.
43
![Page 44: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/44.jpg)
Theory
Zipf’s Law
44
![Page 45: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/45.jpg)
Theory
45
![Page 46: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/46.jpg)
Theory
46
Modeling Natural Language• Words arranged in decreasing order of their
frequencies
![Page 47: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/47.jpg)
Theory
47
Modeling Natural Language• Words arranged in decreasing order of their
frequencies• Distribution of words is very skewed• Words that are too frequent (“stopwords”) can
be disregarded.• Stopword is a word which does not carry
meaning in natural language• i.e. Stopwords in English: a, the, by, and• Therefore, half of the words appearing in a text
do not need to be considered
![Page 48: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/48.jpg)
Theory
48
Modeling Natural Language• Third Issue: Distribution of words in the
documents of a collection.• Simple Model: Consider that each word appears
the same number of times in every document (Not True)
• Better Model: Use a binomial distribution
![Page 49: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/49.jpg)
Theory
49
Heaps’ Law• Fourth Issue: Number of distinct words in a
document (document vocabulary)• To predict the growth of vocabulary size in
natural language text:
![Page 50: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/50.jpg)
Theory
50
Modeling Natural Language• Vocabulary size grows sub-linearly with text size
![Page 51: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/51.jpg)
Theory
51
Modeling Natural Language• The set of different words of a language is fixed
by a constant.• However, the limit is so high that it is common to
assume the size of the vocabulary is:
• Many argue that the number keeps growing anyway because of typing and spelling errors.
• As the total text size grows, the predictions of the model become more accurate.
![Page 52: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/52.jpg)
Theory
52
Text Similarity• Similarity is measured by a distance function• Hamming distance: For strings of the same
length, distance between them is the number of positions with different characters (distance is 0 if equal).
• A distance function should be symmetric and satisfy:
![Page 53: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/53.jpg)
Theory
53
Text Similarity• Levenshtein “edit” distance: the minimal number
of char insertions, deletions, and substitutions needed to make two strings equal.
• Edit distance between color and colour is 1• Edit distance between survey and surgery is 2
![Page 54: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/54.jpg)
Theory
54
Text Similarity• Longest Common Subsequence (LCS):• All non-common characters of two (or more)
strings• Remaining sequence of characters is the LCS of
both strings• LCS of survey and surgery is surey.
![Page 55: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/55.jpg)
Theory
55
Text Similarity• Similarity can be extended to documents• Compute the longest sequence of lines between
two files• ‘diff’ command in Unix
![Page 56: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/56.jpg)
Theory
56
Resemblance Measure
![Page 57: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/57.jpg)
Theory
57
Resemblance Measure
![Page 58: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/58.jpg)
Model
58
Document Preprocessing Operations• Lexical analysis of the text• Elimination of stopwords• Stemming of the remaining words• Selection of index terms or keywords• Construction of term categorization structures
(thesaurus)
![Page 59: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/59.jpg)
Model
59
Logical View of a Document
![Page 60: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/60.jpg)
Document Preprocessing
60
Lexical Analysis• Process of converting stream of chars into
stream of words• Major Objective: Identify words in the text• Word Seperators:
- Space: most common separator- Numbers: inherently vague, context
required- Hyphens: break up hyphenated words- Punctuation marks- Case of letters: A vs. a
![Page 61: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/61.jpg)
Document Preprocessing
61
Elimination of Stopwords• Words that appear too frequently• Usually, not good discriminators• Filtered out as potential index terms• Reduces size of index by 40% or more• At expense of reducing recall: not able to
retrieve documents that contain “to be or not to be”
![Page 62: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/62.jpg)
Document Preprocessing
62
Stemming• Stem: portion of word left after removal of
prefixes/suffixes• User specifies query word but only variant of it is
present in a relevant document• This is partially solved by the adoption of stems• Stemming reduces size of the index• Controversial• Many search engines do not adopt any
stemming
![Page 63: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/63.jpg)
Document Preprocessing
63
Keyword Selection• Full text representation: all words in text is used
as index terms (or, keywords).• Alternative to full text representation:
– Not all words in text used as index terms– Use just nouns as index terms– Group nouns that appear nearby in text into a single
indexing component (a concept)
![Page 64: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/64.jpg)
Document Preprocessing
64
Thesaurus• Used as reference to a treasury of words.• Precompiled list of important words in a
knowledge domain• For each word in this list, a set of related words
derived from a synonymy relationship
![Page 65: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/65.jpg)
Document Preprocessing
65
Thesaurus• Used as reference to a treasury of words.• Precompiled list of important words in a
knowledge domain• For each word in this list, a set of related words
derived from a synonymy relationship
![Page 66: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/66.jpg)
Document Preprocessing
66
Thesaurus• Query formulation process (for IR):– User forms a query– Query terms might be erroneous and improper– Solution: reformulate the original query– Usually, this implies expanding original query
with related terms– Thus, it is natural to use a thesaurus for finding
related terms
![Page 67: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/67.jpg)
Taxonomies
67
![Page 68: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/68.jpg)
Folksonomies
68
Folksonomy• Collaborative flat vocabulary• Terms are selected by a population of users• Each term is called a tag
![Page 69: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/69.jpg)
Folksonomies
69
![Page 70: CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication](https://reader037.vdocuments.us/reader037/viewer/2022110209/56649e305503460f94b21033/html5/thumbnails/70.jpg)
References
• Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition. Chapter 6, Documents: Languages & Properties, Retrieved from http://grupoweb.upf.es/WRG/mir2ed/pdf/slides_chap06.pdf
70