vocabulary size and term distribution: tokenization, text normalization and stemming lecture 2
TRANSCRIPT
Overview
Getting started: – tokenization, stemming, compounds – end of sentence
Collection vocabulary– Terms, tokens, types– Vocabulary size– Term distribution
Stop words Vector representation of text and term weighting
Tokenization
Friends, Romans, Countrymen, lend me your ears; Friends | Romans | Countrymen | lend | me your |
ears
Token an instance of a sequence of characters that are grouped together as a useful semantic unit for processing
Type the class of all tokens containing the same character sequence
Term type that is included in the system dictionary (normalized)
Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing.
How to handle special cases involving apostrophes, hyphens etc?
C++, C#, URLs, emails, phone numbers, datesSan Francisco, Los Angeles
Issues of tokenization are language specific– Requires the language to be known
Language identification based on classifiers that use short character subsequences as features is highly effective– Most languages have distinctive signature
patterns
Very important for information retrieval
Splitting tokens on spaces can cause bad retrieval results
– Search for York University, returns pages containing new york university
German: compound nouns– Retrieval systems for German greatly benefit fron the use of
compound-splitter module– Checks if a word can be subdivided into words that appear
in the vocabulary East Asian Languages (Chinese, Japanese, Korean,
Thai)– Text is written without any spaces between words
Building a stop word list
Sort terms by collection frequency collection frequency and take the most frequent– In a collection about insurance practices,
“insurance” would be a stop word Why do we need stop lists
– Smaller indices for information retrieval– Better approximation of importance for
summarization etc Use problematic in phrasal searches
Trend in IR systems over time– Large stop lists (200-300 terms)– Very small stop lists (7-12 terms)– No stop list whatsoever– The 30 most common words account for 30% of the tokens
in written text
Good compression techniques for indices Term weighting leads to very common words having
little impact for document represenation
Normalization
Token normalization– Canonicalizing tokens so that matches occur
despite superficial differences in the character sequences of the tokens
– U.S.A vs USA– Anti-discriminatory vs antidiscriminatory– Car vs automobile?
Normalization sensitive to query
Query term Terms that should match
Windows Windows
windows Windows, windows, window
Window window, windows
Capitalization/case folding
Good for– Allow instances of Automobile at the beginning of a sentence to
match with a query of automobile– Helps a search engine when most users type ferrari when they are
interested in a Ferrari car Bad for
– Proper names vs common nouns– General Motors, Associated Press, Black
Heuristic solution: lowercase only words at the beginning of the sentence; true casing via machine learning
In IR, lowercasing is most practical because of the way users issue their queries
Other languages
60% of webpages are in english– Less than one third of Internet users speak
English– Less than 10% of the world’s population primarily
speak English
Only about one third of blog posts are in English
Stemming and lemmatization
Organize, organizes, organizing Democracy, democratic, democratization
Am, are, is be
Car, cars, car’s, cars’ ==? car
Stemming– Crude heuristic process that chops off the ends of the
words Democratic democa
Lemmatization– Use of vocabulary and morphological analysis, returns the
base form of a word (lemma) Democratic democracy Sang sing
Porter stemmer
Most common algorithm for stemming English– 5 phases of word reduction– SSES SS
caresses caress– IES I
ponies poni– SS SS– S
cats cat– EMENT
replacement replac cement cement
Vocabulary size
Dictionaries– 600,000+ words
But they do not include names of people, locations, products etc
Heap’s law: estimating the number of terms
bkTM M vocabulary size (number of terms)
T number of tokens
30 < k < 100
b = 0.5
Linear relation between vocabulary size and number of tokens in log-log space
Zipf’s law: modeling the distribution of terms
The collection frequency of the ith most common term is proportional to 1/i
If the most frequent term occurs cf1 then the second most frequent term has half as many occurrences, the third most frequent term has a third as many, etc
icfi
1
ikccf
cicf
i
ki
logloglog