vocabulary size and term distribution: tokenization, text normalization and stemming lecture 2

Vocabulary size and term distribution: tokenization, text normalization and

stemming

Lecture 2

Overview

Getting started: – tokenization, stemming, compounds – end of sentence

Collection vocabulary– Terms, tokens, types– Vocabulary size– Term distribution

Stop words Vector representation of text and term weighting

Tokenization

Friends, Romans, Countrymen, lend me your ears; Friends | Romans | Countrymen | lend | me your |

ears

Token an instance of a sequence of characters that are grouped together as a useful semantic unit for processing

Type the class of all tokens containing the same character sequence

Term type that is included in the system dictionary (normalized)

The cat slept peacefully in the living room. It’s a very old cat.

Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing.

How to handle special cases involving apostrophes, hyphens etc?

C++, C#, URLs, emails, phone numbers, datesSan Francisco, Los Angeles

Issues of tokenization are language specific– Requires the language to be known

Language identification based on classifiers that use short character subsequences as features is highly effective– Most languages have distinctive signature

patterns

Very important for information retrieval

Splitting tokens on spaces can cause bad retrieval results

– Search for York University, returns pages containing new york university

German: compound nouns– Retrieval systems for German greatly benefit fron the use of

compound-splitter module– Checks if a word can be subdivided into words that appear

in the vocabulary East Asian Languages (Chinese, Japanese, Korean,

Thai)– Text is written without any spaces between words

Stop words

Very common words that have no discriminatory power

Building a stop word list

Sort terms by collection frequency collection frequency and take the most frequent– In a collection about insurance practices,

“insurance” would be a stop word Why do we need stop lists

– Smaller indices for information retrieval– Better approximation of importance for

summarization etc Use problematic in phrasal searches

Trend in IR systems over time– Large stop lists (200-300 terms)– Very small stop lists (7-12 terms)– No stop list whatsoever– The 30 most common words account for 30% of the tokens

in written text

Good compression techniques for indices Term weighting leads to very common words having

little impact for document represenation

Normalization

Token normalization– Canonicalizing tokens so that matches occur

despite superficial differences in the character sequences of the tokens

– U.S.A vs USA– Anti-discriminatory vs antidiscriminatory– Car vs automobile?

Normalization sensitive to query

Query term Terms that should match

Windows Windows

windows Windows, windows, window

Window window, windows

Capitalization/case folding

Good for– Allow instances of Automobile at the beginning of a sentence to

match with a query of automobile– Helps a search engine when most users type ferrari when they are

interested in a Ferrari car Bad for

– Proper names vs common nouns– General Motors, Associated Press, Black

Heuristic solution: lowercase only words at the beginning of the sentence; true casing via machine learning

In IR, lowercasing is most practical because of the way users issue their queries

Other languages

60% of webpages are in english– Less than one third of Internet users speak

English– Less than 10% of the world’s population primarily

speak English

Only about one third of blog posts are in English

Stemming and lemmatization

Organize, organizes, organizing Democracy, democratic, democratization

Am, are, is be

Car, cars, car’s, cars’ ==? car

Stemming– Crude heuristic process that chops off the ends of the

words Democratic democa

Lemmatization– Use of vocabulary and morphological analysis, returns the

base form of a word (lemma) Democratic democracy Sang sing

Porter stemmer

Most common algorithm for stemming English– 5 phases of word reduction– SSES SS

caresses caress– IES I

ponies poni– SS SS– S

cats cat– EMENT

replacement replac cement cement

Vocabulary size

Dictionaries– 600,000+ words

But they do not include names of people, locations, products etc

Heap’s law: estimating the number of terms

bkTM M vocabulary size (number of terms)

T number of tokens

30 < k < 100

b = 0.5

Linear relation between vocabulary size and number of tokens in log-log space

Zipf’s law: modeling the distribution of terms

The collection frequency of the ith most common term is proportional to 1/i

If the most frequent term occurs cf1 then the second most frequent term has half as many occurrences, the third most frequent term has a third as many, etc

icfi

1

ikccf

cicf

i

ki

logloglog

Problems with the normalization

A change in the stop word list can dramatically alter term weightings

A document may contain an outlier term

vocabulary size and term distribution: tokenization, text normalization and stemming lecture 2

Documents

windows slide

term weighting slide

discriminatory power

document represenation

los angeles slide

windows windows windows

stop list

small stop