tf idf

Lecture Notes for Algorithms for Data Science 1

Jayesh Choudhuri January 12

1 Nearest NeighborsOne of the fundamental problems in datamining is to find similar items. For example, given aimage to find out similar images from a dataset of images, or looking at collection of web pagesand finding out near duplicate pages. The basic method would be do perform a linear search, i.e.

1. In case of images: To compare the query image with each image from the dataset,

2. In case of documents: Take a string of query document and find the similar string/documentgoing through all the documents from the dataset.

2 Representation of dataset1. Images:

pixel values SIFT features

2. Documents:

string vector set

3 Similarity of DocumentsUnderstanding the meaning of similarity is important. In this case we are trying to find thecharacter-level similarity and this does not requires us to examine the words in the documentsand their uses or semantic meaning. Finding documents that are exactly duplicate is easy andcan be done by comparing two documents character-by-character. In many cases the documentsare exactly identical but share a large portion of similar texts. Searching for such documents islike finding near duplicates instead of exact duplicates. Some of the application of finding nearduplicates are Plagiarsm, Mirror pages, Articles from same source, etc. Generally documents arenormalised or pre-processed by removing the punctuations and by converting all the characters tolower case.

Shingling:One of the ways of representing documents is to represent them as sets. The elements of theset are called as shingles. Given a positive integer k and a sequence of terms in the documentd, the k-shingles of d are defined to be a set of all consecutive sequences of k-terms in d.For eg. consider the following text:

We are having class here

Taking k=5, the representation of document as 5-shingles would be

CS430 Algo for DS Spring 2015 Instructor: Anirban Dasgupta

{We ar, e are, are , are h, ..., here}

In the above case k was taken as the number of the characters. One can also consider k asnumber of words. That would result into a different representation. So, taking k = 2 wordsin the above example we have:

{We are, are having, having a, ..., class here}

Such a representation is known as k gram representation. If k = 1 it is known as unigram,k = 2, its bigram, k = 3 trigram and so on. k gram representation is better for Englishlanguage where space acts as a separation between two words, but in languages like chinesethere is no seperation between words and thus shingling can be used.

Representation of documents as a vector:Documents can also be represented as a vector, where each element in the vector can be aboolean, showing the presence of the term in the document or can be an interger showing thefrequency of a term in the document. In the context of representing documents as vector thefollowing terms are defined:

TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a weight measure usedin text mining and information retrieval. TFIDF is a statistical measure showing importanceof a word in a document in a collection of corpus. TFIDF weight for a word increases withthe increase in the frequency of word in a document but is offset of frequency of word in thecorpus. TFIDF weighting is used for scoring and ranking documents relevance.

Term Frequency:

Each term in the document is assigned a weight. The weight depends on the numberof times the word occurs in the document. One of the simplest way to weight theword in the document is by assigning the weight equal to frequency of the word in thedocument. This weighting scheme is known as term frequency and is denoted tft,d,where t term and d document

The weight term frequency gives quantitative information about the document. Such arepresentation of a document is known as bag of words model. In such cases the orderof terms is not considered but the number of occurence of each term is important (incontrast to boolean representation).

Inverse Document frequency:

The weight term frequency suffers with a critical problem: all the terms are given equalimportance without giving any importance to the relevancy of the term in the documentand the corpus of document. Some of the terms have no power in determining relevance.Consider the corpus of documents from the auto industry. Almost all the documentsare likely to contain the word auto. So, for relevance determination it is necessary toattenuate the effect of the terms that occur too many times in the collection. In orderto scale down the term frequency, a new measure is introduced named as documentfrequency dft, which gives the number of documents in the collection that contain termt. Using document frequency we define Inverse Document Frequency idft which isgiven by

idft = log(N/dft)


where N is the total number of documents in the collection

Tf-idf weighting:

Tf-idf weighting is given by combining the measures term frequency and Inverse Doc-ument frequency to result into a composite weighting for each term in each document.

tf -idft,d = tft,d idftIn other words, tf -idft,d assigns to term t a weight in document d that is

1. highest when t occurs many times within a small number of documents (thus lendinghigh discriminating power to those documents);

2. lower when the term occurs fewer times in a document, or occurs in many documents(thus offering a less pronounced relevance signal);

3. lowest when the term occurs in virtually all documents.

References Mining of Massive Datasets - Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman An Introduction to Information Retrieval - Christopher D. Manning, Prabhakar Raghavan,

Hinrich Schtze