advanced multimedia

49
Advanced Multimedia Text Retrieval/Classification Tamara Berg

Upload: evonne

Post on 25-Jan-2016

23 views

Category:

Documents


1 download

DESCRIPTION

Advanced Multimedia. Text Retrieval/Classification Tamara Berg. Announcements. Matlab basics lab – Feb 7 Matlab string processing lab – Feb 12. If you are unfamiliar with Matlab , attendance at labs is crucial!. Slide from Dan Klein. Slide from Dan Klein. Today!. Slide from Dan Klein. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Advanced Multimedia

Advanced Multimedia

Text Retrieval/ClassificationTamara Berg

Page 2: Advanced Multimedia

Announcements

• Matlab basics lab – Feb 7• Matlab string processing lab – Feb 12

If you are unfamiliar with Matlab, attendance at labs is crucial!

Page 3: Advanced Multimedia

Slide from Dan Klein

Page 4: Advanced Multimedia

Slide from Dan Klein

Page 5: Advanced Multimedia

Slide from Dan Klein

Today!

Page 6: Advanced Multimedia

What does categorization/classification mean?

Page 7: Advanced Multimedia

Slide from Dan Klein

Page 8: Advanced Multimedia

Slide from Dan Klein

Page 9: Advanced Multimedia

Slide from Dan Klein

Page 10: Advanced Multimedia

Slide from Dan Klein

Page 11: Advanced Multimedia

Slide from Min-Yen Kan

Page 12: Advanced Multimedia

Slide from Dan Kleinhttp://yann.lecun.com/exdb/mnist/index.html

Page 13: Advanced Multimedia

Slide from Dan Klein

Page 14: Advanced Multimedia

Slide from Min-Yen Kan

Page 15: Advanced Multimedia

Slide from Min-Yen Kan

Page 16: Advanced Multimedia

Slide from Min-Yen Kan

Page 17: Advanced Multimedia

Slide from Min-Yen Kan

• Machine Learning - how to select a model on the basis of data / experience

Learning parameters (e.g. probabilities)Learning structure (e.g. dependencies)Learning hidden concepts (e.g.

clustering)

Page 18: Advanced Multimedia
Page 19: Advanced Multimedia

Representing Documents

Page 20: Advanced Multimedia

Document Vectors

Page 21: Advanced Multimedia

Document Vectors· Represent document as a “bag of words”

Page 22: Advanced Multimedia

Example

• Doc1 = “the quick brown fox jumped”• Doc2 = “brown quick jumped fox the”

Page 23: Advanced Multimedia

Example

• Doc1 = “the quick brown fox jumped”• Doc2 = “brown quick jumped fox the”

Would a bag of words model represent these two documents differently?

Page 24: Advanced Multimedia

Document Vectors· Documents are represented as “bags of words”· Represented as vectors when used computationally

• Each vector holds a place for every term in the collection• Therefore, most vectors are sparse

Slide from Mitch Marcus

Page 25: Advanced Multimedia

Document Vectors· Documents are represented as “bags of words”· Represented as vectors when used computationally

• Each vector holds a place for every term in the collection• Therefore, most vectors are sparse

Slide from Mitch Marcus

Lexicon – the vocabulary set that you consider to be valid words in your documents.

Usually stemmed (e.g. running->run)

Page 26: Advanced Multimedia

Document Vectors:One location for each word.

nova galaxy heat h’wood film rolediet fur

10 5 3

5 10 10 8 7 9 10 5

10 10 9 10

5 7 9 6 10 2 8

7 5 1 3

ABCDEFGHI

“Nova” occurs 10 times in text A“Galaxy” occurs 5 times in text A“Heat” occurs 3 times in text A(Blank means 0 occurrences.)

Slide from Mitch Marcus

Page 27: Advanced Multimedia

Document Vectors:One location for each word.

nova galaxy heat h’wood film rolediet fur

10 5 3

5 10 10 8 7 9 10 5

10 10 9 10

5 7 9 6 10 2 8

7 5 1 3

ABCDEFGHI

“Nova” occurs 10 times in text A“Galaxy” occurs 5 times in text A“Heat” occurs 3 times in text A(Blank means 0 occurrences.)

Slide from Mitch Marcus

Page 28: Advanced Multimedia

Document Vectors

nova galaxy heat h’wood film rolediet fur

10 5 3

5 10 10 8 7

9 10 5 10 10 9 10

5 7 9 6 10 2 8

7 5 1 3

ABCDEFGHI

Document ids

Slide from Mitch Marcus

Page 29: Advanced Multimedia

Vector Space Model

· Documents are represented as vectors in term space• Terms are usually stems• Documents represented by vectors of terms

· A vector distance measures similarity between documents • Document similarity is based on length and direction of their vectors• Terms in a vector can be “weighted” in many ways

Slide from Mitch Marcus

Page 30: Advanced Multimedia

Document Vectors

nova galaxy heat h’wood film rolediet fur

10 5 3

5 10 10 8 7

9 10 5 10 10 9 10

5 7 9 6 10 2 8

7 5 1 3

ABCDEFGHI

Document ids

Slide from Mitch Marcus

Page 31: Advanced Multimedia

Comparing Documents

Page 32: Advanced Multimedia

Similarity between documents

A = [10 5 3 0 0 0 0 0];G = [5 0 7 0 0 9 0 0];E = [0 0 0 0 0 10 10 0];

Page 33: Advanced Multimedia

Similarity between documents

A = [10 5 3 0 0 0 0 0];G = [ 5 0 7 0 0 9 0 0];E = [ 0 0 0 0 0 10 10 0];

Treat the vectors as binary = number of words in common.

Sb(A,G) = ?Sb(A,E) = ?Sb(G,E) = ?

Which pair of documents are the most similar?

Page 34: Advanced Multimedia

Similarity between documents

A = [10 5 3 0 0 0 0 0];G = [5 0 7 0 0 9 0 0];E = [0 0 0 0 0 10 10 0];

Sum of Squared Distances (SSD) =

SSD(A,G) = ?SSD(A,E) = ?SSD(G,E) = ?

Page 35: Advanced Multimedia

Similarity between documents

A = [10 5 3 0 0 0 0 0];G = [5 0 7 0 0 9 0 0];E = [0 0 0 0 0 10 10 0];

Angle between vectors: Cos(θ) =

Dot Product:

Length (Euclidean norm):

Page 36: Advanced Multimedia

Some words give more information than others

• Does the fact that two documents both contain the word “the” tell us anything? How about “and”? Stop words (noise words): Words that are probably not useful for processing. Filtered out before natural language is applied.

• Other words can be more or less informative.

No definitive list but might include things like: http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words

Page 37: Advanced Multimedia

Some words give more information than others

• Does the fact that two documents both contain the word “the” tell us anything? How about “and”? Stop words (noise words): Words that are probably not useful for processing. Filtered out before natural language is applied.

• Other words can be more or less informative.

No definitive list but might include things like: http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words

Page 38: Advanced Multimedia

Classifying Documents

Page 39: Advanced Multimedia

Slide from Min-Yen Kan

Here the vector space is illustrated as having 2 dimensions. How many dimensions would the data actually live in?

Page 40: Advanced Multimedia

Slide from Min-Yen Kan

Query document – which class should you label it with?

Page 41: Advanced Multimedia

Classification by Nearest Neighbor

Classify the test document as the class of the document “nearest” to the query document (use vector similarity to find most similar doc)

Slide from Min-Yen Kan

Page 42: Advanced Multimedia

Classification by kNN

Classify the test document as the majority class of the k documents “nearest” to the query document. Slide from Min-Yen Kan

Page 43: Advanced Multimedia

Slide from Min-Yen Kan

Page 44: Advanced Multimedia

Slide from Min-Yen Kan

Page 45: Advanced Multimedia

Slide from Min-Yen Kan

Page 46: Advanced Multimedia

Slide from Min-Yen Kan

Page 47: Advanced Multimedia

Slide from Min-Yen Kan

Page 48: Advanced Multimedia
Page 49: Advanced Multimedia

Slide from Min-Yen Kan

What are the features? What’s the training data? Testing data? Parameters?

Classification by kNN