Download - Text Classification and Images
![Page 1: Text Classification and Images](https://reader035.vdocuments.us/reader035/viewer/2022062217/56814b06550346895db81d57/html5/thumbnails/1.jpg)
Text Classification and Images
by Carl Sable
![Page 2: Text Classification and Images](https://reader035.vdocuments.us/reader035/viewer/2022062217/56814b06550346895db81d57/html5/thumbnails/2.jpg)
Overview
• Text Classification.– Involves assigning text documents to one or more
groups (classes).
– Techniques can be applied to image captions to classify corresponding images.
• Various methods, evaluation techniques, and related issues will be discussed.
• Some discussion of other research involving image captions.
![Page 3: Text Classification and Images](https://reader035.vdocuments.us/reader035/viewer/2022062217/56814b06550346895db81d57/html5/thumbnails/3.jpg)
Text Classification Tasks
• Text Categorization (TC) - Assign text documents to existing, well-defined categories.
• Information Retrieval (IR) - Retrieve text documents which match user query.
• Clustering - Group text documents into clusters of similar documents.
• Text Filtering - Retrieve documents which match a user profile.
![Page 4: Text Classification and Images](https://reader035.vdocuments.us/reader035/viewer/2022062217/56814b06550346895db81d57/html5/thumbnails/4.jpg)
Text Categorization
• Classify each test document by assigning category labels.– M-ary categorization assumes M labels per
document.– Binary categorization requires yes/no decision for
every document/category pair.
• Most techniques require training.– Parametric vs non-parametric.– Batch vs on-line.
![Page 5: Text Classification and Images](https://reader035.vdocuments.us/reader035/viewer/2022062217/56814b06550346895db81d57/html5/thumbnails/5.jpg)
Early Work
• The Federalist papers.– Published anonymously between 1787-1788.– Authorship of 12 papers in dispute (either
Hamilton or Madison).
• Mostellar and Wallace, 1963.– Compared rate per thousand words of high
frequency words.– Collected very strong evidence in favor of
Madison.
![Page 6: Text Classification and Images](https://reader035.vdocuments.us/reader035/viewer/2022062217/56814b06550346895db81d57/html5/thumbnails/6.jpg)
Rocchio
• All documents and categories represented by word vectors.
• TF*IDF weights for words.– Term frequency is number of times word appears in
document or category.
– Inverse document relates to scarcity of word over entire training collection.
• Similarity computed for all document, category pairs.
![Page 7: Text Classification and Images](https://reader035.vdocuments.us/reader035/viewer/2022062217/56814b06550346895db81d57/html5/thumbnails/7.jpg)
Naïve Bayes
• Estimates probabilities of categories given a document.
• Uses joint probabilities of words and categories (Bayes’ rule).
• Assumes words are independent of each other.
• Can incorporate a priori probabilities of categories.
![Page 8: Text Classification and Images](https://reader035.vdocuments.us/reader035/viewer/2022062217/56814b06550346895db81d57/html5/thumbnails/8.jpg)
Other Common Methods
• K-Nearest Neighbor (kNN) - Use k closest training documents to predict category.
• Decision Trees (DTree)- Construct classification trees based on training data.
• Neural Networks (NNet) - Learn non-linear mapping from input words to categories.
• Expert Systems - Use manually constructed, domain-specific, application-specific rules.
![Page 9: Text Classification and Images](https://reader035.vdocuments.us/reader035/viewer/2022062217/56814b06550346895db81d57/html5/thumbnails/9.jpg)
Advanced Techniques
• Support Vector Machines (SVMs).– Use Structural Risk Minimization principle.– Find hypothesis which minimizes “true error”.
• Widrow-Hoff and EG - Update weight vector based on each training example.
• Maximum Entropy - Derive constraints expressing characteristics of training data.
• Boosting - Combine weak hypotheses to produce highly accurate classification rule.
![Page 10: Text Classification and Images](https://reader035.vdocuments.us/reader035/viewer/2022062217/56814b06550346895db81d57/html5/thumbnails/10.jpg)
Common Test Corpora
• Reuters - Collection of newswire stories from 1987 to 1991, labeled with categories.
• TREC-AP newswire stories from 1988 to 1990, labeled with categories.
• OHSUMED Medline articles from 1987 to 1991, MeSH categories assigned.
• UseNet newsgroups.
• WebKB - Web pages gathered from university CS departments.
![Page 11: Text Classification and Images](https://reader035.vdocuments.us/reader035/viewer/2022062217/56814b06550346895db81d57/html5/thumbnails/11.jpg)
Other Issues to Consider
• Which words to use (feature selection).
• Normalization.
• Use of lexical databases.– Longman Dictionary of Contemporary English
(LDOCE), WordNet, English Verb Classes and Alternations (EVCA).
– May cause problems due to lexical ambiguity.
• High cost of manual labels.
![Page 12: Text Classification and Images](https://reader035.vdocuments.us/reader035/viewer/2022062217/56814b06550346895db81d57/html5/thumbnails/12.jpg)
Categorizing Images
• Some previous research on content-based image categorization, very little on text-based image categorization!
• WebSEEk.– Categorizes images and videos based on key-terms
extracted from URL, alt text, hyperlinks, and directory names.
– Semi-automated key-term dictionary maps key-terms to subject(s) from a taxonomy.
![Page 13: Text Classification and Images](https://reader035.vdocuments.us/reader035/viewer/2022062217/56814b06550346895db81d57/html5/thumbnails/13.jpg)
Evaluation Metrics
• Per Category Measures:– simple accuracy or error measures
can be misleading.
– precision, recall, and fallout.
– F-measure, average precision, and break-even point (BEP) combine precision and recall.
• Macro-averaging vs Micro-averaging.
• Should choose metric ahead of time (maybe)!
Yes iscorrect
No iscorrect
AssignedYES
a b
AssignedNO
c d
p = a / (a + b)
r = a / (a + c)
f = b / (b + d)
Acc = (a + d) / n
Err = (b + c) / n
contingency table:
![Page 14: Text Classification and Images](https://reader035.vdocuments.us/reader035/viewer/2022062217/56814b06550346895db81d57/html5/thumbnails/14.jpg)
Some Results and Analysis
• Comparisons.– SVM and kNN, AdaBoost, WH, and EG all showed
very impressive performance.– Naïve Bayes and Rocchio tended to show relatively
poor performance.
• Rocchio possibly could have done better.– Should be using probabilistic Rocchio.– Works best if categories are mutually exclusive.– May perform at its best when only 2 categories.
![Page 15: Text Classification and Images](https://reader035.vdocuments.us/reader035/viewer/2022062217/56814b06550346895db81d57/html5/thumbnails/15.jpg)
Information Retrieval
• User inputs query, system should retrieve all relevant documents.
• Simple technique: keyword search.
• Other techniques use on word vectors.– TF*IDF commonly used for weights.– Can compute similarity between query vector and
document vectors.
• Evaluation - Similar to text categorization, treat relevant documents as single category.
![Page 16: Text Classification and Images](https://reader035.vdocuments.us/reader035/viewer/2022062217/56814b06550346895db81d57/html5/thumbnails/16.jpg)
Relevance Feedback
• After initial retrieval, user makes relevance judgements for retrieved documents.
• New round of retrieval based on feedback.• Similar to text categorization with two
categories: relevant vs non-relevant.• Rocchio algorithm originally created for this
task.• Naïve Bayes very successful.
![Page 17: Text Classification and Images](https://reader035.vdocuments.us/reader035/viewer/2022062217/56814b06550346895db81d57/html5/thumbnails/17.jpg)
Possible Improvements
• Lexical databases sometimes used for query expansion.
• Word sense disambiguation.– Expand query with correct senses.– Used on documents to prevent retrieval based
on false matches.
• Notion of semantic similarity.
![Page 18: Text Classification and Images](https://reader035.vdocuments.us/reader035/viewer/2022062217/56814b06550346895db81d57/html5/thumbnails/18.jpg)
Retrieval of Captioned Images
• Typical properties of image captions:– Shorter than documents in typical IR tasks.– Subject noun phrase usually denotes most significant
object in picture.– In news domain, first sentence generally describes
image, rest is background.
• Different types of queries.
• Many techniques from general IR not applicable.
![Page 19: Text Classification and Images](https://reader035.vdocuments.us/reader035/viewer/2022062217/56814b06550346895db81d57/html5/thumbnails/19.jpg)
Related Research
• Smeaton.– Automatically derived Hierarchical Concept Graphs
(HCGs) based on WordNet IS-A links.– Computed semantic similarity between nouns.– Some success improving image retrieval.
• Guglielmo and Rowe.– Used logical form records to capture meaning of
queries and captions for comparison.– System significantly beat keyword search.
![Page 20: Text Classification and Images](https://reader035.vdocuments.us/reader035/viewer/2022062217/56814b06550346895db81d57/html5/thumbnails/20.jpg)
Other Text Classification Tasks
• Clustering documents.– Create groups with similar attributes.– Various methods and algorithms exist.– Hierarchical vs non-hierarchical.– Each group has centroid.– Can aid in Information Retrieval.
• Text Filtering.– Filter articles of potential interest for a user.– Uses many of the same methods as TC and IR.
![Page 21: Text Classification and Images](https://reader035.vdocuments.us/reader035/viewer/2022062217/56814b06550346895db81d57/html5/thumbnails/21.jpg)
Processing Image Captions
• The Correspondence Problem - How to correlate visual information with words.– Visual semantics.– Symbolic representation of visual data.
• Srihari.– Piction - System that automatically identifies human
faces in captioned newspaper photos.– Integrates NLP module which parses captions with
IU module that detects objects.
![Page 22: Text Classification and Images](https://reader035.vdocuments.us/reader035/viewer/2022062217/56814b06550346895db81d57/html5/thumbnails/22.jpg)
Final Observations
• Previous Work.– General text categorization studied extensively.– Some research on text-based image retrieval.– Very little research involving text-based image
categorization.
• Image captions contain information unlikely to be extracted from just images.
• High potential exists for significant research involving text-based image categorization.