document classification techniques using lsi barry britt university of tennessee pilot presentation...
TRANSCRIPT
Document Classification Techniques using LSI
Document Classification Techniques using LSI
Barry Britt
University of Tennessee
PILOT Presentation
Spring 2007
Barry Britt
University of Tennessee
PILOT Presentation
Spring 2007
IntroductionIntroduction
Automatic Classification of Documents
Automatic Classification of Documents
The problem: Brought to Dr. Berry by InRAD, LLC Develop a system that can automatically score
proposals (NSF, SBIR) Proposals are scored by the authors. The current system grades proposals based on the
writing skill of the author
The solution: An automatic system for classifying readiness levels.
The problem: Brought to Dr. Berry by InRAD, LLC Develop a system that can automatically score
proposals (NSF, SBIR) Proposals are scored by the authors. The current system grades proposals based on the
writing skill of the author
The solution: An automatic system for classifying readiness levels.
LSI and GTPLSI and GTP
reduced rank vector space model Queries
are reduced rank approximations Produce contextually relevant results Semantically linked terms and documents are grouped
together GTP is an implementation of LSI
Local and global weighting and document frequencies Document normalization Much more…
reduced rank vector space model Queries
are reduced rank approximations Produce contextually relevant results Semantically linked terms and documents are grouped
together GTP is an implementation of LSI
Local and global weighting and document frequencies Document normalization Much more…
Document Sets - CompositionDocument Sets - Composition
Consist of Technology Readiness Reports Proposals Subjective score from 1 (low) to 9 (high) No structure to documents
Consist of Technology Readiness Reports Proposals Subjective score from 1 (low) to 9 (high) No structure to documents
Example (from DS1):
windshield windshield windshield windshield windshield windshield windshield triphammer triphammer night night ithaca ithaca fleet fleet airline airline worthy window william warn visual visible variant snow severe retrofit reflect red prior primarily popular owner outside oem look imc gages day dangerous cue certification brook analog accumulation accumulate accretion
Document Sets - CompositionDocument Sets - Composition
Document Set 1 (DS1) 4000 documents 85.4 terms per document
Document Set 2 (DS2) 4000 documents 49.2 terms per document
Document Set 3 (DS3) 455 documents 37.1 terms per document
2 Classes - “Phase1” and “Phase2”
Document Set 1 (DS1) 4000 documents 85.4 terms per document
Document Set 2 (DS2) 4000 documents 49.2 terms per document
Document Set 3 (DS3) 455 documents 37.1 terms per document
2 Classes - “Phase1” and “Phase2”
Document Sets - LabelsDocument Sets - Labels
# Class 1 # Class 2 % Class 1 % Class 2 DS1 3244 756 81.10% 19.90% DS2 2955 1045 73.88% 26.13% DS3 166 289 36.48% 63.52%
Figure 1-b: Document Set Class information
Class labels for the individual documents are determined by the authors of the proposals…
Document ClassificationDocument Classification
Classification - Majority RulesClassification - Majority Rules
3 Steps to classification: Choose a document and make it the query
document Retrieve the x closest documents; documents
with the highest cosine similarity values Class = max[n1,n2]
3 Steps to classification: Choose a document and make it the query
document Retrieve the x closest documents; documents
with the highest cosine similarity values Class = max[n1,n2]
Majority Rules - ResultsMajority Rules - Results Phase 1 Phase 2
Phase 1 3239 5 Phase 2 752 4
Figure 2-a: "Majority Rules" confusion matrix for DS1
Phase 1 Phase 2 Phase 1 2875 80 Phase 2 833 212
Figure 2-b: "Majority Rules" confusion matrix for DS2
Phase 1 Phase 2 Phase 1 98 68 Phase 2 53 236
Figure 2-c: "Majority Rules" confusion matrix for DS3
Actual
Predicted
Actual
Actual
Predicted
Predicted
Majority Rules - ResultsMajority Rules - Results
Why are these results not good? Good representation from Class 1 Very poor representation from Class 2
How can we improve results in the underrepresented class?
Why are these results not good? Good representation from Class 1 Very poor representation from Class 2
How can we improve results in the underrepresented class?
Class 1 Class 2 Overall DS1 99.85% 0.53% 81.08% DS2 97.29% 20.29% 77.18% DS3 59.04% 81.66% 73.41%
Figure 2-d: Precision calculations for DS1, DS2, and DS3
Classification - Class WeightingClassification - Class Weighting
Add a “weight” to our classification. Steps:
Choose a document and make it the query document
Retrieve the x closest documents; documents with the highest cosine similarity values
Class = max[weight1 * n1,weight2 * n2] Each class has its own separate weight
Add a “weight” to our classification. Steps:
Choose a document and make it the query document
Retrieve the x closest documents; documents with the highest cosine similarity values
Class = max[weight1 * n1,weight2 * n2] Each class has its own separate weight
Weighted Classifier - ResultsWeighted Classifier - Results
Actual
Predicted
Actual
Predicted
Actual
Predicted
Phase 1 Phase 2 WeightsPhase 1 2355 889 1Phase 2 323 433 4
Phase 1 Phase 2 WeightsPhase 1 2381 574 1Phase 2 498 547 2
Phase 1 Phase 2 WeightsPhase 1 132 34 2Phase 2 104 184 1
(Figure 1-a: Weighted classification confusion matrix for DS1)
(Figure 2-a: Weighted classification confusion matrix for DS1)
(Figure 2-b: Weighted classification confusion matrix for DS2)
Weighted Classifier - ResultsWeighted Classifier - Results
Better classifier Still a good representation from majority class Better representation from minority class
We can still improve on these results for the minority class.
Better classifier Still a good representation from majority class Better representation from minority class
We can still improve on these results for the minority class.
Class 1 Class 2 Overall DS1 72.60% 57.28% 69.70% DS2 80.58% 52.34% 73.20% DS3 79.52% 63.89% 69.60%
Figure 3-d: Precision calculations for DS1, DS2, and DS3
“Weight - Document Size” (WS) Classifier
“Weight - Document Size” (WS) Classifier
Problem: Minority class still underrepresented
Hypothesis: Documents in the same class will have similar
“sizes”, or total number of relevant terms. Solution:
Account for document size in results for the Weighted Classifier
Problem: Minority class still underrepresented
Hypothesis: Documents in the same class will have similar
“sizes”, or total number of relevant terms. Solution:
Account for document size in results for the Weighted Classifier
“Weight - Document Size” (WS) Classifier
“Weight - Document Size” (WS) Classifier
Only consider documents within x total words of the query document
Steps: Choose a document and make it the query document Retrieve the x closest documents within n number of
words of the query document Class = max[weight1 * n1,weight2 * n2]
Each class, like the regular weighted classifier, has its own weight value
Only consider documents within x total words of the query document
Steps: Choose a document and make it the query document Retrieve the x closest documents within n number of
words of the query document Class = max[weight1 * n1,weight2 * n2]
Each class, like the regular weighted classifier, has its own weight value
“Weight - Document Size” (WS) Classifier
“Weight - Document Size” (WS) Classifier
Actual
Predicted
Actual
Predicted
Actual
Predicted
Phase 1 Phase 2 WeightsPhase 1 2379 865 1Phase 2 151 605 4
Phase 1 Phase 2 WeightsPhase 1 2494 461 1Phase 2 364 681 2
Phase 1 Phase 2 WeightsPhase 1 145 21 2Phase 2 118 171 1
(Figure 2-a: Weight/Size (size=3) classification confusion matrix for DS3)
(Figure 2-a: Weight/Size (size=5) classification confusion matrix for DS2)
(Figure 2-a: Weight/Size (size=5) classification confusion matrix for DS1)
“Weight - Document Size” (WS) Classifier
“Weight - Document Size” (WS) Classifier
Best classifier so far Good representation from both classes Best representation so far from the minority
class Can we improve this further?
Best classifier so far Good representation from both classes Best representation so far from the minority
class Can we improve this further?
Class 1 Class 2 Overall DS1 73.34% 80.03% 74.60% DS2 84.40% 65.17% 79.38% DS3 78.31% 72.66% 74.73%
Figure 4-d: Precision calculations for DS1, DS2, and DS3
Term ClassifierTerm Classifier
Rather than classifying based on similar documents, classify based on similar terms.
Steps: Analyze the terms in each document, and the class of
those documents. Choose a document and make it the query document Retrieve the x closest documents (note: we are not
accounting for document size) Class = max[weight1 * n1,weight2 * n2]
Again, each class has its own weight.
Rather than classifying based on similar documents, classify based on similar terms.
Steps: Analyze the terms in each document, and the class of
those documents. Choose a document and make it the query document Retrieve the x closest documents (note: we are not
accounting for document size) Class = max[weight1 * n1,weight2 * n2]
Again, each class has its own weight.
Term ClassifierTerm Classifier
Class 1 and Class 2words
Class 1 words Class 2 words
In one of our document sets, the list of exclusive wordswas less than 3% of the total words.
Term ClassifierTerm Classifier
Take the exclusive words list. If a document clusters near a “Phase1” exclusive word, classify it as “Phase1”, and vice versa
We can use this information to produce an alternate classification.
Take the exclusive words list. If a document clusters near a “Phase1” exclusive word, classify it as “Phase1”, and vice versa
We can use this information to produce an alternate classification.
Term ClassifierTerm Classifier
Actual
Predicted
Actual
Predicted
Actual
Predicted
Phase 1 Phase 2 WeightsPhase 1 2432 812 1Phase 2 343 413 8
Phase 1 Phase 2 WeightsPhase 1 2256 699 1Phase 2 411 643 4
Phase 1 Phase 2 WeightsPhase 1 148 18 3Phase 2 45 244 1
(Figure 5-a: Term classification confusion matrix for DS1)
(Figure 5-b: Term classification confusion matrix for DS2)
(Figure 5-c: Term classification confusion matrix for DS3)
Term ClassifierTerm Classifier
Comparable to the WS Classifier Better for DS3, probably because it is a much smaller
set Not good for Class 2 in the other sets.
The real value lies in reclassification.
Comparable to the WS Classifier Better for DS3, probably because it is a much smaller
set Not good for Class 2 in the other sets.
The real value lies in reclassification.
Class 1 Class 2 Overall DS1 74.97% 54.63% 71.13% DS2 76.35% 61.01% 72.48% DS3 89.16% 84.43% 86.15%
Figure 5-d: Precision calculations for DS1, DS2, and DS3
Document ReclassificationDocument Reclassification
The Term Classifier correctly identifies some documents missed by the WS Classifier.
Confidence Value: If a classification of the WS classifier does
not have a high confidence value, then check to see what the Term classifier says.
The Term Classifier correctly identifies some documents missed by the WS Classifier.
Confidence Value: If a classification of the WS classifier does
not have a high confidence value, then check to see what the Term classifier says.
€
c = vi / vjj =0
n
∑
Document ReclassificationDocument Reclassification
Technique is good for checking small numbers of documents.
Technique is not good for completely reclassifying an entire set.
Technique is good for checking small numbers of documents.
Technique is not good for completely reclassifying an entire set.
Classification Class Scheme
Doc #
Weighted Class 1
Weighted Class 2
Confidence
Predicted Actual WS 3997 5 4 0.556 0 1
Term 3997 8 16 0.667 1 1 Figure 6 - An example of reclassification of a document from DS1
Related WorkRelated Work
Java GUI Front EndJava GUI Front End
Developed in Spring 2007 Helps by providing a stable interface
through which to run GTP and classify documents.
Can “save state”, saves LSI model and all internal data structures for later use.
All tables used in this document were generated by this program.
Developed in Spring 2007 Helps by providing a stable interface
through which to run GTP and classify documents.
Can “save state”, saves LSI model and all internal data structures for later use.
All tables used in this document were generated by this program.
Windows GTPWindows GTP
Direct port of GTP from UNIX to Windows Developed on Windows XP, SP2 Completely self-contained, doesn’t require
external programs or shared libraries Sorting parsed words:
Original GTP uses UNIX sort command… Windows GTP uses an external merge sort…
Direct port of GTP from UNIX to Windows Developed on Windows XP, SP2 Completely self-contained, doesn’t require
external programs or shared libraries Sorting parsed words:
Original GTP uses UNIX sort command… Windows GTP uses an external merge sort…
AcknowledgementsAcknowledgements
These people and groups assisted by providing their knowledge and experiences to the project Dr. Michael Berry Murray Browne Mary Ann Merrell Nathan Fisher The InRAD staff
These people and groups assisted by providing their knowledge and experiences to the project Dr. Michael Berry Murray Browne Mary Ann Merrell Nathan Fisher The InRAD staff
ReferencesReferences
“Using Linear Algebra for Intelligent Information Retrieval.” Michael W. Berry, Susan T. Dumais, and Gavin W. O’Brien, December 1994. Published in SIAM Review 37:4 (1995), pp. 573-595.
Understanding Search Engines: Mathematical Modeling and Text Retrieval. M. Berry and M. Browne, SIAM Book Series: Software, Environments, and Tools. (2005), ISBN 0-89871-581-4.
“GTP (General Text Parser) Software for Text Mining.” J. T. Giles, L. Wo, and M. W. Berry, Statistical Data Mining and Knowledge Discovery. H. Bozdogan (Ed.), CRC Press, Boca Raton, (2003), pp. 455-471.
“Using Linear Algebra for Intelligent Information Retrieval.” Michael W. Berry, Susan T. Dumais, and Gavin W. O’Brien, December 1994. Published in SIAM Review 37:4 (1995), pp. 573-595.
Understanding Search Engines: Mathematical Modeling and Text Retrieval. M. Berry and M. Browne, SIAM Book Series: Software, Environments, and Tools. (2005), ISBN 0-89871-581-4.
“GTP (General Text Parser) Software for Text Mining.” J. T. Giles, L. Wo, and M. W. Berry, Statistical Data Mining and Knowledge Discovery. H. Bozdogan (Ed.), CRC Press, Boca Raton, (2003), pp. 455-471.