7/15/20151 a binary-categorization approach for classifying multiple-record web documents using a...
Post on 22-Dec-2015
217 views
TRANSCRIPT
04/19/23 1
A Binary-Categorization Approach for A Binary-Categorization Approach for Classifying Multiple-Record Web Classifying Multiple-Record Web Documents Using a Probabilistic Documents Using a Probabilistic
Retrieval ModelRetrieval Model
Department of Computer Science
Brigham Young University
Quan Wang
November 2001
04/19/23 2
Thesis ContentThesis Content
Introduction Preliminaries Probabilistic Retrieval Based on Logistic
Regression Analysis Experimental Results Concluding Remarks
04/19/23 3
IntroductionIntroduction
World Wide Web contains tremendous amount of information.
Information retrieval is used to accurately and efficiently classify information.
We use a probabilistic retrieval model based on logistic regression analysis.
04/19/23 4
PreliminariesPreliminaries
Application ontology
Logistic regression
Probabilistic retrieval based on logistic regression
04/19/23 5
Application OntologyApplication Ontology
Car
Year Price
Make Model
Mileage Feature PhoneNr
1:*
1:*
1:*
1:*
1:* 1:* 1:*
0:0.975:1 0:0.8:1
0:0.908:10:1.15:*0:2.2:*
0:0.925:1
0:0.45:1
04/19/23 6
Document RepresentationDocument Representation
A set of <index term : term frequency> pairs A1:x1, …….. An:xn.
A density heuristic value y; A grouping heuristic value z;
Document d (x1,……,xn, y, z) (V, y, z)
04/19/23 7
Independence AssumptionIndependence Assumption
P(R|x1,……,xn, y, z)
Independenceassumption
P(R|x1) P(R|xn) P(R|y) P(R|z)* ***
04/19/23 8
Logistic RegressionLogistic Regression
P
x
P(R|x)* ** * *******
*** * ******* ** * xi
P(R|xi)
P(R| x) = 1/(1+exp(-(C0+C1 x))), ln(O(R|x) = C0+C1 x.
04/19/23 9
Probabilistic Retrieval Based on Logistic Probabilistic Retrieval Based on Logistic Regression AnalysisRegression Analysis
Data processing Data analysis Probabilistic retrieval on car-ads application
ontology Correlation relations
04/19/23 10
Data ProcessingData Processing
The corresponding normalized vector
V’ = (X1’, …….. Xn’) is computed as
V’ = |V| / |u|
V
where V is a document vector, u is an ontology vector.
,
04/19/23 11
Data DistributionsData Distributions
**** ** *** **
**** ** *** **
04/19/23 12
Logistic Regression-1Logistic Regression-1
04/19/23 13
Logistic Regression-2Logistic Regression-2
Regression coefficients P-value
04/19/23 14
Statistical Information : Statistical Information : PP-Value-Value
A p-value is a significance indicator.
A large p-value indicates either a bad regression model or a statistically insignificant index term.
We should keep only significant index terms.
04/19/23 15
Select Important Index TermsSelect Important Index Terms
Features PhoneN Density Grouping
P-value .001 .034 .052 .012
Year Make Model Mileage Price
P-value .679 .002 .074 .002 .001
The car-ads application ontology
Double S-curve
04/19/23 16
Probabilistic Retrieval ModelProbabilistic Retrieval Model
ln(O(R|xi)), ln(O(R|y)), ln(O(R|z))
> 0 < 0
relevant irrelevant
04/19/23 17
Correlation RelationsCorrelation Relations
Correlation: There are strong positive correlations among document properties (e.g. Death Date & Birth Date in the obituaries).
Correlations are extra information implicitly contained in a document.
Correlation relations handle “patterns”, e.g., Birth Date-Death Date pair appearing in obituaries application ontology.
04/19/23 18
Special Web DocumentsSpecial Web Documents
Multiple-record Web documents Similar content, format (e.g. item for sale) Same lexical object values (e.g. Honda makes cars and
motorcycles)
8 documents (motorcycle, boat, snowmobile, bicycle) for the car-ads, and 5 documents (death notice, bibliography for famous people, find a graveyard, politician died young, famous people died in car accident) for the obituary.
04/19/23 19
04/19/23 20
04/19/23 21
04/19/23 22
04/19/23 23
Experimental ResultsExperimental Results
Car-ads obituary
recall 100% 100%
precision 83.3%* 83.3%
accuracy 92.9% 92.0%
*Ten out of eighteen negative documents are specially selected.
04/19/23 24
ConclusionsConclusions
We propose a probabilistic model which is suitable for classifying multiple-record Web documents.
The model performance on a random chosen test document set could be better than the results we present in the thesis.