7/15/20151 a binary-categorization approach for classifying multiple-record web documents using a...

04/19/23 1

A Binary-Categorization Approach for A Binary-Categorization Approach for Classifying Multiple-Record Web Classifying Multiple-Record Web Documents Using a Probabilistic Documents Using a Probabilistic

Retrieval ModelRetrieval Model

Department of Computer Science

Brigham Young University

Quan Wang

November 2001

04/19/23 2

Thesis ContentThesis Content

Introduction Preliminaries Probabilistic Retrieval Based on Logistic

Regression Analysis Experimental Results Concluding Remarks

04/19/23 3

IntroductionIntroduction

World Wide Web contains tremendous amount of information.

Information retrieval is used to accurately and efficiently classify information.

We use a probabilistic retrieval model based on logistic regression analysis.

04/19/23 4

PreliminariesPreliminaries

Application ontology

Logistic regression

Probabilistic retrieval based on logistic regression

04/19/23 5

Application OntologyApplication Ontology

Car

Year Price

Make Model

Mileage Feature PhoneNr

1:*

1:*

1:*

1:*

1:* 1:* 1:*

0:0.975:1 0:0.8:1

0:0.908:10:1.15:*0:2.2:*

0:0.925:1

0:0.45:1

04/19/23 6

Document RepresentationDocument Representation

A set of <index term : term frequency> pairs A1:x1, …….. An:xn.

A density heuristic value y; A grouping heuristic value z;

Document d (x1,……,xn, y, z) (V, y, z)

04/19/23 7

Independence AssumptionIndependence Assumption

P(R|x1,……,xn, y, z)

Independenceassumption

P(R|x1) P(R|xn) P(R|y) P(R|z)* ***

04/19/23 8

Logistic RegressionLogistic Regression

P

x

P(R|x)* ** * *******

*** * ******* ** * xi

P(R|xi)

P(R| x) = 1/(1+exp(-(C0+C1 x))), ln(O(R|x) = C0+C1 x.

04/19/23 9

Probabilistic Retrieval Based on Logistic Probabilistic Retrieval Based on Logistic Regression AnalysisRegression Analysis

Data processing Data analysis Probabilistic retrieval on car-ads application

ontology Correlation relations

04/19/23 10

Data ProcessingData Processing

The corresponding normalized vector

V’ = (X1’, …….. Xn’) is computed as

V’ = |V| / |u|

V

where V is a document vector, u is an ontology vector.

,

04/19/23 11

Data DistributionsData Distributions

**** ** *** **

**** ** *** **

04/19/23 12

Logistic Regression-1Logistic Regression-1

04/19/23 13

Logistic Regression-2Logistic Regression-2

Regression coefficients P-value

04/19/23 14

Statistical Information : Statistical Information : PP-Value-Value

A p-value is a significance indicator.

A large p-value indicates either a bad regression model or a statistically insignificant index term.

We should keep only significant index terms.

04/19/23 15

Select Important Index TermsSelect Important Index Terms

Features PhoneN Density Grouping

P-value .001 .034 .052 .012

Year Make Model Mileage Price

P-value .679 .002 .074 .002 .001

The car-ads application ontology

Double S-curve

04/19/23 16

Probabilistic Retrieval ModelProbabilistic Retrieval Model

ln(O(R|xi)), ln(O(R|y)), ln(O(R|z))

> 0 < 0

relevant irrelevant

04/19/23 17

Correlation RelationsCorrelation Relations

Correlation: There are strong positive correlations among document properties (e.g. Death Date & Birth Date in the obituaries).

Correlations are extra information implicitly contained in a document.

Correlation relations handle “patterns”, e.g., Birth Date-Death Date pair appearing in obituaries application ontology.

04/19/23 18

Special Web DocumentsSpecial Web Documents

Multiple-record Web documents Similar content, format (e.g. item for sale) Same lexical object values (e.g. Honda makes cars and

motorcycles)

8 documents (motorcycle, boat, snowmobile, bicycle) for the car-ads, and 5 documents (death notice, bibliography for famous people, find a graveyard, politician died young, famous people died in car accident) for the obituary.

04/19/23 19

04/19/23 20

04/19/23 21

04/19/23 22

04/19/23 23

Experimental ResultsExperimental Results

Car-ads obituary

recall 100% 100%

precision 83.3%* 83.3%

accuracy 92.9% 92.0%

*Ten out of eighteen negative documents are specially selected.

04/19/23 24

ConclusionsConclusions

We propose a probabilistic model which is suitable for classifying multiple-record Web documents.

The model performance on a random chosen test document set could be better than the results we present in the thesis.

7/15/20151 a binary-categorization approach for classifying multiple-record web documents using a...

Documents

logistic regression

z slide

relevantirrelevant slide

logistic regression

information retrieval

bad regression model

large pvalue

heuristic value z document