link-based document classification using bayesian networks

25
Introduction Our solution The Bayesian network model Results Conclusions and future works Link-based text classification using Bayesian networks Luis M. de Campos Juan M. Fernández-Luna Juan F. Huete Andrés R. Masegosa Alfonso E. Romero {lci,jmfluna,jhg,andrew,aeromero}@decsai.ugr.es Departamento de Ciencias de la Computación e Inteligencia Artificial E.T.S.I. Informática y de Telecomunicación, CITIC-UGR, Universidad de Granada 18071 – Granada, Spain INEX 2009 Workshop, Brisbane

Upload: alfonso-e-romero

Post on 05-Dec-2014

1.803 views

Category:

Entertainment & Humor


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Link-based text classification usingBayesian networks

Luis M. de Campos Juan M. Fernández-LunaJuan F. Huete Andrés R. Masegosa

Alfonso E. Romero{lci,jmfluna,jhg,andrew,aeromero}@decsai.ugr.es

Departamento de Ciencias de la Computación e Inteligencia ArtificialE.T.S.I. Informática y de Telecomunicación,

CITIC-UGR, Universidad de Granada18071 – Granada, Spain

INEX 2009 Workshop, Brisbane

Page 2: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Our participation

Universidad de Granada at INEX 2009

The third year we participate on XML mining(classification).

As previous ocasions, we are interested in Bayesiannetworks.

We’ve provided a new solution to this problem.

Sorry, no AdHoc this year /.

Page 3: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Our participation

The problem itself

A text (XML) categorization problem. Training/test corpus.

Multilabel (more than 1 category per doc).

Links among files (training, test) given in a matrix.

Vectors of indexed terms (normalized tf-idf) provided.

Page 4: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Our participation

The problem itself

A text (XML) categorization problem. Training/test corpus.� Same as previous yearsMultilabel (more than 1 category per doc).� New this year!Links among files (training, test) given in a matrix.� Same as 2008Vectors of indexed terms (normalized tf-idf) provided.� The eternal question, what about XML?

Page 5: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Our solution (2008)

Encyclopedia regularity (a document of category Ci tendsto links documents on the same category). Graphicallyverified on the training set.

In 2008 we combined a flat-text classifier (Naïve Bayes)with a Bayesian network of fixed structure which modelledinteraction among categories, using learnt probabilitiesP(ci |cj).

Results were discrete / (the worst model among 3, andimprovements over our baseline were not significant).

Page 6: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Our starting point (2009)

We detected the same regularity on categories (no matrixplot this year).

Possible (hidden) hierarchy (for examplePortal:Religion, Portal:Christianity andPortal:Catholicism).

This year we learn the interactions among categories fromdata, no fixed structure, but any which is on the set ofcategories.

Page 7: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Modeling link structure

Modeling link structure I

We assume there is a global probability distributionamong all these variables, and we will model it with aBayesian network.

Variables: categories Ci (39), categories of incoming linksEj (39) and terms Tk (many).

Main Assumption: the probability distributions of adocument and the categories of files that link it areindependent given the category. Or simbolically:

p(dj , ej |ci) = p(dj |ci) p(ej |ci).

Page 8: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Modeling link structure

We then search for the conditional probability p(ci |dj , ej):

p(ci |dj , ej) =p(dj , ej |ci) p(ci)

p(dj , ej)=

p(dj |ci) p(ej |ci) p(ci)

p(dj , ej)

=p(ci |dj) p(dj) p(ej |ci) p(ci)

p(ci) p(dj , ej)

=p(ci |dj) p(dj) p(ci |ej) p(ej)

p(ci) p(dj , ej)

=

(p(dj) p(ej)

p(dj , ej)

) (p(ci |dj) p(ci |ej)

p(ci)

).

Page 9: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Modeling link structure

We then search for the conditional probability p(ci |dj , ej):

p(ci |dj , ej) =p(dj , ej |ci) p(ci)

p(dj , ej)=

p(dj |ci) p(ej |ci) p(ci)

p(dj , ej)

=p(ci |dj) p(dj) p(ej |ci) p(ci)

p(ci) p(dj , ej)

=p(ci |dj) p(dj) p(ci |ej) p(ej)

p(ci) p(dj , ej)

=

(p(dj) p(ej)

p(dj , ej)

) (p(ci |dj) p(ci |ej)

p(ci)

).

p(ci |dj , ej) ∝p(ci |dj) p(ci |ej)

p(ci)

Page 10: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Modeling link structure

We then search for the conditional probability p(ci |dj , ej):

p(ci |dj , ej) =p(dj , ej |ci) p(ci)

p(dj , ej)=

p(dj |ci) p(ej |ci) p(ci)

p(dj , ej)

=p(ci |dj) p(dj) p(ej |ci) p(ci)

p(ci) p(dj , ej)

=p(ci |dj) p(dj) p(ci |ej) p(ej)

p(ci) p(dj , ej)

=

(p(dj) p(ej)

p(dj , ej)

) (p(ci |dj) p(ci |ej)

p(ci)

).

p(ci |dj , ej) ∝p(ci |dj) p(ci |ej)

p(ci)

p(ci |dj , ej) =p(ci |dj) p(ci |ej) / p(ci)

p(ci |dj)p(ci |ej)/p(ci) + p(c i |dj)p(c i |ej)/p(c i)

Page 11: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Modeling link structure

Modeling link structure III

p(ci |dj): output of a probabilistic classifier. Anyprobabilistic classifier.

p(ci |ej): probability of being of Ci considering the set of thecategories of the incoming (known) links. This is modeledby the Bayesian network.

The problem reduces to the following: [see next slide]

Page 12: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Modeling link structure

Modeling link structure IV

We have a vector of 39+39 binary variables for eachdocument: 39 for each category (1 if the doc. is of thatcategory, 0 if not), and 39 more (1 if the document is linkedby documents of this category, 0 if not).

With a learning algorithm, we learn a Bayesian networkfrom that data.

For each document to classify, for each category Ci wecompute its content probability p(ci |dj) (with baseclassifier), and the probability of being of Ci knowing thecategories of certain neighbours p(ci |ej) (with the learntBayesian network).

We combine them using the blue equation.

Page 13: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Learning link structure

Learning Bayesian Network, using WEKA package.

Hillclimbing algorithm (easy and fast).

BDeu metric.

Three parents max. per node.

Propagation, using Elvira (WEKA does not havepropagation algorithms).

Compute p(ci) (once), and p(ci |ej) (for each document j).

Exact propagation was slow �!

Importance Sampling algorithm (approximate).

Page 14: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Learning link structure

Learning Bayesian Network, using WEKA package.

Hillclimbing algorithm (easy and fast).

BDeu metric.

Three parents max. per node.

Propagation, using Elvira (WEKA does not havepropagation algorithms).

Compute p(ci) (once), and p(ci |ej) (for each document j).

Exact propagation was slow �!

Importance Sampling algorithm (approximate).

Page 15: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Learning link structure

Learning Bayesian Network, using WEKA package.

Hillclimbing algorithm (easy and fast).

BDeu metric.

Three parents max. per node.

Propagation, using Elvira (WEKA does not havepropagation algorithms).

Compute p(ci) (once), and p(ci |ej) (for each document j).

Exact propagation was slow �!

Importance Sampling algorithm (approximate).

Page 16: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Learning link structure

Learning Bayesian Network, using WEKA package.

Hillclimbing algorithm (easy and fast).

BDeu metric.

Three parents max. per node.

Propagation, using Elvira (WEKA does not havepropagation algorithms).

Compute p(ci) (once), and p(ci |ej) (for each document j).

Exact propagation was slow �!

Importance Sampling algorithm (approximate).

Page 17: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Base classifiers

Base classifiers

We have used Multinomial Naïve Bayes (binary) andBayesian OR Gate (a model presented by our group inINEX 2007).

They are extensive described on the paper (read it if youwant to learn deeply about these two classifiers).

Any other probabilistic classifiers can be used to firstlyobtain p(ci |dj) (any suggestions or preferences?).

Page 18: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Results

MACC µACC MROC µROC MPRF µPRF MAPN. Bayes 0.95142 0.93284 0.80260 0.81992 0.49613 0.52670 0.64097

N. Bayes + BN 0.95235 0.93386 0.80209 0.81974 0.50015 0.53029 0.64235OR gate 0.75420 0.67806 0.92526 0.92163 0.25310 0.26268 0.72955

OR gate + BN 0.84768 0.81891 0.92810 0.92739 0.31611 0.36036 0.72508

Initial results

Problem in the OR gate! (Evaluation assumesdj ∈ Ci ⇔ p(ci |dj) > 0.5). This is not, in general, true for theOR gate, need some scaling procedure (like SCut strategy).

Page 19: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Results

MACC µACC MROC µROC MPRF µPRF MAPN. Bayes 0.95142 0.93284 0.80260 0.81992 0.49613 0.52670 0.64097

N. Bayes + BN 0.95235 0.93386 0.80209 0.81974 0.50015 0.53029 0.64235OR gate 0.75420 0.67806 0.92526 0.92163 0.25310 0.26268 0.72955

OR gate + BN 0.84768 0.81891 0.92810 0.92739 0.31611 0.36036 0.72508

Initial results

Problem in the OR gate! (Evaluation assumesdj ∈ Ci ⇔ p(ci |dj) > 0.5). This is not, in general, true for theOR gate, need some scaling procedure (like SCut strategy).

MACC µACC MROC µROC MPRF µPRF MAPOR gate 0.92932 0.92612 0.92526 0.92163 0.45966 0.50407 0.72955

OR gate + BN 0.96607 0.95588 0.92810 0.92739 0.51729 0.55116 0.72508

Scaled results (see paper for details).

Page 20: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Conclusions

The model is new, parametrizable (learning algorithm,parameters of algorithm, base classifier,...) and valuableby itself (always improves a baseline).

Using the Bayesian network over the OR gate provides a10% of improvement in some measures ,.

Good results on ROC (ranked third).

Other base classifier? SVM with probabilistic outputs,Logistic Regression...

More experiments for the final version of the paper!

Page 21: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Conclusions

The model is new, parametrizable (learning algorithm,parameters of algorithm, base classifier,...) and valuableby itself (always improves a baseline).

Using the Bayesian network over the OR gate provides a10% of improvement in some measures ,.

Good results on ROC (ranked third).

Other base classifier? SVM with probabilistic outputs,Logistic Regression...

More experiments for the final version of the paper!

Page 22: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Conclusions

The model is new, parametrizable (learning algorithm,parameters of algorithm, base classifier,...) and valuableby itself (always improves a baseline).

Using the Bayesian network over the OR gate provides a10% of improvement in some measures ,.

Good results on ROC (ranked third).

Other base classifier? SVM with probabilistic outputs,Logistic Regression...

More experiments for the final version of the paper!

Page 23: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Conclusions

The model is new, parametrizable (learning algorithm,parameters of algorithm, base classifier,...) and valuableby itself (always improves a baseline).

Using the Bayesian network over the OR gate provides a10% of improvement in some measures ,.

Good results on ROC (ranked third).

Other base classifier? SVM with probabilistic outputs,Logistic Regression...

More experiments for the final version of the paper!

Page 24: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Conclusions

The model is new, parametrizable (learning algorithm,parameters of algorithm, base classifier,...) and valuableby itself (always improves a baseline).

Using the Bayesian network over the OR gate provides a10% of improvement in some measures ,.

Good results on ROC (ranked third).

Other base classifier? SVM with probabilistic outputs,Logistic Regression...

More experiments for the final version of the paper!

Page 25: Link-based document classification using Bayesian Networks

Introduction Our solution The Bayesian network model Results Conclusions and future works

Thank you for yourattention!Questions, comments, criticism?

<SPAM>Expecting to defend my PhD by April 2010,searching for a PostDoc (in Europe) for 2010 on ML/IRrelated stuff. Any offers? , < /SPAM>