text classification with lucene/solr, apache hadoop and libsvm

Text classification with Lucene/Solr and LibSVM

By Majirus FANSI, PhdAgile Software Developer@majirus

Motivation: Guiding user search

● Search engines are basically keyword-oriented

– What about the meaning?● Synonym search needs listing the synonyms● More-Like-This component is about more like THIS● Category search for better user experience

– Deals with the cases where user keywords are not in the collection

– User searches for « emarketing », you returns documents on « webmarketing »

Outline

● Text Categorization

● Introducing Machine Learning

● Why SVM?

● How Solr can help ?

Putting it all Together is our aim

Text classification or Categorization

● Aims– Classify documents into a fixed number of predefined

categories● Each document can be in multiple, exactly one, or no

category at all.● Applications

– Classifying emails (Spam / Not Spam)– Guiding user search

● Challenges– Building text classifiers by hand is difficult and time

consuming– It is advantageous to learn classifiers from examples

Machine Learning

● Definition (by Tom Mitchell - 1998)“A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E”

● Experience E: watching the label of a document● Task T: classify a document● Performance P: probability that a document is correctly

classified.

Machine Learning Algorithms

● Usupervised learning– Let the program learn by itself

● Market segmentation, social network analysis...

● Supervised learning– Teach the computer program how to do something– We give the algorithm the “right answers” for some

examples

Supervised learning problems

– Regression● Predict continuous valued output

● Ex: price of the houses in Corbeil Essonnes

– Classification● Predict a discrete valued output (+1, -1)

Supervised learning: Working

Training algorithm

Hypothesis hFeature vector (x)

Predicted value (y)

Training Set (X, Y)m training examples

(X(i),Y(i)) : ith training example

It's the job of the learning algorithm to produce the model h

X's : input variable or featuresY's : output/target variable

h(x)

Classifier/Decision Boundary

● Carves up the feature space into volumes ● Feature vectors in volumes assigned to the same class ● Decision regions separated by surfaces ● Decision boundary linear if a straight line in the

dimensional space– A line in 2D, a plane in 3D, a hyperplane in 4+D

Which Algorithm for text classifier

Properties of text

● High dimensional input space– More than 10 000 features

● Very few irrelevant features● Document vectors are sparse

– few entries which are non zero● Most text categorization problems are linearly separable

No need to map the input features to a higher dimension space

Classification algorithm /choosing the method

● Thorsten Joachims compares SVM to Naive Bayes, Rocchio, K-nearest neighbor and C4.5 decision tree

● SVM consistently achieve good performance on categorization task

– It outperforms the other methods– Eliminates the need for feature selection– More robust than the other

Thorsten Joachims, 1998. Text Categorization with SVM : Learning with many relevant features

SVM ? Yes But...

« The research community should direct efforts towards increasing the size of annotated training collections, while deemphasizing the focus on comparing different learning techniques trained only on small training corpora »

Banko & Brill in « scaling very very large corpora for natural language disambiguation »

What is SVM - Support Vector Machine?

● « Support Vector Networks » Cortes & Vapnik, 1995● SVM implements the following idea

– Maps the input vectors into some high dimensional feature space Z

● Through some non linear mapping choosing a priori

– In this feature space a linear decision surface is constructed

– Special properties of the decision surface ensures high generalization ability of the learning machine

SVM - Classification of an unknown pattern

sv2

x

svk

sv1

Input vector, x

Support vectors zi in feature

space

classification

wN

X

Non-linear transformation

Input vector in feature space

w2

w1

SVM - decision boundary

● Optimal hyperplane– Training data can be separated without errors– It is the linear decision function with maximal

margin between the vectors of the two classes

● Soft margin hyperplane– Training data cannot be separated without errors

Optimal hyperplane

Optimal hyperplane - figure

Optimal hyperplane

Optimal margin

x2

x1

SVM - optimal hyperplane

● Given the training set X of (x1, y

1), (x

2, y

2), … (x

m, y

m) ; y

i Є{-1, 1}

● X is linearly separable if there exists a vector w and a scalar b s.t.

● Vectors xi for which y

i (w.x

i+b) = 1 is termed support vectors

– Used to construct the hyperplane– if the training vectors are separated without errors by

an optimal hyperplane

● The optimal hyperplane – Unique one which separates the training data with a

maximal margin

w.x i+b≥1 if y i=1(1)

w.x i+b≤−1 if y i=−1(2)(1), (2)⇒ yi(w.xi+b)≥1(3)

E [Pr (error)]≤E[number of support vectors]number of training vectors

(4)

w0 . z+b0=0(5)

SVM - optimal hyperplane – decision function

● Let us consider the optimal hyperplane

● The weight w0 can be written as some linear combination

of SVs

● The linear decision function I(z) is of the form

● zi.z is the dot product between sv

s z

i and vector z

w0 . z+b0=0(5)

w0= ∑support vectors

αi zi(6)

I (z )=sign( ∑support vectors

α i z i . z+b0)(7)

Soft margin hyperplane

Soft margin Classification

● Want to separate the training set with a minimal number of errors

s.t.

● The functional (5) describes the number of training errors

● Removing the subset of training errors from training set● Remaining part separated without errors● By constructing an optimal hyperplane

Φ(ξ)=∑i=1

m

ξiσ ; ξi≥0 ; for small σ>0(5)

y i(w.xi+b)≥1−ξ i; i=1,... ,m(6)

SVM - soft margin Idea

● Soft margin svm can be expressed as

● For sufficiently large C, the vector w0 and constant b

0,

that minimizes (7) under (8) determine the hyperplane that

– minimizes the sum of deviations, ξ, of training errors– Maximizes the margin for correctly classified vectors

minw , b ,ξ

12

w2+C∑i=1

m

ξi(7)

s.t. y i(w.xi+b)≥1−ξ i ξ i≥0 (8)

SVM - soft margin figure

separator

soft margin

x2

x1

ξ=0

ξ=0

0<ξ<1

ξ>1

Constructing text classifier with SVM

Constructing and using the text classifier

● Which library ?– Efficient optimization packages are available

● SVMlight, LibSVM

● From text to features vectors– Lucene/solr helps here

● Multi-class classification vs One-vs-the-rest

● Using the categories for semantic search● Dedicated solr index with the most predictive

terms

SVM library

● SVMlight– By Thorsten Joachim

● LibSVM– By Chan & Lin from Taiwan university– Under heavy development and testing– Library for java, C, python,...,Package for R language

● LibLinear– By Chan, Lin & al.– Brother of LibSVM– Recommended by LibSVM authors for large-scale

linear classification

LibLinear

● A Library for Large Linear Classification– Binary and Multi-class– implements Logistic Regression and linear SVM

● Format of training and testing data file is :– <label> <index1>:<value1><index2>:<value2>... – Each line contains an instance and is ended by a '\n'– <label> is an integer indicating the class label– The pair <index>:<value> gives a feature value

● <index> is an integer starting from 1● <value> is a real number

– Indices must be in ascending order

LibLinear input and dictionary

● Example input file for training1 101:1 123:5 234:2-1 54:2 64:1 453:3

– Do not have to represent the zeros.

● Need a dictionary of terms in lexicographical order

1 .net2 aa...6000 jav...7565 solr

Building the dictionary

● Divide the overall training data into a number of portions

– Using knowledge of your domain● Software development portion● marketing portion...

– Avoid a very large dictionary● A java dev position and a marketing position

share few common terms● Use Expert boolean queries to load a dedicated solr

core per domain– description:python AND title:python

Building the dictionary with Solr

● What do we need in the dictionary– Terms properly analyzed

● LowerCaseFilterFactory, StopFilterFactory,● ASCIIFoldingFilterFactory,

SnowballPorterFilterFactory

– Terms that occurs in a number of documents (df >min)● Rare terms may cause the model to overfit

● Terms are retrieved from solr

– Using solr TermVectorsComponent

Solr TermVectorComponent

● SearchComponent designed to return information about terms in documents

– tv.df returns the document frequency per term in the document

– tv.tf returns document term frequency info per term in the document

● Used as feature value

– tv.fl provides the list of fields to get term vectors for● Only the catch-all field we use for classification

Solr Core configuration

● Set termvectors attribute on fields you will use– <field name="title_and_description" type="texte_analyse"

indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true"/>

– Normalize your text and use stemming during the analysis

● Enable TermVectorComponent in solrconfig– <searchComponent name="tvComponent"

class="org.apache.solr.handler.component.TermVectorComponent"/>

– Configure a RequestHandler to use this component● <lst name="defaults"> <bool name="tv">true</bool> </lst>● <arr name="last-components"> <str>tvComponent</str> </arr>

Constructing Training and Test sets per model

Feature extraction

● Domain expert query is used to extract docs for each category

– TVC returns the terms info of the terms in each document

– Each term is replaced by its index from the dictionary● This is the attribute

– Its tf info is used as value● Some use presence/absence (or 1/0)● Others tf-idf

term_index_from_dico:term_freq is an input feature

Training and Test sets partition

● We shuffle documents set so that high score docs do not go to the same bucket

● We split the result list so that– 60 % to the training set (TS)

● Here are positive examples (the +1s)– 20 % to the validation set (VS)

● Positive in this model, negative in others– 20 % is used for other classes training set (OTS)

● These are negative examples to others● Balanced training set (≈50 % of +1s and ≈50 % of -1s)

– The negatives come form other's 20 % OTS

Model file

● Model file is saved after training– One model per category– It outlines the following

● solver_type L2R_L2LOSS_SVC● nr_class 2● label 1 -1● nr_feature 8920● bias 1.000000000000000● w

● -0.1626437446641374 ● 0 ● 7.152404908494515e-05

w.xi + b ≥ 1 if y

i = 1

Most predictives terms

● Model file contains the weight vector w

● Use w to compute the most predictves terms of the model– Give an indication as to whether the model is good or

not● You are the domain expert

– Useful to extend basic keyword search to semantic search

Toward semantic search - Indexing

● Create a category core in solr– Each document represents a category

● One field for the category ID● One multi-valued field holds its top predictives

terms

● At indexing time– Each document is sent to the classification service– The service returns the categories of the document– Categories are saved in a multi-valued field along with

other domain-pertinents document fields

Toward semantic search - searching

● At search time– User query is run on the category core

● What about libShortText

– The returned categories are used to extend the initial query

● A boost < 1 is assigned to the category

References

● Cortes and Vapnik, 1995. Support-Vector Networks

● Chang and Lin, 2012. LibSVM : A Library for Support Vector Machines

● Fan, Lin, et al. LibLinear, 2012 : A Library for Large Linear Classification

● Thorsten Joachims, 1998. Text Categorization with SVM : Learning with many relevant features

● Rifkin and Klautau, 2004. In Defense of One-Vs-All classification

A big thank you

● Lucene/Solr Revolution EU 2013 organizers

● To Valtech Management

● To Michels, Maj-Daniels, and Marie-Audrey Fansi

● To all of you for your presence and attention

Questions ?

To my wife, Marie-Audrey, for all the attention she pay to our family

text classification with lucene/solr, apache hadoop and libsvm

Technology

optimal hyperplane w

svm classification

learning machine

decision tree svm

svm support vector machine

input vectors

vector w

optimal hyperplane unique