text classification with lucene/solr, apache hadoop and libsvm

45

Upload: lucenerevolution

Post on 11-May-2015

4.731 views

Category:

Technology


5 download

DESCRIPTION

In this session we will show how to build a text classifier using the Apache Lucene/Solr with libSVM libraries. We classify our corpus of job offers into a number of predefined categories. Each indexed document (a job offer) then belongs to zero, one or more categories. Known machine learning techniques for text classification include naïve bayes model, logistic regression, neural network, support vector machine (SVM), etc. We use Lucene/Solr to construct the features vector. Then we use the libsvm library known as the reference implementation of the SVM model to classify the document. We construct as many one-vs-all svm classifiers as there are classes in our setting, then using the Hadoop MapReduce Framework we reconcile the result of our classifiers. The end result is a scalable multi-class classifier. Finally we outline how the classifier is used to enrich basic solr keyword search.

TRANSCRIPT

Page 1: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Page 2: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Text classification with Lucene/Solr and LibSVM

By Majirus FANSI, PhdAgile Software Developer@majirus

Page 3: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Motivation: Guiding user search

● Search engines are basically keyword-oriented

– What about the meaning?● Synonym search needs listing the synonyms● More-Like-This component is about more like THIS● Category search for better user experience

– Deals with the cases where user keywords are not in the collection

– User searches for « emarketing », you returns documents on « webmarketing »

Page 4: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Outline

● Text Categorization

● Introducing Machine Learning

● Why SVM?

● How Solr can help ?

Putting it all Together is our aim

Page 5: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Text classification or Categorization

● Aims– Classify documents into a fixed number of predefined

categories● Each document can be in multiple, exactly one, or no

category at all.● Applications

– Classifying emails (Spam / Not Spam)– Guiding user search

● Challenges– Building text classifiers by hand is difficult and time

consuming– It is advantageous to learn classifiers from examples

Page 6: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Machine Learning

● Definition (by Tom Mitchell - 1998)“A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E”

● Experience E: watching the label of a document● Task T: classify a document● Performance P: probability that a document is correctly

classified.

Page 7: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Machine Learning Algorithms

● Usupervised learning– Let the program learn by itself

● Market segmentation, social network analysis...

● Supervised learning– Teach the computer program how to do something– We give the algorithm the “right answers” for some

examples

Page 8: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Supervised learning problems

– Regression● Predict continuous valued output

● Ex: price of the houses in Corbeil Essonnes

– Classification● Predict a discrete valued output (+1, -1)

Page 9: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Supervised learning: Working

Training algorithm

Hypothesis hFeature vector (x)

Predicted value (y)

Training Set (X, Y)m training examples

(X(i),Y(i)) : ith training example

It's the job of the learning algorithm to produce the model h

X's : input variable or featuresY's : output/target variable

h(x)

Page 10: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Classifier/Decision Boundary

● Carves up the feature space into volumes ● Feature vectors in volumes assigned to the same class ● Decision regions separated by surfaces ● Decision boundary linear if a straight line in the

dimensional space– A line in 2D, a plane in 3D, a hyperplane in 4+D

Page 11: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Which Algorithm for text classifier

Page 12: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Properties of text

● High dimensional input space– More than 10 000 features

● Very few irrelevant features● Document vectors are sparse

– few entries which are non zero● Most text categorization problems are linearly separable

No need to map the input features to a higher dimension space

Page 13: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Classification algorithm /choosing the method

● Thorsten Joachims compares SVM to Naive Bayes, Rocchio, K-nearest neighbor and C4.5 decision tree

● SVM consistently achieve good performance on categorization task

– It outperforms the other methods– Eliminates the need for feature selection– More robust than the other

Thorsten Joachims, 1998. Text Categorization with SVM : Learning with many relevant features

Page 14: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

SVM ? Yes But...

« The research community should direct efforts towards increasing the size of annotated training collections, while deemphasizing the focus on comparing different learning techniques trained only on small training corpora »

Banko & Brill in « scaling very very large corpora for natural language disambiguation »

Page 15: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

What is SVM - Support Vector Machine?

● « Support Vector Networks » Cortes & Vapnik, 1995● SVM implements the following idea

– Maps the input vectors into some high dimensional feature space Z

● Through some non linear mapping choosing a priori

– In this feature space a linear decision surface is constructed

– Special properties of the decision surface ensures high generalization ability of the learning machine

Page 16: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

SVM - Classification of an unknown pattern

sv2

x

svk

sv1

Input vector, x

Support vectors zi in feature

space

classification

wN

X

Non-linear transformation

Input vector in feature space

w2

w1

Page 17: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

SVM - decision boundary

● Optimal hyperplane– Training data can be separated without errors– It is the linear decision function with maximal

margin between the vectors of the two classes

● Soft margin hyperplane– Training data cannot be separated without errors

Page 18: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Optimal hyperplane

Page 19: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Optimal hyperplane - figure

Optimal hyperplane

Optimal margin

x2

x1

Page 20: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

SVM - optimal hyperplane

● Given the training set X of (x1, y

1), (x

2, y

2), … (x

m, y

m) ; y

i Є{-1, 1}

● X is linearly separable if there exists a vector w and a scalar b s.t.

● Vectors xi for which y

i (w.x

i+b) = 1 is termed support vectors

– Used to construct the hyperplane– if the training vectors are separated without errors by

an optimal hyperplane

● The optimal hyperplane – Unique one which separates the training data with a

maximal margin

w.x i+b≥1 if y i=1(1)

w.x i+b≤−1 if y i=−1(2)(1), (2)⇒ yi(w.xi+b)≥1(3)

E [Pr (error)]≤E[number of support vectors]number of training vectors

(4)

w0 . z+b0=0(5)

Page 21: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

SVM - optimal hyperplane – decision function

● Let us consider the optimal hyperplane

● The weight w0 can be written as some linear combination

of SVs

● The linear decision function I(z) is of the form

● zi.z is the dot product between sv

s z

i and vector z

w0 . z+b0=0(5)

w0= ∑support vectors

αi zi(6)

I (z )=sign( ∑support vectors

α i z i . z+b0)(7)

Page 22: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Soft margin hyperplane

Page 23: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Soft margin Classification

● Want to separate the training set with a minimal number of errors

s.t.

● The functional (5) describes the number of training errors

● Removing the subset of training errors from training set● Remaining part separated without errors● By constructing an optimal hyperplane

Φ(ξ)=∑i=1

m

ξiσ ; ξi≥0 ; for small σ>0(5)

y i(w.xi+b)≥1−ξ i; i=1,... ,m(6)

Page 24: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

SVM - soft margin Idea

● Soft margin svm can be expressed as

● For sufficiently large C, the vector w0 and constant b

0,

that minimizes (7) under (8) determine the hyperplane that

– minimizes the sum of deviations, ξ, of training errors– Maximizes the margin for correctly classified vectors

minw , b ,ξ

12

w2+C∑i=1

m

ξi(7)

s.t. y i(w.xi+b)≥1−ξ i ξ i≥0 (8)

Page 25: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

SVM - soft margin figure

separator

soft margin

x2

x1

ξ=0

ξ=0

0<ξ<1

ξ>1

Page 26: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Constructing text classifier with SVM

Page 27: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Constructing and using the text classifier

● Which library ?– Efficient optimization packages are available

● SVMlight, LibSVM

● From text to features vectors– Lucene/solr helps here

● Multi-class classification vs One-vs-the-rest

● Using the categories for semantic search● Dedicated solr index with the most predictive

terms

Page 28: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

SVM library

● SVMlight– By Thorsten Joachim

● LibSVM– By Chan & Lin from Taiwan university– Under heavy development and testing– Library for java, C, python,...,Package for R language

● LibLinear– By Chan, Lin & al.– Brother of LibSVM– Recommended by LibSVM authors for large-scale

linear classification

Page 29: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

LibLinear

● A Library for Large Linear Classification– Binary and Multi-class– implements Logistic Regression and linear SVM

● Format of training and testing data file is :– <label> <index1>:<value1><index2>:<value2>... – Each line contains an instance and is ended by a '\n'– <label> is an integer indicating the class label– The pair <index>:<value> gives a feature value

● <index> is an integer starting from 1● <value> is a real number

– Indices must be in ascending order

Page 30: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

LibLinear input and dictionary

● Example input file for training1 101:1 123:5 234:2-1 54:2 64:1 453:3

– Do not have to represent the zeros.

● Need a dictionary of terms in lexicographical order

1 .net2 aa...6000 jav...7565 solr

Page 31: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Building the dictionary

● Divide the overall training data into a number of portions

– Using knowledge of your domain● Software development portion● marketing portion...

– Avoid a very large dictionary● A java dev position and a marketing position

share few common terms● Use Expert boolean queries to load a dedicated solr

core per domain– description:python AND title:python

Page 32: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Building the dictionary with Solr

● What do we need in the dictionary– Terms properly analyzed

● LowerCaseFilterFactory, StopFilterFactory,● ASCIIFoldingFilterFactory,

SnowballPorterFilterFactory

– Terms that occurs in a number of documents (df >min)● Rare terms may cause the model to overfit

● Terms are retrieved from solr

– Using solr TermVectorsComponent

Page 33: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Solr TermVectorComponent

● SearchComponent designed to return information about terms in documents

– tv.df returns the document frequency per term in the document

– tv.tf returns document term frequency info per term in the document

● Used as feature value

– tv.fl provides the list of fields to get term vectors for● Only the catch-all field we use for classification

Page 34: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Solr Core configuration

● Set termvectors attribute on fields you will use– <field name="title_and_description" type="texte_analyse"

indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true"/>

– Normalize your text and use stemming during the analysis

● Enable TermVectorComponent in solrconfig– <searchComponent name="tvComponent"

class="org.apache.solr.handler.component.TermVectorComponent"/>

– Configure a RequestHandler to use this component● <lst name="defaults"> <bool name="tv">true</bool> </lst>● <arr name="last-components"> <str>tvComponent</str> </arr>

Page 35: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Constructing Training and Test sets per model

Page 36: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Feature extraction

● Domain expert query is used to extract docs for each category

– TVC returns the terms info of the terms in each document

– Each term is replaced by its index from the dictionary● This is the attribute

– Its tf info is used as value● Some use presence/absence (or 1/0)● Others tf-idf

term_index_from_dico:term_freq is an input feature

Page 37: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Training and Test sets partition

● We shuffle documents set so that high score docs do not go to the same bucket

● We split the result list so that– 60 % to the training set (TS)

● Here are positive examples (the +1s)– 20 % to the validation set (VS)

● Positive in this model, negative in others– 20 % is used for other classes training set (OTS)

● These are negative examples to others● Balanced training set (≈50 % of +1s and ≈50 % of -1s)

– The negatives come form other's 20 % OTS

Page 38: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Model file

● Model file is saved after training– One model per category– It outlines the following

● solver_type L2R_L2LOSS_SVC● nr_class 2● label 1 -1● nr_feature 8920● bias 1.000000000000000● w

● -0.1626437446641374 ● 0 ● 7.152404908494515e-05

w.xi + b ≥ 1 if y

i = 1

Page 39: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Most predictives terms

● Model file contains the weight vector w

● Use w to compute the most predictves terms of the model– Give an indication as to whether the model is good or

not● You are the domain expert

– Useful to extend basic keyword search to semantic search

Page 40: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Toward semantic search - Indexing

● Create a category core in solr– Each document represents a category

● One field for the category ID● One multi-valued field holds its top predictives

terms

● At indexing time– Each document is sent to the classification service– The service returns the categories of the document– Categories are saved in a multi-valued field along with

other domain-pertinents document fields

Page 41: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Toward semantic search - searching

● At search time– User query is run on the category core

● What about libShortText

– The returned categories are used to extend the initial query

● A boost < 1 is assigned to the category

Page 42: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

References

● Cortes and Vapnik, 1995. Support-Vector Networks

● Chang and Lin, 2012. LibSVM : A Library for Support Vector Machines

● Fan, Lin, et al. LibLinear, 2012 : A Library for Large Linear Classification

● Thorsten Joachims, 1998. Text Categorization with SVM : Learning with many relevant features

● Rifkin and Klautau, 2004. In Defense of One-Vs-All classification

Page 43: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

A big thank you

● Lucene/Solr Revolution EU 2013 organizers

● To Valtech Management

● To Michels, Maj-Daniels, and Marie-Audrey Fansi

● To all of you for your presence and attention

Page 44: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

Questions ?

Page 45: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM

To my wife, Marie-Audrey, for all the attention she pay to our family