data mining using mahout - iiit hyderabadsearch.iiit.ac.in/cloud/presentations/8.pdf ·...

Data Mining using Mahout Data Mining using Mahout

Team No. 8Pratibha Rani

Prashant SethiaManisha Verma

What is Mahout?What is Mahout?Subproject of Apache Lucene◦ Goal: delivering scalable machine learning

algorithm implementations◦ http://lucene.apache.org/mahout/Version 0.1 released on 07 April 2009 includes 10 algorithm libraries◦ Details in published paper:

http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf

http://lucene.apache.org/mahout/



ObjectiveObjective

Implement two Data Mining/Machine Learning algorithms◦ Convert the algorithm in MapReduce

paradigm◦ Implement using Hadoop◦ Optimize computation

take advantage of MapReduce paradigm

Integrate them in Mahout Library◦ Make it available online.

Implemented AlgorithmsImplemented AlgorithmsClassification of Multi Class data using Linear Discriminant Function (LDF)◦ Machine Learning method for classification◦ Computational cost increases as number of

classes increaseSPRINT◦ Decision tree based parallel classifier for Data

Mining◦ Requires parallelization of computations

Decision Tree ExampleDecision Tree Example

Attribute ListsAttribute Lists

AlgorithmAlgorithm

Algorithm (contd.)Algorithm (contd.)

SPRINT: IntroductionSPRINT: IntroductionCarry out decision tree building process in parallel◦ Frequent lookup of the central class list produces a lot of

network communication in the parallel case◦ Solution: Eliminate the class list

Class labels distributed to each attribute list=> Redundant data, but the memory-resident and network communication bottlenecks are removedEach node keeps its own set of attribute lists=> No need to lookup the node information

Each node is assigned a partition of each attribute list. The nodes are ordered so that the combined lists of non-categorical attributes remain sortedEach node produces its local histograms in parallel, the combined histograms is used to find the best splits

Conversion of SPRINT in MapReduce Conversion of SPRINT in MapReduce paradigmparadigm

For each attribute create Attribute-list from given dataset using MapReduceUse MapReduce to sort all Attribute-listsConvert Tree construction algorithm into MapReduce formatWrite a MapReduce job to read test samples from user given input file and traverse constructed tree to find class labels.

SPRINT SPRINT ContdContd……

Tree construction:1. At each node use MapReduce to find

“Gini index” of each attribute in parallel2. Find attribute with lowest value of Gini

index.1. Split Attribute-list using MapReduce2. Make new nodes for each split.

3. Repeat above steps till unused attributes are left or Attribute-List contains records from only one class.

Advantages of using Advantages of using MapReduceMapReduce

Automatic parallelization of decision tree building processDoes not require parallel sorting algorithm◦ Computationally expensive and requires

shared memory parallel processors

LDF: IntroductionLDF: Introduction

Represent pattern classifiers in terms of a set of discriminant functions gi (x), i=1,…,n.The classifier is said to assign a feature vector x to class wi if gi (x) > gj (x) for all j ≠ ITransforming gi (x) in form g (x) = at yA sample yi is classified correctly if at yi > 0 and is labeled c1

if at yi < 0 then it is labeled c2

LDF: Introduction LDF: Introduction contdcontd……Replace all the samples labeled c2 by their negatives ◦ find a solution weight vector ‘a’ such that

at yi > 0 for all samples.◦ weight vector ‘a’ is called a separating vector

or solution vector◦ Use Fixed-Increment Single Sample

Perceptron algorithm to find this vectorFor multiclass case with n classes ◦ n(n-1)/2 such separating vectors for each pair

of classes

Conversion of LDF in Conversion of LDF in MapReduceMapReduce paradigmparadigm

Requires following tasks: 1. Analyze the given dataset in such a way that

we can divide it in pairs of two classes for all possible pairs of classes

2. Transform the computation algorithm into MapReduce format.

3. Store the final separating vectors obtained from computation algorithm for classifying new samples.

4. Write a MapReduce job which uses the obtained separating vectors for classifying samples given by the user.

LDF LDF contdcontd……Task 1: divide the data into n(n-1)/2 class pairs◦ First Mapper sorts the input file according to the class

labels and outputs class labels and recordsReducer function collects the sorted records in output file.

◦ Second Mapper takes sorted file as input and outputs class label and byteoffset of each record

Reducer takes class label and vector of byteoffsets as input collects the class label and minimum and maximum value of byteoffset for that class

◦ n(n-1)/2 files are created for each pair of classes to store the start and end byteoffset of records of those classes

LDF LDF contdcontd……Task 2 : transformation of Fixed-Increment Single Sample Perceptron algorithm into MapReduceformat◦ n(n-1)/2 input files creates same number of Map

tasks in new joball the n(n-1)/2 LDFs are created in parallel

◦ Mapper takes class label pairs along with their offsets as input and produces two class labels and the corresponding separating vector as output

Reducer simply collects these in an output file.

LDF LDF contdcontd……

Task 3: store output file created by task 1◦ contains parameters of the trained classifier in

the form of class label pairs and the corresponding separating vector

Testing: Classify test samples◦ Mapper reads the sample from the user

given input file and finds the class label of the sample

Reducer outputs samples with class label name

Flow chart Flow chart

1st phase of map Reduce

2nd phase of map Reduce

Final phase of map Reduce

Testing of Data Testing of Data Read Classifier/part00000Store Weight vectors for each plane.The Input Test file is divided into chunks . Each chunk is given to Mapper Class in Classifier.java.Depending upon the output – Correct classification or Incorrect Classification Output is collected.In the reducer class the correct and incorrect classification of each class is summed up and finally the result is written to Result/part00000.Result Analyzer.java reads this file and shows the output to the user.LDA Driver.java runs the Mapper , Reducer, Classifier and Result analyzer class.

Classes Implemented Classes Implemented

LDF Mapper.javaLDF Reducer.javaLDF Driver.javaLDF Classifier.javaLDF Result Analyzer.java

Advantages from Advantages from MapReduceMapReduce

Computation of n(n-1)/2 LDFs is automatically parallelized which reduces the training time of the classifier.Replication of input data is avoided by sorting it using MapReduce and storing only the line offset information◦ helps in reducing storage requirement

during training.

Installation of MahoutInstallation of MahoutDownload the tar files of both apache-mahout and apache-maven projectsUnzip the tar files in a directorySet the Path Variables for mavenConfigure settings.xml to add http proxy and port in conf directory in maven folderSet present working directory to the mahout's core folderCompile the project by 'mvn-compile'Build the project by 'mvn-install'

Troubleshooting while installationTroubleshooting while installation

By default the proxy address and port are not set◦ build fails due to unavailability of packages.

Solution: set the proxy address and port before building the packagesOnce the Build is complete, running the examples is same as running any java program

Integration with Mahout libraryIntegration with Mahout libraryMake wrapper classes which encapsulates the working of Map and Reduce functionsMake classes to set variables for datasetMake classes to model exceptions that can possibly existPut all files related to an algorithm in one folder and add it as package in the library

Thanks

data mining using mahout - iiit hyderabadsearch.iiit.ac.in/cloud/presentations/8.pdf ·...

Documents