data mining using mahout - iiit hyderabadsearch.iiit.ac.in/cloud/presentations/8.pdf ·...
TRANSCRIPT
Data Mining using Mahout Data Mining using Mahout
Team No. 8Pratibha Rani
Prashant SethiaManisha Verma
What is Mahout?What is Mahout?Subproject of Apache Lucene◦ Goal: delivering scalable machine learning
algorithm implementations◦ http://lucene.apache.org/mahout/Version 0.1 released on 07 April 2009 includes 10 algorithm libraries◦ Details in published paper:
http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf
ObjectiveObjective
Implement two Data Mining/Machine Learning algorithms◦ Convert the algorithm in MapReduce
paradigm◦ Implement using Hadoop◦ Optimize computation
take advantage of MapReduce paradigm
Integrate them in Mahout Library◦ Make it available online.
Implemented AlgorithmsImplemented AlgorithmsClassification of Multi Class data using Linear Discriminant Function (LDF)◦ Machine Learning method for classification◦ Computational cost increases as number of
classes increaseSPRINT◦ Decision tree based parallel classifier for Data
Mining◦ Requires parallelization of computations
Decision Tree ExampleDecision Tree Example
Attribute ListsAttribute Lists
AlgorithmAlgorithm
Algorithm (contd.)Algorithm (contd.)
SPRINT: IntroductionSPRINT: IntroductionCarry out decision tree building process in parallel◦ Frequent lookup of the central class list produces a lot of
network communication in the parallel case◦ Solution: Eliminate the class list
Class labels distributed to each attribute list=> Redundant data, but the memory-resident and network communication bottlenecks are removedEach node keeps its own set of attribute lists=> No need to lookup the node information
Each node is assigned a partition of each attribute list. The nodes are ordered so that the combined lists of non-categorical attributes remain sortedEach node produces its local histograms in parallel, the combined histograms is used to find the best splits
Conversion of SPRINT in MapReduce Conversion of SPRINT in MapReduce paradigmparadigm
For each attribute create Attribute-list from given dataset using MapReduceUse MapReduce to sort all Attribute-listsConvert Tree construction algorithm into MapReduce formatWrite a MapReduce job to read test samples from user given input file and traverse constructed tree to find class labels.
SPRINT SPRINT ContdContd……
Tree construction:1. At each node use MapReduce to find
“Gini index” of each attribute in parallel2. Find attribute with lowest value of Gini
index.1. Split Attribute-list using MapReduce2. Make new nodes for each split.
3. Repeat above steps till unused attributes are left or Attribute-List contains records from only one class.
Advantages of using Advantages of using MapReduceMapReduce
Automatic parallelization of decision tree building processDoes not require parallel sorting algorithm◦ Computationally expensive and requires
shared memory parallel processors
LDF: IntroductionLDF: Introduction
Represent pattern classifiers in terms of a set of discriminant functions gi (x), i=1,…,n.The classifier is said to assign a feature vector x to class wi if gi (x) > gj (x) for all j ≠ ITransforming gi (x) in form g (x) = at yA sample yi is classified correctly if at yi > 0 and is labeled c1
if at yi < 0 then it is labeled c2
LDF: Introduction LDF: Introduction contdcontd……Replace all the samples labeled c2 by their negatives ◦ find a solution weight vector ‘a’ such that
at yi > 0 for all samples.◦ weight vector ‘a’ is called a separating vector
or solution vector◦ Use Fixed-Increment Single Sample
Perceptron algorithm to find this vectorFor multiclass case with n classes ◦ n(n-1)/2 such separating vectors for each pair
of classes
Conversion of LDF in Conversion of LDF in MapReduceMapReduce paradigmparadigm
Requires following tasks: 1. Analyze the given dataset in such a way that
we can divide it in pairs of two classes for all possible pairs of classes
2. Transform the computation algorithm into MapReduce format.
3. Store the final separating vectors obtained from computation algorithm for classifying new samples.
4. Write a MapReduce job which uses the obtained separating vectors for classifying samples given by the user.
LDF LDF contdcontd……Task 1: divide the data into n(n-1)/2 class pairs◦ First Mapper sorts the input file according to the class
labels and outputs class labels and recordsReducer function collects the sorted records in output file.
◦ Second Mapper takes sorted file as input and outputs class label and byteoffset of each record
Reducer takes class label and vector of byteoffsets as input collects the class label and minimum and maximum value of byteoffset for that class
◦ n(n-1)/2 files are created for each pair of classes to store the start and end byteoffset of records of those classes
LDF LDF contdcontd……Task 2 : transformation of Fixed-Increment Single Sample Perceptron algorithm into MapReduceformat◦ n(n-1)/2 input files creates same number of Map
tasks in new joball the n(n-1)/2 LDFs are created in parallel
◦ Mapper takes class label pairs along with their offsets as input and produces two class labels and the corresponding separating vector as output
Reducer simply collects these in an output file.
LDF LDF contdcontd……
Task 3: store output file created by task 1◦ contains parameters of the trained classifier in
the form of class label pairs and the corresponding separating vector
Testing: Classify test samples◦ Mapper reads the sample from the user
given input file and finds the class label of the sample
Reducer outputs samples with class label name
Flow chart Flow chart
1st phase of map Reduce
2nd phase of map Reduce
Final phase of map Reduce
Testing of Data Testing of Data Read Classifier/part00000Store Weight vectors for each plane.The Input Test file is divided into chunks . Each chunk is given to Mapper Class in Classifier.java.Depending upon the output – Correct classification or Incorrect Classification Output is collected.In the reducer class the correct and incorrect classification of each class is summed up and finally the result is written to Result/part00000.Result Analyzer.java reads this file and shows the output to the user.LDA Driver.java runs the Mapper , Reducer, Classifier and Result analyzer class.
Classes Implemented Classes Implemented
LDF Mapper.javaLDF Reducer.javaLDF Driver.javaLDF Classifier.javaLDF Result Analyzer.java
Advantages from Advantages from MapReduceMapReduce
Computation of n(n-1)/2 LDFs is automatically parallelized which reduces the training time of the classifier.Replication of input data is avoided by sorting it using MapReduce and storing only the line offset information◦ helps in reducing storage requirement
during training.
Installation of MahoutInstallation of MahoutDownload the tar files of both apache-mahout and apache-maven projectsUnzip the tar files in a directorySet the Path Variables for mavenConfigure settings.xml to add http proxy and port in conf directory in maven folderSet present working directory to the mahout's core folderCompile the project by 'mvn-compile'Build the project by 'mvn-install'
Troubleshooting while installationTroubleshooting while installation
By default the proxy address and port are not set◦ build fails due to unavailability of packages.
Solution: set the proxy address and port before building the packagesOnce the Build is complete, running the examples is same as running any java program
Integration with Mahout libraryIntegration with Mahout libraryMake wrapper classes which encapsulates the working of Map and Reduce functionsMake classes to set variables for datasetMake classes to model exceptions that can possibly existPut all files related to an algorithm in one folder and add it as package in the library
Thanks