mapreduce for machine learning

MapReduce for Machine Learning

by

pranya prabhakarS4 MCA

05

CONTENTSIntroductionMachine LearningMapReduceML on MapReduceApache mahout and its

installation stepsConclusion

Introduction• Data increasing rapidly

• It is necessary to process and to analyze the data

• Analyzing the data by machine as a human being. …Different

Machine LearningSupervised Learning: Generate a function based upon assigned labels that maps inputs to desired outputs.

Unsupervised Learning: Looks for patterns native to a dataset, and models it like clustering (e.g. Data mining &knowledge discovery).

Reinforcement Learning: Learns how to act given reward(or punishment) from the world.

http://nkonst.com/wp-content/uploads/2014/03/ml-eng.png

Types of problems Classification:

data is labeled means it assigned a class - Learn a model from a manually classified data - Predict the class of a new object based on its features and the learned model e.g.: spam/non-spam, fraud/non-fraud

Clustering data is not labelled,but can be divided into groups based on similarity - Group similar looking objects - Notion of similarity: Distance measure: eg:organizing pictures by faces without names.

Regression Data is labeled with real value rather than a label

eg:time series data like the price of a stock over time.


Supervised LearningAlgorithms

Decision Treesk-Nearest NeighboursNaive BayesLogistic RegressionPerceptron and Multi-level

PerceptionsNeural NetworksSVM and Kernel estimation


Unsupervised LearningAlgorithmsClustering

◦k-Means, MinHash, Hierarchical Clustering

Hidden Markov ModelsFeature Extraction methodsSelf-organizing Maps (Neural

Nets)


uses

Spam filteringCredit card Fraud detectionFace recognition(computer

vision)Speech understandingMedical diagnosis and so on…


Current state of ML libraries

Lack scalabilityLack documentations and examplesLack Apache licensingAre not well testedAre Research orientedNot built over existing production

quality librariesLack “Deployability”


MapReduceIt’s a programming frameworkUsed for parallel processing over

large data setsApplication divided into small

fragments of works and distributed across the cluster

Computation unit of HadoopTwo functions: Map() and

Reduce()


Apache mahout

The starting place for MapReduce-based machine learning

A disparate collection of algorithms for

Recommendation Clustering Classification Frequency item Mining


Mahout installation Prerequisites

java Hadoop maven Java installation

1. sudo apt-get install sun java jdk 2. sudo gedit .bashrc set JAVA_HOME in .bashrc file Installation of maven

1. sudo apt-get install maven2 2. open .bashrc and add the lines ############## Apache-Maven ######### export M2_HOME=/usr/local/apache-maven-3.0.4 export M2=$M2_HOME/bin export PATH=$M2:$PATH export JAVA_HOME=$HOME/programs/jdk


Contd..

Run mvn --version to verify that it is correctly installed.


Hadoop installation single node hadoop cluster has been set up as how java installed

Installation of Mahout 1. http://www.apache.org/dyn/closer.cgi/lucene/mahout/ 2. Create a folder and move the download file to the created directory say, mkdir usr/local/mahout 3.Mvn install..it shows as


Example showing 20news group’s database

Application of Mahout Collaborative Filtering Matrix factorization based recommenders A user based Recommender Clustering Canopy Clustering K-Means Clustering Fuzzy K-Means Affinity Propagation Clustering

Classification Naive Bayes

Conclusion

By using the mapReduce framework, we could parallelize a wide range of machine learning algorithms and apache mahout provide s a platform for machine learning in mapReduce paradigm.

mapreduce for machine learning

Education