mapreduce for machine learning
TRANSCRIPT
MapReduce for Machine Learning
by
pranya prabhakarS4 MCA
05
CONTENTSIntroductionMachine LearningMapReduceML on MapReduceApache mahout and its
installation stepsConclusion
Introduction• Data increasing rapidly
• It is necessary to process and to analyze the data
• Analyzing the data by machine as a human being. …Different
Machine LearningSupervised Learning: Generate a function based upon assigned labels that maps inputs to desired outputs.
Unsupervised Learning: Looks for patterns native to a dataset, and models it like clustering (e.g. Data mining &knowledge discovery).
Reinforcement Learning: Learns how to act given reward(or punishment) from the world.
Types of problems Classification:
data is labeled means it assigned a class - Learn a model from a manually classified data - Predict the class of a new object based on its features and the learned model e.g.: spam/non-spam, fraud/non-fraud
Clustering data is not labelled,but can be divided into groups based on similarity - Group similar looking objects - Notion of similarity: Distance measure: eg:organizing pictures by faces without names.
Regression Data is labeled with real value rather than a label
eg:time series data like the price of a stock over time.
Supervised LearningAlgorithms
Decision Treesk-Nearest NeighboursNaive BayesLogistic RegressionPerceptron and Multi-level
PerceptionsNeural NetworksSVM and Kernel estimation
Unsupervised LearningAlgorithmsClustering
◦k-Means, MinHash, Hierarchical Clustering
Hidden Markov ModelsFeature Extraction methodsSelf-organizing Maps (Neural
Nets)
uses
Spam filteringCredit card Fraud detectionFace recognition(computer
vision)Speech understandingMedical diagnosis and so on…
Current state of ML libraries
Lack scalabilityLack documentations and examplesLack Apache licensingAre not well testedAre Research orientedNot built over existing production
quality librariesLack “Deployability”
MapReduceIt’s a programming frameworkUsed for parallel processing over
large data setsApplication divided into small
fragments of works and distributed across the cluster
Computation unit of HadoopTwo functions: Map() and
Reduce()
Apache mahout
The starting place for MapReduce-based machine learning
A disparate collection of algorithms for
Recommendation Clustering Classification Frequency item Mining
Mahout installation Prerequisites
java Hadoop maven Java installation
1. sudo apt-get install sun java jdk 2. sudo gedit .bashrc set JAVA_HOME in .bashrc file Installation of maven
1. sudo apt-get install maven2 2. open .bashrc and add the lines ############## Apache-Maven ######### export M2_HOME=/usr/local/apache-maven-3.0.4 export M2=$M2_HOME/bin export PATH=$M2:$PATH export JAVA_HOME=$HOME/programs/jdk
Contd..
Run mvn --version to verify that it is correctly installed.
Hadoop installation single node hadoop cluster has been set up as how java installed
Installation of Mahout 1. http://www.apache.org/dyn/closer.cgi/lucene/mahout/ 2. Create a folder and move the download file to the created directory say, mkdir usr/local/mahout 3.Mvn install..it shows as
Example showing 20news group’s database
Application of Mahout Collaborative Filtering Matrix factorization based recommenders A user based Recommender Clustering Canopy Clustering K-Means Clustering Fuzzy K-Means Affinity Propagation Clustering
Classification Naive Bayes
Conclusion
By using the mapReduce framework, we could parallelize a wide range of machine learning algorithms and apache mahout provide s a platform for machine learning in mapReduce paradigm.