alex lefur intro to hadoop and mahout

27
Intro to Apache Mahout Grant Ingersoll Lucid Imagination http://www.lucidimagination.com

Upload: alex-lefur

Post on 14-Jul-2015

231 views

Category:

Internet


1 download

TRANSCRIPT

Intro to Apache Mahout

Grant Ingersoll

Lucid Imagination

httpwwwlucidimaginationcom

Anyone Here Use Machine Learning

bull Any users ofndash Google

bull Search

bull Priority Inbox

ndash Facebook

ndash Twitter

ndash LinkedIn

Topics

bull Background and Use Cases

bull What can you do in Mahout

bull Wherersquos the community at

bull Resources

bull K-Means in Hadoop (time permitting)

Definition

bull ldquoMachine Learning is programming computers to optimize a performance criterion using example data or past experiencerdquondash Intro To Machine Learning by E Alpaydin

bull Subset of Artificial Intelligencebull Lots of related fields

ndash Information Retrievalndash Statsndash Biologyndash Linear algebrandash Many more

Common Use Cases

bull Recommend friendsdatesproducts

bull Classify content into predefined groups

bull Find similar content

bull Find associationspatterns in actionsbehaviors

bull Identify key topicssummarize textndash Documents and Corpora

bull Detect anomaliesfraud

bull Ranking search results

bull Others

Apache Mahout

bull An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software Licensendash httpmahoutapacheorg

bull Why Mahoutndash Many Open Source ML libraries either

bull Lack Communitybull Lack Documentation and Examplesbull Lack Scalabilitybull Lack the Apache Licensebull Or are research-oriented

Definition httpdictionaryreferencecombrowsemahout

What does scalable mean to us

bull Goal Be as fast and efficient as possible given the intrinsic design of the algorithmndash Some algorithms wonrsquot scale to massive machine

clusters

ndash Others fit logically on a Map Reduce framework like Apache Hadoop

ndash Still others will need different distributed programming models

ndash Others are already fast (SGD)

bull Be pragmatic

Sampling of Who uses Mahout

httpscwikiapacheorgconfluencedisplayMAHOUTPowered+By+Mahout

What Can I do with Mahout Right Now

3C + FPM + O = Mahout

Collaborative Filtering

bull Extensive framework for collaborative filtering (recommenders)

bull Recommendersndash User based

ndash Item based

bull Online and Offline supportndash Offline can utilize Hadoop

bull Many different Similarity measuresndash Cosine LLR Tanimoto Pearson others

Clustering

bull Document level

ndash Group documents based on a notion of similarity

ndash K-Means Fuzzy K-Means Dirichlet Canopy Mean-Shift EigenCuts (Spectral)

ndash All MapReduce

ndash Distance Measuresbull Manhattan Euclidean

other

bull Topic Modeling

ndash Cluster words across documents to identify topics

ndash Latent DirichletAllocation (MR)

Categorizationbull Place new items into

predefined categoriesndash Sports politics

entertainment

ndash Recommenders

bull Implementationsndash Naiumlve Bayes (MR)

ndash Compl Naiumlve Bayes (MR)

ndash Decision Forests (MR)

ndash Linear Regression (Seq but Fast)

bullSee Chapter 17 of Mahout in Action for Shop It To Me use case

bullhttpawesm5FyNe

Freq Pattern Mining

bull Identify frequently co-occurrent items

bull Useful forndash Query

Recommendationsbull Apple -gt iPhone orange

OS X

ndash Related product placementbull Basket Analysis

bull MapReduce

httpwwwamazoncom

Other

bull Primitive Collections

bull Collocations (MR)

bull Math libraryndash Vectors Matrices etc

bull Noise Reduction via Singular Value Decomp (MR)

Prepare Data from Raw content

bull Data Sourcesndash Lucene integration

bull binmahout lucenevector hellip

ndash Document Vectorizerbull binmahout seqdirectory hellip

bull binmahout seq2sparse hellip

ndash Programmaticallybull See the Utils module in Mahout and the IteratorltVectorgt

classes

ndash Database

ndash File system

How to Command Line

bull Most algorithms have a Driver programndash $MAHOUT_HOMEbinmahoutsh helps with most tasks

bull Prepare the Datandash Different algorithms require different setup

bull Run the algorithmndash Single Nodendash Hadoop

bull Print out the results or incorporate into applicationndash Several helper classes

bull LDAPrintTopics ClusterDumper etc

Whatrsquos Happening Now

bull Unified Framework for Clustering and Classification

bull 05 release on the horizon (May)

bull Working towards 10 release by focusing on

ndash Tests examples documentation

ndash API cleanup and consistency

bull Gearing up for Google Summer of Code

ndash New MR work for Hidden Markov Models

Summary

bull Machine learning is all over the web today

bull Mahout is about scalable machine learning

bull Mahout has functionality for many of todayrsquos common machine learning tasks

bull Many Mahout implementations use Hadoop

Resources

bull httpmahoutapacheorg

bull httpcwikiapacheorgMAHOUT

bull user|devmahoutapacheorg

bull httpsvnapacheorgreposasfmahouttrunk

bull httphadoopapacheorg

Resources

bull ldquoMahout in Actionrdquo ndash Owen Anil Dunning and Friedman

ndash httpawesm5FyNe

bull ldquoIntroducing Apache Mahoutrdquo ndash httpwwwibmcomdeveloperworksjavalibraryj-mahout

bull ldquoTaming Textrdquo by Ingersoll Morton Farris

bull ldquoProgramming Collective Intelligencerdquo by Toby Segaran

bull ldquoData Mining - Practical Machine Learning Tools and Techniquesrdquo by Ian H Witten and Eibe Frank

bull ldquoData-Intensive Text Processing with MapReducerdquo by Jimmy Lin and Chris Dyer

K-Means

bull Clustering Algorithm

ndash Nicely parallelizable

httpenwikipediaorgwikiK-means_clustering

K-Means in Map-Reduce

bull Inputndash Mahout Vectors representing the original content

ndash Eitherbull A predefined set of initial centroids (Can be from Canopy)

bull --k ndash The number of clusters to produce

bull Iteratendash Do the centroid calculation (more in a moment)

bull Clustering Step (optional)

bull Outputndash Centroids (as Mahout Vectors)

ndash Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration

bull Each Iteration calculates the Centroids using

ndash KMeansMapper

ndash KMeansCombiner

ndash KMeansReducer

bull Clustering Step

ndash Calculate the points for each Centroid using

ndash KMeansClusterMapper

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

Anyone Here Use Machine Learning

bull Any users ofndash Google

bull Search

bull Priority Inbox

ndash Facebook

ndash Twitter

ndash LinkedIn

Topics

bull Background and Use Cases

bull What can you do in Mahout

bull Wherersquos the community at

bull Resources

bull K-Means in Hadoop (time permitting)

Definition

bull ldquoMachine Learning is programming computers to optimize a performance criterion using example data or past experiencerdquondash Intro To Machine Learning by E Alpaydin

bull Subset of Artificial Intelligencebull Lots of related fields

ndash Information Retrievalndash Statsndash Biologyndash Linear algebrandash Many more

Common Use Cases

bull Recommend friendsdatesproducts

bull Classify content into predefined groups

bull Find similar content

bull Find associationspatterns in actionsbehaviors

bull Identify key topicssummarize textndash Documents and Corpora

bull Detect anomaliesfraud

bull Ranking search results

bull Others

Apache Mahout

bull An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software Licensendash httpmahoutapacheorg

bull Why Mahoutndash Many Open Source ML libraries either

bull Lack Communitybull Lack Documentation and Examplesbull Lack Scalabilitybull Lack the Apache Licensebull Or are research-oriented

Definition httpdictionaryreferencecombrowsemahout

What does scalable mean to us

bull Goal Be as fast and efficient as possible given the intrinsic design of the algorithmndash Some algorithms wonrsquot scale to massive machine

clusters

ndash Others fit logically on a Map Reduce framework like Apache Hadoop

ndash Still others will need different distributed programming models

ndash Others are already fast (SGD)

bull Be pragmatic

Sampling of Who uses Mahout

httpscwikiapacheorgconfluencedisplayMAHOUTPowered+By+Mahout

What Can I do with Mahout Right Now

3C + FPM + O = Mahout

Collaborative Filtering

bull Extensive framework for collaborative filtering (recommenders)

bull Recommendersndash User based

ndash Item based

bull Online and Offline supportndash Offline can utilize Hadoop

bull Many different Similarity measuresndash Cosine LLR Tanimoto Pearson others

Clustering

bull Document level

ndash Group documents based on a notion of similarity

ndash K-Means Fuzzy K-Means Dirichlet Canopy Mean-Shift EigenCuts (Spectral)

ndash All MapReduce

ndash Distance Measuresbull Manhattan Euclidean

other

bull Topic Modeling

ndash Cluster words across documents to identify topics

ndash Latent DirichletAllocation (MR)

Categorizationbull Place new items into

predefined categoriesndash Sports politics

entertainment

ndash Recommenders

bull Implementationsndash Naiumlve Bayes (MR)

ndash Compl Naiumlve Bayes (MR)

ndash Decision Forests (MR)

ndash Linear Regression (Seq but Fast)

bullSee Chapter 17 of Mahout in Action for Shop It To Me use case

bullhttpawesm5FyNe

Freq Pattern Mining

bull Identify frequently co-occurrent items

bull Useful forndash Query

Recommendationsbull Apple -gt iPhone orange

OS X

ndash Related product placementbull Basket Analysis

bull MapReduce

httpwwwamazoncom

Other

bull Primitive Collections

bull Collocations (MR)

bull Math libraryndash Vectors Matrices etc

bull Noise Reduction via Singular Value Decomp (MR)

Prepare Data from Raw content

bull Data Sourcesndash Lucene integration

bull binmahout lucenevector hellip

ndash Document Vectorizerbull binmahout seqdirectory hellip

bull binmahout seq2sparse hellip

ndash Programmaticallybull See the Utils module in Mahout and the IteratorltVectorgt

classes

ndash Database

ndash File system

How to Command Line

bull Most algorithms have a Driver programndash $MAHOUT_HOMEbinmahoutsh helps with most tasks

bull Prepare the Datandash Different algorithms require different setup

bull Run the algorithmndash Single Nodendash Hadoop

bull Print out the results or incorporate into applicationndash Several helper classes

bull LDAPrintTopics ClusterDumper etc

Whatrsquos Happening Now

bull Unified Framework for Clustering and Classification

bull 05 release on the horizon (May)

bull Working towards 10 release by focusing on

ndash Tests examples documentation

ndash API cleanup and consistency

bull Gearing up for Google Summer of Code

ndash New MR work for Hidden Markov Models

Summary

bull Machine learning is all over the web today

bull Mahout is about scalable machine learning

bull Mahout has functionality for many of todayrsquos common machine learning tasks

bull Many Mahout implementations use Hadoop

Resources

bull httpmahoutapacheorg

bull httpcwikiapacheorgMAHOUT

bull user|devmahoutapacheorg

bull httpsvnapacheorgreposasfmahouttrunk

bull httphadoopapacheorg

Resources

bull ldquoMahout in Actionrdquo ndash Owen Anil Dunning and Friedman

ndash httpawesm5FyNe

bull ldquoIntroducing Apache Mahoutrdquo ndash httpwwwibmcomdeveloperworksjavalibraryj-mahout

bull ldquoTaming Textrdquo by Ingersoll Morton Farris

bull ldquoProgramming Collective Intelligencerdquo by Toby Segaran

bull ldquoData Mining - Practical Machine Learning Tools and Techniquesrdquo by Ian H Witten and Eibe Frank

bull ldquoData-Intensive Text Processing with MapReducerdquo by Jimmy Lin and Chris Dyer

K-Means

bull Clustering Algorithm

ndash Nicely parallelizable

httpenwikipediaorgwikiK-means_clustering

K-Means in Map-Reduce

bull Inputndash Mahout Vectors representing the original content

ndash Eitherbull A predefined set of initial centroids (Can be from Canopy)

bull --k ndash The number of clusters to produce

bull Iteratendash Do the centroid calculation (more in a moment)

bull Clustering Step (optional)

bull Outputndash Centroids (as Mahout Vectors)

ndash Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration

bull Each Iteration calculates the Centroids using

ndash KMeansMapper

ndash KMeansCombiner

ndash KMeansReducer

bull Clustering Step

ndash Calculate the points for each Centroid using

ndash KMeansClusterMapper

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

Topics

bull Background and Use Cases

bull What can you do in Mahout

bull Wherersquos the community at

bull Resources

bull K-Means in Hadoop (time permitting)

Definition

bull ldquoMachine Learning is programming computers to optimize a performance criterion using example data or past experiencerdquondash Intro To Machine Learning by E Alpaydin

bull Subset of Artificial Intelligencebull Lots of related fields

ndash Information Retrievalndash Statsndash Biologyndash Linear algebrandash Many more

Common Use Cases

bull Recommend friendsdatesproducts

bull Classify content into predefined groups

bull Find similar content

bull Find associationspatterns in actionsbehaviors

bull Identify key topicssummarize textndash Documents and Corpora

bull Detect anomaliesfraud

bull Ranking search results

bull Others

Apache Mahout

bull An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software Licensendash httpmahoutapacheorg

bull Why Mahoutndash Many Open Source ML libraries either

bull Lack Communitybull Lack Documentation and Examplesbull Lack Scalabilitybull Lack the Apache Licensebull Or are research-oriented

Definition httpdictionaryreferencecombrowsemahout

What does scalable mean to us

bull Goal Be as fast and efficient as possible given the intrinsic design of the algorithmndash Some algorithms wonrsquot scale to massive machine

clusters

ndash Others fit logically on a Map Reduce framework like Apache Hadoop

ndash Still others will need different distributed programming models

ndash Others are already fast (SGD)

bull Be pragmatic

Sampling of Who uses Mahout

httpscwikiapacheorgconfluencedisplayMAHOUTPowered+By+Mahout

What Can I do with Mahout Right Now

3C + FPM + O = Mahout

Collaborative Filtering

bull Extensive framework for collaborative filtering (recommenders)

bull Recommendersndash User based

ndash Item based

bull Online and Offline supportndash Offline can utilize Hadoop

bull Many different Similarity measuresndash Cosine LLR Tanimoto Pearson others

Clustering

bull Document level

ndash Group documents based on a notion of similarity

ndash K-Means Fuzzy K-Means Dirichlet Canopy Mean-Shift EigenCuts (Spectral)

ndash All MapReduce

ndash Distance Measuresbull Manhattan Euclidean

other

bull Topic Modeling

ndash Cluster words across documents to identify topics

ndash Latent DirichletAllocation (MR)

Categorizationbull Place new items into

predefined categoriesndash Sports politics

entertainment

ndash Recommenders

bull Implementationsndash Naiumlve Bayes (MR)

ndash Compl Naiumlve Bayes (MR)

ndash Decision Forests (MR)

ndash Linear Regression (Seq but Fast)

bullSee Chapter 17 of Mahout in Action for Shop It To Me use case

bullhttpawesm5FyNe

Freq Pattern Mining

bull Identify frequently co-occurrent items

bull Useful forndash Query

Recommendationsbull Apple -gt iPhone orange

OS X

ndash Related product placementbull Basket Analysis

bull MapReduce

httpwwwamazoncom

Other

bull Primitive Collections

bull Collocations (MR)

bull Math libraryndash Vectors Matrices etc

bull Noise Reduction via Singular Value Decomp (MR)

Prepare Data from Raw content

bull Data Sourcesndash Lucene integration

bull binmahout lucenevector hellip

ndash Document Vectorizerbull binmahout seqdirectory hellip

bull binmahout seq2sparse hellip

ndash Programmaticallybull See the Utils module in Mahout and the IteratorltVectorgt

classes

ndash Database

ndash File system

How to Command Line

bull Most algorithms have a Driver programndash $MAHOUT_HOMEbinmahoutsh helps with most tasks

bull Prepare the Datandash Different algorithms require different setup

bull Run the algorithmndash Single Nodendash Hadoop

bull Print out the results or incorporate into applicationndash Several helper classes

bull LDAPrintTopics ClusterDumper etc

Whatrsquos Happening Now

bull Unified Framework for Clustering and Classification

bull 05 release on the horizon (May)

bull Working towards 10 release by focusing on

ndash Tests examples documentation

ndash API cleanup and consistency

bull Gearing up for Google Summer of Code

ndash New MR work for Hidden Markov Models

Summary

bull Machine learning is all over the web today

bull Mahout is about scalable machine learning

bull Mahout has functionality for many of todayrsquos common machine learning tasks

bull Many Mahout implementations use Hadoop

Resources

bull httpmahoutapacheorg

bull httpcwikiapacheorgMAHOUT

bull user|devmahoutapacheorg

bull httpsvnapacheorgreposasfmahouttrunk

bull httphadoopapacheorg

Resources

bull ldquoMahout in Actionrdquo ndash Owen Anil Dunning and Friedman

ndash httpawesm5FyNe

bull ldquoIntroducing Apache Mahoutrdquo ndash httpwwwibmcomdeveloperworksjavalibraryj-mahout

bull ldquoTaming Textrdquo by Ingersoll Morton Farris

bull ldquoProgramming Collective Intelligencerdquo by Toby Segaran

bull ldquoData Mining - Practical Machine Learning Tools and Techniquesrdquo by Ian H Witten and Eibe Frank

bull ldquoData-Intensive Text Processing with MapReducerdquo by Jimmy Lin and Chris Dyer

K-Means

bull Clustering Algorithm

ndash Nicely parallelizable

httpenwikipediaorgwikiK-means_clustering

K-Means in Map-Reduce

bull Inputndash Mahout Vectors representing the original content

ndash Eitherbull A predefined set of initial centroids (Can be from Canopy)

bull --k ndash The number of clusters to produce

bull Iteratendash Do the centroid calculation (more in a moment)

bull Clustering Step (optional)

bull Outputndash Centroids (as Mahout Vectors)

ndash Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration

bull Each Iteration calculates the Centroids using

ndash KMeansMapper

ndash KMeansCombiner

ndash KMeansReducer

bull Clustering Step

ndash Calculate the points for each Centroid using

ndash KMeansClusterMapper

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

Definition

bull ldquoMachine Learning is programming computers to optimize a performance criterion using example data or past experiencerdquondash Intro To Machine Learning by E Alpaydin

bull Subset of Artificial Intelligencebull Lots of related fields

ndash Information Retrievalndash Statsndash Biologyndash Linear algebrandash Many more

Common Use Cases

bull Recommend friendsdatesproducts

bull Classify content into predefined groups

bull Find similar content

bull Find associationspatterns in actionsbehaviors

bull Identify key topicssummarize textndash Documents and Corpora

bull Detect anomaliesfraud

bull Ranking search results

bull Others

Apache Mahout

bull An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software Licensendash httpmahoutapacheorg

bull Why Mahoutndash Many Open Source ML libraries either

bull Lack Communitybull Lack Documentation and Examplesbull Lack Scalabilitybull Lack the Apache Licensebull Or are research-oriented

Definition httpdictionaryreferencecombrowsemahout

What does scalable mean to us

bull Goal Be as fast and efficient as possible given the intrinsic design of the algorithmndash Some algorithms wonrsquot scale to massive machine

clusters

ndash Others fit logically on a Map Reduce framework like Apache Hadoop

ndash Still others will need different distributed programming models

ndash Others are already fast (SGD)

bull Be pragmatic

Sampling of Who uses Mahout

httpscwikiapacheorgconfluencedisplayMAHOUTPowered+By+Mahout

What Can I do with Mahout Right Now

3C + FPM + O = Mahout

Collaborative Filtering

bull Extensive framework for collaborative filtering (recommenders)

bull Recommendersndash User based

ndash Item based

bull Online and Offline supportndash Offline can utilize Hadoop

bull Many different Similarity measuresndash Cosine LLR Tanimoto Pearson others

Clustering

bull Document level

ndash Group documents based on a notion of similarity

ndash K-Means Fuzzy K-Means Dirichlet Canopy Mean-Shift EigenCuts (Spectral)

ndash All MapReduce

ndash Distance Measuresbull Manhattan Euclidean

other

bull Topic Modeling

ndash Cluster words across documents to identify topics

ndash Latent DirichletAllocation (MR)

Categorizationbull Place new items into

predefined categoriesndash Sports politics

entertainment

ndash Recommenders

bull Implementationsndash Naiumlve Bayes (MR)

ndash Compl Naiumlve Bayes (MR)

ndash Decision Forests (MR)

ndash Linear Regression (Seq but Fast)

bullSee Chapter 17 of Mahout in Action for Shop It To Me use case

bullhttpawesm5FyNe

Freq Pattern Mining

bull Identify frequently co-occurrent items

bull Useful forndash Query

Recommendationsbull Apple -gt iPhone orange

OS X

ndash Related product placementbull Basket Analysis

bull MapReduce

httpwwwamazoncom

Other

bull Primitive Collections

bull Collocations (MR)

bull Math libraryndash Vectors Matrices etc

bull Noise Reduction via Singular Value Decomp (MR)

Prepare Data from Raw content

bull Data Sourcesndash Lucene integration

bull binmahout lucenevector hellip

ndash Document Vectorizerbull binmahout seqdirectory hellip

bull binmahout seq2sparse hellip

ndash Programmaticallybull See the Utils module in Mahout and the IteratorltVectorgt

classes

ndash Database

ndash File system

How to Command Line

bull Most algorithms have a Driver programndash $MAHOUT_HOMEbinmahoutsh helps with most tasks

bull Prepare the Datandash Different algorithms require different setup

bull Run the algorithmndash Single Nodendash Hadoop

bull Print out the results or incorporate into applicationndash Several helper classes

bull LDAPrintTopics ClusterDumper etc

Whatrsquos Happening Now

bull Unified Framework for Clustering and Classification

bull 05 release on the horizon (May)

bull Working towards 10 release by focusing on

ndash Tests examples documentation

ndash API cleanup and consistency

bull Gearing up for Google Summer of Code

ndash New MR work for Hidden Markov Models

Summary

bull Machine learning is all over the web today

bull Mahout is about scalable machine learning

bull Mahout has functionality for many of todayrsquos common machine learning tasks

bull Many Mahout implementations use Hadoop

Resources

bull httpmahoutapacheorg

bull httpcwikiapacheorgMAHOUT

bull user|devmahoutapacheorg

bull httpsvnapacheorgreposasfmahouttrunk

bull httphadoopapacheorg

Resources

bull ldquoMahout in Actionrdquo ndash Owen Anil Dunning and Friedman

ndash httpawesm5FyNe

bull ldquoIntroducing Apache Mahoutrdquo ndash httpwwwibmcomdeveloperworksjavalibraryj-mahout

bull ldquoTaming Textrdquo by Ingersoll Morton Farris

bull ldquoProgramming Collective Intelligencerdquo by Toby Segaran

bull ldquoData Mining - Practical Machine Learning Tools and Techniquesrdquo by Ian H Witten and Eibe Frank

bull ldquoData-Intensive Text Processing with MapReducerdquo by Jimmy Lin and Chris Dyer

K-Means

bull Clustering Algorithm

ndash Nicely parallelizable

httpenwikipediaorgwikiK-means_clustering

K-Means in Map-Reduce

bull Inputndash Mahout Vectors representing the original content

ndash Eitherbull A predefined set of initial centroids (Can be from Canopy)

bull --k ndash The number of clusters to produce

bull Iteratendash Do the centroid calculation (more in a moment)

bull Clustering Step (optional)

bull Outputndash Centroids (as Mahout Vectors)

ndash Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration

bull Each Iteration calculates the Centroids using

ndash KMeansMapper

ndash KMeansCombiner

ndash KMeansReducer

bull Clustering Step

ndash Calculate the points for each Centroid using

ndash KMeansClusterMapper

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

Common Use Cases

bull Recommend friendsdatesproducts

bull Classify content into predefined groups

bull Find similar content

bull Find associationspatterns in actionsbehaviors

bull Identify key topicssummarize textndash Documents and Corpora

bull Detect anomaliesfraud

bull Ranking search results

bull Others

Apache Mahout

bull An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software Licensendash httpmahoutapacheorg

bull Why Mahoutndash Many Open Source ML libraries either

bull Lack Communitybull Lack Documentation and Examplesbull Lack Scalabilitybull Lack the Apache Licensebull Or are research-oriented

Definition httpdictionaryreferencecombrowsemahout

What does scalable mean to us

bull Goal Be as fast and efficient as possible given the intrinsic design of the algorithmndash Some algorithms wonrsquot scale to massive machine

clusters

ndash Others fit logically on a Map Reduce framework like Apache Hadoop

ndash Still others will need different distributed programming models

ndash Others are already fast (SGD)

bull Be pragmatic

Sampling of Who uses Mahout

httpscwikiapacheorgconfluencedisplayMAHOUTPowered+By+Mahout

What Can I do with Mahout Right Now

3C + FPM + O = Mahout

Collaborative Filtering

bull Extensive framework for collaborative filtering (recommenders)

bull Recommendersndash User based

ndash Item based

bull Online and Offline supportndash Offline can utilize Hadoop

bull Many different Similarity measuresndash Cosine LLR Tanimoto Pearson others

Clustering

bull Document level

ndash Group documents based on a notion of similarity

ndash K-Means Fuzzy K-Means Dirichlet Canopy Mean-Shift EigenCuts (Spectral)

ndash All MapReduce

ndash Distance Measuresbull Manhattan Euclidean

other

bull Topic Modeling

ndash Cluster words across documents to identify topics

ndash Latent DirichletAllocation (MR)

Categorizationbull Place new items into

predefined categoriesndash Sports politics

entertainment

ndash Recommenders

bull Implementationsndash Naiumlve Bayes (MR)

ndash Compl Naiumlve Bayes (MR)

ndash Decision Forests (MR)

ndash Linear Regression (Seq but Fast)

bullSee Chapter 17 of Mahout in Action for Shop It To Me use case

bullhttpawesm5FyNe

Freq Pattern Mining

bull Identify frequently co-occurrent items

bull Useful forndash Query

Recommendationsbull Apple -gt iPhone orange

OS X

ndash Related product placementbull Basket Analysis

bull MapReduce

httpwwwamazoncom

Other

bull Primitive Collections

bull Collocations (MR)

bull Math libraryndash Vectors Matrices etc

bull Noise Reduction via Singular Value Decomp (MR)

Prepare Data from Raw content

bull Data Sourcesndash Lucene integration

bull binmahout lucenevector hellip

ndash Document Vectorizerbull binmahout seqdirectory hellip

bull binmahout seq2sparse hellip

ndash Programmaticallybull See the Utils module in Mahout and the IteratorltVectorgt

classes

ndash Database

ndash File system

How to Command Line

bull Most algorithms have a Driver programndash $MAHOUT_HOMEbinmahoutsh helps with most tasks

bull Prepare the Datandash Different algorithms require different setup

bull Run the algorithmndash Single Nodendash Hadoop

bull Print out the results or incorporate into applicationndash Several helper classes

bull LDAPrintTopics ClusterDumper etc

Whatrsquos Happening Now

bull Unified Framework for Clustering and Classification

bull 05 release on the horizon (May)

bull Working towards 10 release by focusing on

ndash Tests examples documentation

ndash API cleanup and consistency

bull Gearing up for Google Summer of Code

ndash New MR work for Hidden Markov Models

Summary

bull Machine learning is all over the web today

bull Mahout is about scalable machine learning

bull Mahout has functionality for many of todayrsquos common machine learning tasks

bull Many Mahout implementations use Hadoop

Resources

bull httpmahoutapacheorg

bull httpcwikiapacheorgMAHOUT

bull user|devmahoutapacheorg

bull httpsvnapacheorgreposasfmahouttrunk

bull httphadoopapacheorg

Resources

bull ldquoMahout in Actionrdquo ndash Owen Anil Dunning and Friedman

ndash httpawesm5FyNe

bull ldquoIntroducing Apache Mahoutrdquo ndash httpwwwibmcomdeveloperworksjavalibraryj-mahout

bull ldquoTaming Textrdquo by Ingersoll Morton Farris

bull ldquoProgramming Collective Intelligencerdquo by Toby Segaran

bull ldquoData Mining - Practical Machine Learning Tools and Techniquesrdquo by Ian H Witten and Eibe Frank

bull ldquoData-Intensive Text Processing with MapReducerdquo by Jimmy Lin and Chris Dyer

K-Means

bull Clustering Algorithm

ndash Nicely parallelizable

httpenwikipediaorgwikiK-means_clustering

K-Means in Map-Reduce

bull Inputndash Mahout Vectors representing the original content

ndash Eitherbull A predefined set of initial centroids (Can be from Canopy)

bull --k ndash The number of clusters to produce

bull Iteratendash Do the centroid calculation (more in a moment)

bull Clustering Step (optional)

bull Outputndash Centroids (as Mahout Vectors)

ndash Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration

bull Each Iteration calculates the Centroids using

ndash KMeansMapper

ndash KMeansCombiner

ndash KMeansReducer

bull Clustering Step

ndash Calculate the points for each Centroid using

ndash KMeansClusterMapper

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

Apache Mahout

bull An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software Licensendash httpmahoutapacheorg

bull Why Mahoutndash Many Open Source ML libraries either

bull Lack Communitybull Lack Documentation and Examplesbull Lack Scalabilitybull Lack the Apache Licensebull Or are research-oriented

Definition httpdictionaryreferencecombrowsemahout

What does scalable mean to us

bull Goal Be as fast and efficient as possible given the intrinsic design of the algorithmndash Some algorithms wonrsquot scale to massive machine

clusters

ndash Others fit logically on a Map Reduce framework like Apache Hadoop

ndash Still others will need different distributed programming models

ndash Others are already fast (SGD)

bull Be pragmatic

Sampling of Who uses Mahout

httpscwikiapacheorgconfluencedisplayMAHOUTPowered+By+Mahout

What Can I do with Mahout Right Now

3C + FPM + O = Mahout

Collaborative Filtering

bull Extensive framework for collaborative filtering (recommenders)

bull Recommendersndash User based

ndash Item based

bull Online and Offline supportndash Offline can utilize Hadoop

bull Many different Similarity measuresndash Cosine LLR Tanimoto Pearson others

Clustering

bull Document level

ndash Group documents based on a notion of similarity

ndash K-Means Fuzzy K-Means Dirichlet Canopy Mean-Shift EigenCuts (Spectral)

ndash All MapReduce

ndash Distance Measuresbull Manhattan Euclidean

other

bull Topic Modeling

ndash Cluster words across documents to identify topics

ndash Latent DirichletAllocation (MR)

Categorizationbull Place new items into

predefined categoriesndash Sports politics

entertainment

ndash Recommenders

bull Implementationsndash Naiumlve Bayes (MR)

ndash Compl Naiumlve Bayes (MR)

ndash Decision Forests (MR)

ndash Linear Regression (Seq but Fast)

bullSee Chapter 17 of Mahout in Action for Shop It To Me use case

bullhttpawesm5FyNe

Freq Pattern Mining

bull Identify frequently co-occurrent items

bull Useful forndash Query

Recommendationsbull Apple -gt iPhone orange

OS X

ndash Related product placementbull Basket Analysis

bull MapReduce

httpwwwamazoncom

Other

bull Primitive Collections

bull Collocations (MR)

bull Math libraryndash Vectors Matrices etc

bull Noise Reduction via Singular Value Decomp (MR)

Prepare Data from Raw content

bull Data Sourcesndash Lucene integration

bull binmahout lucenevector hellip

ndash Document Vectorizerbull binmahout seqdirectory hellip

bull binmahout seq2sparse hellip

ndash Programmaticallybull See the Utils module in Mahout and the IteratorltVectorgt

classes

ndash Database

ndash File system

How to Command Line

bull Most algorithms have a Driver programndash $MAHOUT_HOMEbinmahoutsh helps with most tasks

bull Prepare the Datandash Different algorithms require different setup

bull Run the algorithmndash Single Nodendash Hadoop

bull Print out the results or incorporate into applicationndash Several helper classes

bull LDAPrintTopics ClusterDumper etc

Whatrsquos Happening Now

bull Unified Framework for Clustering and Classification

bull 05 release on the horizon (May)

bull Working towards 10 release by focusing on

ndash Tests examples documentation

ndash API cleanup and consistency

bull Gearing up for Google Summer of Code

ndash New MR work for Hidden Markov Models

Summary

bull Machine learning is all over the web today

bull Mahout is about scalable machine learning

bull Mahout has functionality for many of todayrsquos common machine learning tasks

bull Many Mahout implementations use Hadoop

Resources

bull httpmahoutapacheorg

bull httpcwikiapacheorgMAHOUT

bull user|devmahoutapacheorg

bull httpsvnapacheorgreposasfmahouttrunk

bull httphadoopapacheorg

Resources

bull ldquoMahout in Actionrdquo ndash Owen Anil Dunning and Friedman

ndash httpawesm5FyNe

bull ldquoIntroducing Apache Mahoutrdquo ndash httpwwwibmcomdeveloperworksjavalibraryj-mahout

bull ldquoTaming Textrdquo by Ingersoll Morton Farris

bull ldquoProgramming Collective Intelligencerdquo by Toby Segaran

bull ldquoData Mining - Practical Machine Learning Tools and Techniquesrdquo by Ian H Witten and Eibe Frank

bull ldquoData-Intensive Text Processing with MapReducerdquo by Jimmy Lin and Chris Dyer

K-Means

bull Clustering Algorithm

ndash Nicely parallelizable

httpenwikipediaorgwikiK-means_clustering

K-Means in Map-Reduce

bull Inputndash Mahout Vectors representing the original content

ndash Eitherbull A predefined set of initial centroids (Can be from Canopy)

bull --k ndash The number of clusters to produce

bull Iteratendash Do the centroid calculation (more in a moment)

bull Clustering Step (optional)

bull Outputndash Centroids (as Mahout Vectors)

ndash Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration

bull Each Iteration calculates the Centroids using

ndash KMeansMapper

ndash KMeansCombiner

ndash KMeansReducer

bull Clustering Step

ndash Calculate the points for each Centroid using

ndash KMeansClusterMapper

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

What does scalable mean to us

bull Goal Be as fast and efficient as possible given the intrinsic design of the algorithmndash Some algorithms wonrsquot scale to massive machine

clusters

ndash Others fit logically on a Map Reduce framework like Apache Hadoop

ndash Still others will need different distributed programming models

ndash Others are already fast (SGD)

bull Be pragmatic

Sampling of Who uses Mahout

httpscwikiapacheorgconfluencedisplayMAHOUTPowered+By+Mahout

What Can I do with Mahout Right Now

3C + FPM + O = Mahout

Collaborative Filtering

bull Extensive framework for collaborative filtering (recommenders)

bull Recommendersndash User based

ndash Item based

bull Online and Offline supportndash Offline can utilize Hadoop

bull Many different Similarity measuresndash Cosine LLR Tanimoto Pearson others

Clustering

bull Document level

ndash Group documents based on a notion of similarity

ndash K-Means Fuzzy K-Means Dirichlet Canopy Mean-Shift EigenCuts (Spectral)

ndash All MapReduce

ndash Distance Measuresbull Manhattan Euclidean

other

bull Topic Modeling

ndash Cluster words across documents to identify topics

ndash Latent DirichletAllocation (MR)

Categorizationbull Place new items into

predefined categoriesndash Sports politics

entertainment

ndash Recommenders

bull Implementationsndash Naiumlve Bayes (MR)

ndash Compl Naiumlve Bayes (MR)

ndash Decision Forests (MR)

ndash Linear Regression (Seq but Fast)

bullSee Chapter 17 of Mahout in Action for Shop It To Me use case

bullhttpawesm5FyNe

Freq Pattern Mining

bull Identify frequently co-occurrent items

bull Useful forndash Query

Recommendationsbull Apple -gt iPhone orange

OS X

ndash Related product placementbull Basket Analysis

bull MapReduce

httpwwwamazoncom

Other

bull Primitive Collections

bull Collocations (MR)

bull Math libraryndash Vectors Matrices etc

bull Noise Reduction via Singular Value Decomp (MR)

Prepare Data from Raw content

bull Data Sourcesndash Lucene integration

bull binmahout lucenevector hellip

ndash Document Vectorizerbull binmahout seqdirectory hellip

bull binmahout seq2sparse hellip

ndash Programmaticallybull See the Utils module in Mahout and the IteratorltVectorgt

classes

ndash Database

ndash File system

How to Command Line

bull Most algorithms have a Driver programndash $MAHOUT_HOMEbinmahoutsh helps with most tasks

bull Prepare the Datandash Different algorithms require different setup

bull Run the algorithmndash Single Nodendash Hadoop

bull Print out the results or incorporate into applicationndash Several helper classes

bull LDAPrintTopics ClusterDumper etc

Whatrsquos Happening Now

bull Unified Framework for Clustering and Classification

bull 05 release on the horizon (May)

bull Working towards 10 release by focusing on

ndash Tests examples documentation

ndash API cleanup and consistency

bull Gearing up for Google Summer of Code

ndash New MR work for Hidden Markov Models

Summary

bull Machine learning is all over the web today

bull Mahout is about scalable machine learning

bull Mahout has functionality for many of todayrsquos common machine learning tasks

bull Many Mahout implementations use Hadoop

Resources

bull httpmahoutapacheorg

bull httpcwikiapacheorgMAHOUT

bull user|devmahoutapacheorg

bull httpsvnapacheorgreposasfmahouttrunk

bull httphadoopapacheorg

Resources

bull ldquoMahout in Actionrdquo ndash Owen Anil Dunning and Friedman

ndash httpawesm5FyNe

bull ldquoIntroducing Apache Mahoutrdquo ndash httpwwwibmcomdeveloperworksjavalibraryj-mahout

bull ldquoTaming Textrdquo by Ingersoll Morton Farris

bull ldquoProgramming Collective Intelligencerdquo by Toby Segaran

bull ldquoData Mining - Practical Machine Learning Tools and Techniquesrdquo by Ian H Witten and Eibe Frank

bull ldquoData-Intensive Text Processing with MapReducerdquo by Jimmy Lin and Chris Dyer

K-Means

bull Clustering Algorithm

ndash Nicely parallelizable

httpenwikipediaorgwikiK-means_clustering

K-Means in Map-Reduce

bull Inputndash Mahout Vectors representing the original content

ndash Eitherbull A predefined set of initial centroids (Can be from Canopy)

bull --k ndash The number of clusters to produce

bull Iteratendash Do the centroid calculation (more in a moment)

bull Clustering Step (optional)

bull Outputndash Centroids (as Mahout Vectors)

ndash Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration

bull Each Iteration calculates the Centroids using

ndash KMeansMapper

ndash KMeansCombiner

ndash KMeansReducer

bull Clustering Step

ndash Calculate the points for each Centroid using

ndash KMeansClusterMapper

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

Sampling of Who uses Mahout

httpscwikiapacheorgconfluencedisplayMAHOUTPowered+By+Mahout

What Can I do with Mahout Right Now

3C + FPM + O = Mahout

Collaborative Filtering

bull Extensive framework for collaborative filtering (recommenders)

bull Recommendersndash User based

ndash Item based

bull Online and Offline supportndash Offline can utilize Hadoop

bull Many different Similarity measuresndash Cosine LLR Tanimoto Pearson others

Clustering

bull Document level

ndash Group documents based on a notion of similarity

ndash K-Means Fuzzy K-Means Dirichlet Canopy Mean-Shift EigenCuts (Spectral)

ndash All MapReduce

ndash Distance Measuresbull Manhattan Euclidean

other

bull Topic Modeling

ndash Cluster words across documents to identify topics

ndash Latent DirichletAllocation (MR)

Categorizationbull Place new items into

predefined categoriesndash Sports politics

entertainment

ndash Recommenders

bull Implementationsndash Naiumlve Bayes (MR)

ndash Compl Naiumlve Bayes (MR)

ndash Decision Forests (MR)

ndash Linear Regression (Seq but Fast)

bullSee Chapter 17 of Mahout in Action for Shop It To Me use case

bullhttpawesm5FyNe

Freq Pattern Mining

bull Identify frequently co-occurrent items

bull Useful forndash Query

Recommendationsbull Apple -gt iPhone orange

OS X

ndash Related product placementbull Basket Analysis

bull MapReduce

httpwwwamazoncom

Other

bull Primitive Collections

bull Collocations (MR)

bull Math libraryndash Vectors Matrices etc

bull Noise Reduction via Singular Value Decomp (MR)

Prepare Data from Raw content

bull Data Sourcesndash Lucene integration

bull binmahout lucenevector hellip

ndash Document Vectorizerbull binmahout seqdirectory hellip

bull binmahout seq2sparse hellip

ndash Programmaticallybull See the Utils module in Mahout and the IteratorltVectorgt

classes

ndash Database

ndash File system

How to Command Line

bull Most algorithms have a Driver programndash $MAHOUT_HOMEbinmahoutsh helps with most tasks

bull Prepare the Datandash Different algorithms require different setup

bull Run the algorithmndash Single Nodendash Hadoop

bull Print out the results or incorporate into applicationndash Several helper classes

bull LDAPrintTopics ClusterDumper etc

Whatrsquos Happening Now

bull Unified Framework for Clustering and Classification

bull 05 release on the horizon (May)

bull Working towards 10 release by focusing on

ndash Tests examples documentation

ndash API cleanup and consistency

bull Gearing up for Google Summer of Code

ndash New MR work for Hidden Markov Models

Summary

bull Machine learning is all over the web today

bull Mahout is about scalable machine learning

bull Mahout has functionality for many of todayrsquos common machine learning tasks

bull Many Mahout implementations use Hadoop

Resources

bull httpmahoutapacheorg

bull httpcwikiapacheorgMAHOUT

bull user|devmahoutapacheorg

bull httpsvnapacheorgreposasfmahouttrunk

bull httphadoopapacheorg

Resources

bull ldquoMahout in Actionrdquo ndash Owen Anil Dunning and Friedman

ndash httpawesm5FyNe

bull ldquoIntroducing Apache Mahoutrdquo ndash httpwwwibmcomdeveloperworksjavalibraryj-mahout

bull ldquoTaming Textrdquo by Ingersoll Morton Farris

bull ldquoProgramming Collective Intelligencerdquo by Toby Segaran

bull ldquoData Mining - Practical Machine Learning Tools and Techniquesrdquo by Ian H Witten and Eibe Frank

bull ldquoData-Intensive Text Processing with MapReducerdquo by Jimmy Lin and Chris Dyer

K-Means

bull Clustering Algorithm

ndash Nicely parallelizable

httpenwikipediaorgwikiK-means_clustering

K-Means in Map-Reduce

bull Inputndash Mahout Vectors representing the original content

ndash Eitherbull A predefined set of initial centroids (Can be from Canopy)

bull --k ndash The number of clusters to produce

bull Iteratendash Do the centroid calculation (more in a moment)

bull Clustering Step (optional)

bull Outputndash Centroids (as Mahout Vectors)

ndash Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration

bull Each Iteration calculates the Centroids using

ndash KMeansMapper

ndash KMeansCombiner

ndash KMeansReducer

bull Clustering Step

ndash Calculate the points for each Centroid using

ndash KMeansClusterMapper

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

What Can I do with Mahout Right Now

3C + FPM + O = Mahout

Collaborative Filtering

bull Extensive framework for collaborative filtering (recommenders)

bull Recommendersndash User based

ndash Item based

bull Online and Offline supportndash Offline can utilize Hadoop

bull Many different Similarity measuresndash Cosine LLR Tanimoto Pearson others

Clustering

bull Document level

ndash Group documents based on a notion of similarity

ndash K-Means Fuzzy K-Means Dirichlet Canopy Mean-Shift EigenCuts (Spectral)

ndash All MapReduce

ndash Distance Measuresbull Manhattan Euclidean

other

bull Topic Modeling

ndash Cluster words across documents to identify topics

ndash Latent DirichletAllocation (MR)

Categorizationbull Place new items into

predefined categoriesndash Sports politics

entertainment

ndash Recommenders

bull Implementationsndash Naiumlve Bayes (MR)

ndash Compl Naiumlve Bayes (MR)

ndash Decision Forests (MR)

ndash Linear Regression (Seq but Fast)

bullSee Chapter 17 of Mahout in Action for Shop It To Me use case

bullhttpawesm5FyNe

Freq Pattern Mining

bull Identify frequently co-occurrent items

bull Useful forndash Query

Recommendationsbull Apple -gt iPhone orange

OS X

ndash Related product placementbull Basket Analysis

bull MapReduce

httpwwwamazoncom

Other

bull Primitive Collections

bull Collocations (MR)

bull Math libraryndash Vectors Matrices etc

bull Noise Reduction via Singular Value Decomp (MR)

Prepare Data from Raw content

bull Data Sourcesndash Lucene integration

bull binmahout lucenevector hellip

ndash Document Vectorizerbull binmahout seqdirectory hellip

bull binmahout seq2sparse hellip

ndash Programmaticallybull See the Utils module in Mahout and the IteratorltVectorgt

classes

ndash Database

ndash File system

How to Command Line

bull Most algorithms have a Driver programndash $MAHOUT_HOMEbinmahoutsh helps with most tasks

bull Prepare the Datandash Different algorithms require different setup

bull Run the algorithmndash Single Nodendash Hadoop

bull Print out the results or incorporate into applicationndash Several helper classes

bull LDAPrintTopics ClusterDumper etc

Whatrsquos Happening Now

bull Unified Framework for Clustering and Classification

bull 05 release on the horizon (May)

bull Working towards 10 release by focusing on

ndash Tests examples documentation

ndash API cleanup and consistency

bull Gearing up for Google Summer of Code

ndash New MR work for Hidden Markov Models

Summary

bull Machine learning is all over the web today

bull Mahout is about scalable machine learning

bull Mahout has functionality for many of todayrsquos common machine learning tasks

bull Many Mahout implementations use Hadoop

Resources

bull httpmahoutapacheorg

bull httpcwikiapacheorgMAHOUT

bull user|devmahoutapacheorg

bull httpsvnapacheorgreposasfmahouttrunk

bull httphadoopapacheorg

Resources

bull ldquoMahout in Actionrdquo ndash Owen Anil Dunning and Friedman

ndash httpawesm5FyNe

bull ldquoIntroducing Apache Mahoutrdquo ndash httpwwwibmcomdeveloperworksjavalibraryj-mahout

bull ldquoTaming Textrdquo by Ingersoll Morton Farris

bull ldquoProgramming Collective Intelligencerdquo by Toby Segaran

bull ldquoData Mining - Practical Machine Learning Tools and Techniquesrdquo by Ian H Witten and Eibe Frank

bull ldquoData-Intensive Text Processing with MapReducerdquo by Jimmy Lin and Chris Dyer

K-Means

bull Clustering Algorithm

ndash Nicely parallelizable

httpenwikipediaorgwikiK-means_clustering

K-Means in Map-Reduce

bull Inputndash Mahout Vectors representing the original content

ndash Eitherbull A predefined set of initial centroids (Can be from Canopy)

bull --k ndash The number of clusters to produce

bull Iteratendash Do the centroid calculation (more in a moment)

bull Clustering Step (optional)

bull Outputndash Centroids (as Mahout Vectors)

ndash Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration

bull Each Iteration calculates the Centroids using

ndash KMeansMapper

ndash KMeansCombiner

ndash KMeansReducer

bull Clustering Step

ndash Calculate the points for each Centroid using

ndash KMeansClusterMapper

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

Collaborative Filtering

bull Extensive framework for collaborative filtering (recommenders)

bull Recommendersndash User based

ndash Item based

bull Online and Offline supportndash Offline can utilize Hadoop

bull Many different Similarity measuresndash Cosine LLR Tanimoto Pearson others

Clustering

bull Document level

ndash Group documents based on a notion of similarity

ndash K-Means Fuzzy K-Means Dirichlet Canopy Mean-Shift EigenCuts (Spectral)

ndash All MapReduce

ndash Distance Measuresbull Manhattan Euclidean

other

bull Topic Modeling

ndash Cluster words across documents to identify topics

ndash Latent DirichletAllocation (MR)

Categorizationbull Place new items into

predefined categoriesndash Sports politics

entertainment

ndash Recommenders

bull Implementationsndash Naiumlve Bayes (MR)

ndash Compl Naiumlve Bayes (MR)

ndash Decision Forests (MR)

ndash Linear Regression (Seq but Fast)

bullSee Chapter 17 of Mahout in Action for Shop It To Me use case

bullhttpawesm5FyNe

Freq Pattern Mining

bull Identify frequently co-occurrent items

bull Useful forndash Query

Recommendationsbull Apple -gt iPhone orange

OS X

ndash Related product placementbull Basket Analysis

bull MapReduce

httpwwwamazoncom

Other

bull Primitive Collections

bull Collocations (MR)

bull Math libraryndash Vectors Matrices etc

bull Noise Reduction via Singular Value Decomp (MR)

Prepare Data from Raw content

bull Data Sourcesndash Lucene integration

bull binmahout lucenevector hellip

ndash Document Vectorizerbull binmahout seqdirectory hellip

bull binmahout seq2sparse hellip

ndash Programmaticallybull See the Utils module in Mahout and the IteratorltVectorgt

classes

ndash Database

ndash File system

How to Command Line

bull Most algorithms have a Driver programndash $MAHOUT_HOMEbinmahoutsh helps with most tasks

bull Prepare the Datandash Different algorithms require different setup

bull Run the algorithmndash Single Nodendash Hadoop

bull Print out the results or incorporate into applicationndash Several helper classes

bull LDAPrintTopics ClusterDumper etc

Whatrsquos Happening Now

bull Unified Framework for Clustering and Classification

bull 05 release on the horizon (May)

bull Working towards 10 release by focusing on

ndash Tests examples documentation

ndash API cleanup and consistency

bull Gearing up for Google Summer of Code

ndash New MR work for Hidden Markov Models

Summary

bull Machine learning is all over the web today

bull Mahout is about scalable machine learning

bull Mahout has functionality for many of todayrsquos common machine learning tasks

bull Many Mahout implementations use Hadoop

Resources

bull httpmahoutapacheorg

bull httpcwikiapacheorgMAHOUT

bull user|devmahoutapacheorg

bull httpsvnapacheorgreposasfmahouttrunk

bull httphadoopapacheorg

Resources

bull ldquoMahout in Actionrdquo ndash Owen Anil Dunning and Friedman

ndash httpawesm5FyNe

bull ldquoIntroducing Apache Mahoutrdquo ndash httpwwwibmcomdeveloperworksjavalibraryj-mahout

bull ldquoTaming Textrdquo by Ingersoll Morton Farris

bull ldquoProgramming Collective Intelligencerdquo by Toby Segaran

bull ldquoData Mining - Practical Machine Learning Tools and Techniquesrdquo by Ian H Witten and Eibe Frank

bull ldquoData-Intensive Text Processing with MapReducerdquo by Jimmy Lin and Chris Dyer

K-Means

bull Clustering Algorithm

ndash Nicely parallelizable

httpenwikipediaorgwikiK-means_clustering

K-Means in Map-Reduce

bull Inputndash Mahout Vectors representing the original content

ndash Eitherbull A predefined set of initial centroids (Can be from Canopy)

bull --k ndash The number of clusters to produce

bull Iteratendash Do the centroid calculation (more in a moment)

bull Clustering Step (optional)

bull Outputndash Centroids (as Mahout Vectors)

ndash Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration

bull Each Iteration calculates the Centroids using

ndash KMeansMapper

ndash KMeansCombiner

ndash KMeansReducer

bull Clustering Step

ndash Calculate the points for each Centroid using

ndash KMeansClusterMapper

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

Clustering

bull Document level

ndash Group documents based on a notion of similarity

ndash K-Means Fuzzy K-Means Dirichlet Canopy Mean-Shift EigenCuts (Spectral)

ndash All MapReduce

ndash Distance Measuresbull Manhattan Euclidean

other

bull Topic Modeling

ndash Cluster words across documents to identify topics

ndash Latent DirichletAllocation (MR)

Categorizationbull Place new items into

predefined categoriesndash Sports politics

entertainment

ndash Recommenders

bull Implementationsndash Naiumlve Bayes (MR)

ndash Compl Naiumlve Bayes (MR)

ndash Decision Forests (MR)

ndash Linear Regression (Seq but Fast)

bullSee Chapter 17 of Mahout in Action for Shop It To Me use case

bullhttpawesm5FyNe

Freq Pattern Mining

bull Identify frequently co-occurrent items

bull Useful forndash Query

Recommendationsbull Apple -gt iPhone orange

OS X

ndash Related product placementbull Basket Analysis

bull MapReduce

httpwwwamazoncom

Other

bull Primitive Collections

bull Collocations (MR)

bull Math libraryndash Vectors Matrices etc

bull Noise Reduction via Singular Value Decomp (MR)

Prepare Data from Raw content

bull Data Sourcesndash Lucene integration

bull binmahout lucenevector hellip

ndash Document Vectorizerbull binmahout seqdirectory hellip

bull binmahout seq2sparse hellip

ndash Programmaticallybull See the Utils module in Mahout and the IteratorltVectorgt

classes

ndash Database

ndash File system

How to Command Line

bull Most algorithms have a Driver programndash $MAHOUT_HOMEbinmahoutsh helps with most tasks

bull Prepare the Datandash Different algorithms require different setup

bull Run the algorithmndash Single Nodendash Hadoop

bull Print out the results or incorporate into applicationndash Several helper classes

bull LDAPrintTopics ClusterDumper etc

Whatrsquos Happening Now

bull Unified Framework for Clustering and Classification

bull 05 release on the horizon (May)

bull Working towards 10 release by focusing on

ndash Tests examples documentation

ndash API cleanup and consistency

bull Gearing up for Google Summer of Code

ndash New MR work for Hidden Markov Models

Summary

bull Machine learning is all over the web today

bull Mahout is about scalable machine learning

bull Mahout has functionality for many of todayrsquos common machine learning tasks

bull Many Mahout implementations use Hadoop

Resources

bull httpmahoutapacheorg

bull httpcwikiapacheorgMAHOUT

bull user|devmahoutapacheorg

bull httpsvnapacheorgreposasfmahouttrunk

bull httphadoopapacheorg

Resources

bull ldquoMahout in Actionrdquo ndash Owen Anil Dunning and Friedman

ndash httpawesm5FyNe

bull ldquoIntroducing Apache Mahoutrdquo ndash httpwwwibmcomdeveloperworksjavalibraryj-mahout

bull ldquoTaming Textrdquo by Ingersoll Morton Farris

bull ldquoProgramming Collective Intelligencerdquo by Toby Segaran

bull ldquoData Mining - Practical Machine Learning Tools and Techniquesrdquo by Ian H Witten and Eibe Frank

bull ldquoData-Intensive Text Processing with MapReducerdquo by Jimmy Lin and Chris Dyer

K-Means

bull Clustering Algorithm

ndash Nicely parallelizable

httpenwikipediaorgwikiK-means_clustering

K-Means in Map-Reduce

bull Inputndash Mahout Vectors representing the original content

ndash Eitherbull A predefined set of initial centroids (Can be from Canopy)

bull --k ndash The number of clusters to produce

bull Iteratendash Do the centroid calculation (more in a moment)

bull Clustering Step (optional)

bull Outputndash Centroids (as Mahout Vectors)

ndash Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration

bull Each Iteration calculates the Centroids using

ndash KMeansMapper

ndash KMeansCombiner

ndash KMeansReducer

bull Clustering Step

ndash Calculate the points for each Centroid using

ndash KMeansClusterMapper

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

Categorizationbull Place new items into

predefined categoriesndash Sports politics

entertainment

ndash Recommenders

bull Implementationsndash Naiumlve Bayes (MR)

ndash Compl Naiumlve Bayes (MR)

ndash Decision Forests (MR)

ndash Linear Regression (Seq but Fast)

bullSee Chapter 17 of Mahout in Action for Shop It To Me use case

bullhttpawesm5FyNe

Freq Pattern Mining

bull Identify frequently co-occurrent items

bull Useful forndash Query

Recommendationsbull Apple -gt iPhone orange

OS X

ndash Related product placementbull Basket Analysis

bull MapReduce

httpwwwamazoncom

Other

bull Primitive Collections

bull Collocations (MR)

bull Math libraryndash Vectors Matrices etc

bull Noise Reduction via Singular Value Decomp (MR)

Prepare Data from Raw content

bull Data Sourcesndash Lucene integration

bull binmahout lucenevector hellip

ndash Document Vectorizerbull binmahout seqdirectory hellip

bull binmahout seq2sparse hellip

ndash Programmaticallybull See the Utils module in Mahout and the IteratorltVectorgt

classes

ndash Database

ndash File system

How to Command Line

bull Most algorithms have a Driver programndash $MAHOUT_HOMEbinmahoutsh helps with most tasks

bull Prepare the Datandash Different algorithms require different setup

bull Run the algorithmndash Single Nodendash Hadoop

bull Print out the results or incorporate into applicationndash Several helper classes

bull LDAPrintTopics ClusterDumper etc

Whatrsquos Happening Now

bull Unified Framework for Clustering and Classification

bull 05 release on the horizon (May)

bull Working towards 10 release by focusing on

ndash Tests examples documentation

ndash API cleanup and consistency

bull Gearing up for Google Summer of Code

ndash New MR work for Hidden Markov Models

Summary

bull Machine learning is all over the web today

bull Mahout is about scalable machine learning

bull Mahout has functionality for many of todayrsquos common machine learning tasks

bull Many Mahout implementations use Hadoop

Resources

bull httpmahoutapacheorg

bull httpcwikiapacheorgMAHOUT

bull user|devmahoutapacheorg

bull httpsvnapacheorgreposasfmahouttrunk

bull httphadoopapacheorg

Resources

bull ldquoMahout in Actionrdquo ndash Owen Anil Dunning and Friedman

ndash httpawesm5FyNe

bull ldquoIntroducing Apache Mahoutrdquo ndash httpwwwibmcomdeveloperworksjavalibraryj-mahout

bull ldquoTaming Textrdquo by Ingersoll Morton Farris

bull ldquoProgramming Collective Intelligencerdquo by Toby Segaran

bull ldquoData Mining - Practical Machine Learning Tools and Techniquesrdquo by Ian H Witten and Eibe Frank

bull ldquoData-Intensive Text Processing with MapReducerdquo by Jimmy Lin and Chris Dyer

K-Means

bull Clustering Algorithm

ndash Nicely parallelizable

httpenwikipediaorgwikiK-means_clustering

K-Means in Map-Reduce

bull Inputndash Mahout Vectors representing the original content

ndash Eitherbull A predefined set of initial centroids (Can be from Canopy)

bull --k ndash The number of clusters to produce

bull Iteratendash Do the centroid calculation (more in a moment)

bull Clustering Step (optional)

bull Outputndash Centroids (as Mahout Vectors)

ndash Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration

bull Each Iteration calculates the Centroids using

ndash KMeansMapper

ndash KMeansCombiner

ndash KMeansReducer

bull Clustering Step

ndash Calculate the points for each Centroid using

ndash KMeansClusterMapper

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

Freq Pattern Mining

bull Identify frequently co-occurrent items

bull Useful forndash Query

Recommendationsbull Apple -gt iPhone orange

OS X

ndash Related product placementbull Basket Analysis

bull MapReduce

httpwwwamazoncom

Other

bull Primitive Collections

bull Collocations (MR)

bull Math libraryndash Vectors Matrices etc

bull Noise Reduction via Singular Value Decomp (MR)

Prepare Data from Raw content

bull Data Sourcesndash Lucene integration

bull binmahout lucenevector hellip

ndash Document Vectorizerbull binmahout seqdirectory hellip

bull binmahout seq2sparse hellip

ndash Programmaticallybull See the Utils module in Mahout and the IteratorltVectorgt

classes

ndash Database

ndash File system

How to Command Line

bull Most algorithms have a Driver programndash $MAHOUT_HOMEbinmahoutsh helps with most tasks

bull Prepare the Datandash Different algorithms require different setup

bull Run the algorithmndash Single Nodendash Hadoop

bull Print out the results or incorporate into applicationndash Several helper classes

bull LDAPrintTopics ClusterDumper etc

Whatrsquos Happening Now

bull Unified Framework for Clustering and Classification

bull 05 release on the horizon (May)

bull Working towards 10 release by focusing on

ndash Tests examples documentation

ndash API cleanup and consistency

bull Gearing up for Google Summer of Code

ndash New MR work for Hidden Markov Models

Summary

bull Machine learning is all over the web today

bull Mahout is about scalable machine learning

bull Mahout has functionality for many of todayrsquos common machine learning tasks

bull Many Mahout implementations use Hadoop

Resources

bull httpmahoutapacheorg

bull httpcwikiapacheorgMAHOUT

bull user|devmahoutapacheorg

bull httpsvnapacheorgreposasfmahouttrunk

bull httphadoopapacheorg

Resources

bull ldquoMahout in Actionrdquo ndash Owen Anil Dunning and Friedman

ndash httpawesm5FyNe

bull ldquoIntroducing Apache Mahoutrdquo ndash httpwwwibmcomdeveloperworksjavalibraryj-mahout

bull ldquoTaming Textrdquo by Ingersoll Morton Farris

bull ldquoProgramming Collective Intelligencerdquo by Toby Segaran

bull ldquoData Mining - Practical Machine Learning Tools and Techniquesrdquo by Ian H Witten and Eibe Frank

bull ldquoData-Intensive Text Processing with MapReducerdquo by Jimmy Lin and Chris Dyer

K-Means

bull Clustering Algorithm

ndash Nicely parallelizable

httpenwikipediaorgwikiK-means_clustering

K-Means in Map-Reduce

bull Inputndash Mahout Vectors representing the original content

ndash Eitherbull A predefined set of initial centroids (Can be from Canopy)

bull --k ndash The number of clusters to produce

bull Iteratendash Do the centroid calculation (more in a moment)

bull Clustering Step (optional)

bull Outputndash Centroids (as Mahout Vectors)

ndash Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration

bull Each Iteration calculates the Centroids using

ndash KMeansMapper

ndash KMeansCombiner

ndash KMeansReducer

bull Clustering Step

ndash Calculate the points for each Centroid using

ndash KMeansClusterMapper

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

Other

bull Primitive Collections

bull Collocations (MR)

bull Math libraryndash Vectors Matrices etc

bull Noise Reduction via Singular Value Decomp (MR)

Prepare Data from Raw content

bull Data Sourcesndash Lucene integration

bull binmahout lucenevector hellip

ndash Document Vectorizerbull binmahout seqdirectory hellip

bull binmahout seq2sparse hellip

ndash Programmaticallybull See the Utils module in Mahout and the IteratorltVectorgt

classes

ndash Database

ndash File system

How to Command Line

bull Most algorithms have a Driver programndash $MAHOUT_HOMEbinmahoutsh helps with most tasks

bull Prepare the Datandash Different algorithms require different setup

bull Run the algorithmndash Single Nodendash Hadoop

bull Print out the results or incorporate into applicationndash Several helper classes

bull LDAPrintTopics ClusterDumper etc

Whatrsquos Happening Now

bull Unified Framework for Clustering and Classification

bull 05 release on the horizon (May)

bull Working towards 10 release by focusing on

ndash Tests examples documentation

ndash API cleanup and consistency

bull Gearing up for Google Summer of Code

ndash New MR work for Hidden Markov Models

Summary

bull Machine learning is all over the web today

bull Mahout is about scalable machine learning

bull Mahout has functionality for many of todayrsquos common machine learning tasks

bull Many Mahout implementations use Hadoop

Resources

bull httpmahoutapacheorg

bull httpcwikiapacheorgMAHOUT

bull user|devmahoutapacheorg

bull httpsvnapacheorgreposasfmahouttrunk

bull httphadoopapacheorg

Resources

bull ldquoMahout in Actionrdquo ndash Owen Anil Dunning and Friedman

ndash httpawesm5FyNe

bull ldquoIntroducing Apache Mahoutrdquo ndash httpwwwibmcomdeveloperworksjavalibraryj-mahout

bull ldquoTaming Textrdquo by Ingersoll Morton Farris

bull ldquoProgramming Collective Intelligencerdquo by Toby Segaran

bull ldquoData Mining - Practical Machine Learning Tools and Techniquesrdquo by Ian H Witten and Eibe Frank

bull ldquoData-Intensive Text Processing with MapReducerdquo by Jimmy Lin and Chris Dyer

K-Means

bull Clustering Algorithm

ndash Nicely parallelizable

httpenwikipediaorgwikiK-means_clustering

K-Means in Map-Reduce

bull Inputndash Mahout Vectors representing the original content

ndash Eitherbull A predefined set of initial centroids (Can be from Canopy)

bull --k ndash The number of clusters to produce

bull Iteratendash Do the centroid calculation (more in a moment)

bull Clustering Step (optional)

bull Outputndash Centroids (as Mahout Vectors)

ndash Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration

bull Each Iteration calculates the Centroids using

ndash KMeansMapper

ndash KMeansCombiner

ndash KMeansReducer

bull Clustering Step

ndash Calculate the points for each Centroid using

ndash KMeansClusterMapper

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

Prepare Data from Raw content

bull Data Sourcesndash Lucene integration

bull binmahout lucenevector hellip

ndash Document Vectorizerbull binmahout seqdirectory hellip

bull binmahout seq2sparse hellip

ndash Programmaticallybull See the Utils module in Mahout and the IteratorltVectorgt

classes

ndash Database

ndash File system

How to Command Line

bull Most algorithms have a Driver programndash $MAHOUT_HOMEbinmahoutsh helps with most tasks

bull Prepare the Datandash Different algorithms require different setup

bull Run the algorithmndash Single Nodendash Hadoop

bull Print out the results or incorporate into applicationndash Several helper classes

bull LDAPrintTopics ClusterDumper etc

Whatrsquos Happening Now

bull Unified Framework for Clustering and Classification

bull 05 release on the horizon (May)

bull Working towards 10 release by focusing on

ndash Tests examples documentation

ndash API cleanup and consistency

bull Gearing up for Google Summer of Code

ndash New MR work for Hidden Markov Models

Summary

bull Machine learning is all over the web today

bull Mahout is about scalable machine learning

bull Mahout has functionality for many of todayrsquos common machine learning tasks

bull Many Mahout implementations use Hadoop

Resources

bull httpmahoutapacheorg

bull httpcwikiapacheorgMAHOUT

bull user|devmahoutapacheorg

bull httpsvnapacheorgreposasfmahouttrunk

bull httphadoopapacheorg

Resources

bull ldquoMahout in Actionrdquo ndash Owen Anil Dunning and Friedman

ndash httpawesm5FyNe

bull ldquoIntroducing Apache Mahoutrdquo ndash httpwwwibmcomdeveloperworksjavalibraryj-mahout

bull ldquoTaming Textrdquo by Ingersoll Morton Farris

bull ldquoProgramming Collective Intelligencerdquo by Toby Segaran

bull ldquoData Mining - Practical Machine Learning Tools and Techniquesrdquo by Ian H Witten and Eibe Frank

bull ldquoData-Intensive Text Processing with MapReducerdquo by Jimmy Lin and Chris Dyer

K-Means

bull Clustering Algorithm

ndash Nicely parallelizable

httpenwikipediaorgwikiK-means_clustering

K-Means in Map-Reduce

bull Inputndash Mahout Vectors representing the original content

ndash Eitherbull A predefined set of initial centroids (Can be from Canopy)

bull --k ndash The number of clusters to produce

bull Iteratendash Do the centroid calculation (more in a moment)

bull Clustering Step (optional)

bull Outputndash Centroids (as Mahout Vectors)

ndash Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration

bull Each Iteration calculates the Centroids using

ndash KMeansMapper

ndash KMeansCombiner

ndash KMeansReducer

bull Clustering Step

ndash Calculate the points for each Centroid using

ndash KMeansClusterMapper

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

How to Command Line

bull Most algorithms have a Driver programndash $MAHOUT_HOMEbinmahoutsh helps with most tasks

bull Prepare the Datandash Different algorithms require different setup

bull Run the algorithmndash Single Nodendash Hadoop

bull Print out the results or incorporate into applicationndash Several helper classes

bull LDAPrintTopics ClusterDumper etc

Whatrsquos Happening Now

bull Unified Framework for Clustering and Classification

bull 05 release on the horizon (May)

bull Working towards 10 release by focusing on

ndash Tests examples documentation

ndash API cleanup and consistency

bull Gearing up for Google Summer of Code

ndash New MR work for Hidden Markov Models

Summary

bull Machine learning is all over the web today

bull Mahout is about scalable machine learning

bull Mahout has functionality for many of todayrsquos common machine learning tasks

bull Many Mahout implementations use Hadoop

Resources

bull httpmahoutapacheorg

bull httpcwikiapacheorgMAHOUT

bull user|devmahoutapacheorg

bull httpsvnapacheorgreposasfmahouttrunk

bull httphadoopapacheorg

Resources

bull ldquoMahout in Actionrdquo ndash Owen Anil Dunning and Friedman

ndash httpawesm5FyNe

bull ldquoIntroducing Apache Mahoutrdquo ndash httpwwwibmcomdeveloperworksjavalibraryj-mahout

bull ldquoTaming Textrdquo by Ingersoll Morton Farris

bull ldquoProgramming Collective Intelligencerdquo by Toby Segaran

bull ldquoData Mining - Practical Machine Learning Tools and Techniquesrdquo by Ian H Witten and Eibe Frank

bull ldquoData-Intensive Text Processing with MapReducerdquo by Jimmy Lin and Chris Dyer

K-Means

bull Clustering Algorithm

ndash Nicely parallelizable

httpenwikipediaorgwikiK-means_clustering

K-Means in Map-Reduce

bull Inputndash Mahout Vectors representing the original content

ndash Eitherbull A predefined set of initial centroids (Can be from Canopy)

bull --k ndash The number of clusters to produce

bull Iteratendash Do the centroid calculation (more in a moment)

bull Clustering Step (optional)

bull Outputndash Centroids (as Mahout Vectors)

ndash Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration

bull Each Iteration calculates the Centroids using

ndash KMeansMapper

ndash KMeansCombiner

ndash KMeansReducer

bull Clustering Step

ndash Calculate the points for each Centroid using

ndash KMeansClusterMapper

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

Whatrsquos Happening Now

bull Unified Framework for Clustering and Classification

bull 05 release on the horizon (May)

bull Working towards 10 release by focusing on

ndash Tests examples documentation

ndash API cleanup and consistency

bull Gearing up for Google Summer of Code

ndash New MR work for Hidden Markov Models

Summary

bull Machine learning is all over the web today

bull Mahout is about scalable machine learning

bull Mahout has functionality for many of todayrsquos common machine learning tasks

bull Many Mahout implementations use Hadoop

Resources

bull httpmahoutapacheorg

bull httpcwikiapacheorgMAHOUT

bull user|devmahoutapacheorg

bull httpsvnapacheorgreposasfmahouttrunk

bull httphadoopapacheorg

Resources

bull ldquoMahout in Actionrdquo ndash Owen Anil Dunning and Friedman

ndash httpawesm5FyNe

bull ldquoIntroducing Apache Mahoutrdquo ndash httpwwwibmcomdeveloperworksjavalibraryj-mahout

bull ldquoTaming Textrdquo by Ingersoll Morton Farris

bull ldquoProgramming Collective Intelligencerdquo by Toby Segaran

bull ldquoData Mining - Practical Machine Learning Tools and Techniquesrdquo by Ian H Witten and Eibe Frank

bull ldquoData-Intensive Text Processing with MapReducerdquo by Jimmy Lin and Chris Dyer

K-Means

bull Clustering Algorithm

ndash Nicely parallelizable

httpenwikipediaorgwikiK-means_clustering

K-Means in Map-Reduce

bull Inputndash Mahout Vectors representing the original content

ndash Eitherbull A predefined set of initial centroids (Can be from Canopy)

bull --k ndash The number of clusters to produce

bull Iteratendash Do the centroid calculation (more in a moment)

bull Clustering Step (optional)

bull Outputndash Centroids (as Mahout Vectors)

ndash Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration

bull Each Iteration calculates the Centroids using

ndash KMeansMapper

ndash KMeansCombiner

ndash KMeansReducer

bull Clustering Step

ndash Calculate the points for each Centroid using

ndash KMeansClusterMapper

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

Summary

bull Machine learning is all over the web today

bull Mahout is about scalable machine learning

bull Mahout has functionality for many of todayrsquos common machine learning tasks

bull Many Mahout implementations use Hadoop

Resources

bull httpmahoutapacheorg

bull httpcwikiapacheorgMAHOUT

bull user|devmahoutapacheorg

bull httpsvnapacheorgreposasfmahouttrunk

bull httphadoopapacheorg

Resources

bull ldquoMahout in Actionrdquo ndash Owen Anil Dunning and Friedman

ndash httpawesm5FyNe

bull ldquoIntroducing Apache Mahoutrdquo ndash httpwwwibmcomdeveloperworksjavalibraryj-mahout

bull ldquoTaming Textrdquo by Ingersoll Morton Farris

bull ldquoProgramming Collective Intelligencerdquo by Toby Segaran

bull ldquoData Mining - Practical Machine Learning Tools and Techniquesrdquo by Ian H Witten and Eibe Frank

bull ldquoData-Intensive Text Processing with MapReducerdquo by Jimmy Lin and Chris Dyer

K-Means

bull Clustering Algorithm

ndash Nicely parallelizable

httpenwikipediaorgwikiK-means_clustering

K-Means in Map-Reduce

bull Inputndash Mahout Vectors representing the original content

ndash Eitherbull A predefined set of initial centroids (Can be from Canopy)

bull --k ndash The number of clusters to produce

bull Iteratendash Do the centroid calculation (more in a moment)

bull Clustering Step (optional)

bull Outputndash Centroids (as Mahout Vectors)

ndash Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration

bull Each Iteration calculates the Centroids using

ndash KMeansMapper

ndash KMeansCombiner

ndash KMeansReducer

bull Clustering Step

ndash Calculate the points for each Centroid using

ndash KMeansClusterMapper

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

Resources

bull httpmahoutapacheorg

bull httpcwikiapacheorgMAHOUT

bull user|devmahoutapacheorg

bull httpsvnapacheorgreposasfmahouttrunk

bull httphadoopapacheorg

Resources

bull ldquoMahout in Actionrdquo ndash Owen Anil Dunning and Friedman

ndash httpawesm5FyNe

bull ldquoIntroducing Apache Mahoutrdquo ndash httpwwwibmcomdeveloperworksjavalibraryj-mahout

bull ldquoTaming Textrdquo by Ingersoll Morton Farris

bull ldquoProgramming Collective Intelligencerdquo by Toby Segaran

bull ldquoData Mining - Practical Machine Learning Tools and Techniquesrdquo by Ian H Witten and Eibe Frank

bull ldquoData-Intensive Text Processing with MapReducerdquo by Jimmy Lin and Chris Dyer

K-Means

bull Clustering Algorithm

ndash Nicely parallelizable

httpenwikipediaorgwikiK-means_clustering

K-Means in Map-Reduce

bull Inputndash Mahout Vectors representing the original content

ndash Eitherbull A predefined set of initial centroids (Can be from Canopy)

bull --k ndash The number of clusters to produce

bull Iteratendash Do the centroid calculation (more in a moment)

bull Clustering Step (optional)

bull Outputndash Centroids (as Mahout Vectors)

ndash Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration

bull Each Iteration calculates the Centroids using

ndash KMeansMapper

ndash KMeansCombiner

ndash KMeansReducer

bull Clustering Step

ndash Calculate the points for each Centroid using

ndash KMeansClusterMapper

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

Resources

bull ldquoMahout in Actionrdquo ndash Owen Anil Dunning and Friedman

ndash httpawesm5FyNe

bull ldquoIntroducing Apache Mahoutrdquo ndash httpwwwibmcomdeveloperworksjavalibraryj-mahout

bull ldquoTaming Textrdquo by Ingersoll Morton Farris

bull ldquoProgramming Collective Intelligencerdquo by Toby Segaran

bull ldquoData Mining - Practical Machine Learning Tools and Techniquesrdquo by Ian H Witten and Eibe Frank

bull ldquoData-Intensive Text Processing with MapReducerdquo by Jimmy Lin and Chris Dyer

K-Means

bull Clustering Algorithm

ndash Nicely parallelizable

httpenwikipediaorgwikiK-means_clustering

K-Means in Map-Reduce

bull Inputndash Mahout Vectors representing the original content

ndash Eitherbull A predefined set of initial centroids (Can be from Canopy)

bull --k ndash The number of clusters to produce

bull Iteratendash Do the centroid calculation (more in a moment)

bull Clustering Step (optional)

bull Outputndash Centroids (as Mahout Vectors)

ndash Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration

bull Each Iteration calculates the Centroids using

ndash KMeansMapper

ndash KMeansCombiner

ndash KMeansReducer

bull Clustering Step

ndash Calculate the points for each Centroid using

ndash KMeansClusterMapper

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

K-Means

bull Clustering Algorithm

ndash Nicely parallelizable

httpenwikipediaorgwikiK-means_clustering

K-Means in Map-Reduce

bull Inputndash Mahout Vectors representing the original content

ndash Eitherbull A predefined set of initial centroids (Can be from Canopy)

bull --k ndash The number of clusters to produce

bull Iteratendash Do the centroid calculation (more in a moment)

bull Clustering Step (optional)

bull Outputndash Centroids (as Mahout Vectors)

ndash Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration

bull Each Iteration calculates the Centroids using

ndash KMeansMapper

ndash KMeansCombiner

ndash KMeansReducer

bull Clustering Step

ndash Calculate the points for each Centroid using

ndash KMeansClusterMapper

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

K-Means in Map-Reduce

bull Inputndash Mahout Vectors representing the original content

ndash Eitherbull A predefined set of initial centroids (Can be from Canopy)

bull --k ndash The number of clusters to produce

bull Iteratendash Do the centroid calculation (more in a moment)

bull Clustering Step (optional)

bull Outputndash Centroids (as Mahout Vectors)

ndash Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration

bull Each Iteration calculates the Centroids using

ndash KMeansMapper

ndash KMeansCombiner

ndash KMeansReducer

bull Clustering Step

ndash Calculate the points for each Centroid using

ndash KMeansClusterMapper

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

Map-Reduce Iteration

bull Each Iteration calculates the Centroids using

ndash KMeansMapper

ndash KMeansCombiner

ndash KMeansReducer

bull Clustering Step

ndash Calculate the points for each Centroid using

ndash KMeansClusterMapper

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

KMeansMapper

bull During Setupndash Load the initial Centroids (or the

Centroids from the last iteration)

bull Map Phasendash For each input

bull Calculate itrsquos distance from each Centroid and output the closest one

bull Distance Measures are pluggablendash Manhattan Euclidean Squared

Euclidean Cosine others

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

KMeansReducer

bull Setupndash Load up clustersndash Convergence informationndash Partial sums from

KMeansCombiner (more in a moment)

bull Reduce Phasendash Sum all the vectors in the

cluster to produce a new Centroid

ndash Check for Convergence

bull Output cluster

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

KMeansCombiner

bull Just like KMeansReducer but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt

KMeansClusterMapper

bull Some applications only care about what the Centroids are so this step is optional

bull Setupndash Load up the clusters and the DistanceMeasure

used

bull Map Phasendash Calculate which Cluster the point belongs to

ndash Output ltClusterId Vectorgt