apache mahout. mahout introduction machine learning clustering k-means canopy clustering fuzzy...

Apache Mahout

• Mahout Introduction • Machine Learning• Clustering• K-means • Canopy Clustering• Fuzzy K-Means

• Conclusion

What is Mahout?

• Distributed machine learning libraries– “scalable to reasonably large data sets”– Runs on Hadoop

What?• Hadoop brings:– Map/Reduce API– HDFS– In other words, scalability and fault-tolerance

• Mahout brings:– Library of machine learning algorithms– Examples

Why Mahout?

• Many Open Source ML libraries either:– Lack Community

– Lack Documentation and Examples

– Lack Scalability

– Lack the Apache License ;-)

– Or are research-oriented

Clustering

• Unsupervised• Find Natural Groupings– Documents– Search Results– People– Genetic traits in groups– Many, many more uses

Types

• Supervised– Using labeled training data, create function that

predicts output of unseen inputs• Unsupervised– Using unlabeled data, create function that predicts

output• Semi-Supervised– Uses labeled and unlabeled data

Example: Clustering

Google News

K-means Algorithm

1) Pick a number (k) of cluster centers2) Assign every element to its nearest cluster

center3) Move each cluster center to the mean of

its assigned elements 4) Repeat 2-3 until convergence

Figure 1: K-means algorithm. Training examples are shown as dots, and cluster centroids are shown as crosses.

K-means Example

K-means Example

Invocation using the command line takes the form:

Canopy Clustering• Canopy Clustering is a very simple, fast and surprisingly accurate method for

grouping objects into clusters.

Define two thresholdsTight: T1

Loose: T2Put all records into a set SWhile S is not empty

Remove any record r from S and create a canopy centered at rFor each other record ri, compute cheap distance d from r to ri If d < T2, place ri in r’s canopyIf d < T1, remove ri from S

Canopy Clustering

SequenceFile (WritableComparable, VectorWritable)

Invocation using the command line takes the form:

Fuzzy K-Means

Fuzzy K-Means (also called Fuzzy C-Means) is an extension of K-Means , the popular simple clustering technique. Fuzzy K-Means (also called Fuzzy C-Means) is an extension of K-Means , the popular simple clustering technique.

Like K-Means, Fuzzy K-Means works on those objects which can be represented in n-dimensional vector space and a distance measure is defined. The algorithm is similar to k-means.

Initialize k clusters

Until convergedCompute the probability of a point belong to a cluster for every pairRe-compute the cluster centers using above probability membership values of points to clusters.

Fuzzy K-MeansInvocation using the command line takes the form:

Conclusion

• Mahout did not scale well• Mahout was not easy to learn• Mahout was not easily modifiable

• For performance and efficiency, it is better to– Understand the data set– Understand data mining– Understand the methodology

Thank you !

apache mahout. mahout introduction machine learning clustering k-means canopy clustering fuzzy...

Documents

fuzzy cmeans

training data

cluster centersassign

cluster centroids

large data setsruns

command line

examples lack scalability

nearest cluster centermove