introducing apache mahout scalable machine learning for all! grant ingersoll lucid imagination

29
Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll Lucid Imagination

Upload: amberlynn-harrison

Post on 19-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Introducing Apache Mahout

Scalable Machine Learning for All!

Grant Ingersoll

Lucid Imagination

Overview

• What is Machine Learning?

• Mahout

Definition• “Machine Learning is programming

computers to optimize a performance criterion using example data or past experience”– Intro. To Machine Learning by E.

Alpaydin

• Subset of Artificial Intelligence– Many other fields: comp sci., biology,

math, psychology, etc.

Types• Supervised

– Using labeled training data, create function that predicts output of unseen inputs

• Unsupervised– Using unlabeled data, create function

that predicts output

• Semi-Supervised– Uses labeled and unlabeled data

Characterizations

• Lots of Data

• Identifiable Features in that Data

• Too big/costly for people to handle– People still can help

Clustering

• Unsupervised

• Find Natural Groupings– Documents– Search Results– People– Genetic traits in groups– Many, many more uses

Example: Clustering

Google News

Collaborative Filtering

• Unsupervised

• Recommend people and products– User-User

• User likes X, you might too

– Item-Item• People who bought X also bought Y

Example: Collab Filtering

Amazon.com

Classification/Categorization

• Many, many types

• Spam Filtering

• Named Entity Recognition

• Phrase Identification

• Sentiment Analysis

• Classification into a Taxonomy

Example: NER

NER?

Excerpt from Yahoo News

Example: Categorization

Info. Retrieval

• Learning Ranking Functions

• Learning Spelling Corrections

• User Click Analysis and Tracking

Other

• Image Analysis

• Robotics

• Games

• Higher level natural language processing

• Many, many others

What is Apache Mahout?

• A Mahout is an elephant trainer/driver/keeper, hence…

+Machine Learning

=

(and other distributed techniques)

What?

• Hadoop brings:– Map/Reduce API– HDFS– In other words, scalability and fault-

tolerance

• Mahout brings:– Library of machine learning algorithms– Examples

Why Mahout?• Many Open Source ML libraries either:

– Lack Community

– Lack Documentation and Examples

– Lack Scalability

– Lack the Apache License ;-)

– Or are research-oriented

Why Mahout?• Intelligent Apps are the Present and

Future

• Thus, Mahout’s Goal is:– Scalable Machine Learning with Apache

License

Current Status• What’s in it:

– Simple Matrix/Vector library– Taste Collaborative Filtering– Clustering

• Canopy/K-Means/Fuzzy K-Means/Mean-shift/Dirichlet

– Classifiers• Naïve Bayes• Complementary NB

– Evolutionary• Integration with Watchmaker for fitness function

How?

• Examples– Taste– Clustering– Classification– Evolutionary

Taste: Movie Recommendations

• Given ratings by users of movies, recommend other movies

• http://lucene.apache.org/mahout/taste.html#demo

Taste Demo

• http://localhost:8080/mahout-taste-webapp/RecommenderServlet?userID=12&debug=true

• http://localhost:8080/mahout-taste-webapp/RecommenderServlet?userID=43&debug=true

Clustering: Synthetic Control Data

• http://archive.ics.uci.edu/ml/datasets/Synthetic+Control+Chart+Time+Series

• Each clustering impl. has an example Job for running in <MAHOUT_HOME>/examples– o.a.mahout.clustering.syntheticcontrol.*

• Outputs clusters…

Classification: NB and CNB Examples

• 20 Newsgroups– http://cwiki.apache.org/confluence/

display/MAHOUT/TwentyNewsgroups

• Wikipedia– http://cwiki.apache.org/confluence/

display/MAHOUT/WikipediaBayesExample

Evolutionary

• Traveling Salesman– http://cwiki.apache.org/confluence/

display/MAHOUT/Traveling+Salesman

• Class Discovery– http://cwiki.apache.org/confluence/

display/MAHOUT/Class+Discovery

What’s Next?• More Examples• Winnow/Perceptron (MAHOUT-85)• Text Clustering• Association Rules (MAHOUT-108)• Logistic Regression• Solr Integration (SOLR-769)• GSOC

When, Who• When? Now!

– Mahout is growing

• Who? You!– We want programmers who:

• Are comfortable with math• Like to work on hard problems

– We want others to:• Kick the tires

Where?

• http://lucene.apache.org/mahout– Hadoop - http://hadoop.apache.org

• http://cwiki.apache.org/MAHOUT

• mahout-{user|dev}@lucene.apache.org– http://www.lucidimagination.com/search/p:mahout

Resources

• “Programming Collective Intelligence” by Segaran

• “Data Mining - Practical Machine Learning Tools and Techniques” by Witten and Frank

• “Taming Text” by Ingersoll and Morton