from data to informationdec., 16th 2009: hadoop* get together in berlin – richard hutton...

84
From Data to Information Apache Mahout Speaker: Isabel Drost

Upload: others

Post on 14-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

From Data to InformationApache Mahout

Speaker: Isabel Drost

Page 2: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Isabel Drost

Nighttime:Co-Founder Apache Mahout.

Organizer of Berlin Hadoop Get Together.

Daytime:Software developer @ Berlin

Page 3: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Hello ApacheCon visitors!

Page 4: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Agenda

● Motivation.

● HowTo: A path from data to information.

● Introduction to Mahout.

Page 5: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

January 3, 2006 by Matt Callowhttp://www.flickr.com/photos/blackcustard/81680010

Page 6: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

News aggregation

Today: Read news papers,Blogs, Twitter, RSS feed.

Wish: Aggregate sourcesand track emerging topics.

September 10, 2008 by Alex Barthhttp://www.flickr.com/photos/a-barth/2846621384

Page 7: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg
Page 8: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Go to cinema

Today: IMDB, zitty, movie reviewpages, twitter, blogs, ask friends.

Wish: Reviews, sentimentdetection, recommendations.

March 22, 2008 by Crystian Cruzhttp://www.flickr.com/photos/crystiancruz/2353895708

Page 9: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

HowTo: From data to information.

Page 10: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

From data to information.

● Start collecting and storing data.

● Analyse and understand data.

● Answer more complex questions.

Page 11: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Collecting and storing data.

Page 12: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

By Lab2112, http://www.flickr.com/photos/lab2112/462388595/

Page 13: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Data storage optionsData storage options

● Structured, relational.– Customer data.

– Bug database.

Page 14: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

By bareform, http://www.flickr.com/photos/bareform/2483573213/

Page 15: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Data storage optionsData storage options

● Continuous files.– Log data.

– Document Stream.

Page 16: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

January 8, 2008 by Pink Sherbet Photographyhttp://www.flickr.com/photos/pinksherbet/2177961471/

Page 17: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Data storage optionsData storage options

● Semi-structured data:– Documents.

– Independent rows.

Page 18: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

From data to information.

● Start collecting and storing your data.

● Analyse and understand your data.

● Answer more complex questions.

Page 19: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Understanding your data

● Data profiling.

● Goals:– Identify usual behaviour.

– Find exceptional cases.

● Exact questions depend on domain.

Page 20: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Example: Shopping sessions

● Average amount of money spent.● Number of customers per state.● Min/Max age of customers.● Number of shopping sessions.● Words associated with product.

Page 21: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Example: Access Logs

● Average session length.● Entry-/ exit-pages.● Average number of hits/ day.● Clean data.

Page 22: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Example: Textual documents

● Average length of documents.● Distribution of document topics.● Distribution of authors.

Page 23: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Understanding your data

● Analysing data in HDFS/ HBase/ CouchDB:– Write analysis code as Map/Reduce jobs.

– Use higher level language.

Page 24: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

From data to information.

● Start collecting and storing your data.

● Analyse and understand your data.

● Answer more complex questions.

Page 25: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Analyse shopping lists

By tanakawho, http://www.flickr.com/photos/28481088@N00/349049527/

Page 26: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Interactive web search

Page 27: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Show most relevant ads

Page 28: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Show most relevant ads

Page 29: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Show most relevant ads

Page 30: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Find emerging news topics

Page 31: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Machine learning – what's that?

Page 32: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Image by John Leech, from: The Comic History of Rome by Gilbert Abbott A Beckett.

Bradbury, Evans & Co, London, 1850sArchimedes taking a Warm Bath

Page 33: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Archimedes model of nature

Page 34: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

June 25, 2008 by chase-mehttp://www.flickr.com/photos/sasy/2609508999

Page 35: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg
Page 36: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

An SVM's model of nature

Page 37: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

The challenge

Page 38: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

● Large amounts of data.

● Structured and unstructured data.

● Diverse tasks.

Page 39: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Mission

Provide scalable data mining algorithms.

Page 40: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

● Commercially friendly license.

● Scalable to large amounts of data.

● Well documented.

● Healthy community.

● Targeted to developers.

Page 41: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

What does Mahout have to offer.

Page 42: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Discover groups of items

● Group items by similarity.

● Examples:– Group news articles by topic.

– Find developers with similar interests.

Page 43: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg
Page 44: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg
Page 45: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Discover groups of similar items

● Canopy.

● k-Means.

● Fuzzy k-Means.

● Dirichlet based.

● Others upcoming.

Page 46: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Discover groups of similar items

Page 47: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Identify dominant topics

● Given a dataset of texts, identify main topics.

● Examples:– Dominant topics in set of mails.

– Identify news message categories.

Algorithms: Parallel LDA

Page 48: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Assign items to defined categories.

● Given pre-defined categories, assign items to it.

● Examples:– Spam mail classification.

– Discovery of images depicting humans.

Page 49: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

By freezelight, http://www.flickr.com/photos/63056612@N00/155554663/

Page 50: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg
Page 51: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg
Page 52: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Assign items to defined categories.

● Naïve Bayes.

● Complementary naïve bayes.

● Random forests.

● Others upcoming.

Page 53: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Assign items to defined categories

● Examples based on “standard” datasets:

● 20 Newsgroupshttp://cwiki.apache.org/MAHOUT/twentynewsgroups.html

● Wikipediahttp://cwiki.apache.org/MAHOUT/wikipediabayesexample.html

Page 54: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Recommendation mining.

● Recommend items to users.

● Examples:– Find books related to the book I am buying.

– Find movies I might like.

Page 55: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Recommending places

Page 56: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Recommending people

Page 57: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Recommendation mining.

● Integrated Taste.● Mature Java library.● Java-based, web service / HTTP bindings.

● Batch mode based on EC2 and Hadoop.

Page 58: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Frequent pattern mining

● Given groups of items, find commonly co-occurring items.

● Examples:– In shopping carts find items bought together.

– In query logs find queries issued in one session.

Page 59: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

By crypto, http://www.flickr.com/photos/crypto/3201254932/sizes/l/

Page 60: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

By crypto, http://www.flickr.com/photos/crypto/3201254932/sizes/l/

By libraryman, http://www.flickr.com/photos/libraryman/78337046/sizes/l/

Page 61: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

By crypto, http://www.flickr.com/photos/crypto/3201254932/sizes/l/

By libraryman, http://www.flickr.com/photos/libraryman/78337046/sizes/l/

By quinnanya, http://www.flickr.com/photos/quinnanya/2806883231/

Page 62: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Upcoming

● More algorithms.

● Optimization of existing implementations.

● More examples.

● Release 0.2

Page 63: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Upcoming

● “TU Winter of Code”– Crawl and store blog postings.

– Group posts and identify emerging topics.

– Index results with Solr.

Database Systems and Information Management

Prof. Dr. Volker Markl

Page 64: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

TU Winter of Code

● 6 students, 5 months.● http://github.org/MaineC/Playground

Page 65: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Why go for Apache Mahout?

Page 66: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Jumpstart your project with proven code.

January 8, 2008 by dreizehn28http://www.flickr.com/photos/1328/2176949559

Page 67: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Discuss ideas and problems online.

November 16, 2005 [phil h]http://www.flickr.com/photos/hi-phi/64055296

Page 68: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Become part of the community.

Page 69: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

[email protected]

[email protected]

Interest in solving hard problems.

Being part of lively community.

Engineering best practices.

Bug reports, patches, features.

Documentation, code, examples.July 9, 2006 by trackrecordhttp://www.flickr.com/photos/trackrecord/185514449

Page 70: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Dec., 16th 2009: Hadoop* Get Together in Berlin

– Richard Hutton (nugg.ad): “Moving from five days to one hour.”

– Jörg Möllenkamp (Sun): “Jörg Möllenkamp (Sun): "Hadoop on Sun."

– Nikolaus Pohle (nurago): "M/R for MR - Online Market Research powered by Apache Hadoop. Enable consultants to analyze online behavior for audience segmentation, advertising effects and usage patterns."

http://upcoming.yahoo.com/event/4842528/

* UIMA, Hbase, Lucene, Solr, katta, Mahout, CouchDB, pig, Hive, Cassandra, Cascading, JAQL, ... talks welcome as well.

Page 71: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

March 2009: Hadoop* Get Together in Berlin

– Dragan Milosevic ( ): TBA

– YOU!

newthinking store Berlin

Tucholskystr. 48

* UIMA, Hbase, Lucene, Solr, katta, Mahout, CouchDB, pig, Hive, Cassandra, Cascading, JAQL, ... talks welcome as well.

Page 72: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Mahout Meetup this evening

Page 73: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

[email protected]

[email protected]

Interest in solving hard problems.

Being part of lively community.

Engineering best practices.

Bug reports, patches, features.

Documentation, code, examples.July 9, 2006 by trackrecordhttp://www.flickr.com/photos/trackrecord/185514449

Page 74: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg
Page 75: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg
Page 76: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Going parallel: k-Means

Page 77: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg
Page 78: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg
Page 79: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg
Page 80: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg
Page 81: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Until stable.

Page 82: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Until stable.

Data intensive.Output: Cluster assignment.Pre-Compute centers.

Done in Map.

Page 83: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Until stable.

Data intensive.Output: Cluster assignment.Pre-Compute centers.

Done in Map. Done in Reduce.

Page 84: From Data to InformationDec., 16th 2009: Hadoop* Get Together in Berlin – Richard Hutton (nugg.ad): “Moving from five days to one hour.” – Jörg Möllenkamp (Sun): “Jörg

Make searching the web easier