talk bedcon berlin 2012

Upload: samiie30

Post on 04-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Talk BedCon Berlin 2012

    1/42

    How to build a recommender system based on Mahout and

    Java EE

    Berlin Expert Days 29. 30. March 2012 Manuel Blechschmidt CTO Apaxo GmbH

  • 7/31/2019 Talk BedCon Berlin 2012

    2/42

    All the web content will be personalized in threeto five years.

    Sheryl Sandberg COO Facebook 09.2010

  • 7/31/2019 Talk BedCon Berlin 2012

    3/42

    Agenda

    - What is personalization?

    - What algorithms can be used?- Architecture of a recommender system

    - How to bundle Mahout into an Java EE

    application- Interface with the recommender

    - Conclusion

  • 7/31/2019 Talk BedCon Berlin 2012

    4/42

    What is personalization?

    Personalization involves using technology toaccommodate the differences betweenindividuals. Once confined mainly to the Web, itis increasingly becoming a factor in education,health care (i.e. personalized medicine),television, and in both "business to business"and "business to consumer" settings.

    Source:https://en.wikipedia.org/wiki/Personalization

  • 7/31/2019 Talk BedCon Berlin 2012

    5/42

    Amazon.com

  • 7/31/2019 Talk BedCon Berlin 2012

    6/42

    TripAdvisor.com

  • 7/31/2019 Talk BedCon Berlin 2012

    7/42

    eBay

  • 7/31/2019 Talk BedCon Berlin 2012

    8/42

    criteo.com - Retargeting

  • 7/31/2019 Talk BedCon Berlin 2012

    9/42

    Zalando

  • 7/31/2019 Talk BedCon Berlin 2012

    10/42

    Plista

  • 7/31/2019 Talk BedCon Berlin 2012

    11/42

    YouTube

  • 7/31/2019 Talk BedCon Berlin 2012

    12/42

    Naturideen.de (coming soon)

  • 7/31/2019 Talk BedCon Berlin 2012

    13/42

    Recommender

    This talk will concentrate on recommender technology based on collaborative filtering (cf)to personalize a web site

    - a lot of research is going on

    - cf has shown great success in movie and musicindustry

    - recommenders can collect data silently and useit without manual maintenance

  • 7/31/2019 Talk BedCon Berlin 2012

    14/42

    What is a recommender?Let U be a set of users of the recommendation system and I be the

    set of items from which the users can choose. A recommender r is a function which produces for a user u

    ia set of recommended

    items R kwith k entries and a binary, transitive, antisymmetric and

    total relation prefers_over ui

    which can be used for sorting the

    recommendations for the user. The recommender r is oftencalled a top-k recommender.

  • 7/31/2019 Talk BedCon Berlin 2012

    15/42

    WARNING!

    The next slides contain math which takes normally3 years to understand so don't be disappointedif you don't get it.

  • 7/31/2019 Talk BedCon Berlin 2012

    16/42

    What should wolf and sheep eat?Ms. Baumgartner your sun is ignoringagain the food chain.

    SON!

    Source: http://static.nichtlustig.de/comics/full/081009.jpg

    http://static.nichtlustig.de/comics/full/081009.jpghttp://static.nichtlustig.de/comics/full/081009.jpg
  • 7/31/2019 Talk BedCon Berlin 2012

    17/42

    Demo Data

    Carrots Grass Pork Beef Corn FishRabbit 10 7 1 2 ? 1

    Cow 7 10 ? ? ? ?

    Dog ? 1 10 10 ? ?

    Pig 5 6 4 ? 7 3Chicken 7 6 2 ? 10 ?

    Pinguin 2 2 ? 2 2 10

    Bear 2 ? 8 8 2 7

    Lion ? ? 9 10 2 ?Tiger ? ? 8 ? ? 5

    Antilope 6 10 1 1 ? ?

    Wolf 1 ? ? 8 ? 3

    Sheep ? 8 ? ? ? 2

  • 7/31/2019 Talk BedCon Berlin 2012

    18/42

    Characteristics of Demo Data

    Ratings from 1 10

    Users: 12

    Items: 6

    Ratings: 43 (unusual normally 100,000 100,000,000)

    Matrix filled: ~60% (unusual normally sparse around 0.5-2%)

    Average Number of Ratings per User: ~3.58 Average Number of Ratings per Item: ~7.17

    Average Rating: ~5.579

    https://github.com/ManuelB/facebook-recommender-demo/tree/master/docs/BedConExamples.R

    https://github.com/ManuelB/facebook-recommender-demo/tree/master/docs/BedConExamples.Rhttps://github.com/ManuelB/facebook-recommender-demo/tree/master/docs/BedConExamples.R
  • 7/31/2019 Talk BedCon Berlin 2012

    19/42

  • 7/31/2019 Talk BedCon Berlin 2012

    20/42

  • 7/31/2019 Talk BedCon Berlin 2012

    21/42

    YouTube Rating Distribution

    http://techcrunch.com/2009/09/22/youtube-comes-to-a-5-star-realization-its-ratings-are-useless/

    http://techcrunch.com/2009/09/22/youtube-comes-to-a-5-star-realization-its-ratings-are-useless/http://techcrunch.com/2009/09/22/youtube-comes-to-a-5-star-realization-its-ratings-are-useless/
  • 7/31/2019 Talk BedCon Berlin 2012

    22/42

    Model and Memory Approaches

    - Item- or User- Based Collaborative Filtering

    - Idea suggest what similair users did

    - Matrix Factorization e.g Singular Value Decomposition- Try to create a profile based of factors for users and items

    Main difference:

    A model base approach tries to extract the underlying logic fromthe data.

  • 7/31/2019 Talk BedCon Berlin 2012

    23/42

    User Based Approach

    - Find similar or opposite animals like wolf

    - Checkout what these other animals like- Recommend this to wolf

  • 7/31/2019 Talk BedCon Berlin 2012

    24/42

    Find animals which voted for beef,fish and carrots too

    Carrots Grass Pork Beef Corn Fish

    Wolf 1 ? ? 8 ? 3

    Pinguin 2 2 ? 2 2 10

    Bear 2 ? 8 8 2 7

    Rabbit 10 7 1 2 ? 1

    Cow 7 10 ? ? ? ?

    Dog ? 1 10 10 ? ?

    Pig 5 6 4 ? 7 3

    Chicken 7 6 2 ? 10 ?Lion ? ? 9 10 2 ?

    Tiger ? ? 8 ? ? 5

    Antilope 6 10 1 1 ? ?

    Sheep ? 8 ? ? ? ?

  • 7/31/2019 Talk BedCon Berlin 2012

    25/42

    Pearson Correlation

    - 1 = very similar

    - (-1) = complete opposite votings

    - similarity between wolf and pinguin: -0.2401922

    - cor(c(8,3,1),c(2,10,2))

    - similarity between wolf and bear: 0.8196562

    - cor(c(8,3,1),c(8,7,2))- similarity between wolf and rabbit: -0.6465846

    - cor(c(8,3,1),c(2,1,10))

  • 7/31/2019 Talk BedCon Berlin 2012

    26/42

    Weighted sum of the ratings

    Pork (10):

    (0.8196 * 8 + (-0.6466) * 1) / (0.8196 + (-0.6465)

    = 34 ~ 10

    Grass (5.65):

    (2*(-0.2401)+7*(-0.6465)) / ((-0.2401) + (-0.6465))= 5,65

    Corn (2,0):

    2*(-0.2401)+2*(0.8196)) / (-0.2401+0.8196) = 2,0

  • 7/31/2019 Talk BedCon Berlin 2012

    27/42

    Predicted ratings

    - Wolf should eat: Pork Rating: 10.0

    - Wolf should eat: Grass Rating: 5.645701

    - Wolf should eat: Corn Rating: 2.0

  • 7/31/2019 Talk BedCon Berlin 2012

    28/42

    SVD

    http://public.lanl.gov/mewall/kluwer2002.html

    http://public.lanl.gov/mewall/kluwer2002.htmlhttp://public.lanl.gov/mewall/kluwer2002.html
  • 7/31/2019 Talk BedCon Berlin 2012

    29/42

    Factorized Matrixes

  • 7/31/2019 Talk BedCon Berlin 2012

    30/42

    Predicted Matrix (k = 2)

  • 7/31/2019 Talk BedCon Berlin 2012

    31/42

    Predicted Ratings

    - Sheep should eat: Carrots Rating: 6.01

    - Sheep should eat: Corn Rating: 5.83

  • 7/31/2019 Talk BedCon Berlin 2012

    32/42

    What other algorithms can be used?

    Similarity Measures for Item or User based:

    - LogLikelihood Similarity

    - Cosine Similarity

    - Pearson Similarity

    - etc.

    Estimating algorithms for SVD:- ALSWRFactorizer

    - ExpectationMaximizationSVDFactorizer

  • 7/31/2019 Talk BedCon Berlin 2012

    33/42

    Let's built it

  • 7/31/2019 Talk BedCon Berlin 2012

    34/42

    Mahout - Taste

    Taste is a flexible, fast collaborative filteringengine for Java.

    It was primarly written by Sean Owen and is partof the Apache Mahout library.

    Another important author is: Sebastian Schelter

    It offers all the algorithms introduced here and isOpen Source

    http://mahout.apache.org/taste.html

    http://mahout.apache.org/taste.htmlhttp://mahout.apache.org/taste.html
  • 7/31/2019 Talk BedCon Berlin 2012

    35/42

    Architecture of the recommender

  • 7/31/2019 Talk BedCon Berlin 2012

    36/42

    Packaging

  • 7/31/2019 Talk BedCon Berlin 2012

    37/42

    Eclipse Project

  • 7/31/2019 Talk BedCon Berlin 2012

    38/42

    Maven pom.xml

  • 7/31/2019 Talk BedCon Berlin 2012

    39/42

  • 7/31/2019 Talk BedCon Berlin 2012

    40/42

    Demo and live Debugging

  • 7/31/2019 Talk BedCon Berlin 2012

    41/42

    Conclusion

    Recommendation is a lot of math

    You shouldn't implement the algorithms again

    There are a lot of unsanswered questions

    - Scalibility, Performance, Usability

    You can gain a lot from good personalization

  • 7/31/2019 Talk BedCon Berlin 2012

    42/42

    More sources

    http://www.apaxo.de /

    http://mahout.apache.org /

    http://research.yahoo.com /

    http://www.grouplens.org/

    http://recsys.acm.org/

    https://github.com/ManuelB/facebook-recommender-demo/

    http://docs.oracle.com/javaee/6/tutorial/doc/

    http://www.apaxo.de/http://mahout.apache.org/http://research.yahoo.com/http://www.grouplens.org/http://recsys.acm.org/https://github.com/ManuelB/facebook-recommender-demo/http://docs.oracle.com/javaee/6/tutorial/doc/http://docs.oracle.com/javaee/6/tutorial/doc/https://github.com/ManuelB/facebook-recommender-demo/http://recsys.acm.org/http://www.grouplens.org/http://research.yahoo.com/http://mahout.apache.org/http://www.apaxo.de/