map reduce: an example (james grant at big data brighton)
DESCRIPTION
Presentation by Brandwatch Developer James Grant at the second Big Data Brighton meetup, hosted by Brandwatch: www.brandwatch.comTRANSCRIPT
Map ReduceAn Example
Who am I?
My name is James Grant ([email protected]).
I'm a developer here at Brandwatch.
For the last three years I've been a Data Engineer at Last.fm and the maintainer of their Hadoop Cluster.
Coming up…
● What happens during MapReduce?● Plays and Reach from music listening data● The Mapper pseudo code● The Reducer pseudo code● The result● What if…?
What happens during MapReduce?
Input Data
Data FragmentData FragmentData Fragment
Mapper Map Output
Reducer Input
ReducerData
FragmentData FragmentReduce
Output
Sort
Plays and Reach from music listening data
● Plays - The number of times that song has been played
● Reach - The number of unique listeners to that song
● Similar to hits and uniques for web properties
● Input data has columns for user id and song id (amongst others)
The Mapperfunction map(Integer user, Integer song): emit(song, user);
The Reducerfunction reduce(Integer song, Iterator users): Integer plays = 0; Set uniqueUsers = [];
foreach user in users: increment plays; if user not within uniqueUsers: uniqueUsers.add(user);
result.plays = plays; result.reach = uniqueUsers.cardinality(); emit(song, result);
What if…?
You often hear that for nearly all cases you should use a higher level tool like Pig or Hive to solve problems.
So what does the Pig script look like for this problem?
Using Pigsubs = LOAD 'submissions.tsv' USING PigStorage() AS (user:int, song:int);songs = GROUP subs BY song;songs = FOREACH songs GENERATE group AS song, subs.user;songs = FOREACH songs GENERATE song, COUNT($1.user), COUNT(Distinct($1.user));STORE songs INTO 'playsreach.tsv';
Questions?