hack reduce introduction
DESCRIPTION
TRANSCRIPT
What is hack/reduce?
•A Home for the Big Data Community
•24/7 Access to Cluster Compute Power
•Regular Hackathons
hack/reduceBoston’s Big Data Hackspace
hack/reduceMontreal
Ottawa
TorontoBoston
2011
2012
Why should you care?•Work with Millions and Billions of
records
•Find patterns in Big Data sets
•Use data to detect, predict, forecast
•Extract new information from raw data
APIs SuckIn Big data there are:
•no requests,
•no predefined parameters
•no structured responses.
You are free to intersect anything with anything.
You can analyse, mutate, group, split, reorder in any way you can imagine.
What you can do today
•Access the hack/reduce GoGrid Cluster:
•240 Cores
•240GB of RAM
•10TB of Disk
What you can do today
Use Hadoop to Explore big Open Data sets, like:
•20 Years of the Federal Parliament Hansard
•Hourly Canadian Weather 1953 to 2001
•The 1881 Census. Details about 4.3M people
•One Summer of Bixi Station Status Updates
What is Map/Reduce?
•Framework for distributed computing on large data sets on clusters of computers
•MapReduce patented by Google
•Hadoop implementation is Googlesque
•Michael Stonebraker hates it
What is Map/Reduce?
•Map = function applied in parallel to every item in the dataset
•Reduce = function applied in parallel to groups of values emitted by Map function
What is Map/Reduce?
map(String docId, String document): for each word w in document: emit(w, 1); reduce(String word, Iterator counts): int sum = 0; for each count in counts: sum += count; emit(word, sum);
MapReduce: http://cluster-1-master.gg.hackreduce.net:50030
SSH: ssh -i hackreduce [email protected]
private key (“hackreduce”): http://bit.ly/X13pNh
wiki: http://github.com/hackreduce/Hackathon