“introducing hadoop on azure:
DESCRIPTION
hello Map-Reduce!”. “Introducing Hadoop on Azure:. Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago. Materials: http ://www.joehummel.net/downloads.html Email: [email protected]. A little history…. - PowerPoint PPT PresentationTRANSCRIPT
“Introducing Hadoop on Azure:
Joe Hummel, PhD
Visiting Researcher: U. of California, IrvineAdjunct Professor: U. of Illinois, Chicago &
Loyola U., Chicago
Materials: http://www.joehummel.net/downloads.htmlEmail: [email protected]
hello Map-Reduce!”
Hadoop on Azure 2
Map-Reduce is from functional programming
A little history…
// function returns 1 if i is prime, 0 if not:let isPrime(i) = ...
// sums 2 numbers:let sum(x, y) = return x + y
// count the number of primes in 1..N:let countPrimes(N) = let L = [ 1 .. N ] // [ 1, 2, 3, 4, 5, 6, ... ] let T = map isPrime L // [ 0, 1, 1, 0, 1, 0, ... ] let count = reduce sum T // 42 return count
3
Hadoop:◦ Created by to drive internet search◦ Parallelism
◦ Data partitioning
◦ Fault tolerance
A little more history…
BIGData
pagehits
4
Freely-available framework for big data◦ http://hadoop.apache.org/
Based on concept of Map-Reduce:
Hadoop today
BIGdata
Map
Map
Map
Map...
Reduce R
map function reduce intermediate results
...
5
Workflow
Map
Sort
Reduce
Merge
[ <key1, [value,value,…]>, <key2, [value,value,…]>, … ]
[ <key1, value>, <key2, value>… ] R
Data
Map
Sort
Map
Sort
[ <key1,value>, <key4,value>, <key2,value>, … ]
[ <key1,value>, <key1,value>, … ]
6
We’ll be working with Chicago crime data…◦ https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
◦ http://www.cityofchicago.org/city/en/narr/foia/CityData.html
Data set for demo
1 GB
5M rows
7
Compute top-10 crimes…
Goal?
0486 3669030820 308074...0890 166916
IUCR Count
IUCR = Illinois Uniform Crime Codes
https://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e
8
Hadoop on Azure…
Demo
Hadoop on Azure
// Javascript version:
var map = function (key, value, context){ var values = value.split(","); context.write(values[4], 1);};
var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum);};
0486 3669030820 308074...
9
Rich ecosystem around Hadoop ◦ Pig ◦ Hive◦ HBASE◦ …
Hadoop on Azure
Hadoop++
// interactive PIG with explicit Map-Reduce functions:
pig.from("CC-from-2001.txt"). mapReduce("IUCR-Count.js", "IUCR, Count:long"). orderBy("Count DESC"). take(10). to("output-from-2001")
// interactive PIG without explicit Map-Reduce:
schema = "ID,Case Number,Date,Block,IUCR,..."pig.from("CC-from-2001.txt", schema). groupBy("IUCR"). select("group, SUM($1.Count"). orderBy("Count DESC"). take(10). to("output-from-2001")
10
Microsoft is offering free access to Hadoop◦ Request invitation @ http://www.hadooponazure.com/
Hadoop connector for Excel◦ Process data using Hadoop, analyze/visualize using Excel
Hadoop on Azure
Hadoop on Azure
Big Data Processing, Cheap
11
Freely-available plugin for Excel 2010◦ http://www.powerpivot.com/
Turns Excel into an in-memory database◦ More precisely, turns spreadsheet into an OLAP cube
Note: ◦ If you have 32-bit Excel, install 32-bit PowerPivot
◦ If you have 64-bit Excel, install 64-bit PowerPivot
◦ GBs of data will require 64-bit
◦ [ How to tell what version of Excel you have? File menu, help… ]
PowerPivot
12
PowerPivot…◦ Install
◦ PowerPivot menu
◦ PowerPivot Window
◦ Get Data...
◦ PivotTable…
Demo
Big Data Processing, Cheap
Compare and contrastApproach Pros Cons Target Scalable?
PowerPivot
No programming, built-in UI and visualization
Lack of scalability GBsLimited by
RAM
LINQ Flexibility of analysis ProgrammingGBs, few
TBsLimited by
local resources
Hadoop Scalability, ease of programming
Must fit into Map-Reduce
framework; not necessarily fast
GBs, TBs, PBs
Yes!(via cluster or
cloud)
Big Data Processing, Cheap
15
That’s it!
Big Data Processing, Cheap
16
Presenter: Joe Hummel◦ Email: [email protected]◦ Materials: http://www.joehummel.net/downloads.html
Keep an eye for final release of:◦ Hadoop on Azure
◦ Hadoop on Windows
◦ PowerView plugin for Excel 2013
Thank you for attending
Big Data Processing, Cheap