hadoop: a hands-on introduction
Post on 11-May-2015
5.123 Views
Preview:
DESCRIPTION
TRANSCRIPT
HadoopA Hands-on Introduction
Claudio MartellaElia Bruni
9 November 2011
Tuesday, November 8, 11
Outline
• What is Hadoop
• Why is Hadoop
• How is Hadoop
• Hadoop & Python
• Some NLP code
• A more complicated problem: Eva
2
Tuesday, November 8, 11
A bit of Context
• 2003: first MapReduce library @ Google
• 2003: GFS paper
• 2004: MapReduce paper
• 2005: Apache Nutch uses MapReduce
• 2006: Hadoop was born
• 2007: first 1000 nodes cluster at Y!
3
Tuesday, November 8, 11
An Ecosystem
• HDFS & MapReduce
• Zookeeper
• HBase
• Pig & Hive
• Mahout
• Giraph
• Nutch4
Tuesday, November 8, 11
Traditional way
• Design a high-level Schema
• You store data in a RDBMS
• Which has very poor write throughput
• And doesn’t scale very much
• When you talk about Terabyte of data
• Expensive Data Warehouse
5
Tuesday, November 8, 11
BigData & NoSQL
• Store first, think later
• Schema-less storage
• Analytics
• Petabyte scale
• Offline processing
6
Tuesday, November 8, 11
Vertical Scalability
• Extremely expensive
• Requires expertise in distributed systems and concurrent programming
• Lacks of real fault-tolerance
7
Tuesday, November 8, 11
Horizontal Scalability
• Built on top of commodity hardware
• Easy to use programming paradigms
• Fault-tolerance through replication
8
Tuesday, November 8, 11
1st Assumptions
• Data to process does not fit on one node.
• Each node is commodity hardware.
• Failure happens.
Spread your data among your nodes and replicate it.
9
Tuesday, November 8, 11
2nd Assumptions
• Moving computation is cheap.
• Moving data is expensive.
• Distributed computing is hard.
Move computation to data, with simple paradigm.
10
Tuesday, November 8, 11
3rd Assumptions
• Systems run on spinning hard disks.
• Disk seek >> disk scan.
• Many small files are expensive.
Base the paradigm on scanning large files.
11
Tuesday, November 8, 11
Typical Problem
• Collect and iterate over many records
• Filter and extract something from each
• Shuffle & sort these intermediate results
• Group-by and aggregate them
• Produce final output set
12
Tuesday, November 8, 11
Typical Problem
• Collect and iterate over many records
• Filter and extract something from each
• Shuffle & sort these intermediate results
• Group-by and aggregate them
• Produce final output set
MA
P
REDU
CE
13
Tuesday, November 8, 11
Quick example127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"
• (frank, index.html)
• (index.html, 10/Oct/2000)
• (index.html, http://www.example.com/start.html)
14
Tuesday, November 8, 11
MapReduce
• Programmers define two functions:
★ map (key, value) (key’, value’)*
★ reduce (key’, [value’+]) (key”, value”)*
• Can also define:
★ combine (key, value) (key’, value’)*
★ partitioner: k‘ partition
15
Tuesday, November 8, 11
9
MapReduce| Programmers specify two functions:
map (k, v) ĺ <k’, v’>*reduce (k’, v’) ĺ <k’, v’>*
All l ith th k d d t thz All values with the same key are reduced together
| Usually, programmers also specify:partition (k’, number of partitions) ĺ partition for k’z Often a simple hash of the key, e.g. hash(k’) mod nz Allows reduce operations for different keys in parallelcombine (k’, v’) ĺ <k’, v’>*z Mini-reducers that run in memory after the map phasez Mini-reducers that run in memory after the map phasez Used as an optimization to reducer network traffic
| Implementations:z Google has a proprietary implementation in C++z Hadoop is an open source implementation in Java
k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6
mapmap map map
Shuffle and Sort: aggregate values by keys
ba 1 2 c c3 6 a c5 2 b c7 9
a 1 5 b 2 7 c 2 3 6 9
reduce reduce reduce
r1 s1 r2 s2 r3 s3
16
Tuesday, November 8, 11
MapReduce daemons
• JobTracker: it’s the Master, it runs the schedule of the jobs, assigns tasks to nodes, collects hearth-beats from workers, reschedules for fault-tolerance.
• TaskTracker: it’s the Worker, it runs on each slave, runs (multiple) Mappers and Reducers each in their JVM.
17
Tuesday, November 8, 11
11
UserProgram
(1) fork (1) fork (1) fork
split 0split 1split 2split 3split 4
worker
worker
worker
worker
Master
outputfile 0
outputfile 1
(2) assign map(2) assign reduce
(3) read(4) local write
(5) remote read(6) write
worker
Inputfiles
Mapphase
Intermediate files(on local disk)
Reducephase
Outputfiles
Redrawn from (Dean and Ghemawat, OSDI 2004)
How do we get data to the workers?
NAS
Compute Nodes
SAN
What’s the problem here?
18
Tuesday, November 8, 11
HDFS daemons
• NameNode: it’s the Master, it keeps the filesystem metadata (in-memory), the file-block-node mapping, decides replication and block placement, collects heart-beats from nodes.
• DataNode: it’s the Slave, it stores the blocks (64MB) of the files and serves directly reads and writes.
19
Tuesday, November 8, 11
13
GFS: Design Decisions
| Files stored as chunksz Fixed size (64MB)
| Reliability through replication| Reliability through replicationz Each chunk replicated across 3+ chunkservers
| Single master to coordinate access, keep metadataz Simple centralized management
| No data cachingz Little benefit due to large data sets, streaming readsz Little benefit due to large data sets, streaming reads
| Simplify the APIz Push some of the issues onto the client
Application GFS masterApplication
GSF Client
GFS masterFile namespace
/foo/barchunk 2ef0
GFS chunkserver GFS chunkserver
(file name, chunk index)
(chunk handle, chunk location)
Instructions to chunkserver
Chunkserver state(chunk handle, byte range)
Redrawn from (Ghemawat et al., SOSP 2003)
Linux file system
…
Linux file system
…
chunk data
20
Tuesday, November 8, 11
Transparent to
• Workers to data assignment
• Map / Reduce assignment to nodes
• Management of synchronization
• Management of communication
• Fault-tolerance and restarts
21
Tuesday, November 8, 11
Take home recipe
• Scan-based computation (no random I/O)
• Big datasets
• Divide-and-conquer class algorithms
• No communication between tasks
22
Tuesday, November 8, 11
Not good for
• Real-time / Stream processing
• Graph processing
• Computation without locality
• Small datasets
23
Tuesday, November 8, 11
Questions?
Tuesday, November 8, 11
Baseline solution
Tuesday, November 8, 11
What we attacked
• You don’t want to parse the file many times
• You don’t want to re-calculate the norm
• You don’t want to calculate 0*n
26
Tuesday, November 8, 11
Our solution
line format: <string><norm>[<col><value>]*
0 1.3 0 0 7.1 1.1
1.2 0 0 0 0 3.4
0 5.7 0 0 1.1 2
5.1 0 0 4.6 0 10
0 0 0 1.6 0 0
1.3 7.1
1.2 3.4
5.7 1.1
5.1 4.6
1.6
2
1.1
for example: cat 12.1313 0 5.1 3 4.6 5 10
10
27
Tuesday, November 8, 11
Benchmarking
• serial python (single-core): 7 minutes
• java+hadoop (single-core): 2 minutes
• serial python (big file): 18 days
• java+hadoop (parallel, big file): 8 hours
• it makes sense: 18d / 3.5 = 5.14d / 14 = 8h
28
Tuesday, November 8, 11
top related