hadoop introduction berlin buzzwords 2011
DESCRIPTION
Hadoop introduction berlin buzzwords 2011 from Berlin Buzzwords 2011TRANSCRIPT
An IntroductionKai Voigt, Cloudera Inc
Berlin Buzzwords, June 6th 2011
Montag, 6. Juni 2011
Big Data
• Capture
• Storage
• Search
• Analytics
Montag, 6. Juni 2011
hadoop.apache.org
Google Filesystem (GFS)
Hadoop Distributed Filesystem (HDFS)
MapReduce MapReduce
Montag, 6. Juni 2011
Hadoop Distributed Filesystem
• Easy Access
• Distributed
• Redundant
• Scalable
Montag, 6. Juni 2011
File Splits
Block1
Block2
Block3
Block4
Block5
Block6 ... Block
100
Large File
Block 101
64M 64M 64M 64M 64M 64M 64M 40M
6440M
Montag, 6. Juni 2011
Block Placement
Node 1 Node 2 Node 3 Node 4 Node 5
Block1
Block1
Block1
Block2
Block3
Block2
Block3
Block2
Block3
Montag, 6. Juni 2011
MapReduceBlock
1Input Block2
Block3
Block4
map()Inter-
mediateData
Output
reduce()
Montag, 6. Juni 2011
WordCount Example
map (offset, line) { foreach word in line { emit (word, 1) }}
Montag, 6. Juni 2011
WordCount Example
reduce (word, count[]) { total = 0; foreach number in count[] { total += number } emit (word, total)}
Montag, 6. Juni 2011
( , 1)
map()
( , 1)( , 1)( , 1)( , 1)
( , 1)( , 1)( , 1)
( , 1)( , 1)( , 1)( , 1)
Montag, 6. Juni 2011
Sort & Shuffle( , 1)( , 1)( , 1)( , 1)
( , 1)( , 1)( , 1)( , 1)
( , 1)( , 1)( , 1)( , 1)
( , (1,1))( , (1,1))
( , (1))
( , (1,1,1,1))( , (1,1))
Montag, 6. Juni 2011
( , 4)( , 2)( , 1)
( , 2)( , 2)
reduce()
( , (1,1))( , (1,1))
( , (1))
( , (1,1,1,1))( , (1,1))
Montag, 6. Juni 2011
Use Case: Recommendations
• "People looking as this article also looked at these articles"
• "You might also know these people"
• iTunes Genius Playlist
• Banner Placement
Montag, 6. Juni 2011
Use Case: Text Processing
• Document Indexing
• Semantic Analytics
Montag, 6. Juni 2011
Use Case: Machine Learning
• Spam vs No Spam
• Credit Card Fraud
• "People of Interest"
Information
DataDataData
Montag, 6. Juni 2011
Use Case: Graphs
• Shortest Paths
• Bottleneck Nodes
• Flow Optimization
• Spanning Trees
• Spanning Routes
Montag, 6. Juni 2011
Hadoop Ecosystem
Hive & Pig
HBase
Sqoop
Flume
Oozie
Mahout
High Level Language Access
Real Time Access
SQL to/from Hadoop
Distributed Data Collection
Job Workflow
Machine Learning Library
many more
Montag, 6. Juni 2011
Homework
• Cloudera's Distribution including Hadoop (CDH3)
• Online Tutorials
• WordCount Example
• Conference Demo Cluster
Montag, 6. Juni 2011
Quick Demo!
Montag, 6. Juni 2011
Thank you!
• Kai Voigt
• http://www.cloudera.com/
• http://apache.hadoop.org/
Montag, 6. Juni 2011