hadoop introduction berlin buzzwords 2011

20
An Introduction Kai Voigt, Cloudera Inc Berlin Buzzwords, June 6th 2011 Montag, 6. Juni 2011

Upload: iammutex

Post on 19-May-2015

3.060 views

Category:

Technology


0 download

DESCRIPTION

Hadoop introduction berlin buzzwords 2011 from Berlin Buzzwords 2011

TRANSCRIPT

Page 1: Hadoop introduction   berlin buzzwords 2011

An IntroductionKai Voigt, Cloudera Inc

Berlin Buzzwords, June 6th 2011

Montag, 6. Juni 2011

Page 2: Hadoop introduction   berlin buzzwords 2011

Big Data

• Capture

• Storage

• Search

• Analytics

Montag, 6. Juni 2011

Page 3: Hadoop introduction   berlin buzzwords 2011

hadoop.apache.org

Google Filesystem (GFS)

Hadoop Distributed Filesystem (HDFS)

MapReduce MapReduce

Montag, 6. Juni 2011

Page 4: Hadoop introduction   berlin buzzwords 2011

Hadoop Distributed Filesystem

• Easy Access

• Distributed

• Redundant

• Scalable

Montag, 6. Juni 2011

Page 5: Hadoop introduction   berlin buzzwords 2011

File Splits

Block1

Block2

Block3

Block4

Block5

Block6 ... Block

100

Large File

Block 101

64M 64M 64M 64M 64M 64M 64M 40M

6440M

Montag, 6. Juni 2011

Page 6: Hadoop introduction   berlin buzzwords 2011

Block Placement

Node 1 Node 2 Node 3 Node 4 Node 5

Block1

Block1

Block1

Block2

Block3

Block2

Block3

Block2

Block3

Montag, 6. Juni 2011

Page 7: Hadoop introduction   berlin buzzwords 2011

MapReduceBlock

1Input Block2

Block3

Block4

map()Inter-

mediateData

Output

reduce()

Montag, 6. Juni 2011

Page 8: Hadoop introduction   berlin buzzwords 2011

WordCount Example

map (offset, line) { foreach word in line { emit (word, 1) }}

Montag, 6. Juni 2011

Page 9: Hadoop introduction   berlin buzzwords 2011

WordCount Example

reduce (word, count[]) { total = 0; foreach number in count[] { total += number } emit (word, total)}

Montag, 6. Juni 2011

Page 10: Hadoop introduction   berlin buzzwords 2011

( , 1)

map()

( , 1)( , 1)( , 1)( , 1)

( , 1)( , 1)( , 1)

( , 1)( , 1)( , 1)( , 1)

Montag, 6. Juni 2011

Page 11: Hadoop introduction   berlin buzzwords 2011

Sort & Shuffle( , 1)( , 1)( , 1)( , 1)

( , 1)( , 1)( , 1)( , 1)

( , 1)( , 1)( , 1)( , 1)

( , (1,1))( , (1,1))

( , (1))

( , (1,1,1,1))( , (1,1))

Montag, 6. Juni 2011

Page 12: Hadoop introduction   berlin buzzwords 2011

( , 4)( , 2)( , 1)

( , 2)( , 2)

reduce()

( , (1,1))( , (1,1))

( , (1))

( , (1,1,1,1))( , (1,1))

Montag, 6. Juni 2011

Page 13: Hadoop introduction   berlin buzzwords 2011

Use Case: Recommendations

• "People looking as this article also looked at these articles"

• "You might also know these people"

• iTunes Genius Playlist

• Banner Placement

Montag, 6. Juni 2011

Page 14: Hadoop introduction   berlin buzzwords 2011

Use Case: Text Processing

• Document Indexing

• Semantic Analytics

Montag, 6. Juni 2011

Page 15: Hadoop introduction   berlin buzzwords 2011

Use Case: Machine Learning

• Spam vs No Spam

• Credit Card Fraud

• "People of Interest"

Information

DataDataData

Montag, 6. Juni 2011

Page 16: Hadoop introduction   berlin buzzwords 2011

Use Case: Graphs

• Shortest Paths

• Bottleneck Nodes

• Flow Optimization

• Spanning Trees

• Spanning Routes

Montag, 6. Juni 2011

Page 17: Hadoop introduction   berlin buzzwords 2011

Hadoop Ecosystem

Hive & Pig

HBase

Sqoop

Flume

Oozie

Mahout

High Level Language Access

Real Time Access

SQL to/from Hadoop

Distributed Data Collection

Job Workflow

Machine Learning Library

many more

Montag, 6. Juni 2011

Page 18: Hadoop introduction   berlin buzzwords 2011

Homework

• Cloudera's Distribution including Hadoop (CDH3)

• Online Tutorials

• WordCount Example

• Conference Demo Cluster

Montag, 6. Juni 2011

Page 19: Hadoop introduction   berlin buzzwords 2011

Quick Demo!

Montag, 6. Juni 2011

Page 20: Hadoop introduction   berlin buzzwords 2011

Thank you!

• Kai Voigt

[email protected]

• http://www.cloudera.com/

• http://apache.hadoop.org/

Montag, 6. Juni 2011