hadoop overview kdd2011

62
Modeling with Hadoop Vijay K. Narayanan Principal Scientist, Yahoo! Labs, Yahoo! Milind Bhandarkar Chief Architect, Greenplum Labs, EMC 2

Upload: milind-bhandarkar

Post on 26-Jan-2015

109 views

Category:

Technology


1 download

DESCRIPTION

In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.

TRANSCRIPT

Page 1: Hadoop Overview kdd2011

Modeling with Hadoop

Vijay K. Narayanan Principal Scientist, Yahoo! Labs, Yahoo!

Milind Bhandarkar Chief Architect, Greenplum Labs, EMC2

Page 2: Hadoop Overview kdd2011

2

Session 1: Overview of Hadoop

• Motivation

•  Hadoop

• Map-Reduce

•  Distributed File System

• Next Generation MapReduce

• Q & A

Page 3: Hadoop Overview kdd2011

Session 2: Modeling with Hadoop

• Types of learning in MapReduce

• Algorithms in MapReduce framework

• Data parallel algorithms

• Sequential algorithms

• Challenges and Enhancements

3

Page 4: Hadoop Overview kdd2011

Session 3: Hands On Exercise

•  Spin-up Single Node Hadoop cluster in a Virtual Machine

• Write a regression trainer

•  Train model on a dataset

4

Page 5: Hadoop Overview kdd2011

Overview of Apache Hadoop

Page 6: Hadoop Overview kdd2011

6

Hadoop At Yahoo!���(Some Statistics)

•  40,000 + machines in 20+ clusters

•  Largest cluster is 4,000 machines

•  170 Petabytes of storage

•  1000+ users

•  1,000,000+ jobs/month

Page 7: Hadoop Overview kdd2011

EVERY CLICK BEHIND

Page 8: Hadoop Overview kdd2011
Page 9: Hadoop Overview kdd2011

Who Uses Hadoop ?

Page 10: Hadoop Overview kdd2011

10

Why Hadoop ?

Page 11: Hadoop Overview kdd2011

Big Datasets���(Data-Rich Computing theme proposal. J. Campbell, et al, 2007)

Page 12: Hadoop Overview kdd2011

Cost Per Gigabyte���(http://www.mkomo.com/cost-per-gigabyte)

Page 13: Hadoop Overview kdd2011

Storage Trends���(Graph by Adam Leventhal, ACM Queue, Dec 2009)

Page 14: Hadoop Overview kdd2011

14

Motivating Examples

Page 15: Hadoop Overview kdd2011

Yahoo! Search Assist

Page 16: Hadoop Overview kdd2011

16

Search Assist

•  Insight: Related concepts appear close together in text corpus

•  Input: Web pages

•  1 Billion Pages, 10K bytes each

•  10 TB of input data

• Output: List(word, List(related words))

Page 17: Hadoop Overview kdd2011

17

// Input: List(URL, Text)foreach URL in Input : Words = Tokenize(Text(URL)); foreach word in Tokens : Insert (word, Next(word, Tokens)) in Pairs; Insert (word, Previous(word, Tokens)) in Pairs;// Result: Pairs = List (word, RelatedWord)Group Pairs by word;// Result: List (word, List(RelatedWords)foreach word in Pairs : Count RelatedWords in GroupedPairs;// Result: List (word, List(RelatedWords, count))foreach word in CountedPairs : Sort Pairs(word, *) descending by count; choose Top 5 Pairs;// Result: List (word, Top5(RelatedWords))

Search Assist

Page 18: Hadoop Overview kdd2011

People You May Know

Page 19: Hadoop Overview kdd2011

19

People You May Know

•  Insight: You might also know Joe Smith if a lot of folks you know, know Joe Smith

•  if you don’t know Joe Smith already

• Numbers:

•  100 MM users

•  Average connections per user is 100

Page 20: Hadoop Overview kdd2011

20

// Input: List(UserName, List(Connections))foreach u in UserList : // 100 MM foreach x in Connections(u) : // 100 foreach y in Connections(x) : // 100 if (y not in Connections(u)) : Count(u, y)++; // 1 Trillion Iterations Sort (u,y) in descending order of Count(u,y); Choose Top 3 y; Store (u, {y0, y1, y2}) for serving;

People You May Know

Page 21: Hadoop Overview kdd2011

21

Performance

•  101 Random accesses for each user

•  Assume 1 ms per random access

•  100 ms per user

•  100 MM users

•  100 days on a single machine

Page 22: Hadoop Overview kdd2011

22

MapReduce Paradigm

Page 23: Hadoop Overview kdd2011

23

Map & Reduce

•  Primitives in Lisp (& Other functional languages) 1970s

•  Google Paper 2004

•  http://labs.google.com/papers/mapreduce.html

Page 24: Hadoop Overview kdd2011

24

Output_List = Map (Input_List)

Square (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) =(1, 4, 9, 16, 25, 36,49, 64, 81, 100)

Map

Page 25: Hadoop Overview kdd2011

25

Output_Element = Reduce (Input_List)

Sum (1, 4, 9, 16, 25, 36,49, 64, 81, 100) = 385

Reduce

Page 26: Hadoop Overview kdd2011

26

Parallelism

• Map is inherently parallel

•  Each list element processed independently

•  Reduce is inherently sequential

•  Unless processing multiple lists

•  Grouping to produce multiple lists

Page 27: Hadoop Overview kdd2011

27

// Input: http://hadoop.apache.orgPairs = Tokenize_And_Pair ( Text ( Input ) )

Output = {(apache, hadoop) (hadoop, mapreduce) (hadoop, streaming) (hadoop, pig) (apache, pig) (hadoop, DFS) (streaming, commandline) (hadoop, java) (DFS, namenode) (datanode, block) (replication, default)...}

Search Assist Map

Page 28: Hadoop Overview kdd2011

28

// Input: GroupedList (word, GroupedList(words))CountedPairs = CountOccurrences (word, RelatedWords)

Output = {(hadoop, apache, 7) (hadoop, DFS, 3) (hadoop, streaming, 4) (hadoop, mapreduce, 9) ...}

Search Assist Reduce

Page 29: Hadoop Overview kdd2011

29

Issues with Large Data

• Map Parallelism: Chunking input data

•  Reduce Parallelism: Grouping related data

•  Dealing with failures & load imbalance

Page 30: Hadoop Overview kdd2011
Page 31: Hadoop Overview kdd2011

31

Apache Hadoop

•  January 2006: Subproject of Lucene

•  January 2008: Top-level Apache project

•  Stable Version: 0.20.203

•  Latest Version: 0.22 (Coming soon)

Page 32: Hadoop Overview kdd2011

32

Apache Hadoop

•  Reliable, Performant Distributed file system

• MapReduce Programming framework

•  Ecosystem: HBase, Hive, Pig, Howl, Oozie, Zookeeper, Chukwa, Mahout, Cascading, Scribe, Cassandra, Hypertable, Voldemort, Azkaban, Sqoop, Flume, Avro ...

Page 33: Hadoop Overview kdd2011

33

Problem: Bandwidth to Data

•  Scan 100TB Datasets on 1000 node cluster

•  Remote storage @ 10MB/s = 165 mins

•  Local storage @ 50-200MB/s = 33-8 mins

• Moving computation is more efficient than moving data

• Need visibility into data placement

Page 34: Hadoop Overview kdd2011

34

Problem: Scaling Reliably

•  Failure is not an option, it’s a rule !

•  1000 nodes, MTBF < 1 day

•  4000 disks, 8000 cores, 25 switches, 1000 NICs, 2000 DIMMS (16TB RAM)

• Need fault tolerant store with reasonable availability guarantees

•  Handle hardware faults transparently

Page 35: Hadoop Overview kdd2011

35

Hadoop Goals

•  Scalable: Petabytes (1015 Bytes) of data on thousands on nodes

•  Economical: Commodity components only

•  Reliable

•  Engineering reliability into every application is expensive

Page 36: Hadoop Overview kdd2011

36

Hadoop MapReduce

Page 37: Hadoop Overview kdd2011

37

Think MapReduce

•  Record = (Key, Value)

•  Key : Comparable, Serializable

•  Value: Serializable

•  Input, Map, Shuffle, Reduce, Output

Page 38: Hadoop Overview kdd2011

38

cat /var/log/auth.log* | \ grep “session opened” | cut -d’ ‘ -f10 | \sort | \uniq -c > \~/userlist

Seems Familiar ?

Page 39: Hadoop Overview kdd2011

39

Map

•  Input: (Key1, Value1)

• Output: List(Key2, Value2)

•  Projections, Filtering, Transformation

Page 40: Hadoop Overview kdd2011

40

Shuffle

•  Input: List(Key2, Value2)

• Output

•  Sort(Partition(List(Key2, List(Value2))))

•  Provided by Hadoop

Page 41: Hadoop Overview kdd2011

41

Reduce

•  Input: List(Key2, List(Value2))

• Output: List(Key3, Value3)

•  Aggregation

Page 42: Hadoop Overview kdd2011

42

Hadoop Streaming

•  Hadoop is written in Java

•  Java MapReduce code is “native” • What about Non-Java Programmers ?

•  Perl, Python, Shell, R

•  grep, sed, awk, uniq as Mappers/Reducers

•  Text Input and Output

Page 43: Hadoop Overview kdd2011

43

Hadoop Streaming

•  Thin Java wrapper for Map & Reduce Tasks

•  Forks actual Mapper & Reducer

•  IPC via stdin, stdout, stderr

•  Key.toString() \t Value.toString() \n

•  Slower than Java programs

•  Allows for quick prototyping / debugging

Page 44: Hadoop Overview kdd2011

44

$ bin/hadoop jar hadoop-streaming.jar \ -input in-files -output out-dir \ -mapper mapper.sh -reducer reducer.sh# mapper.shsed -e 's/ /\n/g' | grep .# reducer.shuniq -c | awk '{print $2 "\t" $1}'

Hadoop Streaming

Page 45: Hadoop Overview kdd2011

45

Hadoop Distributed File System (HDFS)

Page 46: Hadoop Overview kdd2011

46

HDFS

•  Data is organized into files and directories

•  Files are divided into uniform sized blocks (default 128MB) and distributed across cluster nodes

•  HDFS exposes block placement so that computation can be migrated to data

Page 47: Hadoop Overview kdd2011

47

HDFS

•  Blocks are replicated (default 3) to handle hardware failure

•  Replication for performance and fault tolerance (Rack-Aware placement)

•  HDFS keeps checksums of data for corruption detection and recovery

Page 48: Hadoop Overview kdd2011

48

HDFS

• Master-Worker Architecture

•  Single NameNode

• Many (Thousands) DataNodes

Page 49: Hadoop Overview kdd2011

49

HDFS Master���(NameNode)

• Manages filesystem namespace

•  File metadata (i.e. “inode”) • Mapping inode to list of blocks + locations

•  Authorization & Authentication

•  Checkpoint & journal namespace changes

Page 50: Hadoop Overview kdd2011

50

Namenode

• Mapping of datanode to list of blocks

• Monitor datanode health

•  Replicate missing blocks

•  Keeps ALL namespace in memory

•  60M objects (File/Block) in 16GB

Page 51: Hadoop Overview kdd2011

51

Datanodes

•  Handle block storage on multiple volumes & block integrity

•  Clients access the blocks directly from data nodes

•  Periodically send heartbeats and block reports to Namenode

•  Blocks are stored as underlying OS’s files

Page 52: Hadoop Overview kdd2011

HDFS Architecture

Page 53: Hadoop Overview kdd2011

53

Next Generation MapReduce

Page 54: Hadoop Overview kdd2011

MapReduce Today (Courtesy: Arun Murthy, Hortonworks)

Page 55: Hadoop Overview kdd2011

55

Why ?

•  Scalability Limitations today

• Maximum cluster size: 4000 nodes

• Maximum Concurrent tasks: 40,000

•  Job Tracker SPOF

•  Fixed map and reduce containers (slots)

•  Punishes pleasantly parallel apps

Page 56: Hadoop Overview kdd2011

56

Why ? (contd)

• MapReduce is not suitable for every application

•  Fine-Grained Iterative applications

•  HaLoop: Hadoop in a Loop

• Message passing applications

•  Graph Processing

Page 57: Hadoop Overview kdd2011

57

Requirements

• Need scalable cluster resources manager

•  Separate scheduling from resource management

• Multi-Lingual Communication Protocols

Page 58: Hadoop Overview kdd2011

58

Bottom Line

• @techmilind #mrng (MapReduce, Next Gen) is in reality, #rmng (Resource Manager, Next Gen)

•  Expect different programming paradigms to be implemented

•  Including MPI (soon)

Page 59: Hadoop Overview kdd2011

Architecture (Courtesy: Arun Murthy, Hortonworks)

Page 60: Hadoop Overview kdd2011

60

The New World

•  Resource Manager

•  Allocates resources (containers) to applications

•  Node Manager

•  Manages containers on nodes

•  Application Master

•  Specific to paradigm e.g. MapReduce application master, MPI application master etc

Page 61: Hadoop Overview kdd2011

61

Container

•  In current terminology: A Task Slot

•  Slice of the node’s hardware resources

•  #of cores, virtual memory, disk size, disk and network bandwidth etc

•  Currently, only memory usage is sliced

Page 62: Hadoop Overview kdd2011

62

Questions ?