asbury hadoop overview

50
Hadoop Overview Overview, HDFS and Map/Reduce Hadoop Workshop 1

Upload: brian-enochson

Post on 26-Jan-2015

113 views

Category:

Technology


0 download

DESCRIPTION

Slides from Hadoop workshop on 3/8/2014

TRANSCRIPT

Page 1: Asbury Hadoop Overview

Hadoop Workshop 1

Hadoop OverviewOverview, HDFS and Map/Reduce

Page 2: Asbury Hadoop Overview

Hadoop Workshop 2

The Required Intro…- SW Engineer who has worked as designer / developer

on NOSQL (Mongo, Cassandra, Hadoop)

- Consultant – HBO, ACS, ACXIOM (GM), AT&T

- Specialize in SW Development, architecture and training

- Currently working with Cassandra, Storm, Kafka, Node.js and MongoDB

- [email protected]

Page 3: Asbury Hadoop Overview

Hadoop Workshop 3

What’s The Plan• Intros, Agenda and software

• Hadoop Overview, Ecosystem and Use Cases

• HDFS – hadoop distributed file system

• Map/Reduce – how to create and use

• Hadoop Streaming and Pig

• A little bit about the supporting cast, the Hadoop Eco-System

Page 4: Asbury Hadoop Overview

Hadoop Workshop 4

And you…

What is you experience and interests with Big

Data, NoSql, Hadoop and Software Development?

Page 6: Asbury Hadoop Overview

Hadoop Workshop 6

Big Data

• The reason for products like Hadoop is the emergence of Big Data.

• Big Data is not new, but is more common now, we are more capable of collecting.

• Big Data is not just about size. It is also frequency of delivery and the unstructured nature.

• RDMS was about structure and definition. It fails in handling data requirements in many cases.

Page 7: Asbury Hadoop Overview

Hadoop Workshop 7

3 V’s

Page 8: Asbury Hadoop Overview

Hadoop Workshop 8

Or is it 4?

Page 9: Asbury Hadoop Overview

Hadoop Workshop 9

NoSQL

• Not Only SQL. Doesn’t mean “No” SQL

• Category of products developed to handle big data demands.

• Built to handle distributed architecture

• Take advantage of large clusters of commodity hardware

• There are trade-offs, live with CAP Theroem

Page 10: Asbury Hadoop Overview

Hadoop Workshop 10

Hadoop & NoSQL

• Hadoop itself is not a NoSQL product

• Hbase, which is part of Hadoop eco-system is a NoSQL product.

• Hadoop is a computing framework (a lot more to come) and typically runs on a cluster of machines

• NoSQL products example Cassandra, MongoDB, Riak, CouchDB and Hbase.

Page 11: Asbury Hadoop Overview

Hadoop Workshop 11

Hadoop

Where does the name come from?

The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce,

meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such.

Googol is a kid’s term.

• This is from Doug Cutting, inventor of Hadoop (also of Lucene the search engine and Nutch (along with Mike Cafarella) which a web crawler and various other items). Hadoop was created to process data created by Nutch.

Page 12: Asbury Hadoop Overview

Hadoop Workshop 12

Hadoop In More Detail• Hadoop is an open source framework for

processing large amounts of data in batch.

• Designed from the ground up to scale out as you add hardware.

• Hadoop Core has three main components:• Commons

• HDFS

• MapReduce

Page 13: Asbury Hadoop Overview

Hadoop Workshop 13

Components• The first is called Commons and contains (as the

name implied) common functionality.

• In addition Hadoop has its own file system called HDFS that is made to be fault tolerant and supplies the cornerstone to let it run on commodity hardware.

• The final component making up the Hadoop system is called MapReduce and it implements the model that allows processing the data in a parallel manner.

Page 14: Asbury Hadoop Overview

Hadoop Workshop 14

Storage Layer - HDFS• Hadoop Distributed File System (aka HDFS, or

Hadoop DFS)

• Runs on top of regular OS file system, typically Linux ext3

• Fixed-size blocks (64MB by default) that are replicated

• Write once, read many; optimized for streaming in and out

Page 15: Asbury Hadoop Overview

Hadoop Workshop 15

Execution Layer - MapReduce• Responsible for running a job in parallel on many

servers

• Handles re-trying a task that fails

• Validating complete results

• Jobs consist of special “map” and “reduce” operations

Page 16: Asbury Hadoop Overview

Hadoop Workshop 16

Hadoop Installation, Typical• Has one “master” server - high quality, beefy box. • NameNode process - manages file system • JobTracker process - manages tasks

• Has multiple “slave” servers - commodity hardware.• DataNode process - manages file system blocks on local

drives • TaskTracker process - runs tasks on server

• Uses high speed network between all servers

Page 17: Asbury Hadoop Overview

Hadoop Workshop 17

How does this look

Page 18: Asbury Hadoop Overview

Hadoop Workshop 18

Eco-System• Hadoop is a lot more than HDFS and MapReduce.

• There is an large (and ever growing) Eco-System built up around Hadoop.

• These are for making MapReduce less complex, Montoring and Job Control, Adding Persistence systems and adding Real-Time capabilities.

Page 19: Asbury Hadoop Overview

Hadoop Workshop 19

Hadoop Eco-System• Avro

A serialization system for efficient, cross-language RPC and persistent data storage. • Mahout

Machine Learning System

• Pig High Level query langiage (PigLatin) that creates

map reduce jobs.

• Hive A distributed data warehouse. Hive manages data

stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.

Page 20: Asbury Hadoop Overview

Hadoop Workshop 20

More Eco-System…• HBase

A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads). • ZooKeeper

A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications. • Sqoop

A tool for efficient bulk transfer of data between structured data stores (such as relational databases) and HDFS. • Oozie

A service for running and scheduling workflows of Hadoop jobs (including Map- Reduce, Pig, Hive, and Sqoop jobs).

Page 21: Asbury Hadoop Overview

Hadoop Workshop 21

Hadoop 2• A new MapReduce runtime, called

MapReduce 2, implemented on a new system called YARN (Yet Another Resource Negotiator), which is a general resource management system for running distributed applications. MR2 replaces the “classic” runtime in previous releases.

• HDFS federation, which partitions the HDFS namespace across multiple namenodes to support clusters with very large numbers of files.

• HDFS high-availability, which removes the namenode as a single point of failure by supporting standby namenodes for failover.

Page 22: Asbury Hadoop Overview

Hadoop Workshop 22

YARN• YARN was added in Hadoop 2

• It is a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications

• YARN can run applications that do not follow the MapReduce model, unlike the original MapReduce model (MR1)

• YARN provides the daemons and APIs necessary to develop generic distributed applications of any kind, handles and schedules resource requests (such as memory and CPU) from such applications, and supervises their execution

Page 23: Asbury Hadoop Overview

Hadoop Workshop 23

On to HDFS…

Any questions on the Hadoop Overview?

Page 24: Asbury Hadoop Overview

Hadoop Workshop 24

HDFS

• Hadoop Distributed File System

• Single Name Node and a Cluster of Datanodes

• Stores large files (+ GB) across nodes

• Reliability is through replication, default value is 3.

Page 25: Asbury Hadoop Overview

Hadoop Workshop 25

NameNode• Master and in Hadoop 1 is a SPOF.• In Hadoop 2 added failover to prevent the SPOF

aspect

• Contains • Directory Tree of all the files

• Keeps track of where all the data resides across the cluster

• For really large clusters, tracking these files becomes an issue, Hadoop 2 added NameNode Federation to fix that.

Page 26: Asbury Hadoop Overview

Hadoop Workshop 26

Web UI • http://localhost:50070/

• Can view information on namenode, blocks and view logs.

Page 27: Asbury Hadoop Overview

Hadoop Workshop 27

DataNode• Files stored on HDFS are chunked and stored as

blocks on DataNode

• Manages storage attached to the nodes that they run on.

• Data never flows through NameNode, only DataNodes

Page 28: Asbury Hadoop Overview

Hadoop Workshop 28

Blocks• Typically file system blocks are a few kilobytes.

• HDFS uses block size typically 64MB or 128MB.

• The large size minimizes the cost of seeking data.

• Blocks do not need to be on the same cluster. So large files in HDFS often span multiple clusters

• >hadoop fsck / -files -blocks

Page 29: Asbury Hadoop Overview

Hadoop Workshop 29

Read Path

Page 30: Asbury Hadoop Overview

Hadoop Workshop 30

Write Path

Page 31: Asbury Hadoop Overview

Hadoop Workshop 31

HDFS Command Line• You can interact with HDFS from command line by

use of “hadoop fs”

• >hadoop fs –help

• >hadoop fs -ls /

https://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html

Page 32: Asbury Hadoop Overview

Hadoop Workshop 32

More command line

• Get some data• http://www.gutenberg.org/ebooks/20417• http://www.gutenberg.org/ebooks/5000• http://www.gutenberg.org/ebooks/4300

• Save to /tmp/gutenberg local

Page 33: Asbury Hadoop Overview

Hadoop Workshop 33

Copy to HDFS• hadoop fs -mkdir /tmp

• hadoop fs -mkdir /tmp/gutenberg

• hadoop fs -copyFromLocal /tmp/gutenberg/* /tmp/gutenberg

Check it:

hadoop fs -ls /tmp/gutenberg/ Orhttp://localhost:50075/

Page 34: Asbury Hadoop Overview

Hadoop Workshop 34

Copy From HDFS• Let’s rename them

hadoop fs -mv /tmp/gutenberg/pg20417.txt /tmp/gutenberg/pg20417.txt.bak

hadoop fs -mv /tmp/gutenberg/pg4300.txt /tmp/gutenberg/pg4300.txt.bak

hadoop fs -mv /tmp/gutenberg/pg5000.txt /tmp/gutenberg/pg5000.txt.bak

• Then

hadoop fs -copyToLocal /tmp/gutenberg/*.bak /tmp/gutenberg

• And again

ls /tmp/gutenberg

Page 35: Asbury Hadoop Overview

Hadoop Workshop 35

Java API• Quite simple, like normal Java File I/O

• General write to FileSystem abstract class, can use DistributedFileSystem

• Simple way to read and write to file system is using java.net.URL. For example:

InputStream in = new URL("hdfs://host/path").openStream();

Page 36: Asbury Hadoop Overview

Hadoop Workshop 36

Let’s look at some examples• Listing Files

• Getting Status

• Copying to HDFS

• Copying from HDFS

• Viewing

Page 37: Asbury Hadoop Overview

Hadoop Workshop 37

On to Map Reduce…

Any questions on HDFS?

Page 38: Asbury Hadoop Overview

Hadoop Workshop 38

Map Reduce• Splits Processing Into Steps• Mapper

• Reducer

• These are by nature easily parallized and can take advantage of available hardware.

Page 39: Asbury Hadoop Overview

Hadoop Workshop 39

Some Key Concepts• Key Value Pair: two pieces of data, exchanged

between the Map and Reduce phases. Also sometimes called a TUPLE

• Map: The ‘map’ function in the MapReduce algorithm user defined converts each input Key Value Pair to 0...n output Key Value Pairs

• Reduce: The ‘reduce’ function in the MapReduce algorithm user defined converts each input Key + all Values to 0...n output Key Value Pairs

• Group: A built-in operation that happens between Map and Reduce ensures each Key passed to Reduce includes all Values

Page 40: Asbury Hadoop Overview

Hadoop Workshop 40

Here is how it all works together

Page 41: Asbury Hadoop Overview

Hadoop Workshop 41

Who Does What• Map translates input to keys and values to new

keys and values

• System Groups each unique key with all its values

• Reduce translates the values of each unique key to new keys and values

Page 42: Asbury Hadoop Overview

Hadoop Workshop 42

Example• Let’s look at the combining english to foreign

language translations

Page 43: Asbury Hadoop Overview

Hadoop Workshop 43

Divide and Conquer

Page 44: Asbury Hadoop Overview

Hadoop Workshop 44

JobTracker• Is a single point of failure

• Determines # Mapper Tasks from file splits via InputFormat

• Uses predefined value for # Reducer Tasks

• Client applications use JobClient to submit jobs and query status

• Command line use hadoop job <commands>

• Web status console use http://jobtracker-server:50030/

Page 45: Asbury Hadoop Overview

Hadoop Workshop 45

Task Tracker• Spawns each Task as a new child JVM

• Max # mapper and reducer tasks set independently

• Can pass child JVM opts via mapred.child.java.opts

• Can re-use JVM to avoid overhead of task initialization

Page 46: Asbury Hadoop Overview

Hadoop Workshop 46

Testing with MRUnit

Page 47: Asbury Hadoop Overview

Hadoop Workshop 47

How It Works• MRUnit’s MapDriver or ReduceDriver are the key

class

• Configure which mapper under test

• The input value,

• The expected output key

• The expected output value

Page 48: Asbury Hadoop Overview

Hadoop Workshop 48

Let’s Look At An Example• Testing A Mapper

• Testing A Reducer

Page 49: Asbury Hadoop Overview

Hadoop Workshop 49

Thoughts• Not for quick jobs

• Scripts / R programs are often better

• Doesn’t work well with a lot of small files in HDFS

• Don’t have that real – time aspect (see Apache Storm)

• Don’t let people tell you Hadoop is the answer for everything

• Hive * Pig are good alternative (espcially for SQL speakers)

Page 50: Asbury Hadoop Overview

Hadoop Workshop 50

That’s All

• Questions?

• Contact Me

[email protected]