hadoop masterclass from big data partnership

53
Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Apache Hadoop Masterclass An introduction to concepts & ecosystem for all audiences

Upload: bigdatapartnership

Post on 27-Jan-2015

107 views

Category:

Technology


0 download

DESCRIPTION

Here are the slides from a presentation delivered by Big Data Partnership (@BigDataExperts). This masterclass on Hadoop is an hour-long version of the one-day course run by Big Data Partnership (http://www.bigdatapartnership.com/wp-content/uploads/2013/11/Hadoop-Masterclass.pdf). "This one-day course is designed to help both IT professionals and decision-makers understand the concepts and benefits of Apache Hadoop and how it can help them meet business goals. You will gain an understanding of the Hadoop technology stack, including MapReduce, HDFS, Hive, Pig, HBase, and provides an initial introduction to Mahout and other common utilities." If you have any questions or would like to learn more about big data (including the consultancy, training and support we offer), please get in touch contact [at] bigdatapartnership dot com.

TRANSCRIPT

Page 1: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Apache Hadoop MasterclassAn introduction to concepts & ecosystem for all audiences

Page 2: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Who We Are?

Page 3: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Agenda

1. What is Hadoop? -- Why is it so popular?

2. Hardware

3. Ecosystem Tools

4. Summary & Recap

3

Page 4: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

1. What is Hadoop?How is it different? Why is it so popular?

4

Page 5: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

1.1 First, a question…Conceptual design exercise

5

Page 6: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

What is Big Data?Why people are turning to Hadoop

6

Data Complexity: Variety and Velocity

TB

GB

MB

PB Big Data!!Log files Spatial & GPS coordinates Data market feeds eGov feeds Weather

Text/image

Click stream Wikis/blogs

Sensors/RFID/ devices

Social sentiment Audio/video

Web 2.0!!Web Logs

Digital Marketing

Search Marketing

Recommendations

Advertising

Mobile

Collaboration

eCommerce

ERP/CRM

Payables Payroll Inventory

Contacts Deal Tracking Sales Pipeline

Dat

a Si

ze

Page 7: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Problem:Legacy database, experiencing huge growth in data volumes

7

Scale

UP

€ / GB

€€ / GB

€€€ / GBLarge Application Database or Data Warehouse

€€€€ / GB TB ➔ PB ?

Data Volume

Performance

Cost

Page 8: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

How would you re-design from the basics?

8

Scale OUT

1/x 1/x1/x 1/x1/x

✓ Commodity hardware

✓ Low, predictable cost per unit

How do we store data?

Requirements: Less cost Less complexity Less risk More scalable More reliable More flexible

0-5 6-10 11-15 16-20 21-25

0-5 0-56-10 6-10

11-15 11-1516-20

16-20

21-2521-25

How do we query data?

Q QQQQ ✓ Distributed Filesystem

✓ Distributed Computation

€ € € € €

Page 9: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Engineering for efficiency & reliabilityWhat issues exist for coordinating between nodes?

!▪ Limited network bandwidth

!▪ Data loss/re-transmission

!▪ Server death

!▪ Failover scenarios

9

⇒ Do more locally before crossing network !

⇒ Use reliable transfers !

⇒ Make servers repeatable elsewhere ! ⇒ Make nodes identity-less: any other can take over ⇒ Job co-ordination

These are our system requirements

Page 10: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Engineering for efficiency & reliabilityWhat issues exist within each node?

!▪ Hardware - MTBF (Mean Time Between Failure)

!▪ Software crash

!▪ Disk seek time

!▪ Disk contention

10

⇒ Expect failure: nodes have no identity, data replication !!

⇒ as above !!

⇒ Avoid random seeks, read sequentially (10x faster than seek) !!

⇒ Use multiple disks in parallel, with multiple cores !!!

These are our system requirements

Page 11: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Engineering for flexibilityWhat issues exist for a query interface?

!▪ Schema not obvious up front

!▪ Schema changes after implementation

!▪ Language may not suit all users

!▪ Language may not support advanced use cases

11

⇒ Support both schema and schema-less operations !!

⇒ as above + Support sparse/columnar storage !!

⇒ Plug-in query framework and/or Programming API !!

⇒ Generic programming API

!These are our system requirements

Page 12: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

To sum up:

▪ Original legacy assumptions, traditional enterprise systems - ▪ Hardware can be made reliable by spending more on it ▪ Machines have a unique identity each ▪ Datasets must be centralised !

▪ New paradigm for data management with Hadoop - ▪ Distribute + duplicate chunks of each data file across many nodes ▪ Use local compute resource to process each chunk in parallel ▪ Minimise elapsed seek time ▪ Expect failure ▪ Handle failover elegantly, and automatically

12

Page 13: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

1.2 What is Hadoop?Architecture Overview

13

Page 14: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

What is Hadoop?Logical Architecture

14

HDFSHadoop Distributed Filesystem

MapReduce

Computation:

Storage:Data Management

Data Analysis

User Tools

Core

Dis

trib

utio

n

Distributed System

+

Node NodeNode Node NodeNode

Fast, Private Network

Page 15: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

What is Hadoop?Physical Architecture

15

Computation:

Storage:

Node

DataNode

TaskTracker

RegionServer

Node

DataNode

TaskTracker

RegionServer

Node

DataNode

TaskTracker

RegionServer

Node

DataNode

TaskTracker

RegionServerHBase:

Master:Node

NameNodeNode

JobTrackerNode

HMasterUser Tools

Client Libraries

Page 16: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

1.3 HDFSDistributed Storage

16

Page 17: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Storage: writing data

17

HDFS48 TB

8 TB 8 TB8 TB 8 TB 8 TB8 TB

Page 18: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Storage: reading data

18

HDFS48 TB

8 TB 8 TB8 TB 8 TB 8 TB8 TB

Page 19: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

1.4 MapReduceDistributed Computation

19

Page 20: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Computation

20

HDFS48 TB

8 TB 8 TB8 TB 8 TB 8 TB8 TB

MapReduceF(x) F(x)

Page 21: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Computation: MapReduceHigh-level MapReduce process

21

Input File (split into chunks)

Map Shuffle & sort

Reduce Output Files (one per Reducer)

Page 22: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Computation: MapReduceHigh-level MapReduce process

22

Then there were two cousins laid up; when the one should

be lamed with reasons and the other mad without

any. But is all this for your father? No, some of it is for my

child’s father. O, how full of briers is this working day

world! They are but burs, cousin, thrown upon thee in holiday

foolery: if we walk not in the trodden paths our very

the,1 other,1 …

were,1 the,1 …

this,1 some,1 …

father,1 this,1 …

they,1 upon,1 …

the,1 paths,1 …

father,{1} other,{1} paths,{1}

some,{1} the,{1,1,1} they,{1}

this,{1,1} upon,{1} where,{1}

father,1 other,1 paths,1

some,1 the,3 they,1

this,2 upon,1 where,1

Input Map Shuffle & Sort

Reduce Output

Page 23: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

1.5 Key ConceptsPoints to Remember

23

Page 24: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Linear Scalability

24

HDFSHadoop Distributed Filesystem

MapReduce

Node Node NodeNode NodeNode Node NodeNode

Cluster Size 150% Time

Cluster Size

Storage Capacity

Compute Capacity

Compute Time

Hardware €Software €

Total €

€ / TB

Staffing €

Page 25: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

HDFS for StorageWhat do I need to remember?

Functional Qualities:

▪ Store many files into a large pooled filesystem

▪ Familiar interface: standard UNIX filesystem

▪ Organise by any arbitrary naming scheme: files/directories

!Non-Functional Qualities:

▪ Files can be very very large (>> single disk)

▪ Distributed, fault-tolerant, reliable – replicas of chunks are stored

▪ High aggregate throughput: parallel reads/writes

▪ Scalable25

Page 26: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Computation: Not only MapReduce

▪ With Hadoop 2.1, Hadoop introduced YARN

▪ Enhanced MapReduce: ▪ Makes MapReduce just one of many pluggable framework

modules ▪ Existing MapReduce programs continue to run as before ▪ Previous MapReduce API versions run side-by-side

▪ Takes us way beyond MapReduce: ▪ Any new programming model can now be plugged into a

running Hadoop cluster on demand ▪ Including: MPI, BSP, graph, others …

26

Page 27: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

1.6 Why is it so popular?Key Differentiators

27

Page 28: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Why is Hadoop so popular?The story so far

1. Scalability ▪ Democratisation - the ability to store and manage petabytes

of files and analyse them across thousands of nodes has never been available to the general population before

▪ Why should the HPC world care? More users = more problems solved for you by others

▪ Predictable scalability curve

2. Cost ▪ It can be done on cheap hardware, with minimal staff ▪ Free software

3. Ease of use ▪ Familiar Filesystem interface ▪ Simple programming model

28

Page 29: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Why is Hadoop so popular?

▪ What about flexibility?

▪ How does it compare to a traditional database system? ▪ Tables consisting of rows, columns => 2D ▪ Foreign keys linking tables together ▪ One-to-one, one-to-many, many-to-many relations

29

Page 30: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Traditional Database Approach

▪ Data model tightly coupled to storage

▪ Have to model like this up-front before anything can be inserted

▪ Only choice of model

▪ May not be appropriate for your application

▪ We’ve grown up with this as the normal general-purpose approach, but is it really?

30

Customers

Orders

Order Lines

SCHEMA ON WRITE

Page 31: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Traditional database approachSometimes it doesn’t fit

▪ What if we want to store graph data? ▪ Network of nodes & edges ▪ e.g. social graph

!!!!

!▪ Tricky!

▪ Need to store lots of self-references to far away rows ▪ Slow performance ▪ What if we need changing attributes for different node types? ▪ Loss of intuitive schema design

31

Page 32: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Traditional database approachSometimes it doesn’t fit

▪ What if we want to store a sparse matrix? ▪ To avoid a very complicated schema, would need to store a

table with NULL for all empty matrix cells !!

▪ Lots of NULLs ▪ Waste memory + disk space ▪ Very poor performance ▪ Loss of intuitive schema design

32

Page 33: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Traditional database approachSometimes it doesn’t fit

▪ What if we want to store binary files from a scientific instrument… images… videos? ▪ Could store metadata in first few cols, then pack binary

data into one huge column? !!

▪ Inefficient ▪ Slow ▪ Might run out of space in the binary column ▪ No internal structure ▪ Database driver is a bottleneck

33

Page 34: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Hadoop data modelTypes of data

34

Structured • Pre-defined

schema • Highly • Example:

relational database system

Semi-structured • Inconsistent

structure • Cannot be

stored in rows and tables in a typical database

• Examples: logs, tweets, sensor feeds

Unstructured • Lacks structure

or.. • Parts of it lack

structure • Examples: free-

form text, reports, customer feedback forms

SCHEMA ON READ

Page 35: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Hadoop approachA more flexible option?

▪ Just load your data files into the cluster as they are ▪ Don’t worry about the format right now ▪ Leave files in their native/application format

▪ Since they are now in a cluster with distributed computing power local to each storage shard anyway: ▪ Query the data files in-place ▪ Bring the computation to the data ▪ Saves shipping data around

▪ Use a more flexible model – don’t worry about the schema up-front, figure it out later

▪ Use alternative programming models to query data – MapReduce is more abstract than SQL so can support wider range of customisation

▪ Beyond that, can replace MapReduce altogether

35

Page 36: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

3. Hardware Considerations50,000 ft picture

36

Page 37: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Hardware ConsiderationsThe Basics

▪ Cluster infrastructure design is a complex topic

▪ Full treatment is beyond the scope of this talk

▪ But, essentially: ▪ Prefer cheap commodity hardware ▪ Failure is planned and expected – Hadoop will deal with it

gracefully ▪ The only exception to this is a service called the

“NameNode” which coordinates the filesystem – for some distributions this requires a high-availability node

▪ The more memory the better

37

Page 38: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Hardware ConsiderationsPerformance

▪ Try to pack as many disks and cores into a node as is reasonable: ▪ 8 disks + ▪ Cheap SATA drives – NOT expensive SAS drives ▪ Cheap CPUs, not expensive server-grade high performance

chips !

▪ Match one core to a CPU, generally ▪ So for 12 disks, get 12 CPU cores ▪ Adding more cores => add more memory

38

Page 39: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Hardware ConsiderationsBottlenecks

▪ You will hit bottlenecks in the following order: 1. Disk I/O throughput 2. Network throughput 3. Running out of memory 4. CPU speed

!

▪ Therefore, money will be wasted on expensive high-speed CPUs, and they will increase power consumption costs

▪ Disk I/O is best addressed by following the one-disk-to-CPU-core ratio guideline

39

Page 40: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

4. Hadoop ToolsEcosystem tools & utilities

40

Page 41: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Man

age

men

t &

M

oni

tori

ng

(Am

bar

i,

Zook

eepe

r,

Nag

ios,

G

angl

ia)

Hadoop Subprojects & ToolsOverview

41

Distributed Storage (HDFS)

Distributed Processing (MapReduce)

Scripting (Pig)

Metadata Management (HCatalog)

NoS

QL

Colu

mn

DB

(HBa

se)

Data Extraction &

Load (Sqoop, W

ebHD

FS, 3rd party tools)

Workflow

& Scheduling

(Oozie)

Query (Hive)

Machine Learning & Predictive Analytics (Mahout)

Page 42: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

PigWhat do I need to remember?

▪ High-level dataflow scripting tool

▪ Best for exploring data quickly or prototyping

▪ When you load data, you can: ▪ Specify a schema at load time ▪ Not specify one and let Pig guess ▪ Not specify one and tell Pig later once you work it out

▪ Pig eats any kind of data (hence the name “Pig”)

▪ Pluggable load/save adapters available (e.g. Text, Binary, Hbase)

42

Page 43: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

PigSample script

> LOAD 'student' USING PigStorage() AS (name:chararray, year:int, gpa:float); > DUMP A; (John,2005,4.0F) (John,2006,3.9F) (Mary,2006,3.8F) (Mary,2007,2.7F) (Bill,2009,3.9F) (Joe,2010,3.8F)

> B = GROUP A BY name; > C = FOREACH B GENERATE AVG(gpa); > STORE C INTO ’grades’ USING PigStorage();

43

Page 44: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

HiveWhat do I need to remember?

▪ Data Warehouse-like front end & SQL for Hadoop

▪ Best for managing known data with fixed schema

▪ Stores data in HDFS, metadata in a local Database (MySql)

▪ Supports two table types: ▪ Internal tables (Hive manages the data files & format) ▪ External tables (you manage the data files & format)

▪ Not fully safe to throw inexperienced native-SQL users at

44

Page 45: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

HiveSample script

> CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User’) > COMMENT 'This is the page view table’ > PARTITIONED BY(dt STRING, country STRING) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '1’ > STORED AS SEQUENCEFILE;

> SELECT * FROM page_view WHERE viewTime>1400;

> SELECT * FROM page_view > JOIN users ON page_view.userid = users.userid> ORDER BY users.subscription_date;

45

Page 46: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

OozieWhat do I need to remember?

▪ Workflow engine & scheduler for Hadoop

▪ Best for building repeatable production workflows of common Hadoop jobs

▪ Great for composing smaller jobs into larger more complex ones

▪ Stores job metadata in a local Database (MySql)

▪ Complex recurring schedules easy to create

▪ Some counter-intuitive cluster setup steps required, due to the way job actions for Hive and Pig are launched

46

Page 47: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

SqoopWhat do I need to remember?

▪ Data import/export tool for Hadoop + external database

▪ Suitable for importing/exporting flat simple tables

▪ Suitable for importing arbitrary query as table

▪ Does not have any configuration or state

▪ Can easily overload a remote database if too much parallelism requested (think DDoS attack) !

▪ Can load directly to/from Hive tables as well as HDFS files

47

Page 48: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

5. Summary

48

Page 49: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

SummaryHadoop as a concept

▪ Driving factors: ▪ Why it was designed the way it was ▪ Why that is important to you as a user ▪ What makes it popular ▪ What it isn’t so suitable for

49

Page 50: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

SummaryHadoop as a platform

▪ Key concepts: ▪ Distributed Storage ▪ Distributed Computation ▪ Bring the computation to the data ▪ Variety in data formats (store files, analyse in-place later) ▪ Variety in data analysis (more abstract programming

models) ▪ Linear Scalability

50

Page 51: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

SummaryHadoop Ecosystem

▪ Multiple vendors ▪ Brings choice and competition ▪ Feature differentiators ▪ Variety of licensing models

!

▪ Rich set of user tools ▪ Exploration ▪ Querying ▪ Workflow & ETL ▪ Distributed Database ▪ Machine Learning

51

Page 52: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Further Training

▪ Apache Hadoop 2.0 Developing Java Applications (4 days)

▪ Apache Hadoop 2.0 Development for Data Analysts (4

days)

▪ Apache Hadoop 2.0 Operations Management (3 days)

▪ MapR Hadoop Fundamentals of Administration (3 days)

▪ Apache Cassandra DevOps Fundamentals (3 days)

▪ Apache Hadoop Masterclass (1 day)

▪ Big Data Concepts Masterclass (1 day)

▪ Machine Learning at scale with Apache Mahout (1 day)

52

Page 53: Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.

Contact Details

Tim SeearsCTO Big Data [email protected] @BigDataExperts