how to set up a hadoop cluster using oracle solaris (hands-on lab)

26
How to Set Up a Hadoop Cluster with Oracle Solaris [HOL10182] Orgad Kimchi Principal Software Engineer

Upload: orgad-kimchi

Post on 05-Dec-2014

2.578 views

Category:

Technology


9 download

DESCRIPTION

This is the slide deck from the Oracle Open World 2013 Hands-On Lab "How to Set Up a Hadoop Cluster Using Oracle Solaris" http://www.oracle.com/technetwork/systems/hands-on-labs/hol-setup-hadoop-solaris-2041770.html

TRANSCRIPT

Page 1: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

How to Set Up a Hadoop Cluster with Oracle Solaris [HOL10182]Orgad KimchiPrincipal Software Engineer

Page 2: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template2

Disclaimer

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle Corporation.

Page 3: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template3

Agenda

Lab Overview

Hadoop Overview

The Benefits of Using Oracle Solaris Technologies for a

Hadoop Cluster

Page 4: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template4

Lab Overview

In this Hands-on-Lab we will preset and demonstrate using exercises how to set up a Hadoop cluster Using Oracle Solaris 11 technologies like: Zones, ZFS, DTrace  and Network Virtualization

Key topics include the Hadoop Distributed File System and MapReduce.

We will also cover the Hadoop installation process and the cluster building blocks: NameNode, a secondary NameNode, and DataNodes.

Page 5: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template5

Lab Overview – Cont’d

During the lab users will learn how to load data into the Hadoop cluster and run Map-Reduce job.

This hands-on training lab is for system administrators and others responsible for managing Apache Hadoop clusters in production or development environments

Page 6: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template6

Lab Main Topics    This hands-on lab consists of 13 exercises covering various Oracle Solaris and Apache Hadoop technologies:

1. Install Hadoop.

2. Edit the Hadoop configuration files.

3. Configure the Network Time Protocol.

4. Create the virtual network interfaces (VNICs).

5. Create the NameNode and the secondary NameNode zones.

6. Set up the DataNode zones.

7. Configure the NameNode.

8. Set up SSH.

9. Format HDFS from the NameNode.

10. Start the Hadoop cluster.

11. Run a MapReduce job.

12. Secure data at rest using ZFS encryption.

13. Use Oracle Solaris DTrace for performance monitoring.

Page 7: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template7

What is Big Data

Big Data is both: Large and Variable Datasets + New Set of Technologies

Extremely large files of unstructured or semi-structured data Large and highly distributed datasets that are otherwise difficult to manage

as a single unit of information That can economically acquire, organize, store, analyze and extract value

from Big Data datasets – thus facilitating better, more informed business decisions

Page 8: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template8

Data is Everywhere!Facts & Figures

234M Web sites Facebook

500M Users 40M photos per day 30 billion new pieces of

content per month7M New sites in 2010New York Stock Exchange

1 TB of data per day Web 2.0

147M Blogs and growing Twitter – 12TB of data per day

8

Page 9: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template9

Introduction To Hadoop

Page 10: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template10

What is Hadoop ?

Originated at Google 2003 – Generation of search indexes and web scores Top level Apache project, Consists of two key services

1. Hadoop Distributed File System (HDFS), highly scalable, fault-tolerant , distributed

2. MapReduce API (Java), Can be scripted in other languages

Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure.

Page 11: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template11

Components of Hadoop

Page 12: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template12

HDFS

HDFS is the file system responsible for storing data on the cluster

Written in Java (based on Google’s GFS) Sits on top of a native file system (ext3, ext4, xfs, etc) POSIX like file permissions model Provides redundant storage for massive amounts of data HDFS is optimized for large, streaming reads of files

Page 13: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template13

The Five Hadoop Daemons - Hadoop is comprised of five separate daemons NameNode : Holds the metadata for HDFS Secondary NameNode : Performs housekeeping functions for the

NameNode DataNode : Stores actual HDFS data blocks JobTracker : Manages MapReduce jobs, distributes individual

tasks to machines running the TaskTracker. Coordinates MapReduce stages.

TaskTracker : Responsible for instantiating and monitoring individual Map and Reduce tasks

Page 14: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template14

Hadoop Architecture

Page 15: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template15

MapReduce

Map:– Accepts input key/value pair– Emits intermediate key/value pair

Reduce:– Accepts intermediate key/value*

pair– Emits output key/value pair

Very big

data

ResultMAP

REDUCE

PartitioningFunction

15

Page 16: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template16 16

MapReduce ExampleCounting word occurrences in a document:

how many chucks could a woodchuck chuck if a woodchuck could chuck wood

4 Node Map

how,1 many,1 chucks,1 could,1 a,1 woodchuck,1 chuck,1 if,1 a,1 woodchuck,1 could,1 chuck,1 wood,1

Group by Key

a,1:1 chuck,1:1 chucks,1 could,1:1 how,1 if,1 many,1 wood,1 woodchuck,1:1

2 Node Reduce

a,2 chuck,2 chucks,1 could,2 how,1 if,1 many,1 wood,1 woodchuck,2

Output

Page 17: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template17

MapReduce Functions

MapReduce partitions data into 64MB chunks ( default )

Distributes data and jobs across thousands of nodes

Tasks scheduled based on location of data

Master writes periodic checkpoints

If map worker fails Master restarts job on new node

Barrier - no reduce can begin until all maps are complete

HDFS manages data replication for redundancy

MapReduce library does the hard work for us!

Page 18: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template18

RDBMS compared to MapReduce

Traditional RDBMS MapReduce

Data size Gigabytes Petabytes

Access Interactive and batch Batch

Updates Read and write many times

Write once, read many times

Structure Static schema Dynamic schema

Integrity High Low

Scaling Nonlinear Linear

Page 19: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template19

The benefits of using Oracle Solaris technologies for a Hadoop cluster

Page 20: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template20

Architecture Layout

Page 21: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template21

Fast provision of new cluster members using the Solaris zones cloning feature

Very high network throughput between the zones for data node replication

Oracle Solaris Zones Benefits

The benefits of using Oracle Solaris Zones for a Hadoop cluster

Page 22: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template22

Immense data capacity,128 bit file system, perfect for big data-set

Optimized disk I/O utilization for better I/O performance with ZFS built-in compression

Secure data at rest using ZFS encryption

Oracle Solaris ZFS Benefits

The benefits of using Oracle Solaris ZFS for a Hadoop cluster

Page 23: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template23

The benefits of using Oracle Solaris technologies for a Hadoop cluster

• Multithread awareness - Oracle Solaris understands the correlation between cores and the threads, and it provides a fast and efficient thread implementation.

• DTrace - comprehensive, advanced tracing tool for troubleshooting systematic problems in real time.

• SMF – allow to build dependencies between Hadoop services (e.g. starting the MapReduce daemons after the HDFS daemons).

Page 25: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template25

Graphic Section Divider

Page 26: How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12 of the corporate presentation template26