introduction to hadoop

73
Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent. Introduction to Hadoop 1010.0 1 SAP Open Source Summit - October 2010 Lars George – [email protected]

Upload: mairi

Post on 25-Feb-2016

23 views

Category:

Documents


1 download

DESCRIPTION

Introduction to Hadoop. SAP Open Source Summit - October 2010 Lars George – [email protected]. 1010.01. Traditional Large-Scale Computation. Traditionally, computation has been processor-bound Relatively small amounts of data - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Introduction to Hadoop

1010.01

SAP Open Source Summit - October 2010Lars George – [email protected]

Page 2: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Traditional Large-Scale Computation

Traditionally, computation has been processor-bound– Relatively small amounts of data– Significant amount of complex processing performed on that data

For many years, the primary push was to increase the computing power of a single machine– Faster processor, more RAM

Page 3: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Distributed Systems

Moore’s Law: roughly stated, processing power doubles every two years

Even that hasn’t always proved adequate for very CPU-intensive jobs

Distributed systems evolved to allow developers to use multiple machines for a single job– MPI– PVM– Condor

MPI: Message Passing InterfacePVM: Parallel Virtual Machine

Page 4: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Distributed Systems: Problems

Programming for traditional distributed systems is complex– Data exchange requires synchronization– Finite bandwidth is available– Temporal dependencies are complicated– It is difficult to deal with partial failures of the system

Ken Arnold, CORBA designer:– “Failure is the defining difference between distributed and local

programming, so you have to design distributed systems with the expectation of failure”– Developers spend more time designing for failure than they

do actually working on the problem itself

CORBA: Common Object Request Broker Architecture

Page 5: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Distributed Systems: Data Storage

Typically, data for a distributed system is stored on a SAN At compute time, data is copied to the compute nodes Fine for relatively limited amounts of data

SAN: Storage Area Network

Page 6: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

The Data-Driven World

Modern systems have to deal with far more data than was the case in the past– Organizations are generating huge amounts of data– That data has inherent value, and cannot be discarded

Examples:– Facebook – over 15Pb of data– eBay – over 10Pb of data

Many organizations are generating data at a rate of terabytes (or more) per day

Page 7: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Data Becomes the Bottleneck

Getting the data to the processors becomes the bottleneck Quick calculation

– Typical disk data transfer rate: 75Mb/sec– Time taken to transfer 100Gb of data to the processor: approx 22

minutes!– Assuming sustained reads– Actual time will be worse, since most servers have less than

100Gb of RAM available

A new approach is needed

Page 8: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Partial Failure Support

The system must support partial failure– Failure of a component should result in a graceful degradation of

application performance– Not complete failure of the entire system

Page 9: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Data Recoverability

If a component of the system fails, its workload should be assumed by still-functioning units in the system– Failure should not result in the loss of any data

Page 10: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Component Recovery

If a component of the system fails and then recovers, it should be able to rejoin the system– Without requiring a full restart of the entire system

Page 11: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Consistency

Component failures during execution of a job should not affect the outcome of the job

Page 12: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Scalability

Adding load to the system should result in a graceful decline in performance of individual jobs– Not failure of the system

Increasing resources should support a proportional increase in load capacity

Page 13: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

MapReduce

Hadoop Distributed File System (HDFS)

Apache Hadoop

• Consolidates Everything• Move complex and relational

data into a single repository

• Stores Inexpensively• Keep raw data always available• Use commodity hardware

• Processes at the Source• Eliminate ETL bottlenecks• Mine data first, govern later

Open Source Core Storage and Processing Engine

Page 14: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

The Hadoop Project

Hadoop is an open-source project overseen by the Apache Software Foundation

Originally based on papers published by Google in 2003 and 2004

Hadoop committers work at several different organizations– Including Facebook, Yahoo!, Cloudera

Page 15: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

2005 2007 200920082004 2006 201020032002

Open Source, Web Crawler project created by Doug Cutting

Publishes MapReduce, GFS Paper

Open Source, MapReduce & HDFS project created by Doug Cutting

Runs 4,000 Node Hadoop Cluster

Hadoop wins Terabyte sort benchmark

Launches SQL Support for Hadoop

Releases CDH3 and Cloudera Enterprise

Origin of Hadoop

How does an Elephant sneak up on you?

Page 16: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

What is Hadoop?

A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license)

Scalable data processing engine– Hadoop Distributed File System (HDFS): self-healing high-bandwidth

clustered storage– MapReduce: fault-tolerant distributed processing

Key value– Flexible -> store data without a schema and add it later as needed– Affordable -> cost / TB at a fraction of traditional options– Broadly adopted -> a large and active ecosystem– Proven at scale -> dozens of petabyte + implementations in

production today

Page 17: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

No Lock-In - Investments in skills, services & hardware are preserved regardless of vendor choice

Rich Ecosystem - Dozens of complementary software, hardware and services firms

Community Development - Hadoop & related projects are expanding at a rapid pace

Apache Hadoop

The Importance of Being Open

Page 18: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Hadoop in the Enterprise

Logs Files Web

EnterpriseData

Warehouse

RelationalDatabase

Web Application

Enterprise Reporting Hive, Pig

AnalystsBusiness UsersCustomers

Page 19: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Apache Hadoop - Not Without Challenges

Ease of use – command line interface only; data import and access requires development skills

Complexity – more than a dozen different components, all with different versions, dependencies and patch requirements

Manageability – Hadoop is challenging to configure, upgrade, monitor and administer

Interoperability – limited support for popular databases and analytical tools

Page 20: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Hadoop Components

Hadoop consists of two core components– The Hadoop Distributed File System (HDFS)– MapReduce

There are many other projects based around core Hadoop– Often referred to as the ‘Hadoop Ecosystem’– Pig, Hive, HBase, Flume, Oozie, Sqoop, etc

A set of machines running HDFS and MapReduce is known as a Hadoop Cluster– Individual machines are known as nodes– A cluster can have as few as one node, as many as several

thousands– More nodes = better performance!

Page 21: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Hadoop Components: HDFS

HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster

Data is split into blocks and distributed across multiple nodes in the cluster– Each block is typically 64Mb or 128Mb in size

Each block is replicated multiple times– Default is to replicate each block three times– Replicas are stored on different nodes– This ensures both reliability and availability

Page 22: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Hadoop Components: MapReduce

MapReduce is the system used to process data in the Hadoop cluster

Consists of two phases: Map, and then Reduce– Between the two is a stage known as the sort and shuffle

Each Map task operates on a discrete portion of the overall dataset– Typically one HDFS block of data

After all Maps are complete, the MapReduce system distributes the intermediate data to nodes which perform the Reduce phase

Page 23: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

HDFS Basic Concepts

HDFS is a filesystem written in Java– Based on Google’s GFS

Sits on top of a native filesystem– ext3, xfs etc

Provides redundant storage for massive amounts of data– Using cheap, unreliable computers

Page 24: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

HDFS Basic Concepts (cont’d)

HDFS performs best with a ‘modest’ number of large files– Millions, rather than billions, of files– Each file typically 100Mb or more

Files in HDFS are ‘write once’– No random writes to files are allowed– Append support is available in Cloudera’s Distribution for Hadoop

(CDH) and in Hadoop 0.21

HDFS is optimized for large, streaming reads of files– Rather than random reads

Page 25: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

MapReduce Execution

Page 26: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

MapReduce Example: Word Count

Count the number of occurrences of each word in a large amount of input data– This is the ‘hello world’ of MapReduce programming

map(String input_key, String input_value)foreach word w in input_value:

emit(w, 1)

reduce(String output_key, Iterator<int>

intermediate_vals)set count = 0foreach v in intermediate_vals:

count += vemit(output_key, count)

Page 27: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

MapReduce Example: Word Count (cont’d)

Input to the Mapper:

Output from the Mapper:

(3414, 'the cat sat on the mat')(3437, 'the aardvark sat on the sofa')

('the', 1), ('cat', 1), ('sat', 1), ('on', 1), ('the', 1), ('mat', 1), ('the', 1), ('aardvark', 1),('sat', 1), ('on', 1), ('the', 1), ('sofa', 1)

Page 28: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

MapReduce Example: Word Count (cont’d)

Intermediate data sent to the Reducer:

Final Reducer Output:

('aardvark', [1])('cat', [1]) ('mat', [1])('on', [1, 1])('sat', [1, 1])('sofa', [1]) ('the', [1, 1, 1, 1])

('aardvark', 1)('cat', 1) ('mat', 1)('on', 2)('sat', 2)('sofa', 1) ('the', 4)

Page 29: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Basic Cluster Configuration

Page 30: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

What is common across Hadoop-able problems?

Nature of the data Complex data

Multiple data sources

Lots of it

Nature of the analysis Batch processing

Parallel execution

Spread data over a cluster of servers and take the computation to the data

Page 31: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

What Analysis is Possible With Hadoop?

Text mining

Index building

Graph creation and analysis

Pattern recognition

Collaborative filtering

Prediction models

Sentiment analysis

Risk assessment

Page 32: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Benefits of Analyzing With Hadoop

• Previously impossible/impractical to do this analysis

• Analysis conducted at lower cost

• Analysis conducted in less time

• Greater flexibility

Page 33: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

1. Modeling True Risk

Page 34: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

1. Modeling True Risk

Solution with Hadoop• Source, parse and aggregate disparate data

sources to build comprehensive data picture• e.g. credit card records, call recordings, chat

sessions, emails, banking activity• Structure and analyze

• Sentiment analysis, graph creation, pattern recognition

Typical Industry• Financial Services (Banks, Insurance)

Page 35: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

2. Customer Churn Analysis

Page 36: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

2. Customer Churn Analysis

Solution with Hadoop• Rapidly test and build behavioral model of customer from

disparate data sources• Structure and analyze with Hadoop

• Traversing• Graph creation• Pattern recognition

Typical Industry• Telecommunications, Financial Services

Page 37: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

3. Recommendation Engine

Page 38: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

3. Recommendation Engine

Solution with Hadoop• Batch processing framework

• Allow execution in in parallel over large datasets• Collaborative filtering

• Collecting ‘taste’ information from many users• Utilizing information to predict what similar users like

Typical Industry• Ecommerce, Manufacturing, Retail

Page 39: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

4. Ad Targeting

Page 40: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

4. Ad Targeting

Solution with Hadoop• Data analysis can be conducted in parallel, reducing

processing times from days to hours• With Hadoop, as data volumes grow the only expansion

cost is hardware• Add more nodes without a degradation in performance

Typical Industry• Advertising

Page 41: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

5. Point of Sale Transaction Analysis

Page 42: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

5. Point of Sale Transaction Analysis

Solution with Hadoop• Batch processing framework

• Allow execution in in parallel over large datasets• Pattern recognition

• Optimizing over multiple data sources• Utilizing information to predict demand

Typical Industry• Retail

Page 43: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

6. Analyzing Network Data to Predict Failure

Page 44: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

6. Analyzing Network Data to Predict Failure

Solution with Hadoop• Take the computation to the data

• Expand the range of indexing techniques from simple scans to more complex data mining

• Better understand how the network reacts to fluctuations• How previously thought discrete anomalies may, in

fact, be interconnected• Identify leading indicators of component failure

Typical Industry• Utilities, Telecommunications, Data Centers

Page 45: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

7. Threat Analysis

Page 46: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

7. Threat Analysis

Solution with Hadoop• Parallel processing over huge datasets

• Pattern recognition to identify anomalies i.e. threats

Typical Industry• Security• Financial Services• General: spam fighting,

click fraud

Page 47: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

8. Trade Surveillance

Page 48: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

8. Trade Surveillance

Solution with Hadoop• Batch processing framework

• Allow execution in in parallel over large datasets• Pattern recognition

• Detect trading anomalies and harmful behavior

Typical Industry• Financial services• Regulatory bodies

Page 49: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

9. Search Quality

Page 50: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

9. Search Quality

Solution with Hadoop• Analyzing search attempts in conjunction with structured

data• Pattern recognition

• Browsing pattern of users performing searches in different categories

Typical Industry• Web• Ecommerce

Page 51: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

10. Data “Sandbox”

Page 52: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Solution with Hadoop• With Hadoop an organization can “dump” all this data

into a HDFS cluster

• Then use Hadoop to start trying out different analysis on the data

• See patterns or relationships that allow the organization to derive additional value from data

Typical Industry• Common across all industries

10. Data “Sandbox”

Page 53: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

The Hadoop Ecosystem

Started as stand-alone project Turned into thriving ecosystem Full stack of tools covers complete set of business problems Covers

– data aggregationand collection – data input and output to legacy system– random access to data– high-level query languages– workflow system– UI framework to enable easy access– ...

Page 54: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Hive and Pig

Although MapReduce is very powerful, it can also be complex to master

Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code

Many organizations have programmers who are skilled at writing code in scripting languages

Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce– Hive was initially developed at Facebook, Pig at Yahoo!

Cloudera offers a two-day course, Data Manipulation with Hive and Pig

Page 55: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

HBase: ‘The Hadoop Database’

HBase is a column-store database layered on top of HDFS– Based on Google’s Big Table– Provides interactive access to data

Can store massive amounts of data– Multiple Gigabytes, up to Terabytes of data

High Write Throughput– Thousands per second (per node)

Copes well with sparse data– Tables can have many thousands of columns

– Even if a given row only has data in a few of the columns

Has a constrained access model– Limited to lookup of a row by a single key– No transactions

– Single row operations only

Page 56: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Flume

Flume is a distributed, reliable, available service for efficiently moving large amounts of data as it is produced– Ideally suited to gathering logs from multiple systems and

inserting them into HDFS as they are generated

Developed in-house by Cloudera, and released as open-source software

Design goals:– Reliability– Scalability– Manageability– Extensibility

Page 57: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

ZooKeeper: Distributed Consensus Engine

ZooKeeper is a ‘distributed consensus engine’– A quorum of ZooKeeper nodes exists– Clients can connect to any node and be assured that they will

receive the correct, up-to-date information– Elegantly handles a node crashing

Used by many other Hadoop projects– HBase, for example

Page 58: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Fuse-DFS

Fuse-DFS allows mounting HDFS volumes via the Linux FUSE filesystem– Enables applications which can only write to a ‘standard’

filesystem to write to HDFS with no application modification

Caution: Does not imply that HDFS can be used as a general-purpose filesystem– Still constrained by HDFS limitations

– For example, no modifications to existing files

Useful for legacy applications which need to write to HDFS

FUSE: Filesystem in USEr space

Page 59: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Ganglia: Cluster Monitoring

Ganglia is an open-source, scalable, distributed monitoring product for high-performance computing systems– Specifically designed for clusters of machines

Not, strictly speaking, a Hadoop project– But very useful for monitoring Hadoop clusters

Collects, aggregates, and provides time-series views of metrics Integrates with Hadoop’s metrics-collection system Note: Ganglia doesn’t provide alerts

– But is easily paired with alerting systems such as Nagios

Page 60: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Sqoop: Retrieving Data From RDBMSs

Sqoop: SQL to Hadoop Extracts data from RDBMSs and inserts it into HDFS

– Also works the other way around

Command-line tool which works with any RDBMS– Optimizations available for some specific RDBMSs

Generates Writeable classes for use in MapReduce jobs Developed at Cloudera, released as Open Source

Page 61: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Hue

Hue: The Hadoop User Experience Graphical front-end to developer and administrator functionality

– Uses a Web browser as its front-end

Developed by Cloudera, released as Open Source Extensible

– Publically-available API

Cloudera Enterprise includes extra functionality– Advanced user management– Integration with LDAP, Active Directory– Accounting

Page 62: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

About Cloudera

Cloudera is “The commercial Hadoop company” Founded by leading experts on Hadoop from Facebook, Google,

Oracle and Yahoo Provides consulting and training services for Hadoop users Staff includes several committers to Hadoop projects

Page 63: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Cloudera Overview

Hadoop…

Jeff Hammerbacher

Amr Awadallah

Doug Cutting

… meets enterprise

Mike Olson - CEO

Omer Trajman – VP, customer solutions

John Kreisa –VP, marketing

Charles Zedlewski – Sr. Director Product Management

Ed Albanese – Sr. Director Business Development

Investors Greylock Partners, Accel Ventures

Product category Data management

Business model Cloudera offers Software, Support, Training, and Professional Services

Employees 40+

Customers 40+

Headquarters Palo Alto, California

Elevator pitch Provider of the leading Hadoop platform for the enterprise

Vision We enable organizations to profit from all of their data

Page 64: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Recent News & Awards

News– Partnership with Netezza to integrate TwinFin with Hadoop– Partnership with Quest to connect Oracle and Hadoop (Ora-Oop)– Partnership with NTT DATA to accelerate Hadoop adoption in APAC region– Partnership with Teradata, Aster Data, EMC Greenplum and more– Launched Cloudera Enterprise and Cloudera’s Distribution for Hadoop v3– $25 Million in Series C Funding, Led by Meritech Capital Partners

Awards– Venture Capital Journal: #1 of 20 most promising startups

– Best Young Tech Entrepreneurs 2010 – Jeff Hammerbacher

– Selected as OnDemand 100

Page 65: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Cloudera’s Distribution for Hadoop, Version 3

• Open source – 100% Apache licensed• Simplified – Component versions & dependencies managed for you• Integrated – All components & functions interoperate through standard API’s• Reliable – Patched with fixes from future releases to improve stability• Supported – Employs project founders and committers for >70% of components

Hue Hue SDK

OozieOozie

HBaseFlume, Sqoop

Zookeeper

Hive

Pig/Hive

The Industry’s Leading Hadoop Distribution

Page 66: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Cloudera Software (All Open-Source)

Cloudera’s Distribution of Hadoop (CDH)– A single, easy-to-install package from the Apache Hadoop core

repository– Includes a stable version of Hadoop, plus critical bug fixes and

solid new features from the development version

Hue– Browser-based tool for cluster administration and job

development– Supports managing internal clusters as well as those running on

public clouds– Helps decrease development time

Page 67: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Cloudera Enterprise

• Increases reliability and consistency of the Hadoop platform• Improves Hadoop’s conformance to important IT policies and procedures• Lowers the cost of management and administration

Enterprise Support and Management Tools

Page 68: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Cloudera Enterprise

Management Tools Description Benefits

Authorization Management and Provisioning

• Manage hierarchies of groups and related permissions including files and applications on a delegated basis.

• LDAP integration for compliance and streamlined provisioning

• Increased control – all user access and actions are logged for future analysis

• Improved security – passwords can be managed via LDAP vs. local passwords

Resource Management • Track resource consumption across groups and users for capacity planning and intercompany billing.

• Manage quotas for groups.

• Resource efficiency – better enable multiple groups to share a single cluster

• Reduced cost of administration – delegated group management and unification of file, quota and application administration work

Integration Configuration and Monitoring

• Monitor and exception handle event data integrations into Hadoop.

• Set configurations and reliability levels for groups of incoming event data.

• Lower cost of administration – easy to configure & update across hundreds of sources

• Greater reliability – identify trouble points early and often

• Visibility – analyze current and historical patterns for future optimization

Management Tools

Page 69: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Cloudera Enterprise

Cloudera Enterprise– Complete package of software and support– Built on top of CDH– Includes tools for monitoring the Hadoop cluster

– Resource consumption tracking– Quota management– Etc

– Includes management tools– Authorization management and provisioning– Integration with LDAP– Etc

– 24 x 7 support

Page 70: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Cloudera Training

Current Curriculum– Developer Training (3 days) - Hands-on training for developers who want to

analyze their data but are new to Hadoop– System Administrator Training (2 Days) - Hands-on training for

administrators who will be responsible for setting up, configuring, monitoring a Hadoop cluster

– HBase Training (2 Days) - Developers who want to use HBase to store data because they need one or more of (1) TB+ data sets, (2) Low latency, random reads & writes or (3) High Throughput (e.g. 10,000’s writes/second)

– Hive Training (2 Days) - Analysts or developers who want to use an SQL-like language to analyze data in Hadoop

Public classes and curriculum can be found at www.cloudera.com/events

Public and Private Classes

Page 71: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Cloudera Professional Services

Cloudera Solution Architects (SAs) deliver expertise putting Hadoop in production – Function as tech leads, trainers– Can code and prototype, but goal is to enable customers to do

the work

Services include– Architecture review– Good practices for developing and operationalizing Hadoop– Customer specific code and templates– Custom integrations and patches

Enable and Accelerate Hadoop Deployments

Page 72: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Cloudera Professional Services

Provides consultancy services to many key users of Hadoop– Including Netflix, Samsung, Trulia, ComScore, LinkedIn,

University of Phoenix,…

Solutions Architects are experts in Hadoop and related technologies– Several are committers to the Apache Hadoop project

Provides training in key areas of Hadoop administration and Development– Courses include Hadoop for System Administrators, Hadoop for

Developers, Data Analysis with Hive and Pig, Learning HBase– Custom course development available– Both public and on-site training available

Page 73: Introduction to Hadoop

Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Questions?