hadoop update big data analytics may 23 nd 2012 matt mead, cloudera

Hadoop UpdateBig Data Analytics

May 23nd 2012Matt Mead, Cloudera

Hadoop Distributed File System (HDFS)

Self-Healing, High Bandwidth Clustered

Storage

MapReduce

Distributed Computing Framework

Apache Hadoop is an open source platform for data storage and processing that is…

Scalable Fault tolerant Distributed

CORE HADOOP SYSTEM COMPONENTS

Provides storage and computationin a single, scalable system.

What is Hadoop?

Why Use Hadoop?

Move beyond rigid legacy frameworks.

Hadoop handles any data type, in any quantity.

Structured, unstructured

Schema, no schema

High volume, low volume

All kinds of analytic applications

1 2 3

Hadoop is 100% Apache® licensed and open source.

No vendor lock-in

Community development

Rich ecosystem of related projects

Hadoop grows with your business.

Proven at petabyte scale

Capacity and performance grow simultaneously

Leverages commodity hardware to mitigate costs

Hadoop helps you derive the complete value of all

your data.

Drives revenue by extracting value from data that was previously out of reach

Controls costs by storing data more affordably than any other platform

The Need for CDH

1. The Apache Hadoop ecosystem is complex– Many different components – lots of moving parts

– Most companies require more than just HDFS and MapReduce

– Creating a Hadoop stack is time-consuming and requires specific expertise• Component and version selection• Integration (internal & external)• System test w/end-to-end workflows

2. Enterprises consume software in a certain way– System, not silo– Tested and stable– Documented and supported– Predictable release schedule

Core Values of CDH

Storage

Computation

Integration

Coordination

Access

Components of theCDH Stack

A Hadoop system with everything you need for production use.

Coordination

Data IntegrationFast

Read/Write Access

Languages / Compilers

Workflow Scheduling Metadata

APACHE ZOOKEEPER

APACHE FLUME, APACHE SQOOP APACHE HBASE

APACHE PIG, APACHE HIVE, APACHE MAHOUT

APACHE OOZIE APACHE OOZIE APACHE HIVE

File System Mount UI Framework SDKFUSE-DFS HUE HUE SDK

HDFS, MAPREDUCE

The Need for CDH

A set of open source components, packaged into a single system.

CORE APACHE HADOOP

HDFS – Distributed, scalable, fault tolerant file system

MapReduce – Parallel processing framework for large data sets

QUERY / ANALYTICS Apache Hive – SQL-like language and metadata repository

Apache Pig – High level language for expressing data analysis programs

Apache HBase – Hadoop database for random, real-time read/write access

Apache Mahout – Library of machine learning algorithms for Apache Hadoop

DATA INTEGRATION

Apache Sqoop – Integrating Hadoop with RDBMS

Apache Flume – Distributed service for collecting and aggregating log and event data

Fuse-DFS – Module within Hadoop for mounting HDFS as a traditional file system

WORKFLOW / COORDINATION

Apache Oozie – Server-based workflow engine for Hadoop activities

Apache Zookeeper – Highly reliable distributed coordination service

GUI / SDK Hue – Browser-based desktop interface for interacting with Hadoop

CLOUD Apache Whirr – Library for running Hadoop in the cloud

Core Hadoop Use CasesAD

VAN

CED

AN

ALYT

ICS

1 2Two Core Use CasesApplied Across Verticals

DATA

PRO

CESS

ING

Social Network Analysis

Content Optimization

Network Analytics

Loyalty & Promotions Analysis

Fraud Analysis

Entity Analysis

Clickstream Sessionization

Engagement

Mediation

Data Factory

Trade Reconciliation

SIGINT

INDUSTRY TERM INDUSTRY TERMVERTICAL

Web

Media

Telco

Retail

Financial

Federal

Bioinformatics Genome MappingSequencing Analysis

FMV & Image Processing

Data Processing – Full Motion Video & Image Processing

• Record by record -> Easy Parallelization– “Unit of work” is important

– Raw data in HDFS

• Adaptation of existing image analyzers to Map Only / Map Reduce

• Scales horizontally

• Simple detections– Vehicles

– Structures

– Faces

Cybersecurity Analysis

Advanced Analytics – Cybersecurity Analysis

• Rates and flows – ingest can be in excess of the multiple gigabyte per second range

• Can be complex because of mixed-workload clusters

• Typically involves ad-hoc analysis– Question oriented analytics

• “Productionized” use cases allow insight by non-analysts

• Existing open source solution SHERPASURFING– Focuses on the cybersecurity analysis underpinnings for common data-sets (pcap, netflow, audit logs, etc.)

– Provides a means to ask questions without reinventing all the plumbing

Index Preparation

Data Processing – Index Preparation

• Hadoop’s Seminal Use Case

• Dynamic Partitioning -> Easy Parallelization

• String Interning

• Inverse Index Construction

• Dimensional data capture

• Destination indices– Lucene/Solr (and derivatives)

– Endeca

• Existing solution USA Search (http://usasearch.howto.gov/)

Data Processing – Schema-less Enterprise Data Warehouse / Landing Zone

• Begins as storage, light ingest processing, retrieval

• Capacity scales horizontally

• Schema-less -> holds arbitrary content

• Schema-less -> allows ad-hoc fusion and analysis

• Additional analytic workload forces decisions

Data Landing Zone

Hadoop: Getting Started

• Reactive– Forced by scale or cost of scaling

• Proactive– Seek talent ahead of need to build

– Identify data-sets

– Determine high-value use cases that change organizational outcomes

– Start with 10-20 nodes and 10+TB unless data-sets are super-dimensional

• Either way– Talent a major challenge

– Start with “Data Processing” use cases

– Physical infrastructure is complex, make the software infrastructure simple to manage

Customer Success

Time required for Production Deployment (Months)

Cost

, $M

illio

ns

1 2 3 4 5 6

$1M

$2M

$3M

$4M

$5M

Option 1: Use Cloudera EnterpriseEstimated Cost: $2 millionDeployment Time: ~ 2 Months

Option 2: Self-SourceEstimated Cost: $4.8 millionDeployment Time: ~ 6 Months

Note: Cost estimates include personnel, software & hardwareSource: Cloudera internal estimates

Self-Source Deployment vs. Cloudera Enterprise – 500 node deployment

Customer Success

Item Cloudera Enterprise Self-Source or Contract

Support Offering World-Class, Global, Dedicated Contributors and Committers

Must recruit, hire, train and retain Hadoop experts

Monitoring and Management Fully Integrated application for Hadoop Intelligence

Must be developed and maintained in house

Support for the Full Hadoop Stack Full Stack* Unknown

Regular Scheduled Releases Yearly Major, Quarterly Minor, Hot Fixes?

N/A

Training and Certification for the Full Hadoop Stack

Available Worldwide None

Support for Full Lifecycle All Inclusive Development through Production

Community support

Rich Knowledge-base 500+ Articles None

Production Solution Guides Included None

* Flume, FuseDFS, HBase, HDFS, Hive, Hue, Mahout, MR1, MR2, Oozie, Pig, Sqoop, Zookeeper

Cloudera Enterprise Subscription vs. Self-Source

• Erin Hawley– Business Development, Cloudera DoD Engagement

– [email protected]

• Matt Mead– Sr. Systems Engineer, Cloudera Federal Engagements

– [email protected]

Contact Us

hadoop update big data analytics may 23 nd 2012 matt mead, cloudera

Documents

apache hadoop apache

apache hadoop ecosystem

use hadoop

hadoop apache whirr

hadoop stack

costs hadoop

data storage

data type