business intelligence and data analytics revolutionized with apache hadoop

28
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics Strata Conference, Sept 22 nd 2011, New York, NY Dr. Amr Awadallah, Founder, CTO, VP of Engineering [email protected], twitter: @awadallah

Upload: cloudera-inc

Post on 29-Jun-2015

3.139 views

Category:

Technology


0 download

DESCRIPTION

Cloudera, CTO, Dr. Amr Awadallah explains how Apache Hadoop is revolutionizing the business intelligence and data analytics environment.

TRANSCRIPT

Page 1: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

How Apache Hadoop is RevolutionizingBusiness Intelligence and Data Analytics

Strata Conference, Sept 22nd 2011, New York, NY

Dr. Amr Awadallah, Founder, CTO, VP of [email protected], twitter: @awadallah

Page 2: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

Business Intelligence Before Adopting Apache Hadoop

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 2

Storage Only Grid (original raw data)

Instrumentation

Collection

RDBMS (processed data)

BI Reports + Interactive Apps

Mostly Append

ETL Compute Grid

Moving Data ToCompute Doesn’t Scale

Can’t Explore OriginalHigh Fidelity Raw Data

Archiving =PrematureData Death

Page 3: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

Business Intelligence After Adopting Apache Hadoop

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 3

Hadoop: Storage + Compute Grid

Instrumentation

Collection

RDBMS

BI Reports + Interactive Apps

Complex Data Processing

Mostly Append

Data Exploration &Advanced Analytics

ETL and Aggregations

Keep Data Alive For Ever

Page 4: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

So What is Apache Hadoop?

• A scalable fault-tolerant distributed system for data storage andprocessing (open source under the Apache license)

• Core Hadoop has two main components:

• Hadoop Distributed File System: self-healing high-bandwidth clustered storage

• MapReduce: fault-tolerant distributed processing

• Key business values:

• Flexible – Store any data, Run any analysis (Mine First, Govern Later)

• Scalable – Start at 1TB/3-nodes then grow to petabytes/thousands of nodes

• Affordable – Cost per TB at a fraction of traditional options

• Open Source – No Lock-In, Rich Ecosystem, Large developer community

• Broadly adopted – A large and active ecosystem, Proven to run at scale

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 4

Page 5: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

The Main Benefit: Agility/Flexibility

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 5

Schema-on-Read (Hadoop):Schema-on-Write (RDBMS):

• Schema must be created beforedata is loaded

• Explicit load operation has totake place which transforms datato database internal structure

• New columns must be addedexplicitly before data for suchcolumns can be loaded into thedatabase

• Read is Fast

• Standards/Governance

• Data is simply copied to the filestore, no special transformation isneeded

• A SerDe (Serializer/Deserlizer) isapplied during read time to extractthe required columns

• New data can start flowinganytime and will appearretroactively once the SerDe isupdated to parse them

• Load is Fast

• Flexibility/AgilityBenefitsBenefits

Page 6: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

What is Complex Data Processing?

1. Java MapReduce: Gives the most flexibility and performance,but potentially long development cycle (the “assemblylanguage” of Hadoop).

2. Streaming MapReduce (also Pipes): Allows you to develop inany programming language of your choice, but slightly lowerperformance and less flexibility.

3. Pig: A high-level language out of Yahoo, suitable for batch dataflow workloads.

4. Hive: A SQL interpreter out of Facebook, also includes a meta-store mapping files to their schemas and associated SerDe.

5. Oozie: A PDL XML workflow server engine that enables creatinga workflow of jobs composed of any of the above.

6Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Page 7: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

What This Means For You: Agility

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 7

Up Front Design Just in Time

Page 8: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

What This Means For You: Innovation

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 8

Data Committee Data Scientist

Page 9: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

What This Means For You: Consolidation

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 9

Silos Sharing

Page 10: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

What This Means For You: Extract Value from Latent Data

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 10

Archive to Tape Keep Data Alive

Page 11: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

Benefit #2: Scalability

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 11

What This Means For You: Ability to Grow Fluidly

Page 12: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

What This Means For You: Data Beats Algorithm

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 12

Smarter Algos More Data

Page 13: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

Where Does Hadoop Fit in the Enterprise Data Stack?

13Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Logs Files Web Data

EnterpriseData

Warehouse

WebApplication

EnterpriseReporting

BI, Analytics

Analysts Business Users

Customers

IDEs

Data Scientists

RelationalDatabases

Low-LatencyServingSystems

ClouderaMgmt Suite

SystemOperators

DataArchitects

Development Tools

ETL

Too

ls

Business Intelligence Tools

Page 14: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

Use The Right Tool For The Right Job

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 14

Relational Databases: Hadoop:

Use when:

• Structured or Not (Agility)

• Scalability of Storage/Compute

• Complex Data Processing

Use when:

• Interactive OLAP Analytics (<1sec)

• Multistep ACID Transactions

• 100% SQL Compliance

Page 15: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

Two Core Use Cases Common Across Many Industries

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 15

AD

VA

NC

EDA

NA

LYTI

CS

DA

TAP

RO

CES

SIN

G

Social Network Analysis

Content Optimization

Network Analytics

Loyalty & Promotions

Fraud Analysis

Entity Analysis

Clickstream Sessionization

Clickstream Sessionization

Mediation

Data Factory

Trade Reconciliation

SIGINT

Application ApplicationIndustry

Web

Media

Telco

Retail

Financial

Federal

Bioinformatics Genome MappingSequencing Analysis

Use CaseUse Case

ManufacturingProduct Quality Mfg Process Tracking

Page 16: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

CDH: Cloudera’s Distribution Including Apache Hadoop

16Copyright © 2011, Cloudera, Inc. All Rights Reserved.

• Open Source – 100% Apache licensed, 100% Open Source, 100% Free.

• Enterprise Ready – Predictable releases, Documentation, Hotfix Patches, Intensive QA

• Integrated – All required component versions & dependencies are managed for you

• Industry Standard – Existing RDBMS, ETL and BI systems work best with it

• Many Form Factors – Public Cloud, Private Cloud, Ubuntu, RHEL, 32/64bit, etc

Coordination

Data IntegrationFast Read/Write

Access

Languages / Compilers

Workflow Scheduling Metadata

UI Framework SDK

ZOOKEEPER

FLUME, SQOOP, ODBC HBASE

PIG, HIVE

OOZIE OOZIE HIVE

HUE SDKHUE

Page 17: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

SCM Express: Simplifies Installation and Configuration

©2011 Cloudera, Inc. All Rights Reserved. 17

Service & Configuration Manager(SCM) Express takes the complexity out ofdeploying and configuring CDH.

Provision a complete Hadoop stack in minutes

Centrally manage system services through a user-friendly interface

Manages services for up to 50 nodes

FREE to download

KEY FEATURES

Automated, wizard-basedinstallation of the

complete Hadoop stack

Central, real-timedashboard forconfigurationmanagement

Ability to configure thecluster while it’s running

Incorporatescomprehensive validation

and error checking

Automates the expansionof services to new nodeswhen they come online

1 2 3 4 5

Page 18: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

What is Cloudera Enterprise?

©2011 Cloudera, Inc. All Rights Reserved. 18

Simplify and Accelerate Hadoop Deployment

Reduce Adoption Costs and Risks

Lower the Cost of Administration

Increase the Transparency & Control of Hadoop

Leverage the Experience of Our Experts

Cloudera Enterprise makes open sourceApache Hadoop enterprise-easy

EFFECTIVENESS

Ensuring Repeatable Value fromApache Hadoop Deployments

EFFICIENCY

Enabling Apache Hadoop to beAffordably Run in Production

ClouderaManagement Suite

ComprehensiveToolset for Hadoop

Administration

Production-LevelSupport

Our Team of ExpertsOn-Call to Help You

Meet Your SLAs

CLOUDERA ENTERPRISE COMPONENTS

3 of the top 5 telecommunications, mobile services, defense & intelligence,banking, media and retail organizations depend on Cloudera Enterprise

Page 19: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

Hadoop World 2011

The largest gathering of Hadoop practitioners, developers,business executives, industry luminaries and innovativecompanies in the Hadoop ecosystem.

©2011 Cloudera, Inc. All Rights Reserved.

• 1400 attendees, 25+ sponsors

• 60 sessions across 5 tracks for:

– Business Decision Makers

– Enterprise Architects

– IT Operators

– Data Scientists

– Developers

• Cloudera Training and Certification(November 7, 10, 11)

November 8-9

Sheraton New York Hotel& Towers, NYC

Learn more and register at

www.hadoopworld.com

$50 discount for

Strata attendees

19

Page 20: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

What I Would Like You To Remember:

• The Key Benefits of the Apache Hadoop Data Platform:

• Agility/Flexibility (Enables Innovation/Exploration).

• Complex Data Processing (Any Language, Any Problem).

• Scalability of Storage/Compute (Freedom to Grow).

• Economical Active Archive (Keep All Your Data Alive).

• Cloudera Enterprise enables:

• Lower the Cost of Management and Administration.

• Simplify and Accelerate Hadoop Deployment.

• Increase the Transparency & Control of Hadoop.

• Firm SLAs on Issue Resolution.

20Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Page 21: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

Contact Information:

21Copyright © 2011, Cloudera, Inc. All Rights Reserved.

Amr Awadallah

[email protected]

650-644-3921

http://twitter.com/awadallah

Page 22: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 22

Page 23: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

Appendix

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 23

Page 24: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

Hadoop Timeline

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 24

2002 2003 2004 2005 2006 2007 2008 2009

Doug Cutting & Mike Cafarellastarted working on Nutch

Google publishes GFS &MapReduce papers

Doug Cutting adds DFS &MapReduce support to Nutch

Yahoo! hires Cutting,Hadoop spins out of Nutch

Facebooks launches Hive:SQL Support for Hadoop

Fastest sort of a TB, 3.5minsover 910 nodes

NY Times converts 4TB ofimage archives over 100 EC2s

• Fastest sort of a TB, 62secsover 1,460 nodes• Sorted a PB in 16.25hoursover 3,658 nodes

Hadoop Summit 2009,750 attendees

ClouderaFounded

Doug Cuttingjoins Cloudera

Page 25: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

Cloudera’s Track Record

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 25

• Customers: Multiple customers with >1,000 Hadoop nodes under management

• Supporting dozens of diverse production use cases including ones that are revenue criticalwith tight SLA’s

• Community: years of demonstrated leadership in the Apache Hadoop ecosystem.Cloudera employees are:

• The largest contributor to the Hadoop ecosystem in patches

• Founders of 70% of the projects in the Apache Hadoop ecosystem including ApacheHadoop itself

• The first to build & integrate what is now the reference Hadoop stack

• Industry: Multiple years of experience providing Hadoop solutions across industries:

• 2 of the top 5 payments companies run Cloudera

• 3 of the top 5 commerical banks run Cloudera

• 2 of the top 4 online travel companies run Cloudera

Page 26: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

Cloudera Enterprise Management Suite

©2011 Cloudera, Inc. All Rights Reserved. 26

Utility It Helps You… So You Can… It’s Like…

Activity Monitor • Consolidate all user activitiesinto a real-time view

• Diagnose user performance

• Track activity metrics

• Improve performance

• Improve conformance toSLAs

• Improve QOS

• MySQL Enterprise Monitor

• Quest Foglight for Oracle /SQL Server

Service &ConfigurationManager

• Manage system services

• Automate changes

• Validate settings

• 1-click security

• Lower cost of administration

• Improve uptime

• Red Hat Satellite Server

• Microsoft System Center

• Oracle Enterprise Manager

ResourceManager

• Report on the usage ofscarce resources

• Plan for capacity expansion

• Improve quality of service

• Extend the life of the cluster

• VMware vCenter

AuthorizationManager

• Centralize management of allusers, groups and privileges

• Manage permissions viadelegated administration

• Lower the costs ofadministration

• Improve compliance

• Teradata securityadministration

Page 27: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

CDH Integrates with Existing IT Infrastructure

27

Databases Cloud/OS HardwareBI/Analytics

Copyright © 2011, Cloudera, Inc. All Rights Reserved.

ETL

Page 28: Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 28