how does this stuff work?. - ohdsi · what is hadoop? how does this stuff work?. shared ....

Post on 05-Jul-2018

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1 © Cloudera, Inc. All rights reserved.

What is Hadoop? How does this stuff work?.

Shared Processing

Shared Data

2 © Cloudera, Inc. All rights reserved.

Cloudera Technology Making Hadoop Fast, Easy, and Secure for the Modernized Architecture

Hadoop is a new kind of data platform. • One place for unlimited data • Unified data access

Cloudera makes it: • Fast for business • Easy to manage • Secure without compromise

3 © Cloudera, Inc. All rights reserved.

MapReduce: A great tool for its day

The original scalable, general, processing engine of Hadoop ecosystem - Useful across diverse problem domains - Fueled initial ecosystem explosion

MapReduce Execution Engine

Hive Pig Mahout Solr Crunch

4 © Cloudera, Inc. All rights reserved.

Enter Apache Spark Flexible, in-memory data processing for Hadoop

Easier Development

Flexible, Extensible API

Faster Processing (Batch & Streaming)

• Rich APIs for Scala, Java, and Python

• Interactive shell

• APIs for different types of workloads: • Batch • Streaming • Machine Learning • Graph

• In-Memory processing and caching

5 © Cloudera, Inc. All rights reserved.

The Spark Ecosystem & Hadoop

Spark Streaming MLlib SparkSQL GraphX Data-

frames SparkR

STORAGE HDFS, HBase

RESOURCE MANAGEMENT YARN

Spark Impala MR Others Search

6 © Cloudera, Inc. All rights reserved.

John Hope – Senior Solutions Engineer hope@cloudera.com

Back Slides Hadoop and Spark Essentials and a few other topics

7 © Cloudera, Inc. All rights reserved.

Our relationship with data is changing - forever.

8 © Cloudera, Inc. All rights reserved.

Data can be a powerful strategic asset

data helps achieve your business vision.

…only if...

9 © Cloudera, Inc. All rights reserved.

Data is Transforming Business Drive Customer

Insights +revenue Improve Product & Services

Efficiency -costs Lower Business

Risk

10 © Cloudera, Inc. All rights reserved.

• ETL Offload • Too much data, too little time, too

costly • Active Archive

• Save all data vs. moving to slow archival storage

• Mainframe Migration • Move costly CPU loads

• Real-time streaming • Fraud detection, patient care,

transactions

• Data Discovery • Search All Types of Data • Predictive Analytics • Scalable BI • 360 View of xyz

• Eliminate siloed data • Anomaly detection • PB-scale platform for cyber security • Behavioral analytics • Multi-tenant / shared resources

Exploring Use Cases

11 © Cloudera, Inc. All rights reserved.

Data Management

Yesterday and Today

12 © Cloudera, Inc. All rights reserved.

80% of data available to most companies is not used to make data driven decisions Existing solutions are not enough. Only capture structured data. Challenging to adapt to unstructured or

semi-structured sources and diverse types. Optimized for clean, organized data,

not uncertain data quality and value. structured

other data

20%

2005 2010 2015

Volu

me

of D

ata

Gen

erat

ed

80%

13 © Cloudera, Inc. All rights reserved.

Working the Data Analyzing the Data

13

Discovery & Data Management Lifecycle Too little time finding insights that change the business

Data Select Data

Prepare Data

Transform Data

See Patterns

Interpret & Evaluate Knowledge

Insights

Traditional Approach Model First vs. Data First

Time

f(x) ?

a = A

80% 20% 80% 20%

14 © Cloudera, Inc. All rights reserved.

What if we could make data preparation 20% of the effort so you can focus 80% of your time on executing and improving your business?

Inverting Data Access Cycles

15 © Cloudera, Inc. All rights reserved.

Adopt an Agile Approach Successful projects start small, fail often, and iterate to success

1. Get data you already have, or create new data.

2. Explore and analyze, quickly.

3. Deploy your application.

…and repeat

Add: new data sources, more users, more use cases, more complex analytics, go real-time

Collect, Create, Manage

unlimited data

Explore, Analyze data in many ways

Operationalize insights to drive action

16 © Cloudera, Inc. All rights reserved.

Working the Data Analyzing the Data

16

Discovery & Data Management Lifecycle Move quickly to business changing insights

Data Select Data

Prepare Data

Transform Data

See Patterns

Interpret & Evaluate Knowledge

Insights

Cloudera + Open Source Frameworks Data First vs Model First

Time

f(x) ?

a = A

80% 20% 20% 80%

17 © Cloudera, Inc. All rights reserved.

Current Data Management Architectures Limited data. Single access. Platform silos.

REPORT

ETL

MODEL

Limited Data Structured Only Slow Performance Restricted Use Difficult Redundancy Sometimes SPOF Not Real-Time Many more….. 20% of Available Data Used Today

18 © Cloudera, Inc. All rights reserved.

The Cloudera Solution

Fast, Easy and Secure

19 © Cloudera, Inc. All rights reserved.

Cloudera Enterprise Data Hub Unlimited data. Diverse access. One platform.

REPORT

ETL

MODEL

Yesterday Today

20 © Cloudera, Inc. All rights reserved.

Cloudera Enterprise Data Hub Unlimited data. Diverse access. One platform.

21 © Cloudera, Inc. All rights reserved.

Cloudera Enterprise Data Hub How does this stuff work?.

Shared Processing

Shared Data

22 © Cloudera, Inc. All rights reserved.

Cloudera Enterprise Data Hub Making Hadoop Fast, Easy, and Secure for the Modernized Architecture

Hadoop is a new kind of data platform. • One place for unlimited data • Unified data access

Cloudera makes it: • Fast for business • Easy to manage • Secure without compromise

23 © Cloudera, Inc. All rights reserved.

MapReduce: A great tool for its day

The original scalable, general, processing engine of Hadoop ecosystem - Useful across diverse problem domains - Fueled initial ecosystem explosion

MapReduce Execution Engine

Hive Pig Mahout Solr Crunch

24 © Cloudera, Inc. All rights reserved.

Enter Apache Spark Flexible, in-memory data processing for Hadoop

Easier Development

Flexible, Extensible API

Faster Processing (Batch & Streaming)

• Rich APIs for Scala, Java, and Python

• Interactive shell

• APIs for different types of workloads: • Batch • Streaming • Machine Learning • Graph

• In-Memory processing and caching

25 © Cloudera, Inc. All rights reserved.

The Spark Ecosystem & Hadoop

Spark Streaming MLlib SparkSQL GraphX Data-

frames SparkR

STORAGE HDFS, HBase

RESOURCE MANAGEMENT YARN

Spark Impala MR Others Search

26 © Cloudera, Inc. All rights reserved.

One platform. Many applications. Future Proof.

Data Engineering

Data Discovery & Analytics

Real-Time Data Applications

Increase Revenue

Reduce Costs

Mitigate Risks

Cloudera Enterprise: Fast, Easy, Secure

Business Value

Technology Use Cases

27 © Cloudera, Inc. All rights reserved.

Apache Hadoop … This will change before we leave the room.

28 © Cloudera, Inc. All rights reserved.

29 © Cloudera, Inc. All rights reserved.

Hadoop Isn’t Just Hadoop Anymore

2006-07 2008 2009 2010 2011 2012 Present

Core Hadoop (HDFS, MR)

HBase ZooKeeper

Core Hadoop

Hive Mahout HBase

ZooKeeper Core Hadoop

Sqoop Whirr Avro Hive

Mahout HBase

ZooKeeper Core Hadoop

Flume Bigtop Oozie

MRUnit HCatalog

Sqoop Whirr Avro Hive

Mahout HBase

ZooKeeper Core Hadoop

+ YARN

Spark Tez

Impala Kafka Flume Bigtop Oozie

MRUnit HCatalog

Sqoop Whirr Avro Hive

Mahout HBase

ZooKeeper Core Hadoop

+ YARN

Kudu RecordService

Spark SparkSQL Parquet Sentry

SparkSQL Tez

Impala Kafka Flume Bigtop Oozie

MRUnit HCatalog

Sqoop Whirr Avro

Hive/HoS Mahout HBase

ZooKeeper Core Hadoop +

YARN

30 © Cloudera, Inc. All rights reserved.

Apache Hadoop

Scalable

Flexible

Open (future proof)

Cost-Effective

top related