how does this stuff work?. - ohdsi · what is hadoop? how does this stuff work?. shared ....

30
1 © Cloudera, Inc. All rights reserved. What is Hadoop? How does this stuff work?. Shared Processing Shared Data

Upload: hoangthuy

Post on 05-Jul-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

1 © Cloudera, Inc. All rights reserved.

What is Hadoop? How does this stuff work?.

Shared Processing

Shared Data

Page 2: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

2 © Cloudera, Inc. All rights reserved.

Cloudera Technology Making Hadoop Fast, Easy, and Secure for the Modernized Architecture

Hadoop is a new kind of data platform. • One place for unlimited data • Unified data access

Cloudera makes it: • Fast for business • Easy to manage • Secure without compromise

Page 3: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

3 © Cloudera, Inc. All rights reserved.

MapReduce: A great tool for its day

The original scalable, general, processing engine of Hadoop ecosystem - Useful across diverse problem domains - Fueled initial ecosystem explosion

MapReduce Execution Engine

Hive Pig Mahout Solr Crunch

Page 4: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

4 © Cloudera, Inc. All rights reserved.

Enter Apache Spark Flexible, in-memory data processing for Hadoop

Easier Development

Flexible, Extensible API

Faster Processing (Batch & Streaming)

• Rich APIs for Scala, Java, and Python

• Interactive shell

• APIs for different types of workloads: • Batch • Streaming • Machine Learning • Graph

• In-Memory processing and caching

Page 5: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

5 © Cloudera, Inc. All rights reserved.

The Spark Ecosystem & Hadoop

Spark Streaming MLlib SparkSQL GraphX Data-

frames SparkR

STORAGE HDFS, HBase

RESOURCE MANAGEMENT YARN

Spark Impala MR Others Search

Page 6: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

6 © Cloudera, Inc. All rights reserved.

John Hope – Senior Solutions Engineer [email protected]

Back Slides Hadoop and Spark Essentials and a few other topics

Page 7: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

7 © Cloudera, Inc. All rights reserved.

Our relationship with data is changing - forever.

Page 8: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

8 © Cloudera, Inc. All rights reserved.

Data can be a powerful strategic asset

data helps achieve your business vision.

…only if...

Page 9: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

9 © Cloudera, Inc. All rights reserved.

Data is Transforming Business Drive Customer

Insights +revenue Improve Product & Services

Efficiency -costs Lower Business

Risk

Page 10: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

10 © Cloudera, Inc. All rights reserved.

• ETL Offload • Too much data, too little time, too

costly • Active Archive

• Save all data vs. moving to slow archival storage

• Mainframe Migration • Move costly CPU loads

• Real-time streaming • Fraud detection, patient care,

transactions

• Data Discovery • Search All Types of Data • Predictive Analytics • Scalable BI • 360 View of xyz

• Eliminate siloed data • Anomaly detection • PB-scale platform for cyber security • Behavioral analytics • Multi-tenant / shared resources

Exploring Use Cases

Page 11: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

11 © Cloudera, Inc. All rights reserved.

Data Management

Yesterday and Today

Page 12: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

12 © Cloudera, Inc. All rights reserved.

80% of data available to most companies is not used to make data driven decisions Existing solutions are not enough. Only capture structured data. Challenging to adapt to unstructured or

semi-structured sources and diverse types. Optimized for clean, organized data,

not uncertain data quality and value. structured

other data

20%

2005 2010 2015

Volu

me

of D

ata

Gen

erat

ed

80%

Page 13: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

13 © Cloudera, Inc. All rights reserved.

Working the Data Analyzing the Data

13

Discovery & Data Management Lifecycle Too little time finding insights that change the business

Data Select Data

Prepare Data

Transform Data

See Patterns

Interpret & Evaluate Knowledge

Insights

Traditional Approach Model First vs. Data First

Time

f(x) ?

a = A

80% 20% 80% 20%

Page 14: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

14 © Cloudera, Inc. All rights reserved.

What if we could make data preparation 20% of the effort so you can focus 80% of your time on executing and improving your business?

Inverting Data Access Cycles

Page 15: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

15 © Cloudera, Inc. All rights reserved.

Adopt an Agile Approach Successful projects start small, fail often, and iterate to success

1. Get data you already have, or create new data.

2. Explore and analyze, quickly.

3. Deploy your application.

…and repeat

Add: new data sources, more users, more use cases, more complex analytics, go real-time

Collect, Create, Manage

unlimited data

Explore, Analyze data in many ways

Operationalize insights to drive action

Page 16: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

16 © Cloudera, Inc. All rights reserved.

Working the Data Analyzing the Data

16

Discovery & Data Management Lifecycle Move quickly to business changing insights

Data Select Data

Prepare Data

Transform Data

See Patterns

Interpret & Evaluate Knowledge

Insights

Cloudera + Open Source Frameworks Data First vs Model First

Time

f(x) ?

a = A

80% 20% 20% 80%

Page 17: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

17 © Cloudera, Inc. All rights reserved.

Current Data Management Architectures Limited data. Single access. Platform silos.

REPORT

ETL

MODEL

Limited Data Structured Only Slow Performance Restricted Use Difficult Redundancy Sometimes SPOF Not Real-Time Many more….. 20% of Available Data Used Today

Page 18: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

18 © Cloudera, Inc. All rights reserved.

The Cloudera Solution

Fast, Easy and Secure

Page 19: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

19 © Cloudera, Inc. All rights reserved.

Cloudera Enterprise Data Hub Unlimited data. Diverse access. One platform.

REPORT

ETL

MODEL

Yesterday Today

Page 20: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

20 © Cloudera, Inc. All rights reserved.

Cloudera Enterprise Data Hub Unlimited data. Diverse access. One platform.

Page 21: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

21 © Cloudera, Inc. All rights reserved.

Cloudera Enterprise Data Hub How does this stuff work?.

Shared Processing

Shared Data

Page 22: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

22 © Cloudera, Inc. All rights reserved.

Cloudera Enterprise Data Hub Making Hadoop Fast, Easy, and Secure for the Modernized Architecture

Hadoop is a new kind of data platform. • One place for unlimited data • Unified data access

Cloudera makes it: • Fast for business • Easy to manage • Secure without compromise

Page 23: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

23 © Cloudera, Inc. All rights reserved.

MapReduce: A great tool for its day

The original scalable, general, processing engine of Hadoop ecosystem - Useful across diverse problem domains - Fueled initial ecosystem explosion

MapReduce Execution Engine

Hive Pig Mahout Solr Crunch

Page 24: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

24 © Cloudera, Inc. All rights reserved.

Enter Apache Spark Flexible, in-memory data processing for Hadoop

Easier Development

Flexible, Extensible API

Faster Processing (Batch & Streaming)

• Rich APIs for Scala, Java, and Python

• Interactive shell

• APIs for different types of workloads: • Batch • Streaming • Machine Learning • Graph

• In-Memory processing and caching

Page 25: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

25 © Cloudera, Inc. All rights reserved.

The Spark Ecosystem & Hadoop

Spark Streaming MLlib SparkSQL GraphX Data-

frames SparkR

STORAGE HDFS, HBase

RESOURCE MANAGEMENT YARN

Spark Impala MR Others Search

Page 26: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

26 © Cloudera, Inc. All rights reserved.

One platform. Many applications. Future Proof.

Data Engineering

Data Discovery & Analytics

Real-Time Data Applications

Increase Revenue

Reduce Costs

Mitigate Risks

Cloudera Enterprise: Fast, Easy, Secure

Business Value

Technology Use Cases

Page 27: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

27 © Cloudera, Inc. All rights reserved.

Apache Hadoop … This will change before we leave the room.

Page 28: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

28 © Cloudera, Inc. All rights reserved.

Page 29: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

29 © Cloudera, Inc. All rights reserved.

Hadoop Isn’t Just Hadoop Anymore

2006-07 2008 2009 2010 2011 2012 Present

Core Hadoop (HDFS, MR)

HBase ZooKeeper

Core Hadoop

Hive Mahout HBase

ZooKeeper Core Hadoop

Sqoop Whirr Avro Hive

Mahout HBase

ZooKeeper Core Hadoop

Flume Bigtop Oozie

MRUnit HCatalog

Sqoop Whirr Avro Hive

Mahout HBase

ZooKeeper Core Hadoop

+ YARN

Spark Tez

Impala Kafka Flume Bigtop Oozie

MRUnit HCatalog

Sqoop Whirr Avro Hive

Mahout HBase

ZooKeeper Core Hadoop

+ YARN

Kudu RecordService

Spark SparkSQL Parquet Sentry

SparkSQL Tez

Impala Kafka Flume Bigtop Oozie

MRUnit HCatalog

Sqoop Whirr Avro

Hive/HoS Mahout HBase

ZooKeeper Core Hadoop +

YARN

Page 30: How does this stuff work?. - OHDSI · What is Hadoop? How does this stuff work?. Shared . Processing . ... Java, and Python ... Hadoop and Spark Essentials

30 © Cloudera, Inc. All rights reserved.

Apache Hadoop

Scalable

Flexible

Open (future proof)

Cost-Effective