how does this stuff work?. - ohdsi · what is hadoop? how does this stuff work?. shared ....
Post on 05-Jul-2018
213 Views
Preview:
TRANSCRIPT
1 © Cloudera, Inc. All rights reserved.
What is Hadoop? How does this stuff work?.
Shared Processing
Shared Data
2 © Cloudera, Inc. All rights reserved.
Cloudera Technology Making Hadoop Fast, Easy, and Secure for the Modernized Architecture
Hadoop is a new kind of data platform. • One place for unlimited data • Unified data access
Cloudera makes it: • Fast for business • Easy to manage • Secure without compromise
3 © Cloudera, Inc. All rights reserved.
MapReduce: A great tool for its day
The original scalable, general, processing engine of Hadoop ecosystem - Useful across diverse problem domains - Fueled initial ecosystem explosion
MapReduce Execution Engine
Hive Pig Mahout Solr Crunch
4 © Cloudera, Inc. All rights reserved.
Enter Apache Spark Flexible, in-memory data processing for Hadoop
Easier Development
Flexible, Extensible API
Faster Processing (Batch & Streaming)
• Rich APIs for Scala, Java, and Python
• Interactive shell
• APIs for different types of workloads: • Batch • Streaming • Machine Learning • Graph
• In-Memory processing and caching
5 © Cloudera, Inc. All rights reserved.
The Spark Ecosystem & Hadoop
Spark Streaming MLlib SparkSQL GraphX Data-
frames SparkR
STORAGE HDFS, HBase
RESOURCE MANAGEMENT YARN
Spark Impala MR Others Search
6 © Cloudera, Inc. All rights reserved.
John Hope – Senior Solutions Engineer hope@cloudera.com
Back Slides Hadoop and Spark Essentials and a few other topics
7 © Cloudera, Inc. All rights reserved.
Our relationship with data is changing - forever.
8 © Cloudera, Inc. All rights reserved.
Data can be a powerful strategic asset
data helps achieve your business vision.
…only if...
9 © Cloudera, Inc. All rights reserved.
Data is Transforming Business Drive Customer
Insights +revenue Improve Product & Services
Efficiency -costs Lower Business
Risk
10 © Cloudera, Inc. All rights reserved.
• ETL Offload • Too much data, too little time, too
costly • Active Archive
• Save all data vs. moving to slow archival storage
• Mainframe Migration • Move costly CPU loads
• Real-time streaming • Fraud detection, patient care,
transactions
• Data Discovery • Search All Types of Data • Predictive Analytics • Scalable BI • 360 View of xyz
• Eliminate siloed data • Anomaly detection • PB-scale platform for cyber security • Behavioral analytics • Multi-tenant / shared resources
Exploring Use Cases
11 © Cloudera, Inc. All rights reserved.
Data Management
Yesterday and Today
12 © Cloudera, Inc. All rights reserved.
80% of data available to most companies is not used to make data driven decisions Existing solutions are not enough. Only capture structured data. Challenging to adapt to unstructured or
semi-structured sources and diverse types. Optimized for clean, organized data,
not uncertain data quality and value. structured
other data
20%
2005 2010 2015
Volu
me
of D
ata
Gen
erat
ed
80%
13 © Cloudera, Inc. All rights reserved.
Working the Data Analyzing the Data
13
Discovery & Data Management Lifecycle Too little time finding insights that change the business
Data Select Data
Prepare Data
Transform Data
See Patterns
Interpret & Evaluate Knowledge
Insights
Traditional Approach Model First vs. Data First
Time
f(x) ?
a = A
80% 20% 80% 20%
14 © Cloudera, Inc. All rights reserved.
What if we could make data preparation 20% of the effort so you can focus 80% of your time on executing and improving your business?
Inverting Data Access Cycles
15 © Cloudera, Inc. All rights reserved.
Adopt an Agile Approach Successful projects start small, fail often, and iterate to success
1. Get data you already have, or create new data.
2. Explore and analyze, quickly.
3. Deploy your application.
…and repeat
Add: new data sources, more users, more use cases, more complex analytics, go real-time
Collect, Create, Manage
unlimited data
Explore, Analyze data in many ways
Operationalize insights to drive action
16 © Cloudera, Inc. All rights reserved.
Working the Data Analyzing the Data
16
Discovery & Data Management Lifecycle Move quickly to business changing insights
Data Select Data
Prepare Data
Transform Data
See Patterns
Interpret & Evaluate Knowledge
Insights
Cloudera + Open Source Frameworks Data First vs Model First
Time
f(x) ?
a = A
80% 20% 20% 80%
17 © Cloudera, Inc. All rights reserved.
Current Data Management Architectures Limited data. Single access. Platform silos.
REPORT
ETL
MODEL
Limited Data Structured Only Slow Performance Restricted Use Difficult Redundancy Sometimes SPOF Not Real-Time Many more….. 20% of Available Data Used Today
18 © Cloudera, Inc. All rights reserved.
The Cloudera Solution
Fast, Easy and Secure
19 © Cloudera, Inc. All rights reserved.
Cloudera Enterprise Data Hub Unlimited data. Diverse access. One platform.
REPORT
ETL
MODEL
Yesterday Today
20 © Cloudera, Inc. All rights reserved.
Cloudera Enterprise Data Hub Unlimited data. Diverse access. One platform.
21 © Cloudera, Inc. All rights reserved.
Cloudera Enterprise Data Hub How does this stuff work?.
Shared Processing
Shared Data
22 © Cloudera, Inc. All rights reserved.
Cloudera Enterprise Data Hub Making Hadoop Fast, Easy, and Secure for the Modernized Architecture
Hadoop is a new kind of data platform. • One place for unlimited data • Unified data access
Cloudera makes it: • Fast for business • Easy to manage • Secure without compromise
23 © Cloudera, Inc. All rights reserved.
MapReduce: A great tool for its day
The original scalable, general, processing engine of Hadoop ecosystem - Useful across diverse problem domains - Fueled initial ecosystem explosion
MapReduce Execution Engine
Hive Pig Mahout Solr Crunch
24 © Cloudera, Inc. All rights reserved.
Enter Apache Spark Flexible, in-memory data processing for Hadoop
Easier Development
Flexible, Extensible API
Faster Processing (Batch & Streaming)
• Rich APIs for Scala, Java, and Python
• Interactive shell
• APIs for different types of workloads: • Batch • Streaming • Machine Learning • Graph
• In-Memory processing and caching
25 © Cloudera, Inc. All rights reserved.
The Spark Ecosystem & Hadoop
Spark Streaming MLlib SparkSQL GraphX Data-
frames SparkR
STORAGE HDFS, HBase
RESOURCE MANAGEMENT YARN
Spark Impala MR Others Search
26 © Cloudera, Inc. All rights reserved.
One platform. Many applications. Future Proof.
Data Engineering
Data Discovery & Analytics
Real-Time Data Applications
Increase Revenue
Reduce Costs
Mitigate Risks
Cloudera Enterprise: Fast, Easy, Secure
Business Value
Technology Use Cases
27 © Cloudera, Inc. All rights reserved.
Apache Hadoop … This will change before we leave the room.
28 © Cloudera, Inc. All rights reserved.
29 © Cloudera, Inc. All rights reserved.
Hadoop Isn’t Just Hadoop Anymore
2006-07 2008 2009 2010 2011 2012 Present
Core Hadoop (HDFS, MR)
HBase ZooKeeper
Core Hadoop
Hive Mahout HBase
ZooKeeper Core Hadoop
Sqoop Whirr Avro Hive
Mahout HBase
ZooKeeper Core Hadoop
Flume Bigtop Oozie
MRUnit HCatalog
Sqoop Whirr Avro Hive
Mahout HBase
ZooKeeper Core Hadoop
+ YARN
Spark Tez
Impala Kafka Flume Bigtop Oozie
MRUnit HCatalog
Sqoop Whirr Avro Hive
Mahout HBase
ZooKeeper Core Hadoop
+ YARN
Kudu RecordService
Spark SparkSQL Parquet Sentry
SparkSQL Tez
Impala Kafka Flume Bigtop Oozie
MRUnit HCatalog
Sqoop Whirr Avro
Hive/HoS Mahout HBase
ZooKeeper Core Hadoop +
YARN
30 © Cloudera, Inc. All rights reserved.
Apache Hadoop
Scalable
Flexible
Open (future proof)
Cost-Effective
top related