multidisziplinäre analyseanwendungen auf einer gemeinsamen datenplattform erstellen
TRANSCRIPT
1© Cloudera, Inc. All rights reserved.
How to build multi-disciplinary analytics
applications on a shared data platform
Frank Hereygers | Systems engineer
3© Cloudera, Inc. All rights reserved.
Challenges in data management
Many data silos, each requiring its own proprietary tools and infrastructureDifferent vendors, products, and services on-premises versus in cloud
A fragmented approach is difficult, expensive, and risky
SQL analytic
databases
NoSQL and real-time databases
Data engineering
and ETL environments
Data warehouses
and data marts
4© Cloudera, Inc. All rights reserved.
Traditional applications
4
• One data type
• One analytic
function
• Hard to
integrate
Data
Exploration
STORAGE
SECURITY
GOVERNANCE
WORKLOAD MGMT
INGEST &
REPLICATION
DATA CATALOG
SQL & BI
Analytics
STORAGE
SECURITY
GOVERNANCE
WORKLOAD MGMT
INGEST & REPLICATION
DATA CATALOG
Operational
Real-Time DB
STORAGE
SECURITY
GOVERNANCE
WORKLOAD MGMT
INGEST & REPLICATION
DATA CATALOG
ETL & Data
Processing
STORAGE
SECURITY
GOVERNANCE
WORKLOAD MGMT
INGEST &
REPLICATION
DATA CATALOG
Custom
Functions
STORAGE
SECURITY
GOVERNANCE
WORKLOAD MGMT
INGEST & REPLICATION
DATA CATALOG
5© Cloudera, Inc. All rights reserved.
Negative consequences for your business
Increased operational costs
many distinct environments
to buy and build
Increased staff overhead
many distinct tools to learn
and support
Increased security risks
many distinct frameworks to enforce
Decreased business insights
narrow data sets and analytics
rigidity
Decreased business agility –
outdated and limiting for
applications
Decreased governance capability –
no common visibility across stores
6© Cloudera, Inc. All rights reserved.
Support multi-function
analytics
Minimize time to add workloads
Support elastic workloads
Enable self-service
Provide a scalable model for sharing data
Reduce costIncrease tenant
isolationSecure the
environment
Key design goals for today’s data management teams
7© Cloudera, Inc. All rights reserved.
Shared Storage (HDFS, Kudu)
Traditional on-premises deployments perform reasonably well
Strong multi-function support
Strong shared data experience
Strong information security model
Moderate cost management
Moderate tenant isolation
Moderate workload elasticity
Weak on self service
Weak on speed of deployment
Shared Data Experience (Metadata, Security, Governance)
One physical cluster provides a shared data experience to multiple workloads and tenants
… but not good enough going forward
8© Cloudera, Inc. All rights reserved.
Traditional cloud deployments are strong where on-premises is weak, but at the expense of creating workload silos
Moderate multi-function support
Weak on shared data experience
Weak information security model
Moderate cost management
Strong on tenant isolation
Strong on workload elasticity
Strong on self service
Strong on speed of deployment
This is the experience of cloud house offerings… but not good enough going forward
Shared Storage
Cloud
11© Cloudera, Inc. All rights reserved.
Today: One platform. Multiple workloads
DATA
ENGINEERING
OPERATIONAL
DATABASE
ANALYTIC
DATABASE
DATA
SCIENCE
Store and process unlimited data fast and
cost-effectively
“Programmatic data processing and machine
learning”
Explore, analyze, and understand
all your data
“Fast, flexible, open source
parallel database”
Build data-driven applications to deliver
real-time insights
“Online applications,lambda/kappaarchitectures”
12© Cloudera, Inc. All rights reserved.
What is a workload?Data + Data Context + Compute
Data Context:
• HMS: Schema definitions
• Sentry: Security
authorizations
• Navigator: Audit logs
• Navigator: Business glossary
• Navigator: Business metadata
• Navigator: Lineage
13© Cloudera, Inc. All rights reserved.
What about multiple workloads?
Cluster
Hive/HMS
Sentry
NavigatorSpark
Keys
HDFS, Kudu, S3, Private Cloud Storage
14© Cloudera, Inc. All rights reserved.
Data context with multiple workloadsTraditional Hadoop clusters contain compute, data, and
data context
Transient Hadoop clusters contain compute and data context,
but externalize data
HDFS, Kudu, S3, Private Cloud
Storage
Why is data
context stored
in each cluster,
and not
alongside
the data?
?
15© Cloudera, Inc. All rights reserved.
The data context consistency problem
Compute and data are becoming further separated
• Compute is stateless: cloud-based or on-prem, either transient or long-running
• Data is stateful: cloud-based or on-prem in HDFS, Kudu, S3, ADLS, Isilon, etc.
What about data context?
• Schema Definitions (Hive Metastore)
• Permissions (Apache Sentry)
• Encryption Keys (KMS)
• Governance (Cloudera Navigator)
Data context should be stateful, but currently is stateless
• This creates synchronization and usability challenges for admins and end users
alike
16© Cloudera, Inc. All rights reserved.
Solution: Shared Data ExperienceExternalize data context services
as a shared service
DATA
ENGINEERIN
G
OPERATIONA
L DATABASE
ANALYTIC
DATABASE
DATA
SCIENCE
Benefits:
• Common schemas, access permissions, classifications, and governance across all workloads
• Reduced cost of ownership: less hardware and software to manage
• Increased end-user productivity: data is presented consistently in every cluster
• Faster expansion: admins don’t have to recreate data context services with each new cluster
KEYSHMS SENTRYNAVIGATO
RKEYSHMS SENTRY
NAVIGATO
R
HDFS, Kudu, S3, Private Cloud StorageHDFS, Kudu, S3, Private Cloud Storage
17© Cloudera, Inc. All rights reserved.
The modern platform for machine learning and analytics optimized for the cloud
EXTENSIBLE
SERVICES
CORE
SERVICES DATA
ENGINEERING
OPERATIONAL
DATABASE
ANALYTIC
DATABASE
DATA
SCIENCE
DATA CATALOG
INGEST &
REPLICATIONSECURITY GOVERNANCE
WORKLOAD
MANAGEMENT
Cloudera Enterprise
S3 ADLS HDFS KUDUSTORAGE
SERVICES
18© Cloudera, Inc. All rights reserved.
Two deployment options
Cloudera SDX
Cloudera SDX: Customer-managed
• RDS-backed Hive Metastore
• RDS-backed Apache Sentry
• Customer-managed Cloudera Navigator
Ideal for:
• Director-launched workloads
• CM-managed workloads
Cloudera Altus SDX: Cloudera-
managed
• Serverless Hive Metastore
• Serverless Apache Sentry
• Serverless Cloudera Navigator
Ideal for:
• Altus SDX workloads
• Hybrid workloads
19© Cloudera, Inc. All rights reserved.
Cloud deployments with SDX optimize for all design goals
Shared Data Experience (Metadata, Security, Governance)
One logical cluster provides a shared data experience to multiple workloads and tenants
SDX makes it possible to transfer on-premises design wins to cloud
Shared Object StorageCloud
Strong multi-function support
Strong shared data experience
Strong information security model
Strong on cost management
Strong on tenant isolation
Strong on workload elasticity
Strong on self service
Strong on speed of deployment
20© Cloudera, Inc. All rights reserved.
Positive business outcomes
Increased business insights
diverse data together with
analytics flexibility
Increased business agility
modern and nimble application
innovation
Increased governance
capability one common
viewpoint and store
Decreased operational costs
– one environment for all
needs
Decreased staff overhead –
one set of controls for
everything
Decreased security risks –
comprehensive controls
everywhere
21© Cloudera, Inc. All rights reserved.
Using Predictive Maintenance to Improve Performance and Reduce Fleet Downtime
• OnCommand Connection is collecting telematics and geolocation data across the fleet
• Reduced maintenance costs to $.03 per mile from $.12-$.15 per mile
• Centralizing data from 13 systems with varying frequency and semantic definitions
• Real-time visibility of 300,000+ trucks in order to improve uptime and vehicle performance