multidisziplinäre analyseanwendungen auf einer gemeinsamen datenplattform erstellen

22
How to build multi-disciplinary analytics applications on a shared data platform Frank Hereygers | Systems engineer

Upload: cloudera-inc

Post on 23-Jan-2018

25 views

Category:

Business


0 download

TRANSCRIPT

1© Cloudera, Inc. All rights reserved.

How to build multi-disciplinary analytics

applications on a shared data platform

Frank Hereygers | Systems engineer

2© Cloudera, Inc. All rights reserved.

3© Cloudera, Inc. All rights reserved.

Challenges in data management

Many data silos, each requiring its own proprietary tools and infrastructureDifferent vendors, products, and services on-premises versus in cloud

A fragmented approach is difficult, expensive, and risky

SQL analytic

databases

NoSQL and real-time databases

Data engineering

and ETL environments

Data warehouses

and data marts

4© Cloudera, Inc. All rights reserved.

Traditional applications

4

• One data type

• One analytic

function

• Hard to

integrate

Data

Exploration

STORAGE

SECURITY

GOVERNANCE

WORKLOAD MGMT

INGEST &

REPLICATION

DATA CATALOG

SQL & BI

Analytics

STORAGE

SECURITY

GOVERNANCE

WORKLOAD MGMT

INGEST & REPLICATION

DATA CATALOG

Operational

Real-Time DB

STORAGE

SECURITY

GOVERNANCE

WORKLOAD MGMT

INGEST & REPLICATION

DATA CATALOG

ETL & Data

Processing

STORAGE

SECURITY

GOVERNANCE

WORKLOAD MGMT

INGEST &

REPLICATION

DATA CATALOG

Custom

Functions

STORAGE

SECURITY

GOVERNANCE

WORKLOAD MGMT

INGEST & REPLICATION

DATA CATALOG

5© Cloudera, Inc. All rights reserved.

Negative consequences for your business

Increased operational costs

many distinct environments

to buy and build

Increased staff overhead

many distinct tools to learn

and support

Increased security risks

many distinct frameworks to enforce

Decreased business insights

narrow data sets and analytics

rigidity

Decreased business agility –

outdated and limiting for

applications

Decreased governance capability –

no common visibility across stores

6© Cloudera, Inc. All rights reserved.

Support multi-function

analytics

Minimize time to add workloads

Support elastic workloads

Enable self-service

Provide a scalable model for sharing data

Reduce costIncrease tenant

isolationSecure the

environment

Key design goals for today’s data management teams

7© Cloudera, Inc. All rights reserved.

Shared Storage (HDFS, Kudu)

Traditional on-premises deployments perform reasonably well

Strong multi-function support

Strong shared data experience

Strong information security model

Moderate cost management

Moderate tenant isolation

Moderate workload elasticity

Weak on self service

Weak on speed of deployment

Shared Data Experience (Metadata, Security, Governance)

One physical cluster provides a shared data experience to multiple workloads and tenants

… but not good enough going forward

8© Cloudera, Inc. All rights reserved.

Traditional cloud deployments are strong where on-premises is weak, but at the expense of creating workload silos

Moderate multi-function support

Weak on shared data experience

Weak information security model

Moderate cost management

Strong on tenant isolation

Strong on workload elasticity

Strong on self service

Strong on speed of deployment

This is the experience of cloud house offerings… but not good enough going forward

Shared Storage

Cloud

9© Cloudera, Inc. All rights reserved.

In the beginning…

10© Cloudera, Inc. All rights reserved.

In the beginning…

11© Cloudera, Inc. All rights reserved.

Today: One platform. Multiple workloads

DATA

ENGINEERING

OPERATIONAL

DATABASE

ANALYTIC

DATABASE

DATA

SCIENCE

Store and process unlimited data fast and

cost-effectively

“Programmatic data processing and machine

learning”

Explore, analyze, and understand

all your data

“Fast, flexible, open source

parallel database”

Build data-driven applications to deliver

real-time insights

“Online applications,lambda/kappaarchitectures”

12© Cloudera, Inc. All rights reserved.

What is a workload?Data + Data Context + Compute

Data Context:

• HMS: Schema definitions

• Sentry: Security

authorizations

• Navigator: Audit logs

• Navigator: Business glossary

• Navigator: Business metadata

• Navigator: Lineage

13© Cloudera, Inc. All rights reserved.

What about multiple workloads?

Cluster

Hive/HMS

Sentry

NavigatorSpark

Keys

HDFS, Kudu, S3, Private Cloud Storage

14© Cloudera, Inc. All rights reserved.

Data context with multiple workloadsTraditional Hadoop clusters contain compute, data, and

data context

Transient Hadoop clusters contain compute and data context,

but externalize data

HDFS, Kudu, S3, Private Cloud

Storage

Why is data

context stored

in each cluster,

and not

alongside

the data?

?

15© Cloudera, Inc. All rights reserved.

The data context consistency problem

Compute and data are becoming further separated

• Compute is stateless: cloud-based or on-prem, either transient or long-running

• Data is stateful: cloud-based or on-prem in HDFS, Kudu, S3, ADLS, Isilon, etc.

What about data context?

• Schema Definitions (Hive Metastore)

• Permissions (Apache Sentry)

• Encryption Keys (KMS)

• Governance (Cloudera Navigator)

Data context should be stateful, but currently is stateless

• This creates synchronization and usability challenges for admins and end users

alike

16© Cloudera, Inc. All rights reserved.

Solution: Shared Data ExperienceExternalize data context services

as a shared service

DATA

ENGINEERIN

G

OPERATIONA

L DATABASE

ANALYTIC

DATABASE

DATA

SCIENCE

Benefits:

• Common schemas, access permissions, classifications, and governance across all workloads

• Reduced cost of ownership: less hardware and software to manage

• Increased end-user productivity: data is presented consistently in every cluster

• Faster expansion: admins don’t have to recreate data context services with each new cluster

KEYSHMS SENTRYNAVIGATO

RKEYSHMS SENTRY

NAVIGATO

R

HDFS, Kudu, S3, Private Cloud StorageHDFS, Kudu, S3, Private Cloud Storage

17© Cloudera, Inc. All rights reserved.

The modern platform for machine learning and analytics optimized for the cloud

EXTENSIBLE

SERVICES

CORE

SERVICES DATA

ENGINEERING

OPERATIONAL

DATABASE

ANALYTIC

DATABASE

DATA

SCIENCE

DATA CATALOG

INGEST &

REPLICATIONSECURITY GOVERNANCE

WORKLOAD

MANAGEMENT

Cloudera Enterprise

S3 ADLS HDFS KUDUSTORAGE

SERVICES

18© Cloudera, Inc. All rights reserved.

Two deployment options

Cloudera SDX

Cloudera SDX: Customer-managed

• RDS-backed Hive Metastore

• RDS-backed Apache Sentry

• Customer-managed Cloudera Navigator

Ideal for:

• Director-launched workloads

• CM-managed workloads

Cloudera Altus SDX: Cloudera-

managed

• Serverless Hive Metastore

• Serverless Apache Sentry

• Serverless Cloudera Navigator

Ideal for:

• Altus SDX workloads

• Hybrid workloads

19© Cloudera, Inc. All rights reserved.

Cloud deployments with SDX optimize for all design goals

Shared Data Experience (Metadata, Security, Governance)

One logical cluster provides a shared data experience to multiple workloads and tenants

SDX makes it possible to transfer on-premises design wins to cloud

Shared Object StorageCloud

Strong multi-function support

Strong shared data experience

Strong information security model

Strong on cost management

Strong on tenant isolation

Strong on workload elasticity

Strong on self service

Strong on speed of deployment

20© Cloudera, Inc. All rights reserved.

Positive business outcomes

Increased business insights

diverse data together with

analytics flexibility

Increased business agility

modern and nimble application

innovation

Increased governance

capability one common

viewpoint and store

Decreased operational costs

– one environment for all

needs

Decreased staff overhead –

one set of controls for

everything

Decreased security risks –

comprehensive controls

everywhere

21© Cloudera, Inc. All rights reserved.

Using Predictive Maintenance to Improve Performance and Reduce Fleet Downtime

• OnCommand Connection is collecting telematics and geolocation data across the fleet

• Reduced maintenance costs to $.03 per mile from $.12-$.15 per mile

• Centralizing data from 13 systems with varying frequency and semantic definitions

• Real-time visibility of 300,000+ trucks in order to improve uptime and vehicle performance

22© Cloudera, Inc. All rights reserved.

Thank you