a data lake is more than hadoop. hadoop is more than a data lake · 2016-08-30 · • to get their...

30
#TDPARTNERS16 GEORGIA WORLD CONGRESS CENTER A Data Lake is more than Hadoop. Hadoop is more than a Data Lake Dan Graham Teradata Director Technical Marketing

Upload: others

Post on 29-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

#TDPARTNERS16 GEORGIA WORLD CONGRESS CENTER

A Data Lake is more than Hadoop.

Hadoop is more

than a Data Lake

Dan Graham

Teradata Director Technical Marketing

Page 2: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

What’s the Big Idea? Big idea #1

“store all data” (whatever “all” means)

Big idea #2 “un-washed, raw data” (NoETL / late-binding)

Big idea #3 “resolve the nagging problem of

accessibility and data integration”

DTG

Big idea #4 Data access/integration

Isn’t that in the data warehouse?

Page 3: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

What is a Data Lake?

A data lake is a collection of long term data containers that capture, refine, and explore any form of raw data at scale, enabled by low cost technologies, from which multiple downstream facilities may draw upon.

Data sources Downstream

Sensors email

Transactions Machine logs

Geolocation Media

BI Tools IDW

Data Marts Analysis

Apps Other Data Lake Data Lake

DTG

Page 4: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Data Warehouse Design Pattern Data Lake Design Pattern

Data Lake is a Design Pattern

• Scalability at low cost

• Original raw data fidelity

• Refine data for exploration

• Loosely coupled, late binding

• Serves downstream systems

• Long term storage

Subject oriented

Data model of the business

Integrated

Consolidated

Consistent data formats

Nonvolatile persisted data

Time variant

High concurrency levels

DTG

Page 5: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Data Lake Design Pattern Data Lake Technologies

S3

1800

Design Patterns vis-à-vis Technologies DTG

• Scalability at low cost

• Original raw data fidelity

• Refine data for exploration

• Loosely coupled, late binding

• Serves downstream systems

• Long term storage

Page 6: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Who is this Guy? What’s he Doing?

Data treatments

Capture, refine, explore

original raw data and metadata

DTG

Data scientists

Programmers

Business users

Batch jobs

Page 7: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Multiple Data Lakes DTG

Sensor data capture, refining

New product design

Market pricing

Page 8: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Hadoop is more than a data lake. A data lake is more than Hadoop.

DTG

Page 9: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

What the Data Lake is Not

• Not a single central repository for all data • Unless you rebuild half the data center

• 100s of reasons data bypasses the lake

• Not only system feeding the data warehouse

• Data goes direct or through ETL servers

• Not an archive • Policies, audits, immutability, extreme security, expirations

• Not dashboards and data marts

ETL analysis

data lake

DTG

Page 10: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Data Manufacturing

DATA R&D

DATA LAKE DATA PRODUCTS

DTG

Page 11: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Data Manufacturing & Hadoop Cluster

DATA R&D

DATA LAKE DATA PRODUCTS

DTG

Page 12: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Data Integration Just Say No to your Inner DBA (and some users)

Levels of data trust Data integration

Certified 100%

Trustworthy 80%

Proven 60%

Experimental 40%

Raw/high risk 20% Low

High

Inve

stm

en

t

DTG

Page 13: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Use Cases

Page 14: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Data Integration Optimization

Reference data look-ups Joins for derived data Lots of derived data

Service-level goals to meet

High velocity data Unstructured data

Low value data Cost savings ROI

DTG

Page 15: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Dark Data Insights

• Dark data, data exhaust deleted

• New unstructured data,

• Expensive, no ROI, unknown value

• Low user demand

• Dark data often contains insights

• Data lake costs are much lower

• Explore, research, discover

• Promote some to production

sensors

email

weblogs

logins

tweets

GPS

Production

mobile

DTG

Page 16: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Complex/ Iterative Processing

• Extensive CPU usage • Iterative processing

• non sequential loops & branches

• Complex algorithms • Video content analysis

• Photo analysis

• Text analysis

• Random forests

• Monte Carlo methods

• Scientific research • Weather simulation

• Electromagnetic modeling

• Physics, DNA, etc.

Complex processing

Set processing

DTG

Page 17: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Managing Shadow IT

• To get their job done, users abscond with data daily

• Bypass IT, governance, and security

• Data-mart-under-my-desk

• Dispensing data reliably • HELP users get needed data

• Improve data quality

• Get some control versus none

• Add some governance, security, audit

DTG

Data Lake

Page 18: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Offloading the Coldest Data

• Offload coldest rows • Free up IDW storage

• Temperature = usage • Date stamp often irrelevant

• Archive, compliance

• Accessible with QueryGrid

Hot/warm data

Coldest data

ETL

QueryGrid move

DTG

Page 19: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Single Subject Data Analysis

• Analytics • Query and reporting

• Data mining

• Dashboards

• Single subject star schema • 1-2 raw data fact tables

• Structured + unstructured data

• Non cleansed data

• Non integrated data

• Dimension tables

#Version: 1.0 #GMT-Offset: -0800 #Software: MyCorpTopaz Web Cache 2.0.0.2.0; #Start-Date: 2015-06-21 00:00:18 #Fields: c-ip c-dns c-auth-id date time cs-method cs-uri sc-status-ctrl bytes cs(Cookie) cs(Referrer) time-taken cs(User-Agent) #date: 2015-07-31; ”buyer”=“Willcox”; order”=“lingerie”; DMS.user; GET /images/bottom.gif 200A17x 350 "BIGipServer_webcache”=“217”; ORA_UCM_AGID=%2fMP%2f8M7%3etSHPV%40%2fS%3f%3fDh3V“; "http://www.myDBl.com/nl.html" 37087 "Mozilla/4.5 [en] (WinNT;)"

Raw data files

store

address

date

type

DTG

Page 20: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Big Pictures

Page 21: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Data Lake Architecture

Math

and Stats

Data

Mining

Business

Intelligence

Applications

Languages

Marketing

ANALYTIC TOOLS & APPS

USERS

Marketing

Executives

Operational

Systems

Frontline

Workers

Customers

Partners

Engineers

Data

Scientists

Business

Analysts

Access Preparation Acquisition

Search

Profiling

Tagging

Analytics

Cleansing

Validation

Aggregation

Materialization

Ingest

Conversion

Encryption

Security, Metadata/Lineage, Administration

Distributed Storage

SOURCES

Sensors

email

Social

Telemetry

Mobile

Tabular Data

Machine logs

DTG

Page 22: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Access Preparation Acquisition

Data Lake Architecture

Math

and Stats

Data

Mining

Business

Intelligence

Applications

Languages

Marketing

ANALYTIC TOOLS & APPS

USERS

Marketing

Executives

Operational

Systems

Frontline

Workers

Customers

Partners

Engineers

Data

Scientists

Business

Analysts

Streams Search Aggregations

Security, Metadata/Lineage, Administration

Distributed Storage

Msg. queues Cleansing Access

Experiments Governance Files

SOURCES

Sensors

email

Social

Telemetry

Mobile

Tabular Data

Machine logs

DTG

Page 23: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Access Preparation Acquisition

Hadoop Data Lake Technologies

Math

and Stats

Data

Mining

Business

Intelligence

Applications

Languages

Marketing

ANALYTIC TOOLS & APPS

USERS

Marketing

Executives

Operational

Systems

Frontline

Workers

Customers

Partners

Engineers

Data

Scientists

Business

Analysts

YARN, Ambari, Navigator, HCatalog, Sentry

HDFS, S3 Raw data, derived views

SOURCES

Sensors

email

Social

Telemetry

Mobile

Tabular Data

Machine logs

DTG

Page 24: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Data Lake: Teradata 1800

Math

and Stats

Data

Mining

Business

Intelligence

Applications

Languages

Marketing

ANALYTIC TOOLS & APPS

USERS

Marketing

Executives

Operational

Systems

Frontline

Workers

Customers

Partners

Engineers

Data

Scientists

Business

Analysts

Access Preparation Acquisition

Teradata Parallel Data Environment

SOURCES

Sensors

email

Social

Telemetry

Mobile

Tabular Data

Machine logs

DTG

Data Lab Studio

QueryGrid

SAS mining Fuzzy Logix

SPSS Revolution R

Informatica DataStage Oracle DI

SAS DI Studio Ab Initio

Microsoft

TPT Data-mover

Listener REST APIs Attunity

Informatica, IBM Data Stage, Oracle Data Integrator, Talend

Viewpoint, Ecosystem Manager, Unity

Page 25: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Data Lake Definition Summary

• The data lake is a design pattern • Requires and uses many technologies

• The data lake is more than Hadoop • Amazon S3, Cassandra, Teradata

• Other tools and technologies

• Hadoop is more than a data lake

• The data lake manages raw data • Refined in downstream processes Downstream

consumers

Data

sources

DTG

Page 26: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Thank You

Questions/Comments

Email:

Follow Me

Twitter @

Rate This Session #

with the PARTNERS Mobile App

Remember To Share Your Virtual Passes

[email protected]

DanGraham_

417 -- rate it a 5 please

26

Page 27: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

27

Page 28: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Data Lake Platforms

Data lake definition Hadoop Amazon

EMR Cassandra Teradata

1800

Long term data containers X X X X

Capture, refine, and explore X X X X

Raw data at scale X X X X

Low cost technologies X X X X

Feeds downstream uses X X X X

Options

Schema-on-read X X X JSON, NVPs

File system HDFS S3 CFS RDBMS

Search engines Solr Solr

SQL, Java, Python, Ruby, scripts X X X X

Page 29: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Data Integration on demand

Data value assumed

Typically schema-on-read

Data integration up front

Data value manufactured

Typically schema-on-write

Value Creation via Data Integration

DATA LAKE

SCM

CRM

ERP INTEGRATED

DATA WAREHOUSE

DTG

Page 30: A Data Lake is more than Hadoop. Hadoop is more than a Data Lake · 2016-08-30 · • To get their job done, users abscond with data daily •Bypass IT, governance, and security

Access Preparation Acquisition

HDFS

Teradata’s Hadoop Data Lake Products

Math

and Stats

Data

Mining

Business

Intelligence

Applications

Languages

Marketing

ANALYTIC TOOLS & APPS

USERS

Marketing

Executives

Operational

Systems

Frontline

Workers

Customers

Partners

Engineers

Data

Scientists

Business

Analysts

Listener App Center

SOURCES

Sensors

email

Social

Telemetry

Mobile

Tabular Data

Machine logs

DTG

Viewpoint