the great lakes: how to approach a big data implementation

Grab some

coffee and

enjoy the

pre-show

banter before

the top of the

hour!

The Briefing Room

The Great Data Lakes: How to Approach a Big Data Implementation

Twitter Tag: #briefr The Briefing Room

Welcome

Host: Eric Kavanagh

[email protected] @eric_kavanagh


  Reveal the essential characteristics of enterprise software, good and bad

  Provide a forum for detailed analysis of today’s innovative technologies

 Give vendors a chance to explain their product to savvy analysts

  Allow audience members to pose serious questions... and get answers!

Mission


Topics

April: BIG DATA

May: CLOUD

June: INNOVATORS


Will History Repeat Itself Again?

Ø  Partitioning matters

Ø  File formats matter

Ø  Metadata matters

Ø  Access patterns matter

Hadoop may be schema-agnostic, but that doesn’t mean you shouldn’t carefully plan your implementation!

“I’ve always found that plans are useless, but planning is indispensable.”


Analyst: Robin Bloor

Robin Bloor is Chief Analyst at The Bloor Group

[email protected] @robinbloor


Think Big, A Teradata Company

  Last year Teradata acquired Think Big Analytics, Inc., a consulting and solutions company focused on big data solutions

  Think Big has expertise in implementing a variety of open source technologies, such as Hadoop, Hbase, Cassandra, MongoDB and Storm, as well as experience with Hortonworks, Cloudera and MapR

  Its consultants can assist with the planning, management and deployment of big data implementations


Guest: Rick Stellwagen

Rick Stellwagen is Data Lake Program Director at Think Big, A Teradata Company. Rick is responsible for defining and rolling out a Data Lake Solution portfolio, identifying and integrating internal and external best in class technologies. He is defining the deployment model, offerings, skills, career path and integrated capabilities required for data lake construction and rollout. He also works with product management, engineering, marketing and external partner alliances to define thought leadership positions and shape product plans both internally and externally.

MAKING BIG DATA COME ALIVE MAKING BIG DATA COME ALIVE

 Data Lake Deployment Best Practices

 Rick Stellwagen, Data Lake Program Director  April 7, 2015

CONFIDENTIAL | 11

A centralized repository of raw data into which all data-producing streams

flow and from which downstream facilities may draw

What is a Data Lake?

11

Information Sources Data Lake Downstream Facilities

Data Variety is the driving factor in building a Data Lake

CONFIDENTIAL | 12

Swamp Reservoir

Data Lake: Swamp or Reservoir?

12

CONFIDENTIAL | 13

�  Corporate Data Sourcing – Repository – System of Record - Govern who, what and when data is accessed or provisioned - Track usage, resolve anomalies, visualize, optimize and clarify data lineage

�  Historical Data Offload - Offload history of operational and analytical data platforms - Centralized control of restore capabilities and leverage deep data history

�  Data Discovery, Organization and Identification - Gain ultimate flexibility in data use and access Schema on read - Lightly conditioned, un-modeled, flexible modeling

�  ETL Offload - Foundation for Data Integration – push staging to Hadoop - Data Quality and validation

�  Business Reporting - OLAP analysis sourced & processed directly from the data lake

Primary Data Lake Use Cases

13

CONFIDENTIAL | 14

•  A Data Reservoir is a managed Data Lake that seeks to guarantee quality, access, provenance, and governance.

•  An important extra guarantee that makes a

Data Reservoir is the presence of metadata that might enable non subject matter experts to easily know the location of and entitlements to the various forms of stored data within.

•  Schema Metadata is always a given, but……

14

Data Lake: Swamp or Reservoir?

CONFIDENTIAL | 15

Business-Ontology

15

How does this data relate to other data?

How do we classify this data

within the business?

CONFIDENTIAL | 16 16

Business-Security

Who can read the data?

Who owns the data?

Who belongs to what

group?

LDAP

Argus

Unix bitmask Permissions

Who can see a column?


Operational

Where did my data come from?

Any environmental context

about the landing zone, OS,

where my data came from?

What processes touched my data?

When did my data get ingested? ... get transformed? ... get exported?

Identity?


Business-Index

What contents are in a file?

What is the data

serialization?

Where can we find certain content in the file?

What terms are in the contents?

e-Discoverysolr

a lot of NoSQL

FileMagic Number


Business-Schema

How does my data denormalize?

How should I interpret

my data?

What are my column names?

Are there any “important” dimensions?

Metareposi

tory

HCatalog


Data Lake Information Sources

Evaluate Source Data Ingest

Collect & Manage

Metadata

Profile - Structure

Sequence

Downstream

Facilities

Generate Reports

Discovery Signals Compress

Automate

Protect

Prepare Data for Ingest

Prepare Source Metadata

Assembling the Reservoir

Perimeter-Authentication-Authorization

Data Hub

Generate Reports

CONFIDENTIAL | 21

Enterprise Data Lake Architecture

21

�  Each Region has different “areas”

�  Three areas for three types of usage -  Data Treatment

-  Data Reservoir

-  Data Lab

Regional Data Treatment Facility

Regional Reservoir Regional Lab

Op MetaData Index

Collection Pools

Ingest Zone SOR Zone

Export Zone

Orchestration VM

OrchestrationDB

Monitoring

MasterCompute Cluster

Biiz MetaData Index

Orchestration VM

OrchestrationDB

Monitoring

Lake

MasterData

Export Zone

<LOB> Zone


Lake

MasterData

<Insight B><Insight A>

VCC VCC

Processes

op md index

HAR Compactor

Ingestion/SOR Reconciliation

de-dup

key generation

Processes

xcorrelate

xco-locate

xcleanse

de-ident X Y

VirtualCompute Cluster

continuous

bulk

metadatacapture

metadatacapture

metadatacapture

de-identification

Key: Validate that Ingestion captures Metadata

CONFIDENTIAL | 22

Data Treatment

22

�  Used by Operations only �  Restricted �  Non-business process �  Lowest-Common-

Denominator Data Serialization

�  The entry point for ALL your data


Ingest Zone SOR Zone

Export Zone

Op MetaData Index

MonitoringOrchestrationDB

Orchestration VM

Regional Data Treatment Facility

Collection Poolscontinuous

bulk

metadatacapture

Make sure you capture

Metadata!

Or you risk a swamp

downstream

CONFIDENTIAL | 23

MasterData

<LOB> Zone

Export Zone


MonitoringOrchestrationDB

Orchestration VM

Lake

Biz MetaData Index

MPP Fast Analytics

Regional ReservoirProcesses

xcorrelate

xco-locate

xcleanse

de-ident

Data Reservoir

23

�  Used by Business AND Operations

�  Marting ! �  Business processes �  DSS �  No Ad Hoc �  Business Restricted �  First Introduction of SME

Don’t let in

un-vetted data!

CONFIDENTIAL | 24

Data Lab

24

�  Used by business primarily

�  “Un-Safe” Data �  Ephemeral (think

virtualization) �  Highly experimental �  New technologies �  Ad Hoc

Regional Lab

Lake

MasterData

<Insight B><Insight A>

VCC VCC

X Y

VirtualCompute Cluster

CONFIDENTIAL | 25

•  Know where you are headed – build on Roadmap or Optimizer Planning •  Quickly put into practice references for company wide Data Lake ingest •  Establish data lineage and governance tracking with metadata services •  Establish standards and practices to scale out your data ingest •  Develop standards for doing profiling and discovery •  Build out a pipeline framework for data transformations •  Develop a Security Plan (perimeter, authentication & authorization) •  Develop an archive and information security approach •  Plan out next steps and approach for discovery and reporting

Data Lake Best Practices

25


Perceptions & Questions

Analyst: Robin Bloor

Robin Bloor, PhD

There Has Been a Clear Shift

Analytics & BI were previously EDW-centric

They are becoming Data Lake-centric

§  Inexpensive (?) §  Any data §  May have metadata §  Poor performance §  Weak scheduling §  Weak data mgmt §  Security? §  Data Lake

§  Expensive §  Prepared data §  Will have metadata §  Optimized performance §  Optimized scheduling §  Good data mgmt §  Secure §  Data workhorse

Hadoop vs Data Mgmt Engine

Hadoop DBMS/EDW

Big Data Architecture - 1

Think Logical, Implement Physical

§  Multiple local instances of Hadoop §  Weak data placement §  Metadata chaos §  Lack of tuning capability §  Security (expense) §  User self-service becoming a file

system nightmare

Straws in the Wind

Operational Concerns

The Need for Best Practices

This is clear:

Data Lake is a new idea

u  Is a data lake really just a multiplicity of data marts growing wild?

u  Aside from performance-critical workloads, what should Hadoop not be used for?

u  Do you have any specific recommendations for metadata management in a data lake?

u  Is there a need for enforced provenance & lineage?

u  Security question: Encryption?

u  Where does streaming fit into the picture?


Upcoming Topics

www.insideanalysis.com

April: BIG DATA

May: CLOUD

June: INNOVATORS


THANK YOU for your

ATTENTION!

Some images provided courtesy of Wikimedia Commons

the great lakes: how to approach a big data implementation

Technology

data lake confidential

big data solutions

data lake construction

alive making big data

data lake program director

swamp reservoir data

great data lakes

data lake solution portfolio