the great lakes: how to approach a big data implementation

39
Grab some coee and enjoy the pre-show banter before the top of the hour!

Upload: inside-analysis

Post on 15-Jul-2015

157 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: The Great Lakes: How to Approach a Big Data Implementation

Grab some

coffee and

enjoy the

pre-show

banter before

the top of the

hour!

Page 2: The Great Lakes: How to Approach a Big Data Implementation

The Briefing Room

The Great Data Lakes: How to Approach a Big Data Implementation

Page 3: The Great Lakes: How to Approach a Big Data Implementation

Twitter Tag: #briefr The Briefing Room

Welcome

Host: Eric Kavanagh

[email protected] @eric_kavanagh

Page 4: The Great Lakes: How to Approach a Big Data Implementation

Twitter Tag: #briefr The Briefing Room

  Reveal the essential characteristics of enterprise software, good and bad

  Provide a forum for detailed analysis of today’s innovative technologies

 Give vendors a chance to explain their product to savvy analysts

  Allow audience members to pose serious questions... and get answers!

Mission

Page 5: The Great Lakes: How to Approach a Big Data Implementation

Twitter Tag: #briefr The Briefing Room

Topics

April: BIG DATA

May: CLOUD

June: INNOVATORS

Page 6: The Great Lakes: How to Approach a Big Data Implementation

Twitter Tag: #briefr The Briefing Room

Will History Repeat Itself Again?

Ø  Partitioning matters

Ø  File formats matter

Ø  Metadata matters

Ø  Access patterns matter

Hadoop may be schema-agnostic, but that doesn’t mean you shouldn’t carefully plan your implementation!

“I’ve always found that plans are useless, but planning is indispensable.”

Page 7: The Great Lakes: How to Approach a Big Data Implementation

Twitter Tag: #briefr The Briefing Room

Analyst: Robin Bloor

Robin Bloor is Chief Analyst at The Bloor Group

[email protected] @robinbloor

Page 8: The Great Lakes: How to Approach a Big Data Implementation

Twitter Tag: #briefr The Briefing Room

Think Big, A Teradata Company

  Last year Teradata acquired Think Big Analytics, Inc., a consulting and solutions company focused on big data solutions

  Think Big has expertise in implementing a variety of open source technologies, such as Hadoop, Hbase, Cassandra, MongoDB and Storm, as well as experience with Hortonworks, Cloudera and MapR

  Its consultants can assist with the planning, management and deployment of big data implementations

Page 9: The Great Lakes: How to Approach a Big Data Implementation

Twitter Tag: #briefr The Briefing Room

Guest: Rick Stellwagen

Rick Stellwagen is Data Lake Program Director at Think Big, A Teradata Company. Rick is responsible for defining and rolling out a Data Lake Solution portfolio, identifying and integrating internal and external best in class technologies. He is defining the deployment model, offerings, skills, career path and integrated capabilities required for data lake construction and rollout. He also works with product management, engineering, marketing and external partner alliances to define thought leadership positions and shape product plans both internally and externally.

Page 10: The Great Lakes: How to Approach a Big Data Implementation

MAKING BIG DATA COME ALIVE MAKING BIG DATA COME ALIVE

 Data Lake Deployment Best Practices

 Rick Stellwagen, Data Lake Program Director  April 7, 2015

Page 11: The Great Lakes: How to Approach a Big Data Implementation

CONFIDENTIAL | 11

A centralized repository of raw data into which all data-producing streams

flow and from which downstream facilities may draw

What is a Data Lake?

11

Information Sources Data Lake Downstream Facilities

Data Variety is the driving factor in building a Data Lake

Page 12: The Great Lakes: How to Approach a Big Data Implementation

CONFIDENTIAL | 12

Swamp Reservoir

Data Lake: Swamp or Reservoir?

12

Page 13: The Great Lakes: How to Approach a Big Data Implementation

CONFIDENTIAL | 13

�  Corporate Data Sourcing – Repository – System of Record - Govern who, what and when data is accessed or provisioned - Track usage, resolve anomalies, visualize, optimize and clarify data lineage

�  Historical Data Offload - Offload history of operational and analytical data platforms - Centralized control of restore capabilities and leverage deep data history

�  Data Discovery, Organization and Identification - Gain ultimate flexibility in data use and access Schema on read - Lightly conditioned, un-modeled, flexible modeling

�  ETL Offload - Foundation for Data Integration – push staging to Hadoop - Data Quality and validation

�  Business Reporting - OLAP analysis sourced & processed directly from the data lake

Primary Data Lake Use Cases

13

Page 14: The Great Lakes: How to Approach a Big Data Implementation

CONFIDENTIAL | 14

•  A Data Reservoir is a managed Data Lake that seeks to guarantee quality, access, provenance, and governance.

•  An important extra guarantee that makes a

Data Reservoir is the presence of metadata that might enable non subject matter experts to easily know the location of and entitlements to the various forms of stored data within.

•  Schema Metadata is always a given, but……

14

Data Lake: Swamp or Reservoir?

Page 15: The Great Lakes: How to Approach a Big Data Implementation

CONFIDENTIAL | 15

Business-Ontology

15

How does this data relate to other data?

How do we classify this data

within the business?

Page 16: The Great Lakes: How to Approach a Big Data Implementation

CONFIDENTIAL | 16 16

Business-Security

Who can read the data?

Who owns the data?

Who belongs to what

group?

LDAP

Argus

Unix bitmask Permissions

Who can see a column?

Page 17: The Great Lakes: How to Approach a Big Data Implementation

CONFIDENTIAL | 17 17

Operational

Where did my data come from?

Any environmental context

about the landing zone, OS,

where my data came from?

What processes touched my data?

When did my data get ingested? ... get transformed? ... get exported?

Identity?

Page 18: The Great Lakes: How to Approach a Big Data Implementation

CONFIDENTIAL | 18 18

Business-Index

What contents are in a file?

What is the data

serialization?

Where can we find certain content in the file?

What terms are in the contents?

e-Discoverysolr

a lot of NoSQL

FileMagic Number

Page 19: The Great Lakes: How to Approach a Big Data Implementation

CONFIDENTIAL | 19 19

Business-Schema

How does my data denormalize?

How should I interpret

my data?

What are my column names?

Are there any “important” dimensions?

Metareposi

tory

HCatalog

Page 20: The Great Lakes: How to Approach a Big Data Implementation

CONFIDENTIAL | 20 20

Data Lake Information Sources

Evaluate Source Data Ingest

Collect & Manage

Metadata

Profile - Structure

Sequence

Downstream

Facilities

Generate Reports

Discovery Signals Compress

Automate

Protect

Prepare Data for Ingest

Prepare Source Metadata

Assembling the Reservoir

Perimeter-Authentication-Authorization

Data Hub

Generate Reports

Page 21: The Great Lakes: How to Approach a Big Data Implementation

CONFIDENTIAL | 21

Enterprise Data Lake Architecture

21

�  Each Region has different “areas”

�  Three areas for three types of usage -  Data Treatment

-  Data Reservoir

-  Data Lab

Regional Data Treatment Facility

Regional Reservoir Regional Lab

Op MetaData Index

Collection Pools

Ingest Zone SOR Zone

Export Zone

Orchestration VM

OrchestrationDB

Monitoring

MasterCompute Cluster

Biiz MetaData Index

Orchestration VM

OrchestrationDB

Monitoring

Lake

MasterData

Export Zone

<LOB> Zone

MasterCompute Cluster

Lake

MasterData

<Insight B><Insight A>

VCC VCC

Processes

op md index

HAR Compactor

Ingestion/SOR Reconciliation

de-dup

key generation

Processes

xcorrelate

xco-locate

xcleanse

de-ident X Y

VirtualCompute Cluster

continuous

bulk

metadatacapture

metadatacapture

metadatacapture

de-identification

Key: Validate that Ingestion captures Metadata

Page 22: The Great Lakes: How to Approach a Big Data Implementation

CONFIDENTIAL | 22

Data Treatment

22

�  Used by Operations only �  Restricted �  Non-business process �  Lowest-Common-

Denominator Data Serialization

�  The entry point for ALL your data

MasterCompute Cluster

Ingest Zone SOR Zone

Export Zone

Op MetaData Index

MonitoringOrchestrationDB

Orchestration VM

Regional Data Treatment Facility

Collection Poolscontinuous

bulk

metadatacapture

Make sure you capture

Metadata!

Or you risk a swamp

downstream

Page 23: The Great Lakes: How to Approach a Big Data Implementation

CONFIDENTIAL | 23

MasterData

<LOB> Zone

Export Zone

MasterCompute Cluster

MonitoringOrchestrationDB

Orchestration VM

Lake

Biz MetaData Index

MPP Fast Analytics

Regional ReservoirProcesses

xcorrelate

xco-locate

xcleanse

de-ident

Data Reservoir

23

�  Used by Business AND Operations

�  Marting ! �  Business processes �  DSS �  No Ad Hoc �  Business Restricted �  First Introduction of SME

Don’t let in

un-vetted data!

Page 24: The Great Lakes: How to Approach a Big Data Implementation

CONFIDENTIAL | 24

Data Lab

24

�  Used by business primarily

�  “Un-Safe” Data �  Ephemeral (think

virtualization) �  Highly experimental �  New technologies �  Ad Hoc

Regional Lab

Lake

MasterData

<Insight B><Insight A>

VCC VCC

X Y

VirtualCompute Cluster

Page 25: The Great Lakes: How to Approach a Big Data Implementation

CONFIDENTIAL | 25

•  Know where you are headed – build on Roadmap or Optimizer Planning •  Quickly put into practice references for company wide Data Lake ingest •  Establish data lineage and governance tracking with metadata services •  Establish standards and practices to scale out your data ingest •  Develop standards for doing profiling and discovery •  Build out a pipeline framework for data transformations •  Develop a Security Plan (perimeter, authentication & authorization) •  Develop an archive and information security approach •  Plan out next steps and approach for discovery and reporting

Data Lake Best Practices

25

Page 26: The Great Lakes: How to Approach a Big Data Implementation

Twitter Tag: #briefr The Briefing Room

Perceptions & Questions

Analyst: Robin Bloor

Page 27: The Great Lakes: How to Approach a Big Data Implementation

Robin Bloor, PhD

Page 28: The Great Lakes: How to Approach a Big Data Implementation

There Has Been a Clear Shift

Analytics & BI were previously EDW-centric

They are becoming Data Lake-centric

Page 29: The Great Lakes: How to Approach a Big Data Implementation

§  Inexpensive (?) §  Any data §  May have metadata §  Poor performance §  Weak scheduling §  Weak data mgmt §  Security? §  Data Lake

§  Expensive §  Prepared data §  Will have metadata §  Optimized performance §  Optimized scheduling §  Good data mgmt §  Secure §  Data workhorse

Hadoop vs Data Mgmt Engine

Hadoop DBMS/EDW

Page 30: The Great Lakes: How to Approach a Big Data Implementation

Big Data Architecture - 1

Think Logical, Implement Physical

Page 31: The Great Lakes: How to Approach a Big Data Implementation

Big Data Architecture - 2

Page 32: The Great Lakes: How to Approach a Big Data Implementation

Big Data Architecture - 3

Page 33: The Great Lakes: How to Approach a Big Data Implementation

§  Multiple local instances of Hadoop §  Weak data placement §  Metadata chaos §  Lack of tuning capability §  Security (expense) §  User self-service becoming a file

system nightmare

Straws in the Wind

Operational Concerns

Page 34: The Great Lakes: How to Approach a Big Data Implementation

The Need for Best Practices

This is clear:

Data Lake is a new idea

Page 35: The Great Lakes: How to Approach a Big Data Implementation

u  Is a data lake really just a multiplicity of data marts growing wild?

u  Aside from performance-critical workloads, what should Hadoop not be used for?

u  Do you have any specific recommendations for metadata management in a data lake?

u  Is there a need for enforced provenance & lineage?

Page 36: The Great Lakes: How to Approach a Big Data Implementation

u  Security question: Encryption?

u  Where does streaming fit into the picture?

Page 37: The Great Lakes: How to Approach a Big Data Implementation

Twitter Tag: #briefr The Briefing Room

Page 38: The Great Lakes: How to Approach a Big Data Implementation

Twitter Tag: #briefr The Briefing Room

Upcoming Topics

www.insideanalysis.com

April: BIG DATA

May: CLOUD

June: INNOVATORS

Page 39: The Great Lakes: How to Approach a Big Data Implementation

Twitter Tag: #briefr The Briefing Room

THANK YOU for your

ATTENTION!

Some images provided courtesy of Wikimedia Commons