![Page 1: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/1.jpg)
Grab some
coffee and
enjoy the
pre-show
banter before
the top of the
hour!
![Page 2: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/2.jpg)
The Briefing Room
The Great Data Lakes: How to Approach a Big Data Implementation
![Page 4: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/4.jpg)
Twitter Tag: #briefr The Briefing Room
Reveal the essential characteristics of enterprise software, good and bad
Provide a forum for detailed analysis of today’s innovative technologies
Give vendors a chance to explain their product to savvy analysts
Allow audience members to pose serious questions... and get answers!
Mission
![Page 5: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/5.jpg)
Twitter Tag: #briefr The Briefing Room
Topics
April: BIG DATA
May: CLOUD
June: INNOVATORS
![Page 6: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/6.jpg)
Twitter Tag: #briefr The Briefing Room
Will History Repeat Itself Again?
Ø Partitioning matters
Ø File formats matter
Ø Metadata matters
Ø Access patterns matter
Hadoop may be schema-agnostic, but that doesn’t mean you shouldn’t carefully plan your implementation!
“I’ve always found that plans are useless, but planning is indispensable.”
![Page 7: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/7.jpg)
Twitter Tag: #briefr The Briefing Room
Analyst: Robin Bloor
Robin Bloor is Chief Analyst at The Bloor Group
[email protected] @robinbloor
![Page 8: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/8.jpg)
Twitter Tag: #briefr The Briefing Room
Think Big, A Teradata Company
Last year Teradata acquired Think Big Analytics, Inc., a consulting and solutions company focused on big data solutions
Think Big has expertise in implementing a variety of open source technologies, such as Hadoop, Hbase, Cassandra, MongoDB and Storm, as well as experience with Hortonworks, Cloudera and MapR
Its consultants can assist with the planning, management and deployment of big data implementations
![Page 9: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/9.jpg)
Twitter Tag: #briefr The Briefing Room
Guest: Rick Stellwagen
Rick Stellwagen is Data Lake Program Director at Think Big, A Teradata Company. Rick is responsible for defining and rolling out a Data Lake Solution portfolio, identifying and integrating internal and external best in class technologies. He is defining the deployment model, offerings, skills, career path and integrated capabilities required for data lake construction and rollout. He also works with product management, engineering, marketing and external partner alliances to define thought leadership positions and shape product plans both internally and externally.
![Page 10: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/10.jpg)
MAKING BIG DATA COME ALIVE MAKING BIG DATA COME ALIVE
Data Lake Deployment Best Practices
Rick Stellwagen, Data Lake Program Director April 7, 2015
![Page 11: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/11.jpg)
CONFIDENTIAL | 11
A centralized repository of raw data into which all data-producing streams
flow and from which downstream facilities may draw
What is a Data Lake?
11
Information Sources Data Lake Downstream Facilities
Data Variety is the driving factor in building a Data Lake
![Page 12: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/12.jpg)
CONFIDENTIAL | 12
Swamp Reservoir
Data Lake: Swamp or Reservoir?
12
![Page 13: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/13.jpg)
CONFIDENTIAL | 13
� Corporate Data Sourcing – Repository – System of Record - Govern who, what and when data is accessed or provisioned - Track usage, resolve anomalies, visualize, optimize and clarify data lineage
� Historical Data Offload - Offload history of operational and analytical data platforms - Centralized control of restore capabilities and leverage deep data history
� Data Discovery, Organization and Identification - Gain ultimate flexibility in data use and access Schema on read - Lightly conditioned, un-modeled, flexible modeling
� ETL Offload - Foundation for Data Integration – push staging to Hadoop - Data Quality and validation
� Business Reporting - OLAP analysis sourced & processed directly from the data lake
Primary Data Lake Use Cases
13
![Page 14: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/14.jpg)
CONFIDENTIAL | 14
• A Data Reservoir is a managed Data Lake that seeks to guarantee quality, access, provenance, and governance.
• An important extra guarantee that makes a
Data Reservoir is the presence of metadata that might enable non subject matter experts to easily know the location of and entitlements to the various forms of stored data within.
• Schema Metadata is always a given, but……
14
Data Lake: Swamp or Reservoir?
![Page 15: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/15.jpg)
CONFIDENTIAL | 15
Business-Ontology
15
How does this data relate to other data?
How do we classify this data
within the business?
![Page 16: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/16.jpg)
CONFIDENTIAL | 16 16
Business-Security
Who can read the data?
Who owns the data?
Who belongs to what
group?
LDAP
Argus
Unix bitmask Permissions
Who can see a column?
![Page 17: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/17.jpg)
CONFIDENTIAL | 17 17
Operational
Where did my data come from?
Any environmental context
about the landing zone, OS,
where my data came from?
What processes touched my data?
When did my data get ingested? ... get transformed? ... get exported?
Identity?
![Page 18: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/18.jpg)
CONFIDENTIAL | 18 18
Business-Index
What contents are in a file?
What is the data
serialization?
Where can we find certain content in the file?
What terms are in the contents?
e-Discoverysolr
a lot of NoSQL
FileMagic Number
![Page 19: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/19.jpg)
CONFIDENTIAL | 19 19
Business-Schema
How does my data denormalize?
How should I interpret
my data?
What are my column names?
Are there any “important” dimensions?
Metareposi
tory
HCatalog
![Page 20: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/20.jpg)
CONFIDENTIAL | 20 20
Data Lake Information Sources
Evaluate Source Data Ingest
Collect & Manage
Metadata
Profile - Structure
Sequence
Downstream
Facilities
Generate Reports
Discovery Signals Compress
Automate
Protect
Prepare Data for Ingest
Prepare Source Metadata
Assembling the Reservoir
Perimeter-Authentication-Authorization
Data Hub
Generate Reports
![Page 21: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/21.jpg)
CONFIDENTIAL | 21
Enterprise Data Lake Architecture
21
� Each Region has different “areas”
� Three areas for three types of usage - Data Treatment
- Data Reservoir
- Data Lab
Regional Data Treatment Facility
Regional Reservoir Regional Lab
Op MetaData Index
Collection Pools
Ingest Zone SOR Zone
Export Zone
Orchestration VM
OrchestrationDB
Monitoring
MasterCompute Cluster
Biiz MetaData Index
Orchestration VM
OrchestrationDB
Monitoring
Lake
MasterData
Export Zone
<LOB> Zone
MasterCompute Cluster
Lake
MasterData
<Insight B><Insight A>
VCC VCC
Processes
op md index
HAR Compactor
Ingestion/SOR Reconciliation
de-dup
key generation
Processes
xcorrelate
xco-locate
xcleanse
de-ident X Y
VirtualCompute Cluster
continuous
bulk
metadatacapture
metadatacapture
metadatacapture
de-identification
Key: Validate that Ingestion captures Metadata
![Page 22: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/22.jpg)
CONFIDENTIAL | 22
Data Treatment
22
� Used by Operations only � Restricted � Non-business process � Lowest-Common-
Denominator Data Serialization
� The entry point for ALL your data
MasterCompute Cluster
Ingest Zone SOR Zone
Export Zone
Op MetaData Index
MonitoringOrchestrationDB
Orchestration VM
Regional Data Treatment Facility
Collection Poolscontinuous
bulk
metadatacapture
Make sure you capture
Metadata!
Or you risk a swamp
downstream
![Page 23: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/23.jpg)
CONFIDENTIAL | 23
MasterData
<LOB> Zone
Export Zone
MasterCompute Cluster
MonitoringOrchestrationDB
Orchestration VM
Lake
Biz MetaData Index
MPP Fast Analytics
Regional ReservoirProcesses
xcorrelate
xco-locate
xcleanse
de-ident
Data Reservoir
23
� Used by Business AND Operations
� Marting ! � Business processes � DSS � No Ad Hoc � Business Restricted � First Introduction of SME
Don’t let in
un-vetted data!
![Page 24: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/24.jpg)
CONFIDENTIAL | 24
Data Lab
24
� Used by business primarily
� “Un-Safe” Data � Ephemeral (think
virtualization) � Highly experimental � New technologies � Ad Hoc
Regional Lab
Lake
MasterData
<Insight B><Insight A>
VCC VCC
X Y
VirtualCompute Cluster
![Page 25: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/25.jpg)
CONFIDENTIAL | 25
• Know where you are headed – build on Roadmap or Optimizer Planning • Quickly put into practice references for company wide Data Lake ingest • Establish data lineage and governance tracking with metadata services • Establish standards and practices to scale out your data ingest • Develop standards for doing profiling and discovery • Build out a pipeline framework for data transformations • Develop a Security Plan (perimeter, authentication & authorization) • Develop an archive and information security approach • Plan out next steps and approach for discovery and reporting
Data Lake Best Practices
25
![Page 26: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/26.jpg)
Twitter Tag: #briefr The Briefing Room
Perceptions & Questions
Analyst: Robin Bloor
![Page 27: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/27.jpg)
Robin Bloor, PhD
![Page 28: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/28.jpg)
There Has Been a Clear Shift
Analytics & BI were previously EDW-centric
They are becoming Data Lake-centric
![Page 29: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/29.jpg)
§ Inexpensive (?) § Any data § May have metadata § Poor performance § Weak scheduling § Weak data mgmt § Security? § Data Lake
§ Expensive § Prepared data § Will have metadata § Optimized performance § Optimized scheduling § Good data mgmt § Secure § Data workhorse
Hadoop vs Data Mgmt Engine
Hadoop DBMS/EDW
![Page 30: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/30.jpg)
Big Data Architecture - 1
Think Logical, Implement Physical
![Page 31: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/31.jpg)
Big Data Architecture - 2
![Page 32: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/32.jpg)
Big Data Architecture - 3
![Page 33: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/33.jpg)
§ Multiple local instances of Hadoop § Weak data placement § Metadata chaos § Lack of tuning capability § Security (expense) § User self-service becoming a file
system nightmare
Straws in the Wind
Operational Concerns
![Page 34: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/34.jpg)
The Need for Best Practices
This is clear:
Data Lake is a new idea
![Page 35: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/35.jpg)
u Is a data lake really just a multiplicity of data marts growing wild?
u Aside from performance-critical workloads, what should Hadoop not be used for?
u Do you have any specific recommendations for metadata management in a data lake?
u Is there a need for enforced provenance & lineage?
![Page 36: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/36.jpg)
u Security question: Encryption?
u Where does streaming fit into the picture?
![Page 37: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/37.jpg)
Twitter Tag: #briefr The Briefing Room
![Page 38: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/38.jpg)
Twitter Tag: #briefr The Briefing Room
Upcoming Topics
www.insideanalysis.com
April: BIG DATA
May: CLOUD
June: INNOVATORS
![Page 39: The Great Lakes: How to Approach a Big Data Implementation](https://reader034.vdocuments.us/reader034/viewer/2022051414/55a6638c1a28ab1e0e8b45e1/html5/thumbnails/39.jpg)
Twitter Tag: #briefr The Briefing Room
THANK YOU for your
ATTENTION!
Some images provided courtesy of Wikimedia Commons