webinar - data lake management: extending storage and lifecycle of data
TRANSCRIPT
Future-proofing your Data Lake Extending Storage and Lifecycle of Data
Scott Gidley, Zaloni and Gus Horn, NetAppWebinar: October 05 2016
• Award-winning provider of enterprise data lake management solutions:
Integrated data lake management platform
Self-service data preparation
• Data Lake Design and Implementation Services: POC, Pilot, Production, Operations, Training
• Data Science Professional Services
3 Zaloni Proprietary
Increased Agility
New Insights
Improved Scalability
Data lakes are central to the modern data architecture
4 Zaloni Proprietary
Data architecture modernizationTr
aditi
onal
Mod
ern
Data Lake
Sources ETL EDW
Derived (Transformed)
Discovery Sandbox
EDW
Streaming
Unstructured Data
Various Sources
Data DiscoveryAnalytics BI
Data ScienceData Discovery
Analytics BI
Data Lake Promise
• Stores all types of data (structured and unstructured) in its raw format
• Stores data for longer periods of time to enable historical analysis
• Manages real-time, streaming, and reference data all in the same environment
• Integrates storage and compute environments
Data Lake Reality• Homogenous data storage degrades
performance and efficiency• Aged or non-relevant data pollutes the
data lake• Lack of business driven SLA’s for data
archival impacts compliance and automated initiatives
Zaloni Confidential and Proprietary - Provided under NDA
Big data opportunities come with challenges
Zaloni Confidential and Proprietary - Provided under NDA
• Leverage the full power of a scale-out architecture with an actionable, scalable data lake
Data Lake 360° : Zaloni’s holistic approach to actionable big data
1. Enable the lake
2. Govern the data
• Improve data visibility, reliability and quality to reduce time-to-insight
3. Engage the business
• Safeguard sensitive data and enable regulatory compliance
• Foster a data-driven business through self-service data discovery and preparation
Data lake’s show promise but success can be short-lived!
▪ Internet retailer relies on data lake to enable:▪ Real-time inventory analytics ▪ Customer next-best-offer programs
▪ Initial implementation shows promise and delivers measurable business value
▪ Increasing costs and decreasing performance due to unmanaged data growth limit long-term ROI
Real-Time Inventory Management
Customer 360:Next Best Offer
Data Lake Reference Architecture
• Data required for LOB specific views - transformed from existing certified data
• Consumers are anyone with appropriate role-based access
• Standardized on corporate governance/ quality policies
• Consumers are anyone with appropriate role-based access
• Single version of truth
TransientLanding Zone Raw
Zone
Analytic Zone
Refined Zone
Sandbox
Data Lake
• Temporary store of source data
• Consumers are IT, Data Stewards
• Implemented in highly regulated industries
• Original source data ready for consumption
• Consumers are ETL developers, data stewards, some data scientists
• Single source of truth with history
• Data required for LOB specific views - transformed from existing certified data
• Consumers are anyone with appropriate role-based access
Sensors (or other time series data)
Relational Data Stores (OLTP/ODS/DW)
Logs(or other unstructured
data)
Social and shared data
Data Lake Reference Architecture with Zaloni
Consumption ZoneSource System
File Data
DB Data
ETL Extracts
Streaming
TransientLanding Zone Raw Zone
Analytic Zone
Refined Zone
Sandbox
APIs
MetadataManagement
Data Quality Data Catalog Security
Data Lake
Business AnalystsResearchers
Data Scientists
DATA LAKE MANAGEMENT & GOVERNANCE PLATFORM
Sensors (or other time series data)
Relational Data Stores (OLTP/ODS/DW)
Logs(or other unstructured
data)
Social and shared data
Bedrock Data Lifecycle Management – Policy Execution
Zaloni DLM – Future proof your data lake
Zaloni Confidential and Proprietary - Provided under NDA
Business AnalystsResearchersData Scientists
File Data DB Data ETL Extracts
Streams
APIs
Raw Data Zone Refined Data Zone Analytic Data ZoneDLM Policy
< 360 Days = Warm > 360 Days = S3 Vault
DLM Policy< 30 Days = Hot
> 30 & < 120 Days = Warm > 120 Days = S3 Vault
DLM Policy< 30 Days = Hot
> 30 Days = S3 Vault
INGEST ORGANIZE ENRICH ENGAGE
S3 Vault
StorageGRID Webscale
Hot
E-Series Flash
Warm
E-Series Disk
Consumption Zone
Applications
Data Lake
Data Storage
Data Tier
Bedrock Data Lifecycle Management – Policy Definition
The complexities of the connected vehicle The classic problems associated with Big Data Volume, Velocity, Variability & Privacy!
© 2016 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---11
The promise of a Connected Car’s Data lakeHow to manage billions of unstructured records
© 2016 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---12
INGESTManage data ingestion
so you know what is your Hadoop Data Lake
ORGANIZEDefine and capture
metadata for ease of searching and browsing
ENRICHOrchestrate and manage the data
preparation process
ENGAGESelf-service data
preparation
Validated Certified Designs with all Distributions of Hadoop
• Map-R• Cloudera • Hortonworks
Uses high performance storage• Resilient Compact footprint• Protection of Data, DDP,
R5/R6/R10• Less Network Congestion
Higher capacity and density• 480TB in 4U• Expandable to 3.1 PB / Controller• Fully serviceable storage system• No Architectural limit
Reliability• 99.9999% reliability <35sec / year
The NetApp Solution for Hadoop
13Insight © 2015 NetApp, Inc. All rights reserved. NetApp Confidential – Limited Use Only
Enterprise Grade Hadoop (Consistent performance during all modes of operation)
12Gb/S SAS
Data Nodes4:1 Ratio
10 GB Ether Net
10 G
B E
ther
Net
wor
k D
ata
Inte
nsiv
e si
de
10 GB Ether Net
1 or
10
GB
Eth
er N
etw
ork
Man
agem
ent
Hadoop Analytic Platform- High Performance HDFS- Heterogeneous File system- Tiered HOT/WARM/COLD Storage- Tested Validated Architecture
High Performance Building Block- High Performance HDFS- Scale to Thousands of Nodes- Exa-Bytes of Capacity
NFS Connector for Hadoop
Resource ManagerName Node(s)
Fully connected Building Block- High Performance NFS optimized- Augment existing Hadoop Cluster- Exa-Bytes of Capacity
Fleet maintenance
▪Large commercial hauling company in US has over 400,000 leased vehicles▪Trucks are under warranty▪Fleet must operate and maximum efficiency to maintain profits
▪Truck drivers have predictable behavior▪They will continue to drive even with warning lights indicating problems with the vehicle, they keep on
trucking
▪Minor problems often times elevate to major ones if not addressed early on during the failure process
▪Perception of driver that the vehicle is under warranty and therefore if it is driving they will continue to the final destination i.e. completing the delivery before addressing any issue
Proactive maintenance is much more cost effective than reactive
© 2016 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---14
Solution for fleet maintenance
▪Placed cellular data telemetry devices in all leased vehicles
▪Collected all telemetry▪Speeds of vehicle and GPS coordinates▪All mechanical sensor data▪Could identify employee
▪Alerts driver to mechanical issue immediately and schedules proactive maintenance with appointment at next rest stop with predictive time out of service
▪Minor problems do not escalate to major failures
▪Immediate improvement of fleet uptime and reduce warranty expense and out of service situations
▪Saved over $5M in the first year of operation
Maintenance and vehicle readiness are correlated
© 2016 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---15
Large Strip-mining Operation in Mid West
▪Vehicles were large Caterpillar Earth Movers▪Maintenance cost in Millions (Oils, Hydraulic, Engine, Transmissions etc.)▪Vehicles only make money when moving product
▪Rather than Hobs meter (How many hours of operation) maintenance it was changed to telemetry based maintenance was implemented
▪Minor issues never progressed to major down time issues
▪Driver behavior had a direct correlation to vehicle damage and ware (brakes and suspension)
▪Maintenance cost reduction paid for Hadoop cluster and related software within the first quarter of operation
Telemetry proved benefits beyond the vehicle
© 2016 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---16
Savings extended beyond pure Maintenance
▪Vehicle load sensors transmitting load in real-time to production plant
▪Suspension load sensors transmitted road conditions▪Abnormal angles were detected in real time▪Pot holes and terrain require re-grading detected before causing excessive strain to the suspension of
Earth movers▪Prior to telemetry the mine guessed were to maintain the road and often were missing major issues
causing excessive suspension strain and out of limit failures costing Millions of dollars in down time and repairs
▪Driver behavior had a direct correlation to vehicle damage and ware (brakes and suspension)▪Drivers were better trained to learn how to brake and accelerate with the vehicles saving millions in
unneeded repairs
▪The side effect of telemetry produced more than $10M in cost reduction in vehicle and road maintenance with greater uptime of fleet
Route maintenance, driver behaviors and real-time product tracking
© 2016 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---17
DATA LAKE MANAGEMENT AND GOVERNANCE PLATFORM
SELF-SERVICE DATA PREPARATION