building a modern data architecture by ben sharma at strata + hadoop world san jose 2016
TRANSCRIPT
• Award-winning provider of enterprise data lake management solutions:
Integrated data lake management platform
Self-service data preparation
• Data Lake Design and Implementation Services
• Data Science Professional Services
2 Zaloni Proprietary
Delivering on the business of big data
Funded by top-tier technology investors:
Data lakes will be central to the modern data architecture
Agility Insight Scalability
3 Zaloni Proprietary
• Store all types data: structured and unstructured data
• Store raw data in its original form for extended period of time
• Uses various tools to correlate, enrich and query for insights on the data
• Provides democratized access via a single unified view across the Enterprise
The promise of a data lake: All data is welcome….
Zaloni Proprietary 4
Data architecture modernization Tr
aditi
onal
N
ew
Data Lake
Sources ETL EDW
Derived (Transformed)
Discovery Sandbox EDW
Streaming
Unstructured Data
Various Sources
Zaloni Proprietary
Data Discovery Analytics
BI
Data Science Data Discovery
Analytics BI
5
Data lake challenges and complications
• Ingestion
• Lack of Visibility
• Privacy and Compliance
• Quality Issues
• Reliance on IT
• Reusability
• Rate of Change
• Skills Gap
• Complexity
Building: Managing: Delivering:
Zaloni Proprietary 6
Engage the business
• Discover • Enrich
• Provision
Govern the data in the lake
• Cleanse • Secure
• Operationalize
Enable the data lake
• Ingest • Organize • Catalog
Data lake reference architecture Consumption
ZoneSource System
File Data
DB Data
ETL Extracts
Streaming
TransientLoading Zone
Raw Data Refined Data
Trusted Data
DiscoverySandbox
Original unaltered data attributes
Tokenized Data
APIs
Reference Data Master Data
Data WranglingData DiscoveryExploratory Analytics
Metadata Data Quality Data Catalog Security
Data Lake
Integrate to common formatData ValidationData CleansingAggregations
OLTP or ODS
Enterprise Data Warehouse
Logs(or other unstructured
data)
Cloud Services
Business AnalystsResearchersData Scientists
Zaloni Proprietary 7
Data lake management platform
Unified Data Management
Managed Ingestion
Data Reliability
Data Visibility
Data Security and Privacy
Integrated Data Lake
Management
Zaloni Proprietary 8
• Ability to ingest vast amounts of data
• Ability to handle a wide variety of formats (streaming, files, custom)
• Ability to handle wide variety of sources
• Capture operational metadata implicitly as new data arrives
• Build in repeatability through automation to pick up incoming data and apply pre-defined processing
First things first….managed ingestion
Various Sources
Streaming
Unstructured Data
Zaloni Proprietary 9
• Reduced time to insight for analytics
• File and record level watermarking provides data lineage
Capture metadata to improve data visibility and reliability
Type of Metadata Description Example
Technical Captures the form and structure of each data set
Type of data (text, JSON, Avro), structure of the data (fields and their types)
Operational Captures lineage, quality, profile and provenance of the data
Source and target locations of data, size, number of records, lineage
Business Captures what it all means to the user
Business names, descriptions, tags, quality and masking rules
Zaloni Proprietary 10
Diagram derived from Gartner report on Self Service Data Preparation
• Interactive data preparation to address errors, corrupted formats, duplicates • Data enrichment to go from raw to refined • Self service to prepare data without IT request/SQL knowledge
Data ready: Data preparation required for actionable data
Orchestrate and automate workflows
Transform Refined Data
Explore
BI Reports Enterprise Data
Integrations
Data Science Data Discovery
Analytics Raw Data
Automation
Reusable Transformations
Data Preparation
Zaloni Proprietary 11
• Data lakes enable multiple groups to share access to centrally stored data
• Differing permissions require enhanced data security
§ Mask or tokenize data before published in the lake for consumption
§ Policy-based security
• Metadata management enables audit and traceability
• End result: more open and democratized access to data in the lake for those with permission
Protect sensitive data
Zaloni Proprietary 12
Discover, Enrich, Provision
Self Service Data Preparation for Analytics: Catalog, Wrangling, Collaboration • See what data is available across your enterprise • Blend data in the lake without a costly IT project • Perform interactive data-driven transformations • Collaborate and share data assets and transformations with peers
EXPLORE PREPARE OPERATIONALIZE
13 Zaloni Proprietary
Catalog with KPIs
Zaloni Confidential and Proprietary 14
• Seeing rapid increase of big data in the Cloud • Leverage cloud platforms as complementary to on-premises • Support sensitive data on premise and external data in the cloud
(e.g. client data, machine-generated)
Key data challenges for hybrid environments:
“Ground to Cloud” hybrid architectures
Zaloni Proprietary
VISIBILITY GOVERNANCE
Need enterprise-wide data catalog (logical data lake)
Need consistent data governance requirements for hybrid platforms
15
INGEST Manage data ingestion
so you know what is your Hadoop Data Lake
ORGANIZE Define and capture
metadata for ease of searching and browsing
ENRICH Orchestrate and manage
the data preparation process
ENGAGE Data visibility and self-
service data preparation
Manage the complete data pipeline
16 Zaloni Proprietary
Network Data Lake architecture
BI Tools
Network Data Lake
Custom Apps
Data Warehouse
Custom Applications: • Subscriber Usage • Network Usage Exploration & Ad-hoc Analytics
Data Lake
Manage Ingestion Manage Metadata Manage, Monitor, Schedule
Operations and Metadata Store
Data Quality & Rules Engine
Transformation
Engine
Work flow Executor
Enterprise Data
Warehouse
• CDR • DPI
• IPFIX
• SNMP • RADIUS
Network Data
• CRM • Billing • Inventory
Enterprise Data
Zaloni Proprietary 17
Managed data lake for healthcare payers
Data Lake Management
Edge NodeData Sources
Relational
Streaming
Files
Data Lake
Configure Ingestion Administer Metadata Manage, Monitor, Schedule
Operations and Metadata Store
Data Quality & Rules Engine
Transformation Engine
Workflow Executor
Analytical Applications
Enterprise Data Warehouse
Consumers
Data Lake
• Claims
• EMR • Lab/Pathology
• Pharmacy • Member
• Social
• Enterprise Data
Applications:• HEDIS Reporting
• Bundle Payments
• Medical Benefits
Management
• Scorecards
• Enterprise Reports
Batch Ingestion
Streaming Ingestion
Change Data
Capture
Data Sets:
18 Zaloni Proprietary
Data Lake for BCBS239 Compliance (RDARR)
Register/ updatemetadata
RDBMS
Mainframes
Flat files
Binary files
Source Systems
Metadatarepositories
MetadataManagement
solution
Extract/ Readmetadata
Data Ingestion Data Quality and Validation
Layout Standardization
Operational Metadata
Generation
Data at Rest
Data Acquisition Automation
• Automated Data Acquisition Framework providing timeliness of data
• Capture Metadata in all phases: Ingestion, Transformation
• Integration with Enterprise Metadata Management
• Integrated Data Quality Analysis
Zaloni Proprietary 19
Getting Started
Roadmap
Prototype
Analytics Strategy
Business drivers AND
Business Questions:
Where is fraud
occurring? How to optimize
inventory?
Data Use Cases Platform
Subject areas Source system
Capabilities, Process
Ingest, Organize,
Enrich, Explore
Roadmap
Prototype
Analytics Strategy
1Questions 2 Inputs 3 Outcomes
Zaloni Proprietary 20
+ + =
Stop by booth #1335 and ask for a copy of our new book and a free t-shirt!
DON’T GO IN THE DATA LAKE WITHOUT US
Zaloni Proprietary