sören eickhoff, informatica gmbh, "informatica intelligent data lake – self service for data...
TRANSCRIPT
Informatica Intelligent Data Lake Self Service for Data Analysts
Februar, 2017
Sören EickhoffSales Consultant Central [email protected]
Data Security
Cloud DataManagement
Big DataManagement
Data Integration
Master Data Management Data Quality
#1 in 6 Data Categories …
Data Platform
Data Lake
Use Case: Data Lake / Data Platform Reference Architecture
Landing ZoneStructured and unstructured enterprise and external data is landed in its raw
form, normalized and ready for use
Data AnalystData Scientist BusinessData StewardData Modeler Data Engineer
Discovery ZoneUser sandbox for self-serve access to data for exploration, data blending,
hypothesis testing, analytics, and collaboration
Production ZoneSanitized transactional, master, and reference data & enriched data models
certified for enterprise use
Machine Device, Cloud
Documents and Emails
Relational,
Mainframe
Social Media, Web
LogsImprove
Predictive Maintenance
Increase Operational Efficiency
Increase Customer
Loyalty
Reduce Security Risk
Improve Fraud
Detection
• Can’t easily find trusted data
• Limited access to the data
• Frustrated by slow response from IT due to long backlog
• Constrained by disparate desktop tools, manual steps
• No way to collaborate, share, and update curated datasets
• Can’t cope with growing demand from the business
• No visibility into what the business is doing with the data
• Struggling to deliver value to the business
• Loosing the ability to govern and manage data as an asset
Challenges Faced by the Business and IT Today
ITData Analysts
Informatica Data Lake Management
Data Lake Management
Enterprise Information
Catalog
IntelligentData Lake
Secure@Source
TITANBlaze
Big Data Management
Intelligent Streaming
Live Data Map(metadata integration)
Big Data Management(data integration)
Data Architect / Steward
Data Scientist / Analyst
InfoSec Analyst Data Engineer
Unified view into enterprise information assets
• Business-user oriented solution
• Semantic search with dynamic facets
• Detailed Lineage and Impact Analysis
• Business Glossary Integration
• Relationships discovery
• High level data profiling
• Automatic Classifications with Data domains
• Business classifications with Custom Attributes
• Broad metadata source connectivity
• Big data scale
Enterprise Information Catalog
Self-service data preparation with collaborative data governance• Collaborative project workspaces
• Automated data ingestion
• Search data asset catalog
• Rapid blend of datasets
• Crowd-sourced data asset, tagging & data sharing
• Automated data asset discovery & Recommendations
• Rapid ‘industrialization’ of preparation steps into re-usable workflows
• Complete tracking of usage, lineage, and security
• Easily support Data Discovery Platforms
Intelligent Data Lake
Enterprise-wide visibility into sensitive data risks
• Sensitive data classification & discovery
• Sensitive data proliferation analysis
• Who has access to sensitive data
• User activity on sensitive data
• Sensitive Data policy-based alerting
• Multi-factor risk scoring
• Identification of highest risk areas
• Integrates data security information from 3rd parties:
- Data stores, owner, classification
- Protection status
- User access info (LDAP, IAM) and activity logs (DB, Hadoop, Salesforce, DAM)
Secure@Source
Easily integrate more data faster from more data sources Big Data Management
Smart Executor
Informatica Big Data Management
ETL/DI Servers
Informatica Data
Transformation Engine on dedicated DI
servers
Data Connectivi
ty
Data Integratio
nData
MaskingData
Quality Data
Governance
YARNHDFS
Map Reduce
Hive on Map
Reduce
Tez Spark
CoreCluster Aware
HiveOnTez
Spark Blaze
Hadoop Cluster
• Visual development interface accelerates developer productivity
• Near universal data connectivity
• Complex data parsing on Hadoop
• Data profiling on Hadoop
• High-speed data ingestion and extraction
• Process and deliver data at scale on Hadoop
• Dynamic schemas and mapping templates
• Data Quality and Data Governance on Hadoop
Take Big Data Management to the Next LevelImproving developer productivity – Dynamic Mappings Re-use PowerCenter & SQL Logic
Automatically profit from new technologies and choose best option - Smart Optimizer
MapReduceSpark
Blaze
Generic source Generic targetRule based logic
Informatica Intelligent Streaming
• Streaming analytics capability into the Intelligent Data Platform
• Unified UI with multiple engines underneath the covers
• Frictionless integration conversion/extension of batch mappings into streaming context
• Abstracted from runtime framework
Collect, ingest and process data in realtime and streaming
Realtimesource
Realtimetarget
Windowtransformation
Spark Streamingcode generated
Intelligent Datalake – Deep Dive
12
DataAnalyst / Scientist
Who?
Prepare & Publish
Search & Discover
Share and Collaborate
Intelligent Data Lake
How?
Applications & Databases Internet of Things
3rd Party Data
Data Modeling Tools BI Tools CustomCloud
Data Access & Metadata Connectivity
Intelligent Metadata FoundationCatalog ClassifyIndex Data Lineage
Data Relationships
Smart Domains
Data Profile
Data Discovery & Analysis Process
Recommend
Discover Collaborate
Publish
Operationalize/Monitor
Prepare
Data Analyst / Scientist
Intelligent Data Lake
Data Asset - Data you work with as a unit
Project - A project contains data assets and worksheets.
Recipe - The steps taken to prepare data in a worksheet.
Data Publication - the process of making prepared data available in the data lake
Data Preparation - The process of combining, cleansing, transforming, and structuring data from one or more data assets so that it is ready for analysis.
TerminologyIntelligent Data Lake
Search and DiscoveryData discovery through a powerful search engine to find relevant data
Semantic search
Fact filtering by asset, resource Type, latest , size, custom attributes…
Data Asset OverviewOverview with asset attributes and integrated profiling stats
Asset attributes collected from the source system
Asset attributes enriched by users to add business context
Column profiling stats including Null/Unique/Duplicate percentages, Inferred data types and data domains.
Details stats include value and pattern distributions
Add data asset To Project from any exploration views
Business Glossary Integration
View Business Glossary Assets like Terms, Policies and Categories in the Catalog
View and navigate to related technical and business assets in the catalog
Data LineageInteractively trace data origin through summarized lineage views for analysts
Use Lineage and Impact Sliders to drill down to desired lineage levels on either side of the seed object.
Relationship ViewShows ecosystem of the asset in the enterprise based on association to other assets
Get a 360 Degree View of data asset using the relationship view. Includes related tables, views, domains and reports, users etc.
Ability to Zoom, find specific assets in the view and filter by asset types
Expand relationship circles to get more details on relationship types and objects.
Data Preparation continued…Excel-based data preparation on Sample data
New formula definition with type-ahead
Large number of functions available for all types of data string, numeric, date, statistical, Math etc.
Advanced functionality such as Join, Merge, Aggregate, Filter, Sort etc.
New values are calculated and shown right away
Data Preparation continued…Excel-based data preparation on Sample data
Column level summary
Column value distributions
Column level Suggestions
Data preparation steps captured as “Recipe”
Data PublicationExecution of data preparation steps on actual data using Infa mapping
Publish the output of data preparation steps back to the lake
Recipe steps are translated into Informatica mapping
Informatica mapping is handed over to BDM platform for execution on actual data sources
BDM platform uses either Map/Reduce or Blaze or Spark to execute the mapping
Mapping is available to the ETL specialists to open in Informatica Developer tool to operationalize
Users credentials are used to access the underlying database.
Organizations need ONE solution that helps them…
Easily Find & Catalog Data &
Discover Relationships
Rapidly Prepare & Share Data ExactlyWhen it is Needed
Get instant Access to Trusted &
Secure Data for Advanced Analytics
Ingest, Cleanse, Integrate & protect data at scale
Forrester Wave™: Big Data Fabric, Q4 ’16
Questions ?