Organising The Data Lake
- Information Management In A Big Data World
Mike Ferguson
Managing Director
Intelligent Business Strategies
Hadoop Summit
Dublin, April 2016
2Copyright © Intelligent Business Strategies 1992-2016!
About Mike Ferguson
Mike Ferguson is Managing Director of
Intelligent Business Strategies Limited. As an
analyst and consultant he specialises in
business intelligence, data management and
enterprise business integration. With over 34
years of IT experience, Mike has consulted for
dozens of companies, spoken at events all over
the world and written numerous articles.
Formerly he was a principal and co-founder of
Codd and Date Europe Limited – the inventors
of the Relational Model, a Chief Architect at
Teradata on the Teradata DBMS and European
Managing Director of DataBase Associates.www.intelligentbusiness.biz
Twitter: @mikeferguson1
Tel/Fax (+44)1625 520700
3Copyright © Intelligent Business Strategies 1992-2016!
Topics
The data integration complexity
The siloed approach to managing and governing data
A new inclusive approach to governing and managing data
Introducing the data reservoir and data refinery
How does a data reservoir and data refinery work?
Mapping new data and insights into your shared business vocabulary
The mission critical importance of an information catalog in a distributed data
landscape
Integrating data reservoirs and data refineries into your existing environment
4Copyright © Intelligent Business Strategies 1992-2016!
The Changing Landscape – We Now Have Different Platforms Optimised For
Different Analytical Workloads
Streaming
data Hadoop
data store
Data Warehouse
RDBMS NoSQL
DBMS
EDW
DW & marts
NoSQLGraph DB
Advanced Analytic
(multi-structured data)
mart
DW
Appliance
Advanced Analytics
(structured data)
Analytical
RDBMS
Big Data workloads result in multiple platforms now being needed for analytical processing
C
R
U
D
Prod
Asset
Cust
MDM
Traditional
query,
reporting &
analysis
Real-time
stream
processing &
decision m’gmt
Data mining,
model
development
Investigative
analysis,
Data refinery
Data mining,
model
development
Graph
analysis
Graph
analysis
5Copyright © Intelligent Business Strategies 1992-2016!
Data Integration Today Has Become Much More Complex
- Popular Data Integration Paths Between Platforms
EDW
DW
Appliance
Analytical DBMS
MDM System
C
R
U
D
Prod
Asset
Cust
XML,JSON
social
Web
logs
ERP
CRM
SCM
Ops
Graph
DBMS
NoSQL DB
Column Fam DB
Document DB
NoSQL DB
web
Data martsTransaction data
Cloud data may
also be part of it
insig
hts
Txn
s
6Copyright © Intelligent Business Strategies 1992-2016!
Issues: Siloed Analytics - Different Tools To Manage And Integrate Data For
Each Type Of Analytical And MDM Store
Analytical
tools
Data
management
tools
EDWmart
Structured data
CRM ERP SCM
Silo
DW & marts
Analytical
tools/apps
Data
management
tools
Multi-structured
data
Silo
DW
Appliance
Advanced Analytics
(structured data)
Data
management
tools
Structured data
CRM ERP SCM
Analytical
tools
Silo
Analytical
tools/apps
Data
management
tools
NoSQL DB e.g. graph DB
Silo
Multi-structured &
structured data
Silo
C
R
U
D
Prod
Asset
Cust
MDM
Applications
Data
management
tools
Master data
management
CRM ERP SCM
7Copyright © Intelligent Business Strategies 1992-2016!
Issues: Data Deluge - Data Is Arriving Faster Than We Can Consume It
F
D I
A L
T T
A E
R
Enterprise
Enterprisesystems
8Copyright © Intelligent Business Strategies 1992-2016!
With 000’s Of Data Sources, IT And Business Need To Working Together As IT
Will Likely Become A Bottleneck
IT
OLTPsystems
Web
logs
web
DQ/DI
jobDQ/DI
jobDQ/DI
job
Open data
IoT
machine data
social & web
C
R
Uprod cust
asset
D
MDM
DW
Data
warehousing
cloud
Data virtualisation
Can business analysts &
Data Scientists help?
DQ/DI
jobDQ/DI
jobDQ/DI
job???
Bottleneck?
Should IT be expected
to do everything?
Big Data
9Copyright © Intelligent Business Strategies 1992-2016!
Issues: Have You Got Self-Service Data Integration Causing Chaos In The
Enterprise?
social
Web
logs
web cloud
sandbox
Data Scientists
sandbox
Data Scientists
sandbox
Data Scientists
HDFS
ETL
/ DQ
Self-serviceBI tools with ETL
ETL
new
insights
SQL on
Hadoop
DW
ETL
/ DQDW
marts
ETL
SCM
CRM
ERP
ETL/D
Q
marts Self-serviceBI tools with ETL
ETL/D
Q
Built by IT
ETL/
DQETL/
DQETL/
DQ
10Copyright © Intelligent Business Strategies 1992-2016!
Problems With The Current Approach
Project oriented siloed approach to DI/DQ with limited collaboration
Cost of data integration is too high
Slow speed of development
Multiple DI/DQ technologies and techniques being used that are not integrated
Lots of re-invention rather than re-use
Fractured metadata across multiple tools or no metadata at all in some cases
Risk of duplicate inconsistent DI/DQ rules for same data
Metadata lineage is unavailable in many places especially with hand-coded Big Data DI/DQ applications
Multiple skill sets fractured across different projects
Repetition of our mistakes, e.g. Big Data preparation
EDW C
R
U
D
Prod
Asset
Cust
MDMDQ/DIDQ/DI
DQ/DI
DQ/DIDQ/DI
cloud Data
virtualisation
DQ/DIDQ/DI
DQ/DI
Self-service
11Copyright © Intelligent Business Strategies 1992-2016!
There has to be a better, more governed
way to fuel productivity and agility without
causing data inconsistency and chaos
EDW
DQ/DI
C
R
U
D
Prod
Asset
Cust
MDM
DQ/DI
DQ/DIDQ/DI
cloudData
virtualisation
DQ/DIDQ/DIDQ/DI
DQ/DI
Self-service
Tools are available but are not well integrated
Also the whole collaborative, metadata and information catalog piece is incomplete
IT IS NOT ENOUGH – THE WHOLE THING HAS TO BE CO-ORDINATED
12Copyright © Intelligent Business Strategies 1992-2016!
We Are All In The Same Boat!
– Everyone For Themselves Is Not An Option
IT Data ArchitectData Scientist
IT Developer Business analyst
Information Management
– Introducing The Data Lake
Reservoir
Reservoir
14Copyright © Intelligent Business Strategies 1992-2016!
What Is A Data Reservoir? - A Collaborative, Governed Environment Aimed At
Rapidly Producing Information
IT Data Architect
Data ScientistDomain Expert
community
Bus. analyst
Need to work together for competitive advantage
Data ScientistIT Developer
community
Data
Architect
Data ScientistDomain Expert
community
Domain Expert
Data ScientistDomain Expert
community
Bus. analyst
Bus. analyst Data Architect
community
15Copyright © Intelligent Business Strategies 1992-2016!
Chaos Is NOT An Option – Business Alignment Of Information Being Produced
Is Critical To Success
Big Data Project
Big Data Project
DW Project
MDM ProjectProject
Strategic Objectives
Business
Strategy
• What problem are you
trying to solve?
• What data do you need?
• What kind(s) of analytic
workload are needed
We need co-ordinated
“info producer” projects in
a managed environment
16Copyright © Intelligent Business Strategies 1992-2016!
Key Capabilities In A Managed Data Reservoir - 1
Data collection
• Automated discovery of the structure and formatting
• Data structure inferred by machine learning
• Automated cataloging, infinite storage and processing
Data classification
• Determines how data should be governed
• Support is needed for different types of classification schemes, e.g.
Retention
Unclassified
Temporary
Project Lifetime
Managed period
Permanent
Confidential
Unclassified
Internal use
Business confidential
Supplier confidential
Sensitive (PII)
Sensitive (Financial)
Sensitive (Operations)
Restricted (Trade secret)
Confidence
Unclassified
Raw (original)
Obsolete
Archived
Trusted
Business
Value
Unclassified
Unimportant
Marginal
Important
Critical
Catastrophic
17Copyright © Intelligent Business Strategies 1992-2016!
Key Capabilities In A Managed Data Reservoir - 2
Collaborative data governance
• Data quality
• Data trustworthiness (confidence)
• Data protection
– Data privacy, access authorisation, lifecycle management
• Compliance
Data refinery
• Systematically clean and refine data through various stages
• Manual and guided data preparation
• “Sandbox” analyse data to produce high value insights
Data as a Service (DaaS)
• Published high value insights available for consumption
• Search for and discover trusted insights, subscribe to receive it
Data consumption
• Provision refined, trusted commonly understood data into any tool or application
18Copyright © Intelligent Business Strategies 1992-2016!
Data virtualisation services
A Data Reservoir Is An Organised Collection Of Raw, In-Progress And Trusted
Data (Multiple Data Stores)
DW
MDM
C
R
U
D
Prod
Asset
Cust
Data marts
Cloud object
storage
Refined
tru
ste
d &
inte
gra
ted d
ata
Str
ong g
overn
ance
Raw
untr
uste
d d
ata
som
e g
overn
ance
ECMStaging areas
ODS
RDM
C
R
U
D
Code
sets
Archived DW data
Hive tables
feedsIoT
XML,JSON
RDBMS Files office docssocial Cloud
clickstream
web logs web services
NoSQL
ODSODS
DW
Text /Image/Video
Filtered sensor data
Published trusted
data
Search
indexes
In-progress data
Data Reservoir
(not a data store but a collection of stores)
Data sources and ingested reservoir
data are all known to the catalog
Info
Catalog
19Copyright © Intelligent Business Strategies 1992-2016!
Replicate
Streaming
Batch Load
Archive
Raw Data Is Being Collected In Multiple Places Across The Enterprise – We
Need To Know What’s Happening!
We need to avoid unconnected silos
But we HAVE TO know what is being collected and
filtered and where that is happening
Also who is doing it, for what business purpose?
20Copyright © Intelligent Business Strategies 1992-2016!
If Multiple Collection Points Exist Then Something Has To Catalog What Data Is
Available, Its Status And Where It Is
All data entering a
reservoir needs to
be catalogued and
organised
You need to know what data is available across the enterprise, where it
came from, what state is it in, should we trust it, can we order it
Information Catalogue
21Copyright © Intelligent Business Strategies 1992-2016!
A Distributed Data Reservoir Requires Information Management Software To
Work Across Multiple Data Stores
Enterprise Information Management (Catalog, DQ, ETL, Security, Privacy…)
The Data Reservoir is distributed but is should be managed
and function as if it were centralised
Key requirements
Define once, execute anywhere
Centralised metadata
Distributed execution of policies associated with data quality, ETL, security, lifeecycle
management across the landscape (multiple execution engines)
22Copyright © Intelligent Business Strategies 1992-2016!
Replicate
Streaming
Batch Load
Archive
A Distributed Data Reservoir Requires Management And Governance As If It
Was Centralised
The data in the reservoir is distributed but the reservoir
is managed and operated as if it were centralised
23Copyright © Intelligent Business Strategies 1992-2016!
Information Production Is A Process That Involves Refining And Integrating Data
High value
information
and /or insights
available for
consumption
Raw
data
Raw
data
Trusted
data
Collaboration is needed
to perform many tasks in
producing information,
e.g. selecting &
transforming data
Reservoir storage
Raw
data
Raw
data
In-
progress
data
Trusted
data
24Copyright © Intelligent Business Strategies 1992-2016!
The Information Production Process Works Across Zones In The Reservoir –
Zones Created By Tagging Files
sandbox
Trusted
Data Zone
Raw Data
Zone
Info
Catalog
master
ref data
DW archive
sandbox
Refinery zone
(prepare &
analyse data)
In-progress data
Refined
data &
Insights
zone
Data
marketplace
Data reservoir management
ETL/
Data prepDQ
Data Ingestion
zone
(transient
data)IoT
RDBMS
office docs
social
Cloud
clickstream
web logs
XML,JSON
web services
NoSQL
Files
DW data
streams
Data Reservoir
Exploratory analysis
25Copyright © Intelligent Business Strategies 1992-2016!
Organising Data In A Reservoir – The Catalog Knows About Data Sources Plus
Data In All Zones And Sandboxes
sandbox
Trusted
Data Zone
Raw Data
Zone
Info
Catalog
master
ref data
DW archive
sandbox
Refinery zone
(prepare &
analyse data)
In-progress data
Refined
data &
Insights
zone
Data
marketplace
Data reservoir management
ETL/
Data prepDQ
Data Ingestion
zone
(transient
data)IoT
RDBMS
office docs
social
Cloud
clickstream
web logs
XML,JSONweb
services
NoSQL
Files
DW data
streams
Data Reservoir
Exploratory analysis
26Copyright © Intelligent Business Strategies 1992-2016!
Operating A Data Reservoir – The Information Production Process Is A
Production Line That Spans Reservoir Zones
Trusted
Data Zone
Raw Data
Zone
Info
Catalog
master
ref data
DW archive
Refinery zone
(prepare &
analyse data)
In-progress data
Refined
data &
Insights
zone
Data
marketplace
Data reservoir management
ETL/
Data prepDQ
Data Ingestion
zone
(transient
data)IoT
RDBMS
office docs
social
Cloud
clickstream
web logs
XML,JSONweb
services
NoSQL
Files
DW data
streams
Data Reservoir
Nominate
new data
Classify
sensitivity,
quality,
retention
Tag data
(what’s it
mean?)
Assign
governance
policies based on
classification
Collaborate
about
processing
Track data
freshness
Rate its value
★★★★
Exploratory analysis
Analyse
consume
Reservoir operations are
controlled via the catalog
and workflow processes
Info
Catalog
Map to shared
business
vocabulary
27Copyright © Intelligent Business Strategies 1992-2016!
Operating A Data Reservoir – Workflows Are Everywhere And Are Components
Of An Information Production Process
sandbox
Trusted
Data Zone
Raw Data
Zone
Info
Catalog
master
ref data
DW archive
sandbox
Refinery zone
(prepare &
analyse data)
In-progress data
Refined
data &
Insights
zone
Data
marketplace
Data reservoir management
ETL/
Data prepDQ
Data Ingestion
zone
(transient
data)IoT
RDBMS
office docs
social
Cloud
clickstream
web logs
XML,JSON
web services
NoSQL
Files
DW data
streams
Data Reservoir
Exploratory analysis
Ingest
w/flow
movement
w/flow
movement
w/flow
Publish
w/flow
Publish
w/flow
Provision
w/flow
Refinery
w/flow
Analytical
w/flow
Gov
w/flow
Gov
w/flow
Stream
w/flow
28Copyright © Intelligent Business Strategies 1992-2016!
Trends – Data And Analytical Workflow (Pipeline) Products Requiring No
Programming Are Emerging Everywhere
Talend Alteryx
Microsoft Azure Data FactoryHortonworks
Dataflow (Nifi)
Dell Statistica
Who is using what
tools?
Any reinvention?
29Copyright © Intelligent Business Strategies 1992-2016!
Operating A Data Reservoir – All Workflows Should Be Approved And
Registered In The Information Catalog
sandbox
Trusted
Data Zone
Raw Data
Zone
Info
Catalog
master
ref data
DW archive
sandbox
Refinery zone
(prepare &
analyse data)
In-progress data
Refined
data &
Insights
zone
Data
marketplace
Data reservoir management
ETL/
Data prepDQ
Data Ingestion
zone
(transient data)IoT
RDBMS
office docs
social
Cloud
clickstream
web logs
XML,JSON
web services
NoSQL
Files
DW data
streams
Data Reservoir
Exploratory analysis
Ingest
w/flow
Publish
w/flow
Publish
w/flow
movement
w/flow
movement
w/flow
Provision
w/flow
Refinery
w/flow
Analytical
w/flow
Gov
w/flow
Gov
w/flow
Stream
w/flow
Convert SSDI workflows to data
virtualisation views to minimise re-
invention and enforce governance
virtu
al v
iew
virtu
al v
iew
virtu
al v
iew
30Copyright © Intelligent Business Strategies 1992-2016!
Data Strategy Requirements – We Need To Enable Information Producers And
Information Consumers
Need to make use of
• A business glossary and information catalog
• Re-usable services to manage and process data
• Collaboration and social computing to manage, process and rate data
• Role-based data management tools aimed at IT AND business
clean &
integrate
service
raw datatrusted data
Information
catalog
BI tool or
application
search
find
shop
order consume
data scientist
IT professional
information producers
clean &
integrate
service
raw data
business analysts
information consumers
like a
“corporate
iTunes” for
data
31Copyright © Intelligent Business Strategies 1992-2016!
A ‘Production Line’ Publish And Subscribe Approach Is Used To Accelerate
Information And Insight Production
data
source
Data
Integration
publishInfo
catalog
trusted data
as a service
publish Info
catalog
trusted, integrated
data ad a service
subscribeAnalyse
(e.g. score)consume
publishAnalytics
catalog
New predictive
analytic pipelines
(as a service)
consume
subscribe
Visualise
Decide Act
Other, e.g. embed
analytic applications
consume
subscribe
publish
Solutions
catalogNew prescriptive
analytic pipelines
publish New analytic
applicationsuse
crawl
discover
profile
publish
Info
catalog
discovered
data
AcquireAcquire
AcquireData Preparation
(clean, transform, filter)
32Copyright © Intelligent Business Strategies 1992-2016!
Cataloging, Automated Discovery And Collaboration Are All Needed When Data
Is Ingested
Trusted
Data Zone
Raw Data
Zone
Info
Catalog
master
ref data
DW archive
Refinery zone
(prepare &
analyse data)
In-progress data
Refined
data &
Insights
zone
Data
marketplace
Data reservoir management
ETL/
Data prepDQ
Data Ingestion
zone
(transient
data)IoT
RDBMS
office docs
social
Cloud
clickstream
web logs
XML,JSONweb
services
NoSQL
Files
DW data
streams
Data Reservoir
Exploratory analysis
Analyse
consume
Automated relationship
discovery, data profiling,
and document clustering
Descriptive metadata is
critical to keeping things
organised
Info
Catalog
Catalog, tag and
describe data/files
(what’s it about?)
collaborative
appraisal
33Copyright © Intelligent Business Strategies 1992-2016!
Governance In A Data Reservoir Is Controlled By Classification And Metadata In
The Information Catalog Classifications drive the governance
Governance Rule
Governance Rule
Governance Rule
Classification
ClassificationInformation
Rule
Information Governance
Rule
Classified by
Actionedby
Physical Data Description
Policy
Governs
Implemented by
Policy
ProcessAssessed by
BusinessAttribute
Classified by
Mapped to
Governs Sensitive
IT Landscape
Deployed toGovernance Action
Describesby
Engine
AccessesMetrics
Measures
ProcessAssessed by
Feeds
OperationalLog
Logs activity
Describes
Data storeData store/ Document/
File/API
MeasuresMeasures
9Source: IBM
34Copyright © Intelligent Business Strategies 1992-2016!
IBM Are Creating ‘Governance Aware’ Runtimes To Verify And Enforce Policies
In A Data Reservoir
Source: IBM
They access the information
catalog to determine what to
do at run time
35Copyright © Intelligent Business Strategies 1992-2016!
We Need A Data Refinery To Process, Clean And Analyse Data To Produce
Consumable High Value Insight
cloud On-premises
DW Analytical
RDBMS
ETL
Server
Data Virtualisation
Server
A data refinery should be able to choose where to best refine data to produce the information needed
36Copyright © Intelligent Business Strategies 1992-2016!
Data virtualisation services
A Key Requirement In A Distributed Data Reservoir Is Centralised Development,
Distributed Execution
MDM
C
R
U
D
Prod
Asset
Cust
Data marts
Cloud object
storage
Refined
tru
ste
d &
inte
gra
ted d
ata
Str
ong g
overn
ance
Raw
untr
uste
d d
ata
som
e g
overn
ance
ECMStaging areas
RDM
C
R
U
D
Code
sets
Archived DW data
Hive tables
feedsIoT
XML,JSON
RDBMS Files office docssocial Cloud
clickstream
web logs web services
NoSQL
Text /Image/Video
Filtered sensor data
Published trusted
data
Search
indexes
In-progress data
Data Reservoir
(not a data store but a collection of stores) Info
Catalog
ODS
DW
staging area
EIM Tool Suite (Profiling, cleansing, ELT)
ODSODS
Execution
engine
Execution
engineExecution
engine
Execution
engine
Execution
engine
Execution
engine
IT User
Interface
Self-
service UI
Execution
engine
Execution
engineExecution
engineExecution
engine
Execution
engineExecution
engineExecution
engine
37Copyright © Intelligent Business Strategies 1992-2016!
On-premises
storage
DW
staging area
Cloud
storage
Execution
engineExecution
engine
Execution
engine
Execution
engineExecution
engine
If A Data Reservoir Is Distributed With Data Too Big To Move Then Processing
Needs To Go The Data
Not centralised,
Not distributed
But Federated
TaskTask
TaskTaskTask
38Copyright © Intelligent Business Strategies 1992-2016!
Options For Refining Data
IT developed ETL processing using EIM tool suites
Self-service data integration
Multi-role EIM tool suites
• Can be used by both IT AND business users
Data virtualisation server
A combination of the above
39Copyright © Intelligent Business Strategies 1992-2016!
Scaling ETL Transformations For In-Hadoop ELT Processing
Data Cleansing and Integration Tool
Extract Parse Clean Transform AnalyseLoad Insights
Option 1
ETL tool generates HQL or
convert generated SQL to HQLOption 2
ETL tool generates Pig
(compiler converts every
transform to a map reduce
job) or JAQL
Option 3
ETL tool generates 3GL MR
or Spark code
Option 4 – Other
Native massively parallel transformation and
integration bypassing any Hadoop execution
engine
E.g. Talend, IBM BigIntegrate, Informatica
40Copyright © Intelligent Business Strategies 1992-2016!
Self-Service Data Integration Tool Vendors
Actian Dataflow
Alteryx
Clear Story Data
Datameer
IBM DataWorks
Informatica Rev
Paxata
SAS Data Loader
for Hadoop
Tamr
Trifacta
AcquireData Preparation
(clean, transform, filter)
Analyse
(e.g. Score)Visualise
Decide Act
Data
Integrationdata
Embed
AcquireData Preparation
(clean, transform, filter)
Analyse
(e.g. Score)Visualise
Decide Act
Data
Integrationdata
Embed
Data preparation, integration, analysis & visualisation
Data preparation and integration
41Copyright © Intelligent Business Strategies 1992-2016!
Some Data Management Vendors Are Trying To Cover All Roles And Integrate
With Other Vendors, e.g. Informatica
Informatica
Catalog & Live
Data Map
Analyst toolData &
Metadata
Relationship
Discovery
Services
Data Quality Profiling & MonitoringServices
Data
Modeling
Services
DataCleansing &
MatchingServices
Data
Integration
Services
Business
Glossary
/ Info Catalog
Services
Data Governance/Management Console
Data
Privacy &
Lifecycle
Management
Services
Data
Audit &
Protection
Services
EIM Tool Suite
IT Data ArchitectData Scientist
Business Analyst
Informatica Rev
Self-service
Cloud DI
metadata
metadata
42Copyright © Intelligent Business Strategies 1992-2016!
Data &
Metadata
Relationship
Discovery
Services
Data Quality Profiling & MonitoringServices
Data
Modeling
DataCleansing &
MatchingServices
Data
Integration
Services
(virt & ETL)
Business
Glossary
/ Info
Catalog
Services
Data Governance/Management Console
metadata
Data
Privacy &
Lifecycle
Management
Services
Data
Audit &
Protection
Services
ESB
Information
servicesC
R
Uprod cust
asset
D
MDM
DW
Data
warehousing
Big Data
Data virtualisation
cloud
Business UserIT DeveloperIT Data Architect
App Self-
Service
Enterprise Service Bus
Some Vendors Are Opening Up Their Service Oriented Data Management
Platforms To IT AND Business Users
Role-based
Uis to the same
data management
platform
Workflow
43Copyright © Intelligent Business Strategies 1992-2016!
Alternatively Interoperability Is Needed Across Tools To Use Data Preparation
Jobs Developed By Different Users
Stand-alone
Data Wrangling
tools
Data &
Metadata
Relationshi
p
Discovery
Services
Data Quality
Profiling & MonitoringServices
Data
Modeling
Services
DataCleansing
& MatchingServices
Data
Integration
Services
Business
Glossary
/ Info
Catalog
Services
Data Governance/Management Console
Data
Privacy &
Lifecycle
Management
Services
Data
Audit &
Protection
Services
EIM Tool Suite
IT Data Architect Data Scientist
Business Analyst
PowerQuery
Self-Service DI
embedded in Self-
Service BI tools
Microsoft Data Factory
Dell Boomi
SnapLogic
IBM DataWorks
Informatica Rev
Cloud DI
Interoperability
metadata metadata
metadatametadata
44Copyright © Intelligent Business Strategies 1992-2016!
Metadata Management In A Data Reservoir
- EIM Platform Information Catalog And Apache Atlas
Stand-alone
Data Wrangling
tools
Services
Data Governance/Management Console
EIM Tool Suite
IT Data Architect Data Scientist
Business Analyst
PowerQuery
Self-Service DI
embedded in Self-
Service BI tools
Microsoft Data Factory
Dell Boomi
SnapLogic
IBM DataWorks
Informatica Rev
Cloud DImetadata
metadata
metadata
metadata
atlas
Graph store
atlas atlas
Information
Catalog
45Copyright © Intelligent Business Strategies 1992-2016!
Metadata Management In A Data Reservoir
- Stand-Alone Information Catalog And Apache Atlas
Stand-alone
Data Wrangling
tools
Services
Data Governance/Management Console
EIM Tool Suite
IT Data Architect Data Scientist
Business Analyst
PowerQuery
Self-Service DI
embedded in Self-
Service BI tools
Microsoft Data Factory
Dell Boomi
SnapLogic
IBM DataWorks
Informatica Rev
Cloud DImetadata
metadata
metadata
metadata
atlas
Graph store
atlas atlas
Information
Catalogmetadata atlas
46Copyright © Intelligent Business Strategies 1992-2016!
New Trusted Data Produced By Refining Un-Modelled Data Should Be Defined
In A Business Glossary
Raw data In-Progress data Refined data
Untrusted Trusted
corporate
firewall
Fit for use
Data Refinery
sandboxBusiness
Glossary
Da
ta V
irtua
lisa
tion
Could implement the
SBV in a data
virtualisation server
47Copyright © Intelligent Business Strategies 1992-2016!
The Critical Importance Of An Information Catalog
– We MUST Be Able To Answer This Question
Business user
What information exists
about……….?
An Information Catalogue
Where is that likely to be documented?
48Copyright © Intelligent Business Strategies 1992-2016!
The Information Catalog
- What Else Do I Want To Know?
Can I search for information? (faceted search via your SBV)
Does the data exist?
Is the data trusted? (what is the rating)
Is the data sensitive? (what is the rating)
Is it high business value (what is the rating)
Can I order it?
Can I specify where to deliver it to and in what format?
Can I see where is it used and who owns it?
Information Catalogue
49Copyright © Intelligent Business Strategies 1992-2016!
Information Catalog Example - Waterline Data
50Copyright © Intelligent Business Strategies 1992-2016!
Faceted Navigation Used In E-Commerce (e.g. Amazon) Is About To Get A
Much Bigger Role In Data Management
Add it to
your cart
Select the
products you
want
51Copyright © Intelligent Business Strategies 1992-2016!
Ordered Parcel Delivery – The Same Thing Will Happen To Provision Ordered
Data
Ordered data
52Copyright © Intelligent Business Strategies 1992-2016!
Virtual Information Provisioning Needs Policy Awareness At Runtime To Create
Virtual Views That Enforce Governance
Information
provisioning
service
Virtual data subset
Virtual full data set
security
policy
(some data not
permitted to be seen)
(all data permitted
to be seen)
“Finished-Goods”
Refined data
Information
provisioning
service
Virtual data subset
Virtual full data set
compliance
policy
(some data not
allowed to be
provisioned outside
the country)
(all data
provisioned inside
the country)
Data reservoir
All data
has SBV Data
Virtu
alis
atio
n
53Copyright © Intelligent Business Strategies 1992-2016!
Conclusions
The challenge is now to manage data in the entire analytical ecosystem
Invest in new skills and training needed in this environment
Data needs to be organised in a data reservoir to prevent chaos
Hadoop is becoming a platform to accelerate cleansing and ETL processing to conduct
exploratory analytics
Multiple options exist to allow IT and business users to clean and integrate data in preparation
for analysis
• Data integration vendors have added functionality to support Hadoop
• Self-service data cleansing and integration tools also exist
The ideal solution is a single platform that supports IT and business user self-service data
integration
An information catalog is critical for end-to-end data governance
• Understanding what data is available (descriptive metadata)
• Understand how it was transformed (metadata lineage)
Data virtualisation is needed to see across multiple data reservoirs
Start small and build out incrementally – don’t just load data and hope
54Copyright © Intelligent Business Strategies 1992-2016!
www.intelligentbusiness.biz
Twitter: @mikeferguson1
Tel/Fax (+44)1625 520700
Thank You!