ibm industry models and data lake

28
© 2016 IBM Corporation IBM Industry Models and the IBM Data Lake January 2017 t O’Sullivan – IBM Analytics ail : [email protected] itter : @PatOSullivanIBM © 2017 IBM Corporation

Upload: pat-osullivan

Post on 19-Feb-2017

119 views

Category:

Data & Analytics


8 download

TRANSCRIPT

Page 1: IBM Industry Models and Data Lake

© 2016 IBM Corporation

IBM Industry Models and the IBM Data LakeJanuary 2017

Pat O’Sullivan – IBM AnalyticsEmail : [email protected] : @PatOSullivanIBM

© 2017 IBM Corporation

Page 2: IBM Industry Models and Data Lake

© 2015 IBM Corporation2 © 2017 IBM Corporation

Disclaimer

IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion.

Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.

The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract.

The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.

2

Page 3: IBM Industry Models and Data Lake

© 2015 IBM Corporation3 © 2017 IBM Corporation

SOA

The broadening scope of analytics

Master Data Management Hub

Applications Data Warehouse

Pattern Discovery for Analytics

Operational Data Store

Adding in a business desire for real-time analytics, self service data and increasing regulations relating to individual privacy, it becomes necessary to have a well- defined, managed and governed approach to information architecture. We call this IBM’s data Lake.

SANDBOXES

AnalyzeValues

SearchFor Data

Reporting

DataLake

Hadoop

Page 4: IBM Industry Models and Data Lake

© 2015 IBM Corporation4 © 2017 IBM Corporation

Big Data Lakes or Swamps?

As we collect data• Can we preserve clarity?• Do we know what we are collecting?• Can we find the data we need?

Are we creating a data swamp?

How do we build trust in big data?• Do we know what data is being used

for?

Page 5: IBM Industry Models and Data Lake

© 2015 IBM Corporation5 © 2017 IBM Corporation

The Data Lake

Data Lake = Efficient Management, Governance, Protection and Access.

Data Lake

Information Management and Governance Fabric

Data Lake Services

Data Lake Repositories

Page 6: IBM Industry Models and Data Lake

© 2015 IBM Corporation6 © 2017 IBM Corporation

Users supported by the Data Lake

Data Lake (System of Insight)

Information Management and Governance Fabric

Data Lake Services

AnalyticsTeams

Governance, Risk andCompliance Team

InformationCurator

Line of BusinessTeams

Data LakeOperations

Data Lake Repositories

Enterprise IT

Other Data Lakes

Systems of Engagement

Systems of Automation

Systems of Record

New Sources

Page 7: IBM Industry Models and Data Lake

© 2015 IBM Corporation7 © 2017 IBM Corporation

The Data Lake subsystems

Data Lake (System of Insight)

Information Management and Governance Fabric

Catalogue

Self-ServiceAccess

EnterpriseIT Data

Exchange

Self-ServiceAccess

AnalyticsTeams

Governance, Risk andCompliance Team

InformationCurator

Line of BusinessTeams

Data LakeOperations

Enterprise IT

Other Data Lakes

Systems of Engagement

Data Lake Repositories

Systems of Automation

Systems of Record

New Sources

Page 8: IBM Industry Models and Data Lake

© 2015 IBM Corporation8 © 2017 IBM Corporation

Data lake repositories

Specialist Processing

Structured and Optimized

System-level Data(Landing Area)

Accumulation of Context for Master and Reference Data

Self-managed DataMetadata

Refined data formatted for particular consumers

Page 9: IBM Industry Models and Data Lake

© 2015 IBM Corporation9 © 2017 IBM Corporation

IBM Industry Data ModelsIBM Industry Data Models provide pre-defined data structures which help accelerate data warehouse, data lake and business intelligence projects.

Industry specific issues being addressed

Integrated set of Models from business requirements to low level design

Predefined and pretested deployment to RDBMS and HDFS environments

IBM Industry Data Models

KPIs

Business Vocabulary

Atomic DW Models Dimensional Models

Banking Insurance Fin Markets Retail Healthcare Telecom E&U

Customer Insight Profitability Risk Regulatory Compliance

Project Acceleration

Technical

Business

Analysis ModelsData Classifications

Business Models

Analysis Models

Design Models

Supportive Terms

DataWarehouse

OperationalData Store

Big DataDataMarts

Information Integration & Governance

Page 10: IBM Industry Models and Data Lake

© 2015 IBM Corporation10 © 2017 IBM Corporation

IBM Industry Models and main data lake deployment paths

Business Vocabulary is deployed to Data Lake Catalog via tools such as InfoSphere Information Governance Catalog (IGC)

Atomic (Inmon) and Dimensional (Kimball) Data Models deployed to data lake via tools such as InfoSphere Data Architect (IDA) and ERwin

Supporting collateralModels-specific white papers and best practice docs outlining the main deployment patterns and implementation considerations

Page 11: IBM Industry Models and Data Lake

© 2015 IBM Corporation11 © 2017 IBM Corporation

Overall set of Models

Business Terms/ FSDMSupportive

ContentAnalytical

Requirements

Atomic Warehouse

Model

Dimensional Warehouse

Models

Business Vocabulary (IGC)

Analysis level Models (IDA)

Design level Models (IDA)

DataModels

Business Data Model

Page 12: IBM Industry Models and Data Lake

© 2015 IBM Corporation12 © 2017 IBM Corporation

Data Lake

View-based

Interaction

Big Data Landscape – main components touched by the IBM Data Models

Line of BusinessApplications

Simple, Ad Hoc

Discoveryand

Analysis

Reporting

InformationService Calls

SearchRequests

ReportRequests

UnderstandInformation

Sources

UnderstandInformation

Sources

DeployDecisionModels

UnderstandCompliance

ReportCompliance

InformationService Calls

DataAccess

CatalogInterfaces

AdvertiseInformation

Source

DeployReal-timeDecisionModels

Enterprise IT Interaction

Data ReservoirOperations

CurationInteraction

Management

DataAccess

DataDeposit

DataDeposit

Raw DataInteraction

Information Integration & Governance

Repositories

Decision ModelManagement

Governance, Risk andCompliance Team

InformationCurator

Enterprise IT

Events to Evaluate

InformationService Calls

Data Out

Data In

Other SystemsOf Insight

NotificationsSystem of

RecordApplications

Enterprise

Service B

us

New Sources

Third Party Feeds

Third Party APIs

Systems of Engagement

Internal Sources

Other SystemsOf Insight

DeployReal-timeDecision Models

Published Data

HarvestedData INFORMATION

WAREHOUSE

DEEP DATA

HistoricalData

DescriptiveData

CATALOG

OPERATIONALHISTORY

REPORTINGDATAMARTS

SANDBOXES

Full info on the IBM Data Lake Reference Architecture see IBM Redbook : Designing and Operating a Data Reservoir http://www.redbooks.ibm.com/Redbooks.nsf/RedpieceAbstracts/sg248274.html?Open

Page 13: IBM Industry Models and Data Lake

© 2015 IBM Corporation13 © 2017 IBM Corporation

Options regarding common models/glossaries to encourage standardization and reuse

DataAccess

Enterprise IT

System of Record

Applications

EnterpriseService Bus

New Sources

Third Party Feeds

Third Party APIs

Systems of Engagement

Internal Sources

Enterprise IT Interaction

InformationService Calls

Data OutPublishingFeeds

ServiceInterfaces

Data In

InformationIntegration &Governance

DataIngestion

DeployDecisionModels

InformationService Calls

DataAccess

DeployReal-time

DecisionModels

DataDeposit

DeployReal-timeDecision Models

View-basedInteraction

Published

OBJECTCACHE

Repositories

SharedOperationalData

ASSETHUB

EXECUTION ENGINES

WORKFLOWMONITOR

InformationService Calls

SearchRequests

CurationInteraction

Management

DataDeposit

ReportRequests

HarvestedData

HistoricalData

DEEP DATA

OPERATIONALHISTORY

INFORMATION WAREHOUSE

REPORTINGDATAMARTS

Line of BusinessApplications

Consumers of Insight

Simple, ad hocDiscovery

and Analysis

Reporting

Analytical InsightApplications

DescriptiveData

CATALOG

SANDBOXES

Data Analysts/Data Scientists

Analytics Tools

Data Management Operations

Shared set of term and physical asset definitions in the Catalog that underpin all queries by all users

Data Scientists can make use of predefined catalogs and likely to create new catalog entries during their daily activities

Business Users use specific subsets of the same shared Catalog as users to ensure consistency of language and meaning

Any published structures required by the Business are based on the same standard definitions and structures as those used elsewhere

Standardized set of Business Term and Data Model definitions used to enforce both the meaning and where appropriate structure of stored data

Data Management Operations use the same shared set of models and catalog entries to build the necessary production ETL assets

Page 14: IBM Industry Models and Data Lake

© 2015 IBM Corporation14 © 2017 IBM Corporation

Catalog Deployment - Models in the Descriptive Data Zone

Business Terms/FSDMSupportive

ContentAnalytical

Requirements

Atomic Warehouse

Model

Dimensional Warehouse

Models

Business Vocabulary (IGC)

Analysis level Models (IDA)

Design level Models (IDA),

PurposeProvide a standard business language and information model that can be used when discussing business concepts and related technical components.Steps1. Business Vocabulary Models are deployed to the

Catalog (IGC) where they used and maintained by business analysts and data stewards

2. The Logical data Models (eg. Business and Atomic & Dimensional Warehouse Models) are be imported into the catalog. However they are mastered in a modelling tool like InfoSphere Data Architect

Considerations Evolving patterns/best practices for the overall

management of enterprise and LOB glossaries

Repositories

HarvestedData

HistoricalData

Enterprise IT Interaction

SharedOperationalDataInformation

Service Calls

Data OutPublishingFeeds

ServiceInterfaces

Data In

DataIngestion

Enterprise IT

System of Record

Applications

EnterpriseService Bus

New Sources

Third Party Feeds

Third Party APIs

Systems of Engagement

Internal Sources

ASSETHUB

DEEP DATA

OPERATIONALHISTORY

INFORMATION WAREHOUSE

REPORTINGDATAMARTS

InformationIntegration &Governance

2

1

SANDBOXES

Business Users

Data Scientists

Business Data Model

DescriptiveData

CATALOG

Descriptive Data Zone

Page 15: IBM Industry Models and Data Lake

© 2015 IBM Corporation15 © 2017 IBM Corporation

Repositories

HarvestedData

HistoricalData

Enterprise IT Interaction

SharedOperationalDataInformation

Service Calls

Data OutPublishingFeeds

ServiceInterfaces

Data In

DataIngestion

Enterprise IT

System of Record

Applications

EnterpriseService Bus

New Sources

Third Party Feeds

Third Party APIs

Systems of Engagement

Internal Sources

ASSETHUB

OPERATIONALHISTORY

InformationIntegration &Governance

DescriptiveData

CATALOG

Business Terms

Supportive Content

Analytical Requirements

Warehouse and Marts – Models in Integrated Warehouse Zone

Atomic Warehouse

Model

Dimensional Warehouse

Models

Business Vocabulary (IGC)

PurposeProvide data modellers with consistent data structures for deployment across the different aspects of an integrated Information Warehouse and Marts zone.Steps1. The Atomic Warehouse Model is used as the basis

for the Inmon-style central relational Information Warehouse

2. The Dimensional Warehouse Model is used as the basis for the Kimball-style Dimensional Information Warehouse.

3. The Dimensional Warehouse Model provides the business-issue-specific structures to enable the deployment of Reporting Data Marts.

I

Integrated Warehouse & Marts ZoneDEEP DATA

INFORMATION WAREHOUSE

3

1

2

REPORTINGDATAMARTS

Business Users

Analysis level Models (IDA)

Design level Models (IDA),

Page 16: IBM Industry Models and Data Lake

© 2015 IBM Corporation16 © 2017 IBM Corporation

Repositories

HarvestedData

HistoricalData

Enterprise IT Interaction

SharedOperationalDataInformation

Service Calls

Data OutPublishingFeeds

ServiceInterfaces

Data In

DataIngestion

Enterprise IT

System of Record

Applications

EnterpriseService Bus

New Sources

Third Party Feeds

Third Party APIs

Systems of Engagement

Internal Sources

ASSETHUB

INFORMATION WAREHOUSE

InformationIntegration &Governance

Dimensional Warehouse

Models

Business Terms

Supportive Content

Analytical Requirements

Big Data Deployment – Models in the Landing Area Zone

Atomic Warehouse

Model

Business Vocabulary (IGC)

PurposeProvide the basis for a consistent and appropriate use of schemas in the different repositories in the Landing Area Zone.Steps1. Atomic Warehouse Model used as the basis for

the deployment for both schema-at-write and schema-at-read Hadoop Deep Data structures

2. Atomic Warehouse Model may provide the basis for deployment for schema-at-read for Operational History raw data structures

Considerations Further investigation needed into the potential

role for DWM deployments to Hadoop-based technology

Landing AreaZone

21

DEEP DATA

OPERATIONALHISTORY

REPORTINGDATAMARTS

SANDBOXES

Business Users

Data Scientists

Analysis level Models (IDA)

Design level Models (IDA),

DescriptiveData

CATALOG

Page 17: IBM Industry Models and Data Lake

© 2015 IBM Corporation17 © 2017 IBM CorporationInformationIntegration &Governance

DescriptiveData

CATALOG

Repositories

SharedOperationalData

ASSETHUB

HarvestedData

HistoricalData

Enterprise IT Interaction

InformationService Calls

Data OutPublishingFeeds

ServiceInterfaces

Data In

DataIngestion

Enterprise IT

System of Record

Applications

EnterpriseService Bus

New Sources

Third Party Feeds

Third Party APIs

Systems of Engagement

Internal Sources

DEEP DATA

OPERATIONALHISTORY

INFORMATION WAREHOUSE

REPORTINGDATAMARTS

SANDBOXES

Business Users

Data Scientists

Summary Picture

Physical ModelHadoop

PhysicalModel RDBMS

Physical Model Dimensional

Logical ModelAtomic

Logical ModelDimensional

Business Vocabulary

Mappings to inform common Business Meaning using the Business Vocabulary in IGC

Generation of Technical Structure using the ER Data Models in ER tool (e.g. IDA)

LegendUse of Business Vocabulary to understand Business Meaning by Users• The Business Vocabulary Terms in IGC can be used to enforce common

business meaning through out the Data lake landscape• The output of the various Logical Models can be used to define the

technical structure of assets in the lake that need to be created. Where a predefined schema is required (e.g. Schema at Write)

41 2 35

67

8910

Page 18: IBM Industry Models and Data Lake

© 2015 IBM Corporation18 © 2017 IBM Corporation

Three different lifecycles relating to the evolution of the models with the Data Lake

Analysis

Refine

Deploy

Review

Requirement

Maintenance of the Business Language

AR

BT

SG

Analysis

Design

Generate

Review

Requirement

Development of the ER/UML Models

AWM DWM

The use of the Industry Models Business Vocabularies to enable a common Business meaning of language by all Data Lake users

The use of the Industry Models Business Vocabularies and derived physical assets in the creation and ongoing management of the Data Lake

The use of the ER and UML models to enforce a common structure of artifacts where required in the Data Lake

BDM

BT - Business TermsAR - Analytical RequirementsSG - Supportive GlossariesBDM - Business Data ModelAWM - Atomic Warehouse ModelDWM - Dimensional Warehouse Model

Legend AWM(Physical)

DWM(Physical)

Management of the runtime production environment

BT

Data Lake Repositories

Data Lake Catalog

DataData Lake Users

Page 19: IBM Industry Models and Data Lake

© 2015 IBM Corporation19 © 2017 IBM Corporation

The Repositories used by the Data Lake Lifecycles

IGC Dev Repository

Modelling Environment

Collaboration/Versioning Repository (e.g. RTC)

Business Language Environment

Runtime Data Lake Environment

IGC ProductionRepository

Data Repositories RDBMS

IGC Browser

IDA

IGC for Eclipse

Data Repositories HDFS

Data Lake Repositories

Data Lake Catalog

IGC Anywhere/REST

IGC Browser

IMAM IDA Import

IMAM

Physical Data Model IG

C W

orkfl

ow

Page 20: IBM Industry Models and Data Lake

© 2015 IBM Corporation20 © 2017 IBM Corporation

Lifecycle 1 - Maintaining the Business Language of the Data Lake Objective : The creation and ongoing maintenance of the

common Business Language to be used by all users to describe the various components of the Data Lake oi underpin the Data Lake

Roles Involved : Business user reps, Business SMEs, Business Language Stakeholders

Analysis

Refine

Deploy

Review

Requirement

Maintenance of the Business Language

AR

BT

SG

Considerations: • Determining the needs of the different users of

the Data Lake (different uses, need for different dialects, amount of technical metadata in the Language)

• Determining the approach to building the business language, the overall flow for creation, promotion and maintenance of terms

• Defining the specific glossary suitable for pure business users , versus Business Analysts, Data Scientists, Data Modellers and IT staff

• Determining the role of using IBM Industry Models to build out the Business Language

Page 21: IBM Industry Models and Data Lake

© 2015 IBM Corporation21 © 2017 IBM Corporation

Lifecycle 2 - Developing the technical Models Objective : The use of the ER and UML models to enforce a common

structure of artifacts where required in the Data Lake Roles Involved : Modellers, Business SMEs,

Considerations: • Ensuring the appropriate communications

between the Data Modellers and the Business Users

• Determining when to use and not to use Data models for the data lake repositories

• Determining the ongoing use of a Canonical Platform Independent Logical Model as a basis for the deployment of the different types of Platform specific, physical Models required across the Data Lake Repositories

• Determining the specific data modelling approaches and scenarios for deploying to the different Data lake repositories.

Analysis

Design

Generate

Review

Requirement

Development of the ER/UML Models

AWM DWM

BDM

Page 22: IBM Industry Models and Data Lake

© 2015 IBM Corporation22 © 2017 IBM Corporation

Lifecycle 3 - Deploying the Models into the runtime Data Lake environment Objective : The use of the Industry Models Business Vocabularies

and derived physical assets in the creation and ongoing management of the Data Lake

Roles Involved : Business user reps, Modellers, Data Lake Ops staff

Considerations: • Determining how to deploy the Business

Language for optimal use by the different Data Lake users (management access to the different terms, handling of ongoing updates)

• Determine the strategy for the ongoing association of the Business Terms with Data Assets (which users tag new data elements with the Business Language and when)

• What is the approach for the Data Lake ops staff to deploy the physical Data Models – how is feedback to the Data Modellers handled.

• How to incorporate the Data Model artifacts into the ongoing Data Lake governance aspects

AWM(Physical)

DWM(Physical)

Management of the runtime production environment

BT

Data Lake Repositories

Data Lake Catalog

DataData Lake Users

Page 23: IBM Industry Models and Data Lake

© 2015 IBM Corporation23 © 2017 IBM Corporation

ClaimFile

PatientInformation

File

Sample Source Data

/data/udmh/patient/<date>/<version>/.. Data files..

Data Transformation

Process (Hive,Spark, Pig,

ETL, ..)

Data Transformation

Process (Hive,Spark, Pig,

ETL, ..)

Hive Metastore

Patient party ext Table

HIVE

Vendor SQL for Hadoop interface

/data/udmh/claim/<date>/<version>/.. Data files..

Claim ext Table

Logical Data Model

PhysicalData Model

Patient ClaimPatient / Claim

Patient Claim

Downstream Data Transformation processes

123

Industry Models Hadoop deployment example – low level

HDFS

Three possible deployment paths

Page 24: IBM Industry Models and Data Lake

© 2015 IBM Corporation24 © 2017 IBM Corporation

Mapping of incoming new structures in the Data Lake

IGC Dev Repository

Runtime Data Lake Environment

IGC ProductionRepository

Data Repositories RDBMS

IDA

IGC for Eclipse

Data Repositories HDFS

Data Lake Repositories

Data Lake Catalog

IGC Anywhere/REST

IGC Browser

IMAM IDA Import

IMAM

Physical Data Model IG

C W

orkfl

ow

New HDFS Structure

1

2a

2b

2c

Question about what are the best practices for the “Bottom-up” mapping of a new structure in the data lake which has not been originally derived from a Data Model. 1. Direct mapping from the Physical Asset to the appropriate Term in the Catalog2. Indirect mapping via a specifically created data model (actual mapping done either via BGE or in BG Browser)

a. Reverse engineer a new model from the HDFS Structureb. Import the Data model into the Catalogc. Import the mappings into the Catalog from IDA (is mapping done in IDA via BGE)

Page 25: IBM Industry Models and Data Lake

© 2015 IBM Corporation25 © 2017 IBM Corporation

Model artifacts in the Data Lake Runtime environment – main usage patterns

There are three main categories ways in which the data model artifacts are used in or impact the Data Lake runtime environment

• Industry Model artifacts are deployed into the Data Lake runtime environment

• Most likely as an output from the two lifecycles “Maintaining the Business Language” and “Deploying the Technical Models”

• Industry Model artifacts deployed in the Data lake are used by and effected by Data Lake users

• For example , Data lake users provide feedback on changes/corrections/additions to the model artifacts

• Industry Model artifacts deployed in the Data lake are impacted by new or changed data coming into the Data Lake Repositories

• The most obvious example is the need for new mappings to a new or changed Repository brought into the Data Lake.

Page 26: IBM Industry Models and Data Lake

© 2015 IBM Corporation26 © 2017 IBM Corporation

REFERENCE MATERIALNew Information Architectures and Capabilities

Page 28: IBM Industry Models and Data Lake

© 2015 IBM Corporation28 © 2017 IBM Corporation

IBM Industry Models and Data lake publications so far :

http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=IMW14877USEN

http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=IMW14872USEN

http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=IMW14877USEN

http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=IMW14872USEN

https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=IMW14911IEEN&