denodo datafest 2017: multi-zone data virtualization for data lakes

10
Multi-zone Data Virtualization for Data Lakes How to share data with other government agencies preserving privacy and security guidelines Paul Grooten October 26th, 2017

Upload: denodo

Post on 21-Jan-2018

50 views

Category:

Data & Analytics


5 download

TRANSCRIPT

Multi-zone Data Virtualization for Data Lakes

How to share data with other government agenciespreserving privacy and security guidelines

Paul GrootenOctober 26th, 2017

Statistics Netherlands (CBS) Key Characteristics

2Multi-zone Data Virtualization for Data Lakes |

Autonomous Public Body with a Legal Entity (“ZBO”)

Official StatisticsEconomic - Social - Census

National and Regional

180 mEur 2000+

The Hague Heerlen Bonaire

Founded in 1899 (5 fte, 2 rooms), now 3 offices

Ambition: to become the Data Hub of the Dutch Government

Data Collection

Data Processing

Publishing

Statistical process

Which problems do we want to solve

• Current methods and technologies are not sufficientanymore to share data easily on a bigger scale

• We want to share more statistical data (also with external parties)

• We want to become faster and need a shorter time to market

• We need to reduce costs (storage, infrastructure)• We need to work on secure & privacy preserved

data sharing• Data sets should be easy to find

3Multi-zone Data Virtualization for Data Lakes |

The layered Data Architecture

4

Demand

Supply

(Leg

acy)D

atasou

rces

Data Source Layer(DSL)

CSVSQL DB

Web Srv

ETL tooling

XLS

App

CBDS

Vraag

Consumer Layer(CL)

Web PageS2STooling P V A

P V A= Data Prep = Data Visualization = Data Analytics

Security

Data V

irtualizatio

nD

EN

OD

O

Data Transformation Layer (DTL)

Data Provisioning Layer (DPL)

Building Block 1

Building Block 2

Building Block 3

Building Block 4

Web-Service C

OData Web-Service B

Web-Service A

Security

UserQue.

Da

ta G

ove

rna

nce

TechMeta

Me

tad

ata

Ma

na

ge

me

nt

Import Conceptual Meta

Conn.String

Existing New

CIO office | Versie 1.81

Se

curi

ty &

Au

tori

sati

on

Multi-zone Data Virtualization for Data Lakes |

…towards a multi zone DaaS Architecture

5

Security

CLD

atasou

rces

DSL

Data V

irtualizatio

n

DTL

DPL

Da

ta G

ove

rna

nce

Existing New

UserQue.

Me

tad

ata

Ma

na

ge

me

nt

TechMeta

Zone CBS

DDC=Departemental Data Center | UDC=Urban Data Center | CL=Consumer LayerDPL=Data Provisioning Layer | DTL=Data Transformation Layer | DSL=Data Source Layer

Building Block 1

Building Block 2

Web-Service A

CBDSDSC

P VA

P V A= Data Prep = Data Visualization = Data Analytics

Security

Zone DDC1

Building Block 7

Building Block 8

Web-Service D

DDC1

Secured

VPN

P VA

Zone UDC1

Building Block 3

Building Block 4

Web-Service B

UDC1

Secured

VPN

P VA

Zone UDC2

Building Block 5

Building Block 6

Web-Service C

UDC2

Secured

VPN

P VA

EHB Se

curi

ty &

Au

tori

sati

on

So what is a Zone?

6

Security

CLD

atasou

rces

DSL

Data V

irtualizatio

n

DTL

DPL

Da

ta G

ove

rna

nce

Existing New

UserQue.

Me

tad

ata

Ma

na

ge

me

nt

TechMeta

Zone CBS

DDC=Departemental Data Center | UDC=Urban Data Center | CL=Consumer Layer DPL=Data Provisioning Layer | DTL=Data Transformation Layer | DSL=Data Source Layer

Building Block 1

Building Block 2

Web-Service A

CBDSDSC

P VA

P V A= Data Prep = Data Visualization = Data Analytics

Security

Zone EZ

Building Block 7

Building Block 8

Web-Service D

DDC1

Secured

VPN

P VA

Zone UDC1

Building Block 3

Building Block 4

Web-Service B

UDC1

Secured

VPN

P VA

Zone UDC2

Building Block 5

Building Block 6

Web-Service C

UDC2

Secured

VPN

P VA

EHB Se

curi

ty &

Au

tori

sati

on

A Zone :• Is a virtual container in

which a specified set of Data Governance rules apply

• Has a specific user group• Contains virtual datasets• Has it’s own authorization

(which can and will differfrom other zones)

• Has an owner• Has it’s own Change

Advisory Board• Can have it’s own cache

database on it’s ownhardware

What do we want to achieve with the Data Lake

7

€ M{ "

stimulateCost data-access

StatisticalRisc

Growth Re-use

Time toMarket

reduce

Multi-zone Data Virtualization for Data Lakes |

What are our next steps

• Finish Proof-of-Concept by end 2017• Develop product (MVP)• Get approval from Board of Management• Implement Minimal Viable Product in 2018/H2• Enhance MVP with new functionalities,

like disclosure control (confidentiality on-the-fly) protection

8Multi-zone Data Virtualization for Data Lakes |

Recommendations

• Check whether your strategy is in line with your plans (v.v.)

• Start experimenting with Data Virtualization in an early stage (start with the express version)

• Build a culture that embraces change and communicate your plans as often as possible

9Multi-zone Data Virtualization for Data Lakes |