Download - Destroying Data Silos
Destroying Data Silos
Hellmar Becker
Senior IT Specialist
Hadoop Summit 2015, Brussels
Who am I?
2
3
Datalake in ING NL
Integrate all data sources
within the bank into
one processing platform
• Batch data streams
• Live transactions
• Model building for customer interaction
Open source software where possible!
Zoom in: Datalake Archive
4
Today, let’s focus on one specific part of the story:
• Collect data in a unified format
• Store these data secure from manipulation and
• unauthorized access
• Make data available to analytical applications
• Business Intelligence, Data Science
Hadoop based cluster is a good solution
to address these targets
Circa 2000: Data Warehouse
• Based on relational database technology (Oracle, DB2, …)
• Challenge 1: Data model is difficult to adapt after the fact
• Challenge 2: Resilience and fault tolerance are not built in
• Challenge 3: Scaling proves difficult and expensive (specialized hardware)
• Challenge 4: RDBMS brings a lot of overhead – e. g. referential integrity
Modern data platforms (Hadoop, Spark, Cassandra) address many of these issues
Old world vs. New World
5
Operational
data
Staging
Files
ETL Operational
data
Data Mart
Data Mart
Data Mart
Metadata
Detail data Aggregated
data
Reporting
Analytics
Predictive
Modeling
6
Target: Data Lake Architecture
Pick your battles
• Toolset in the bank has grown around RDBMS and mainframe
• We cannot sweep out everything, have to handle legacy
• Plant a seed: Replace one component and connect it to all legacy interfaces
• Grow from there!
7
Operational
data
Staging
Files
ETL Operational
data
Data Mart
Data Mart
Data Mart
Metadata
Detail data Aggregated
data
Reporting
Analytics
Predictive
Modeling
Challenges
• Zero Touch Deployment
• Risk issues with deployment tools that require admin (root) access to servers
• Policies within the organization
• Example: The unit of consideration is a single server, but we need to look at entire
clusters
• Legacy protocols – Mainframe data formats, e. g. character sets
• Security is paramount – protect sensitive data
8
Security Concept
Authentication Management
• Using Kerberos – proven technology, secure but hard to configure
• Need to align access with HR database – connect to corporate directory
Authorization Management
• Uniform views across all components of a cluster
• Using Ranger to secure all services with a uniform set of policies
Auditing
• Ranger logs all interactions in order to exterminate threats
Connecting the Pieces
• Sideline challenge: Linux world and Windows world need to be connected
9
Security Concept
10
Agile Working
11
• Setup of this kind of project requires interdisciplinary
cooperation
• DevOps teams provide a lot of the required skills
with short communication paths
• Cooperation across department boundaries can be a
challenge
• Agile delivery vs. Expectations and timelines
• Manage external dependencies in a Scrum setting
Shaping the Future
12
Existing standards do not always fit our goals and tools
Work with interdepartmental teams – DevOps, Infra,
DBAs, Business, Risk, Legal
We are influencing the standards that the bank will set
for coming systems!
Attributions
• Hellmar in Nîmes / With Python in Mindanao, by the author
• Domtoren in het oranje licht by helena_is_here is licensed under CC BY 2.0
• Data Pipeline, ING OIB Image Bank
• Data Pipeline, ING OIB Image Bank, edited (cropped) by the author
• Baby Elephant with mother by David Rosen is licensed under CC BY 2.0
• Bruarfoss Waterfall in winter, Iceland by Diana Robinson is licensed under CC BY-
ND 2.0
• Elephants at Pinnawala by Jan Arendtsz is licensed under CC BY-NC 2.0
13