data vault automation at - bi-podium...data vault automation at de bijenkorf ... dynamodb!!...
TRANSCRIPT
Data Vault Automation at
de Bijenkorf PRESENTED BY
ROB WINTERS ANDREI SCORUS
Presentation agenda ◦ Project objec*ves
◦ Architectural overview
◦ The data warehouse data model
◦ Automa*on in the data warehouse
◦ Successes and failures
◦ Conclusions
About the presenters Rob Winters
Head of Data Technology, the Bijenkorf
Project role: ◦ Project Lead ◦ Systems architect and administrator
◦ Data modeler
◦ Developer (ETL, predic;ve models, reports)
◦ Stakeholder manager
◦ Joined project September 2014
Andrei Scorus
BI Consultant, Incentro
Project role: ◦ Main ETL Developer ◦ ETL Developer ◦ Modeling support
◦ Source system expert
◦ Joined project November 2014
Project objectives
◦ Informa*on requirements ◦ Have one place as the source for all
reports
◦ Security and privacy ◦ Informa*on management
◦ Integrate with produc*on ◦ Non-‐func*onal requirements ◦ System quality
◦ Extensibility ◦ Scalability ◦ Maintainability
◦ Security ◦ Flexibility ◦ Low Cost
Technical Requirements
• One environment to quickly generate customer insights
• Then feed those insights back to produc;on
• Then measure the impact of those changes in near real ;me
Source system landscape
Source Type Number of Sources
Examples Load Frequency
Data Structure
Oracle DB 2 Virgo ERP 2x/hour Par;al 3NF
MySQL 3 Product DB, Web Orders, DWH
10x/hour 3NF (Web Orders), Improperly normalized
Event bus 1 Web/email events
1x/minute Tab delimited with JSON fields
Webhook 1 Transac;onal Emails
1x/minute JSON
REST APIs 5+ GA, DotMailer 1x/hour-‐1x/day JSON
SOAP APIs 5+ AdWords, Pricing 1x/day XML
Architectural overview Tools
AWS ◦ S3 ◦ Kinesis ◦ Elas;cache ◦ Elas;c Beanstalk ◦ EC2 ◦ DynamoDB
Open Source ◦ Snowplow Event Tracker ◦ Rundeck Scheduler ◦ Jenkins Con;nuous
Integra;on ◦ Pentaho PDI
Other ◦ HP Ver;ca ◦ Tableau ◦ Github ◦ RStudio Server
DWH internal architecture
• Tradi;onal three ;er DWH • ODS generated
automa;cally from staging • Ops mart reflects data in
original source form • Helps offload queries
from source systems • Business marts
materialized exclusively from vault
Bijenkorf Data Vault overview Data volumes
• ~1 TB base volume
• 10-‐12 GB daily
• ~250 source tables
Aligned to Data Vault 2.0
• Hash keys
• Hashes used for CDC
• Parallel loading
• Maximum u;liza;on of available resources
• Data unchanged in to the vault
Some sta*s*cs
18 hubs • 34 loading scripts
27 links • 43 loading scripts
39 satellites • 43 loading scripts
13 reference tables • 1 script per table
Model contains
• Sales transac;ons
• Customer and corporate loca;ons
• Customers
• Products
• Payment methods
• E-‐mail
• Phone
• Product grouping
• Campaigns
• deBijenkorf card
• Social media
Excluded from the vault ◦ Event streams
◦ Server logs ◦ Unstructured data
Deep dive: Transactions in DV • Transac;ons
Deep dive: Customers in DV • Same as link on customer
Challenges encountered during data modeling Challenge Issue Details Resolu*on
Source issues • Source systems and original data unavailable for most informa;on
• Data ohen transformed 2-‐4 ;mes before access was available
• Business keys (ex. SKU) typically replaced with sequences
• Business keys rebuilt in staging prior to vault loading
Modeling returns
• Retail returns can appear in ERP in 1-‐3 ways across mul;ple tables with inconsistent keys
• Online returns appear as a state change on original transac;on and may/may not appear in ERP
• Original model showed sale state on line item satellite
• Revised model recorded “nega;ve sale” transac;ons and used a new link to connect to original sale when possible
Fragmented knowledge
• Informa;on about the systems was being held by mul;ple people
• Documenta;on was out-‐of-‐date
• Talking to as many people as possible and tes;ng hypotheses on the data
Targeted benefits of DWH automation Objec*ve Achievements
Speed of development • Integra;on of new sources or data from exis;ng sources takes 1-‐2 steps
• Adding a new vault dependency takes one step
Simplicity • Five jobs handle all ETL processes across DWH
Traceability • Every record/source file is traced in the database and every row automa;cally iden;fied by source file in ODS
Code simplifica*on • Replaced most common key defini;ons with dynamic variable replacement
File management • Every source file automa;cally archived to Amazon S3 in appropriate loca;ons sorted by source, table, and date
• En;re source systems, periods, etc can be replayed in minutes
Source loading automation o Design of loader focused on process abstrac;on, traceability, and minimiza;on of “moving parts”
o Final process consisted of two base jobs working in tandem: one for genera;ng incremental extracts from source systems, one for loading flat files from all sources to staging tables o Replica;on was desired but rejected due to limited access to source systems
Source tables duplicated in staging with addi;on of loadTs and sourceFile columns
Metadata for source file added
Loader automa;cally generates ODS, begins tracking source files for duplica;on and data quality
Query generator
automa;cally executes full duplica;on on first execu;on
and incrementals aherward
CREATE TABLE stg_oms.customer ( customerId int , customerName varchar(500) , customerAddress varchar(5000) , loadTs ;mestamp NOT NULL , sourceFile varchar(255) NOT NULL ) ORDER BY customerId PARTITION BY date(loadTs) ; INSERT INTO meta.source_to_stg_mapping (targetSchema, targetTable, sourceSystem, fileNamePapern, delimiter, nullField) VALUES ('stg_oms','customer','OMS','OMS_CUSTOMER','TAB','NULL') ;
Example: Add addi*onal table from exis*ng source Workflow of source integra*on
Vault loading automation
• New sources automa;cally added
• Last change epoch based on load stamps, advanced each ;me all dependencies execute successfully
All Staging Tables Checked
for Changes
• Dependencies declared at ;me of job crea;on
• Load priori;za;on possible but not u;lized
List of Dependent Vault Loads
Iden;fied
• Jobs parallelized across tables but serialized per job
• Dynamic job queueing ensures appropriate execu;on order
Loads Planned in Hub, Link, Sat Order
• Variables automa;cally iden;fied and replaced
• Each load records performance sta;s;cs and error messages
Loads Executed
o Loader is fully metadata driven with focus on horizontal scalability and management simplicity
o To support speed of development and performance, variable-‐driven SQL templates used throughout
Design goals for mart loading automation
Requirement Solu;on Benefit
Simple, standardized
models
Metadata-‐driven Pentaho PDI
Easy development
using parameters and variables
Easily Extensible
Plugin framework
Rapid integra;on of new
func;onality
Rapid new job development
Recycle standardized jobs
and transforma;ons
Limited moving parts, easy modifica;on
Low administra;on
overhead
Leverage built in logging and tracking
Easily integrated mart loading repor;ng with
other ETL reports
Data Information mart automation flow Retrieve
commands
• Each dimension and fact is processed independently
Get dependencies
• Based on defined transforma;on, get all related vault tables: links, satellites or hubs
Retrieve changed data
• From the related tables, build a list of unique keys that have changed since the last update of the fact or dimension • Store the data in the database un;l further processing
Execute transforma*ons
• Mul;ple Pentaho transforma;ons can be processed per command using the data captured in previous steps
Maintentance
• Logging happens throughout the whole process • Cleanup aher all commands have been processed
Primary uses of Bijenkorf DWH Cu
stom
er Analysis • Provided first
unified data model of customer ac;vity
• 80% reduc;on in unique customer keys
• Allowed for segmenta;on of customers based on combina;on of in-‐store and online ac;vity
Person
aliza;
on
• DV drives recommenda;on engine and customer recommenda;ons (updated nightly)
• Data pipeline supports near real ;me upda;ng of customer recommenda;ons based on web ac;vity Bu
sine
ss Intelligence • DV-‐based marts
replace joining dozens of tables across mul;ple sources with single facts/ dimensions
• IT-‐driven repor;ng being replaced with self-‐service BI
Biggest drivers of success AWS Infrastructure
Cost: En;re infrastructure for less than one server in the data center
Toolset: Most services available off the shelf, minimizing administra;on
Freedom: No dependency on IT for development support
Scalability: Systems automa;cally scaled to match DWH demands
Automa;on
Speed: Enormous ;me savings aher ini;al investment
Simplicity: Able to run and monitor 40k+ queries per day with minimal effort
Auditability: Enforced tracking and archiving without developer involvement
PDI framework
Ease of use: Adding new commands takes at most 45 minutes
Agile: Building the framework took 1 day
Low profile: Average memory usage of 250MB
Biggest mistakes along the way
• Ini;al integra;on design was based on provided documenta;on/models which was rarely accurate
• Current users of sources should have been engaged earlier to explain undocumented caveats
Reliance on documenta;on and requirements over expert users
• Variables were u;lized late in development, slowing progress significantly and crea;ng consistency issues
• Good ini;al design of templates will significantly reduce development ;me in mid/long run
Late u;liza;on of templates and variables
• We apempted to design and populate the en;re data vault prior to focusing on customer deliverables like reports (in addi;on to other projects)
• We have shihed focus to con;nuous release of new informa;on rather than wai;ng for completeness
Aggressive overextension of resources
Primary takeaways
◦ Sources are like cars: the older they are, the more idiosyncrasies. Be cau;ous with design automa;on!
◦ Automa;on can enormously simplify/accelerate data warehousing. Don’t be afraid to roll your own
◦ Balance stateful versus stateless and monolithic versus fragmented architecture design
◦ Cloud based architecture based on column store DBs is extremely scalable, cheap, and highly performant
◦ A successful vault can create a new problem: gewng IT to think about business processes rather than system keys!
Rob Winters [email protected]
Andrei Scorus [email protected]