a data driven etl test framework sqlsat madison

89
SQL Saturday Madison March 29 th , 2014

Upload: terry-bunio

Post on 19-Jun-2015

746 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: A data driven etl test framework sqlsat madison

SQL Saturday Madison –

March 29th, 2014

Page 2: A data driven etl test framework sqlsat madison

Who Am I?

• Terry Bunio

• Data Base Administrator

- Oracle, SQL Server 6,6.5,7,2000,2005,2012, Informix, ADABAS

• Sharepoint fan

• Data Modeler/Architect

- Investors Group, LPL Financial, Manitoba Blue Cross, Assante

Financial, CI Funds, Mackenzie Financial

- Normalized and Dimensional

• Agilist

- Innovation Gamer, Team Member, SQL Developer, Test writer,

Sticky Sticker, Project Manager, PMO on SAP Implementation

- Currently all of these all the same time on an Agile SAP

implementation for an Insurance company

Page 3: A data driven etl test framework sqlsat madison
Page 4: A data driven etl test framework sqlsat madison
Page 5: A data driven etl test framework sqlsat madison
Page 6: A data driven etl test framework sqlsat madison
Page 7: A data driven etl test framework sqlsat madison
Page 8: A data driven etl test framework sqlsat madison
Page 9: A data driven etl test framework sqlsat madison
Page 10: A data driven etl test framework sqlsat madison

Protegra

• In Business 13+ years

• Offices in Winnipeg and San Diego

• Agile and Lean key to our practices

• 70+ consultants

• Software Development and Business Performance

Consulting

• Microsoft Gold Partner

• Culture and Community are cornerstones of the

company

Page 11: A data driven etl test framework sqlsat madison
Page 12: A data driven etl test framework sqlsat madison

Our Mission

Page 13: A data driven etl test framework sqlsat madison

Learning Objectives

• Why do we automate tests?

• Why is Automated Database Testing Difficult?

• What was the problem?

• What is tSQLt and SQLtest?

• How we create a Data Driven ETL Test Framework?

Page 14: A data driven etl test framework sqlsat madison

Agile

Page 16: A data driven etl test framework sqlsat madison

Waterfall

Page 17: A data driven etl test framework sqlsat madison

Agile

Page 18: A data driven etl test framework sqlsat madison

Waterfall versus Agile

Page 19: A data driven etl test framework sqlsat madison

Waterfall

• Do as much as you can before you fall off that cliff

because it is difficult to go back up

- Thorough Business Case

- Detailed Requirements

- Detailed Documentation

- Extensive Construction Phase

- Extensive Test Phase

- Critical Deployment Phase

Page 20: A data driven etl test framework sqlsat madison
Page 21: A data driven etl test framework sqlsat madison

Agile

• Do as little as you can before you can proceed because

we want to deliver asap and are going to have another

cycle anyway…

- Lean Start-up Business Case

- Minimal Requirements (some in the form of Tests)

- Minimal Design

- Minimal Construction

- Minimal Test

- Minimal Deployment

Page 22: A data driven etl test framework sqlsat madison
Page 23: A data driven etl test framework sqlsat madison

Automation

• We usually try to automate tasks that:

- Take a lot of time

- That we need to do frequently

- That are error prone

- That the required people expertise can be captured and

duplicated

Page 24: A data driven etl test framework sqlsat madison

What is more awesome than an

robot Ninja?

Page 25: A data driven etl test framework sqlsat madison

Agile Automation

• In Agile, the tasks that lend themselves to automation

are:

- Testing – Automated Testing

- Deployment – Continuous Integration

• This is primarily due to:

- The amount of times they need to be run

- The potential for people to introduce errors

- The ability to test and catch errors early

- The ease of automation, the ease of transferring human logic

into software

Page 26: A data driven etl test framework sqlsat madison

Agile

Page 27: A data driven etl test framework sqlsat madison

Dangers of Automated Testing

Page 28: A data driven etl test framework sqlsat madison

Dangers of automated testing

• Automated testing can provide great benefit, but you

need to ensure that it is flexible, nimble, and easy to

change and adapt

• Consider how you can create your tests to make them as

easy as possible to maintain and adapt to change

• If you design for this, the chances are your automated

tests will be more valuable and evolve with the software

- Otherwise some may deem them ‘too expensive’ to modify and

enhance and they will be left behind

Page 29: A data driven etl test framework sqlsat madison

Automated Database Testing

Page 30: A data driven etl test framework sqlsat madison

Automated Database Testing

• Unit Testing databases is hard

- Requires modification of the data in the database

- Complex Querying of Multiple Tables

- Multiple types of objects

• Views

• Functions

• Stored Procedures

• Triggers

- Many types of Database Technologies

- Need to test structure(schema), state(data), and process(Stored

Procedures and ETL processes)

Page 31: A data driven etl test framework sqlsat madison

Automated Database Testing

• Lots of existing frameworks out there for automated

testing in programming languages

- Java – Junit

- .NET – Nunit

- Fitnesse

- Etc…

• Although you could use these frameworks to perform

automated database testing, it certainly isn’t easy to

adapt

- Refer to Lion on ice-floe cartoon

Page 32: A data driven etl test framework sqlsat madison

What we need..

• Is an Automated Database testing framework that

- Doesn’t require another programming language to run

- Handles the complexity of adding and deleting data for tests

- Provides additional functionality that makes database testing

easy

• For example, a table to table comparison test or ability to create fake data

and tables

- Makes it easy to manage the large amount of tests that can be

created in a database

- Allows us to create database tests on:

• Structure

• Data

• Metadata

Page 33: A data driven etl test framework sqlsat madison

The Project

• Agile Data Warehouse project for SAP Implementation

• Health Benefits Provider

• I’d worked on multiple Data Warehouse projects with

little or no test automation in the past

- #1 item from those retrospectives were that we would create

automated tests next time for the ETL process

Page 34: A data driven etl test framework sqlsat madison

The Project

• Operational Data Store and Data Warehouse

• SQL Server 2012

- SSIS for ETL process

• We also used the CDC functionality

- SSRS for reporting with Sharepoint as the report portal

• Sharepoint Foundation 2010

• 400+ tables

• 2.5 terabytes of data

- Just for the initial load

• 3640 columns in the Operational Data Store

• 2520 columns in the Data Warehouse

Page 35: A data driven etl test framework sqlsat madison

The quest

Page 36: A data driven etl test framework sqlsat madison

Results

• We reviewed several candidates and none of them really

fit what we required

• We looked at

- Junit

- Dbunit

- FitNesse

- Other SQL specific ones

• Until….

Page 37: A data driven etl test framework sqlsat madison
Page 38: A data driven etl test framework sqlsat madison

tSQLt

Page 39: A data driven etl test framework sqlsat madison

tSQLt

• tSQLt is an Open Source database unit testing

framework for Microsoft SQL Server. tSQLt is compatible

with SQL Server 2005 (service pack 2 required) and

above on all editions.

- Currently using it on SQL Server 2012

- No issues

Page 40: A data driven etl test framework sqlsat madison

tSQLt

• Tests are automatically run within transactions – this

keeps tests independent and reduces any cleanup work

you need

• Tests can be grouped together within a schema –

allowing you to organize your tests and use common

setup methods

- Highly recommended

• Output can be generated in plain text or XML – making it

easier to integrate with a continuous integration tool

• Provides the ability to fake tables and views and create

stored procedure spies – allows you to be notified when

developers aren’t following your standards!!

Page 42: A data driven etl test framework sqlsat madison
Page 43: A data driven etl test framework sqlsat madison

tSQLt Creation and Execution Commands

• NewTestClass

- tSQLt.NewTestClass creates a new test class. A test class is

simply a schema where the user can create test case procedure

and any other related objects.

• DropClass

- tSQLt.DropClass drops a schema and all objects belonging to it.

If the schema does not exist, nothing happens.

• RunAll

- tSQLt.RunAll executes all tests in all test classes created with

tSQLt.NewTestClass in the current database. If the test class

schema contains a stored procedure called SetUp, it is executed

before calling each test case.

Page 44: A data driven etl test framework sqlsat madison

tSQLt Creation and Execution Commands

• Run

tSQLt.Run is a flexible procedure allowing three different ways of

executing test cases:

1. Providing a test class name executes all tests in that test class. If

a SetUp stored procedure exists in that test class, then it is

executed before each test.

2. Providing a test case name executes that single test.

3. Providing no parameter value executes tSQLt.Run the same way

the previous call to tSQLt.Run was made when a parameter was

provided.

Page 45: A data driven etl test framework sqlsat madison

tSQLt additional commands

• ApplyConstraint

- We want to be able to test constraints individually. We can use

FakeTable to remove all the constraints on a table, and

ApplyConstraint to add back in the one which we want to test.

- ApplyConstraint in combination with FakeTable allows

constraints to be tested in isolation of other constraints on a

table.

• ApplyTrigger

- We want to be able to test triggers individually. We can use

FakeTable to remove all the constraints and triggers from a

table, and ApplyTrigger to add back in the one which we want to

test.

- ApplyTrigger in combination with FakeTable allows triggers to be

tested in isolation of constraints and other triggers on a table.

Page 46: A data driven etl test framework sqlsat madison

tSQLt Additional Commands

• FakeTable

- We want to keep our test cases focused and do not want to

insert data which is irrelevant to the current test. However, table

and column constraints can make this difficult.

- FakeTable allows tests to be written in isolation of the constraints

on a table. FakeTable creates an empty version of the table

without the constraints in place of the specified table. Therefore

any statements which access the table during the execution of

the test case are actually working against the fake table with no

constraints. When the test case completes, the original table is

put back in place because of the rollback which tSQLt performs

at the end of each test case.

Page 47: A data driven etl test framework sqlsat madison

tSQLt Demo

• Installation procedures

• Simple Test Demonstration

• Simple tSQLt Demonstration

Page 48: A data driven etl test framework sqlsat madison

tSQLt installation procedures

• Download tSQLt from tSQLt.

• Unzip the file to a location on your hard drive.

• Make sure CLRs are enabled on your development

server by running the following sql:

- EXEC sp_configure 'clr enabled', 1;

- RECONFIGURE;

• Execute the Example.sql file from the zip file to create an

example database (tSQLt_Example) with tSQLt and test

cases.

Page 49: A data driven etl test framework sqlsat madison

tSQLt installation

• Your database must be set to trustworthy for tSQLt to

run. Execute the following script in your development

database:

- DECLARE @cmd NVARCHAR(MAX); SET @cmd='ALTER

DATABASE ' + QUOTENAME(DB_NAME()) + ' SET

TRUSTWORTHY ON;'; EXEC(@cmd);

• Execute the tSQLt.class.sql script (included in the zip

file) in your development database.

Page 50: A data driven etl test framework sqlsat madison

Test Demonstration

Page 51: A data driven etl test framework sqlsat madison

Test Demonstration

• Old School Testing

- Open Enterprise Manager

- Open /SDEC12/SDEC solution

- Open solution explorer

- 1 testStatus1.sql

- 2 testStatusAll.sql

• Boring…

Page 52: A data driven etl test framework sqlsat madison

Test Demonstration

• testTable

- Traditional test

• testTableNew

- tSQLt test

• AssertEqualsTable test success

• Insert extra record

• AssetEqualsTable test failure

Page 53: A data driven etl test framework sqlsat madison

tSQLt Demo

• 4 tSQLt-1.sql

Page 54: A data driven etl test framework sqlsat madison

SQLTest

Page 55: A data driven etl test framework sqlsat madison

SQLTest

• Developed by Red Gate

• SQL Test is a unit test add-in for SQL Server

Management Studio

• Part of the SQL Developer Bundle

• $369

Page 56: A data driven etl test framework sqlsat madison

SQLTest Demo

• Open SQL Test

• Test Class TstContainer_sdec12

• TestCase1 – count rows

• TestCase2 – Fake Table and Assert Empty Table

• TestCase3 – Fake Table and Assert Empty Table - Error

• TestCase4 - AssertLike

• Run All

Page 58: A data driven etl test framework sqlsat madison

SQLTest Additional Features

• Template for adding tests

• Easier graphical interface to enter and review tests

• One click to run tests

- Individually

- Schema

- Database

• Click through on TestCases to debug and modify

Page 59: A data driven etl test framework sqlsat madison

tSQLtclr Weird Error

Page 60: A data driven etl test framework sqlsat madison

tSQLtclr Weird Error

• It is possible that when you restore a database from a

different server you can get this issue when trying to load

the tSQLtclr assembly

• KB link : http://support.microsoft.com/kb/918040

• Easiest solution is to ensure sa is the database owner on

the database running the tSQLt tests

- EXEC sp_changedbowner 'sa'

Page 61: A data driven etl test framework sqlsat madison

ETL Data Driven Test Framework

Page 62: A data driven etl test framework sqlsat madison

ETL Test Categories

- Data completeness - validates that all is loaded correctly. This

involves a field by field comparison of expected versus actual

results.

- Data transformation – validates that all data is transformed

correctly according to specifications.

- Data quality - validates that the ETL application correctly

corrects, ignores, or rejects critical data errors, substitutes

default values, and reports invalid data.

- Process Integrity – validates that the ETL process correctly

executes, reports data errors, generates log messages, and is

able to restart and recover from an abnormal termination

correctly.

- Test Automation – validates that the ETL testing process can

be re-run at any time in an automated fashion to validate that the

process is operating as required.

Page 63: A data driven etl test framework sqlsat madison

ETL Testing Additional Value

- Performance and scalability - validates that the ETL process

performs within expected time frames and that the technical

architecture is scalable and extendable.

- Integration testing - validates that the ETL process functions

well within the entire environment. Validates that other

applications can interact with the ETL process and databases as

required.

- Regression testing – validates that the automated tests can be

utilized to confirm existing functionality continues to operate as

required for each new release.

Page 64: A data driven etl test framework sqlsat madison

ETL Test Objectives

• We need an Agile Test Framework so it is easy to adapt

to changes

• We can’t afford for the Test framework to take material

time to manage on top on Data Models and Databases

- Data Driven

• Although SQLTest is good, you can imagine how long it

would take to add tests for every field

- Not feasible

- Would also create a maintenance nightmare as the schema

changes

• So what is a DBA to do?

Page 65: A data driven etl test framework sqlsat madison

ETL Test Objectives

• To create an automated ETL test framework that is Data

Driven by

- NEGATIVE TESTING - Information_schema metadata in SQL

Server to loop through the tables and fields and execute health

checks

• Table level

• Have we missed tables or columns in our mapping

- POSITIVE TESTING - By Data Mapping metadata in SQL Server

to loop through all fields and validate the ETL process

• Column level

• Have we mapped tables and columns correctly

Page 66: A data driven etl test framework sqlsat madison

ETL Smells

Page 67: A data driven etl test framework sqlsat madison

ETL Smells

• Larger smells that indicate problems in the ETL

- Empty tables

- Empty columns

- Columns with all Nulls

- Columns with only 1 value

- Numeric columns with all zeros

• Usually detected by negative testing

Page 68: A data driven etl test framework sqlsat madison

Negative Tests

• TstTableCount: Compares record counts between

source data and target data. This will be done on table to

table basis. This will be done starting from the target

table and comparing to the associated source table or

tables.

• TstTableColumnDistinct: Compares counts on distinct

values of columns. This is a valuable technique that

points out a variety of possible data errors without doing

a full validation on all fields.

• TstTableColumnNull: Generates a report of all columns

where all the contents of a field is all null. This typically

can highlight situations where the column was not

assigned in the ETL process.

Page 69: A data driven etl test framework sqlsat madison

ETL Aromas

Page 70: A data driven etl test framework sqlsat madison

Positive Tests

• TstColumnDataMapping: Compares columns directly

assigned from a source column on a field by field basis

for specified rows in the target table. More rows can be

selected depending on the risk and complexity of the

data transformation.

• TstColumnConstantMapping: Compares columns

directly assigned from a constant on a field by field basis

for specified rows in the target table. More rows can be

selected depending on the risk and complexity of the

data transformation.

Page 71: A data driven etl test framework sqlsat madison

Positive Tests

• TstColumnFKDataMapping: Compares whether

transformed columns in source and target are both not

null. This is another key indicator that Foreign Key

Values for related tables were unable to be located in the

target database

• TstInvalidColumnDataMapping: Tests that an Invalid

value results in the value being assigned a default value

or Null. This record will be added to the staging table to

test the process.

Page 72: A data driven etl test framework sqlsat madison

Advanced Negative Tests – not yet implemented

• TstInvalidParentFKColumn: Tests that an Invalid

Parent FK value results in the records being logged and

bypassed. This record will be added to the staging table

to test the process.

• TstInvalidFKColumn: Tests that an Invalid FK value

results in the value being assigned a default value or

Null. This record will be added to the staging table to test

the process.

Page 73: A data driven etl test framework sqlsat madison

Data Driven

Page 74: A data driven etl test framework sqlsat madison

Data Mapping Framework Design

• We can create two tables that when used in conjunction

with Information_schema can generate all negative tests

and simple positive tests

- table_data_mapping

- column_data_mapping

• We can then create two more tables that can be used to

generate more complicated positive tests

- etl_test_case

- etl_test_case_transform

• The database also requires a cross_reference table that

converts old PK values to new PK values

Page 75: A data driven etl test framework sqlsat madison

ETL Test Framework Data Model

Page 76: A data driven etl test framework sqlsat madison

table_data_mapping

Column Name Description

Table Data Mapping ID Surrogate Key

Source Database Database that the source data exists in

Source Table Table that the source data exists in

Source Primary Key Source Primary Key Column Name

Target Database Database that the data will be loaded into

Target Table Table that the data will be loaded into

Target Primary Key Primary Key Column Name

Target Table Type Type of Table. Valid Values are:

“Main” – Main Table

“Reference” – Reference Table

Purpose : Defines Table to Table ETL Mapping

Page 77: A data driven etl test framework sqlsat madison

column_data_mapping

Column Name Description Column Data Mapping ID Surrogate Key Table Data Mapping ID ID of the related Table Data Mapping record Source Column Column that the source data exists in Source Constant Constant value that is to be assigned/expected in the

target column Target Column Column that the data will be loaded into Data Mapping Type Formula for column assign. Typical examples would be:

“=” – Assignment

“Constant” – Literal Assignment

“Transform” – Column Transformation

“Generated” – Column Generated

Purpose : Defines Column to Column ETL Mapping

Page 78: A data driven etl test framework sqlsat madison

Data Mapping Complex Framework Design

• We need two additional tables that can be used to

generate the more complex test cases

• These tables are used to stored primary key values to

select which rows we will use as test cases

- We can then choose to test 1 row or 1,000 rows

Page 79: A data driven etl test framework sqlsat madison

etl_test_case

Column Name Description

ETL Test Case ID Surrogate Key

Table Data Mapping ID ID of the related Table Data Mapping

record

Target Primary Key Value Target Primary Key Value for the test

Successful Test Ind Indicate if the test was successful

Purpose : Defines row to row ETL Mapping for a test

case

Page 80: A data driven etl test framework sqlsat madison

etl_test_case_transform

Column Name Description

ETL Test Case Transform

ID

Surrogate Key

ETL Test Case ID ID of the related Test Case record

Target Column Target Column name

Expected value Expected value in target column

Successful Test Ind Indicate if the test was successful

Purpose : Defines expected target column value for

each test case defined in etl_test_case

Page 81: A data driven etl test framework sqlsat madison

One more table… cross_reference

Column Name Description

cross_reference_id Surrogate key

source_system_code Source system name

source_system_table Table row originated from in source system

source_system_val Primary Key value from source system

target_table_name Table row was stored to in target system

target_id Primary key value from target system

updated_date Date row was updated

modified_by User that modified the row

deleted_ind Indicated if the row is deleted

Page 82: A data driven etl test framework sqlsat madison

Data Mapping table in Excel that is used to

populate tables

Column Name Description Target_db Target database that target column exists in Target_table Target table that target column exists in Target_table_type Target table type. Values are Main or Reference Target_column Target column that is being loaded by the ETL process Source_db Source database that source column is being extracted from Source_table Source table that source column is being extracted from Source_column Source column that is being extracted Source_PK Primary Key of the source table that the source column exists in.

This field will be used in conjunction with the cross_reference table

to determine the target_pk value so equivalent records can be

compared between the source and target databases Mapping_type Type of data mapping for the column. The values are:

Constant/”=“/Not Populated/Transform/Generated

Data_type Data type of the target column. Valid values are:

Integer/bigint/nvarchar/datetime/indicator

Nullable_ind Indicates whether the target column is nullable Invalid_test Contains the values that can be tested to cause exceptions

Page 83: A data driven etl test framework sqlsat madison

The Goods

Page 84: A data driven etl test framework sqlsat madison

TstTableCount Demo – 1 minute

• TstColumnCountLoop

• TstColumnCount

Page 85: A data driven etl test framework sqlsat madison

TstTableColumnNull Demo – 15 minutes

• TstColumnColumnNullLoop

• TstColumnColumnNull

Page 86: A data driven etl test framework sqlsat madison

TstTableColumnDistinct Demo – 20 minutes

• TstColumnColumnDistinctLoop

• TstColumnColumnDistinct

• Set ansi_warnings off

Page 87: A data driven etl test framework sqlsat madison

TstColumnDataMapping Demo – 1 minute

• TstColumnDataMappingLoop

• TstColumnDataMapping

Page 88: A data driven etl test framework sqlsat madison

TstColumnConstantMapping Demo – 1 minute

• TstColumnConstantMappingLoop

• TstColumnConstantMapping

Page 89: A data driven etl test framework sqlsat madison

Questions?