data warehouse / etl testing effectiveness

37
Data Warehouse Test Effectiveness It’s All about the Planning! Assuring Data Warehouse Content, Structure and Quality 1 Wayne Yaddow, 2013

Upload: wayne-yaddow

Post on 27-Oct-2015

873 views

Category:

Documents


1 download

DESCRIPTION

Data warehouse testing will be more effective when the planning for such has been accomplished according to a well-structured process. These charts pave the way to successful testing results.

TRANSCRIPT

1

Data Warehouse Test Effectiveness It’s All about the Planning!

Assuring Data Warehouse Content, Structure and Quality

Wayne Yaddow, 2013

Wayne Yaddow, 2013 2

Agenda

Challenges of DWH testingPlanning for DWH tests

Tester skills for DWH testingBasic ETL verifications

Defects you can expect to findTesting tools identified

Wayne Yaddow, 2013 3

DWH -- Definition

A data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database used for reporting and data analysis. It is a central repository of data which is created by integrating data from one or more disparate sources.

Data warehouses store current as well as historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons.

Source, Wikipedia.org, 2013

Wayne Yaddow, 2013 4

Got DWH Quality Issues?

Reprinted with permission from Informatica Corp., 2013

Wayne Yaddow, 2013 5

DWH – Typical StructureSources and DWH Targets

Wayne Yaddow, 2013 6

The Data Testing Process

Wayne Yaddow, 2013 7

Plan QA for typical DWH phases

Data Model Example

8Wayne Yaddow, 2013

Source to Target Mapping Example

9Wayne Yaddow, 2013

Plan QA for DWH LifecyclePrimary goals for verification

– Data completeness

– Data transformations

– Data quality

– Performance and scalability

– Integration testing

– User-acceptance testing

– Regression testing

10Wayne Yaddow, 2013

Wayne Yaddow, 2013 11

Challenges for DWH Testers (1)

1. Often inadequate ETL design documents2. Source table field values unexpectedly null3. Excessive ETL errors discovered after entry to QA4. Source data does not meet table mapping specs

(ex., dirty data)5. Source to target mappings:

1. Often not reviewed by all stakeholders2. Not consistently maintained through dev lifecycle3. Therefore, in error

Wayne Yaddow, 2013 12

Challenges for DWH Testers (2)

6. Data models not maintained 7. Target data does not meet mapping specifications8. Duplicate field values when defined to be

DISTINCT9. ETL SQL / errors that lead to missing rows and

invalid field values10. Constraint violations in source data11. Table keys are incorrect for important RDB

linkages

Wayne Yaddow, 2013 13

Challenges for DWH Testers (3)

12. Huge source data volumes and data types13. Source data quality that must be profiled before

loading to DWH14. Redundancy, duplicate source data15. Many source data records to be rejected16. ETL logs w/ messages to be acted upon17. Source field values may be missing where they

should always be present

Wayne Yaddow, 2013 14

Challenges for DWH Testers (4)

19. SME’s and business rules may not be available 20. Since data ETLs must often pass through

multiple phases, transaction-level traceability will be difficult to attain in a data warehouse.

21. The data warehouse will be a strategic enterprise resource and heavily relied upon

Wayne Yaddow, 2013 15

Planning the DWH QA Strategy

Carefully review:

– Requirements documentation– Data models for source and target schemas– Source to target mappings– ETL / stored proc design & logic– QA deployment tasks / steps– Required QA tools

Wayne Yaddow, 2013

Planning for DWH QA (1)

16

Data integration planning (Data model, LLD’s)

1. Gain understanding of data to be reported by the application (e.g., profiling)… and the tables upon which each user report will be based upon

2. Review, understand data model – gain understanding of keys, flows from source to target

3. Review, understand data LLD’s and mappings: add, update sequences for all sources of each target table

Wayne Yaddow, 2013

Planning for DWH QA (2)

1717

ETL planning and testing (source inputs & ETL design)

1. Participate in ETL design reviews

2. Gain in-depth knowledge of ETL sessions, the order of execution, restraints, transformations

3. Participate in development ETL test case reviews

4. After ETL’s are run, use checklists for QA assessments of rejects, session failures, errors

Wayne Yaddow, 2013

Planning for DWH QA (3)

18

Assess ETL logs: session, workflow, errors

1. Review ETL workflow outputs, source to target counts

2. Verify source to target mapping docs with loaded tables using TOAD and other tools

3. After ETL runs or manual data loads, assess data in every table with focus on key fields (dirty data, incorrect formats, duplicates, etc.). Use TOAD, Excel tools. (SQL queries, filtering, etc.)

Wayne Yaddow, 2013

Planning for DWH QA (4)

19

GUI and report validations

1. Compare report data with target data.

2. Verify that reporting meets user expectations

Analytics test team data validation

3. Test data as it is integrated into application

4. Provide tools and tests for data validation

Wayne Yaddow, 2013 20

Valuable Books

Wayne Yaddow, 2013 21

Plan for QA Methodology & Tools

Wayne Yaddow, 2013 22

Data Profiling

Column / attribute / field profiling provides

statistical measurements associated with:

– frequency distribution of data values – number of records– number of null (i.e., blank) values – data types (e.g., integers, characters)– field length– unique values– patterns in the data

Wayne Yaddow, 2013 23

Identify QA skills (1)

• Understanding fundamental DWH and DB concepts • High skill w/SQL queries and stored procedures• Understanding of data used by the business• Data profiling• Developing strategies, test plans and test cases

specific to DWH and the business• Creating effective ETL test cases / scenarios based on

loading technology and business requirements

Wayne Yaddow, 2013 24

Identify QA skills (2)

• Understanding of data models, data mapping documents, ETL design and ETL coding; ability to provide feedback to designers and developers

• Experience with Oracle, SQL Server, Sybase, DB2 technology

• Informatica session troubleshooting• Deploying DB code to data bases• Unix scripting, Autosys, Anthill, etc.• SQL editors• Use of Excel & MS Access for data analysis

Wayne Yaddow, 2013 25

Valuable Book

Wayne Yaddow, 2013 26

Basic ETL Verifications (1)

• Verify data mappings, source to target• Verify that all tables fields were loaded from source

to staging• Verify that keys were properly generated using

sequence generator • Verify that not-null fields were populated • Verify no data truncation in each field• Verify data types and formats are as specified in

design phase

Wayne Yaddow, 2013 27

Basic ETL Verifications (2)

• Verify no duplicate records in target tables. • Verify transformations based on data low level

design (LLD's) • Verify that numeric fields are populated with

correct precision• Verify that every ETL session completed with only

planned exceptions • Verify all cleansing, transformation, error and

exception handling • Verify PL/SQL calculations and data mappings

Wayne Yaddow, 2013 28

Examples of DWH Defects1. Inadequate ETL and stored procedure design documents

2. Field values are null when specified as “Not Null”.

3. Field constraints and SQL not coded correctly for Informatica ETL

4. Excessive ETL errors discovered after entry to QA

5. Source data does not meet table mapping specifications (ex., dirty data)

6. Source to target mappings: 1) often not reviewed, 2) in error and 2) not consistently maintained through dev lifecycle

Wayne Yaddow, 2013 29

Examples of DWH Defects7. Data models are not adequately maintained during

development lifecycle

8. Target data does not meet mapping specifications

9. Duplicate field values when defined to be DISTINCT

10. ETL SQL / transformation errors leading to missing rows and invalid field values

11. Constraint violations in source

12. Target data is incorrectly stored in nonstandard formats

13. Table keys are incorrect for important relationship linkages

Wayne Yaddow, 2013 30

Verifying Data LoadsFrom RTTS

Wayne Yaddow, 2013 31

DQ tools / techniques for QA teamTOAD / SQL Navigator•Data profiling for value range & boundary analysis •Null field analysis•Row counting•Data type analysis •Referential integrity analysis •Distinct value analysis by field•Duplicate data analysis (fields and rows)•Cardinality analysis• Stored procedures & package verification

Excel•Data filtering for profile analysis•Data value sampling•Data type analysis

MS Access•Table and data analysis across schemas

Testing Automation•Informatica’s Data Validation Option (DVO)• RTTS QuerySurge

Analytics Tools•J – statistics, visualization, data manipulation•Perl – data manipulation, scripting•R – statistics

Wayne Yaddow, 2013 32

Bottom Line Recommendations

• Involve test team in entire DWH SDLC• Profile source and target data• Remember: DWH QA is much more than

source and target record counts• Develop testers SQL and DWH structure skills• Assure availability of source to target mapping

documents• Plan for regression and automated testing

Wayne Yaddow, 2013 33

Planning Dev/Unit TestsUnit testing checklist• Some programmers are not well trained as testers. They may like to program, deploy the

code, and move on to the next development task without a thorough unit test. A checklist will aid database programmers to systematically test their code before formal QA testing.

• Check the mapping of fields that support data staging and in data marts. Check for duplication of values generated using sequence generators. Check the correctness of surrogate keys that uniquely identify rows of data. Check for data-type constraints of the fields present in staging and core levels. Check the data loading status and error messages after ETLs (extracts, transformations, loads).Look for string columns that are incorrectly left- or right-trimmed. Make sure all tables and specified fields were loaded from source to staging. Verify that not-null fields were populated. Verify that no data truncation occurred in each field. Make sure data types and formats are as specified during database design. Make sure there are no duplicate records in target tables. Make sure data transformations are correctly based on business rules. Verify that numeric fields are populated precisely. Make sure every ETL session completed with only planned exceptions. Verify all data cleansing, transformation, and error and exception handling. Verify stored procedure calculations and data mappings. Some programmers are not well trained as testers. They may like to program, deploy the code, and move on to the next development task without a thorough unit test. A checklist will aid database programmers to systematically test their code before formal QA testing.

Wayne Yaddow, 2013 34

Planning for Performance Tests• As the volume of data in the warehouse grows, ETL execution times can be expected to

increase, and performance of queries often degrade. These changes can be mitigated by having a solid technical architecture and efficient ETL design. The aim of performance testing is to point out potential weaknesses in the ETL design, such as reading a file multiple times or creating unnecessary intermediate files. A performance and scalability testing checklist helps discover performance issues.

• Load the database with peak expected production volumes to help ensure that the volume of data can be loaded by the ETL process within the agreed-on window. Compare ETL loading times to loads performed with a smaller amount of data to anticipate scalability issues. Compare the ETL processing times component by component to pinpoint any areas of weakness. Monitor the timing of the reject process and consider how large volumes of rejected data will be handled. Perform simple and multiple join queries to validate query performance on large database volumes. Work with business users to develop sample queries and acceptable performance criteria for each query.

Wayne Yaddow, 2013 35

Recommendations for data verifications

Detailed Recommendations for Data Development and QA

1. Need analysis of a.) source data quality and b.) data field profiles before input to Informatica and other data-build services.

2. QA should participate in all data model and data mapping reviews.

3. Need complete review of ETL error logs and resolution of errors by ETL teams before DB turn-over to QA.

4. Early use of QC during ETL and stored procedure testing to target vulnerable process areas.

5. Substantially improved documentation of PL/SQL stored procedures.

6. QA needs dev or separate environment for early data testing. QA should be able to modify data in order to perform negative tests. (QA currently does only positive tests because the application and data base tests work in parallel in the same environment.)

7. Need substantially enhanced verification of target tables after each ETL load before data turn-over to QA.

8. Need mandatory maintenance of data models and source to target mapping / transformation rules documents from elaboration until transition.

9. Investments in more Informatica and off-the-shelf data quality analysis tools for pre and post ETL.

10. Investments in automated DB regression test tools and training to support frequent data loads.

Wayne Yaddow, 2013 36

Plan QA for All DWH Dev. Phases

Wayne Yaddow, 2013 37

Plan methods & tools for testing