become a big data quality hero

18
T8 Concurrent Class 10/3/2013 11:15:00 AM "Become a Big Data Quality Hero" Presented by: Jason Rauen LexisNexis Brought to you by: 340 Corporate Way, Suite 300, Orange Park, FL 32073 888-268-8770 ∙ 904-278-0524 ∙ [email protected] www.sqe.com

Upload: techwellpresentations

Post on 15-Jan-2015

99 views

Category:

Technology


1 download

DESCRIPTION

Many believe that regression testing an application with minimal data is sufficient. With big data applications, the data testing methodology becomes far more complex. Testing can now be done within the data fabrication process as well as in the data delivery process. Today, comprehensive testing is often mandated by regulatory agencies—and more importantly by customers. Finding issues before deployment and saving your company’s reputation—and in some cases preventing litigation—are critical. Jason Rauen presents an overview of the architecture, processes, techniques, and lessons learned by an original big data company. Detecting defects up-front is vital. Learn how to test thousands, millions, and in some cases billions—yes, billions—of records directly, rendering sampling procedures obsolete. Save time and money for your organization with better data test coverage than ever before.

TRANSCRIPT

Page 1: Become a Big Data Quality Hero

T8 Concurrent Class

10/3/2013 11:15:00 AM

"Become a Big Data Quality

Hero"

Presented by:

Jason Rauen

LexisNexis

Brought to you by:

340 Corporate Way, Suite 300, Orange Park, FL 32073

888-268-8770 ∙ 904-278-0524 ∙ [email protected] ∙ www.sqe.com

Page 2: Become a Big Data Quality Hero

Jason Rauen

LexisNexis

Jason Rauen is a senior quality test analyst at Georgia-based LexisNexis Risk Solutions. With

more than fifteen years of experience, Jason has led the data testing team in big data from its

inception. He has presented big data scripting techniques at HPCC Systems national Data

Summit. His background includes working at companies including Microsoft, AT&T, and

LexisNexis, and instructing at Intel, Boeing, Executrain, and the Department of the Navy.

Page 3: Become a Big Data Quality Hero

9/19/2013

1

“Quality isn’t measured by how many clients you

obtain; it’s measured by how many clients you

retain.”

“QA isn’t the bottom of the totem pole; it’s the dirt

holding it up.”

Interesting Quotes……

Become a Big Data Quality Hero

A look inside QA for Big DataPresented by 01001010 01100001 01110011 01101111 01101110 00100000

01010010 01100001 01110101 01100101 01101110 (Jason Rauen)

Page 4: Become a Big Data Quality Hero

9/19/2013

2

Overview

• Why Test and How it’s Different– Issues

– Benefits

• Architecture and why you need to know– HPCC Systems/Hadoop

– Know Your Data/Environment

• Strategies and Concepts–What to look for

– Sample Gathering (AUB)

– Stats

– Profiling

Why Test and How it’s Different

Why Test Data:

• Traditional methods not adequate – Traditional sampling

needs improvement and is scenario based, not enough

samples, human error, etc….

• Tied into current environment

• Government regulatory compliances

• Auditing requirements

• Company wide initiatives

Page 5: Become a Big Data Quality Hero

9/19/2013

3

Why Test and How it’s Different

Want to keep your customers?

Why Test and How it’s Different

• When?

o Testing - SDLC

o Routine Testing

o Frequency - Yearly/Monthly/Weekly/Daily/Hourly/On

Demand

• What? Types Testing

� New Project – Source to Target (Transform)

� Standard - Production Validation

� Emergency releases

• How?

o Using what you have available

o Freebies – Profiling tools, etc…

Page 6: Become a Big Data Quality Hero

9/19/2013

4

Why Test and How it’s Different

Issues:

• Lack of control

Timing of builds

Samples and location of samples

• 3rd Party Apps

Lack of licenses, Costs, Training, and existing

knowledge

• Extra hardware

• Upgrades

Why Test and How it’s Different

Benefits:

• Cost savings

• Better Coverage

No Samples

Increased Sampling

Focused Samples

• Faster (Time is $)

• Quicker to Diagnosing issues

• Better Data Integrity

• Collaboration with other groups

Page 7: Become a Big Data Quality Hero

9/19/2013

5

Architecture and why you need to know

Typical Generic Architecture

input DB

Architecture and why you need to know

Data Fabrication Engines

• HDFS Hadoop and HPCC THOR

• Made of several nodes

• Where the ETL happens

• Where the Keys are made

Data Delivery Engines

• HPCC ROXIE, HBASE, etc…

• Keys moved to and referenced here

• Queries reside

Page 8: Become a Big Data Quality Hero

9/19/2013

6

Architecture and why you need to know

Architecture and why you need to know

HDFS

Hadoop MapreduceHBASE

Page 9: Become a Big Data Quality Hero

9/19/2013

7

Architecture and why you need to know

Architecture and why you need to know

HDFS

Map Shuffle Reduce

Page 10: Become a Big Data Quality Hero

9/19/2013

8

Architecture and why you need to know

DISTRIBUTE/PROJECT/TRANSFORM Rollup

HPCC Systems

Strategies and Concepts

• What to look for……

� Brand New, Incomplete, or Missing Builds (Data Cops)

� Data progression Today/Yesterday FatherKey/Grandfatherkey

� Count of Deltas in release/deploy

� Keys updated

� Missing keys/New keys

� Field Validations Indexed and Non Indexed

� Key Layout issues

� Corruption unprintable or invalid characters

� Duplicate records of new and existing records

� Data Fabrication Engine to Data delivery Engine deploys/sync

� Queries with new data

Page 11: Become a Big Data Quality Hero

9/19/2013

9

Strategies and Concepts

JOIN

• Sample gathering

• New Key for testing

• Deployment Validation

- Data Fabrication

• Deployment Validation

- Data Delivery

And get a free cookie…

Strategies and Concepts

AUB for JOIN

A = Left key (New)

B = Right key (Old)Types of JOINS

Inner Join Left Outer Join Right Outer Join

Full Outer Join Minus or Left Only

Page 12: Become a Big Data Quality Hero

9/19/2013

10

Strategies and Concepts

AUB for JOIN

A = Left key (New)

B = Right key (Old)

VENN

Strategies and Concepts

Statistics: What you try to remember with this swimming

behind you.

Page 13: Become a Big Data Quality Hero

9/19/2013

11

Strategies and Concepts

Statistics:

• On data sets and keys

- Gives you a high level look at the release

- Ranges

- You’ll start to notice a trend line

• On Releases

- Done over time you’ll see the trend of new data sets and keys

- Done over time you’ll see the trend of changed or modified

data sets and keys

Strategies and Concepts

0

50

100

150

200

250

300

350

400

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

RELEASE NUMBERS

AVERAG 175.4

CEILING 210.6

FLOOR 135.1

Page 14: Become a Big Data Quality Hero

9/19/2013

12

Strategies and Concepts

Data Profiling:

• Data Profiling Summary Report

• Data Profiling Field Detail Report

� http://www.hpccsystems.com/demos/data-

profiling-demo

• Data Profiling Field Combination Report

Strategies and Concepts

Data Profiling Summary Report

Page 15: Become a Big Data Quality Hero

9/19/2013

13

Strategies and Concepts

Data Profiling Field Detail Report

Strategies and Concepts

Data Profiling Field Combination Report

Page 16: Become a Big Data Quality Hero

9/19/2013

14

Strategies and Concepts

SQL

SELECT * FROM Products;

SELECT * FROM Products

WHERE productcode =

‘R2D2C3PO’;

SELECT COUNT (*) FROM

Products;

Pig

DUMP Products;

Products= FILTER

Products BY productcode

= ‘R2D2C3PO’;

DUMP Products;

Products= GROUP

Products ALL;

Products =FOREACH

Products GENERATE

COUNT (Products);

DUMP Products;

ECL

Products;

Products(productcode =

‘R2D2C3PO’);

COUNT(Products);

Strategies and Concepts

SQL

SELECT * FROM Products

ORDER BY productcode;

SELECT * FROM Products FULL

OUTER JOIN OtherProducts

ON Products.col1 =

OtherProducts.col1;

Pig

Products= ORDER

Products BY productcode;

DUMP Products;

Products= JOIN Products

BY col1 FULL OUTER,

OtherProducts BY col1;

DUMP Products;

ECL

SORT(

Products,productcode);

JOIN(Products,OtherPro

ducts, LEFT.col1 =

RIGHT.col1,FULL

OUTER);

Page 17: Become a Big Data Quality Hero

9/19/2013

15

Summary

� Why Test and How it’s Different

� Architecture and why you need to know

� Strategies and Concepts

Questions?

Page 18: Become a Big Data Quality Hero

9/19/2013

16

Contact / Useful links

www.linkedin/in/jasonrauen

• HPCC Systems/ECL Links:http://hpccsystems.com

http://hpccsystems.com/demos

• Hadoop/Pig Latin Links:http://pig.apache.org

http://hadoop.apache.org

• SQL Links:http://sql.org/

http://msdn.microsoft.com/en-US/sqlserver/default.aspx