testing big data: automated etl testing of hadoop

31
built by Bill Hayduk CEO/President RTTS Testing Big Data: Automated ETL Testing of Hadoop Jeff Bocarsly, Ph.D. Chief Architect RTTS Laura Poggi Marketing Manager RTTS Webinar

Upload: rtts

Post on 19-Jun-2015

1.221 views

Category:

Technology


4 download

DESCRIPTION

Learn why testing your enterprise's data is pivotal for success with Big Data and Hadoop. See how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data warehouse - all with one ETL testing tool.

TRANSCRIPT

Page 1: Testing Big Data: Automated ETL Testing of Hadoop

built by

Bill HaydukCEO/President

RTTS

Testing Big Data: Automated ETL Testing of Hadoop

Jeff Bocarsly, Ph.D.Chief Architect

RTTS

Laura PoggiMarketing Manager

RTTS

Webinar

Page 2: Testing Big Data: Automated ETL Testing of Hadoop

Today’s Agenda

• About Big Data and Hadoop

• Data Warehouse refresher

• Hadoop and DWH Use Case

• How to test Big Data

• Demo of QuerySurge & Hadoop

built by

AGENDA

Topic: Testing Big Data: Automated ETL Testing of Hadoop

Host: RTTS

Date: Thursday, January 30, 2014

Time: 1:00 pm, Eastern Standard

Time (New York, GMT-05:00)

Session number:630 771 732

Page 3: Testing Big Data: Automated ETL Testing of Hadoop

About

RTTS is the leading provider of software quality for critical business systems

FACTSFounded: 1996

Primary Focus:

consulting services, software

Locations: New York, Atlanta, Philly, Phoenix

Geographic region:North America

Customer profile:Fortune 1000, > 600 clients

Software:

QuerySurge

Page 4: Testing Big Data: Automated ETL Testing of Hadoop

built by

Facebook handles 300 million photos a day and about 105 terabytes of data every 30 minutes.

- TechCrunch

The big data market will grow from $3.2 billion in 2010 to $32.4 billion in 2017.- Research Firm IDC

65% of…advanced analytics will have Hadoop embedded (in them) by 2015.-Gartner

Page 5: Testing Big Data: Automated ETL Testing of Hadoop

Big data – defined as too much volume, velocity and variety to work on normal database architectures.

SizeDefined as 5 petabytes or more 1 petabyte = 1,000 terabytes 1,000 terabytes = 1,000,000 gigabytes1,000,000 gigabytes = 1,000,000,000 megabytes

about Big Data

built by

Page 6: Testing Big Data: Automated ETL Testing of Hadoop

What is ?

• easily deals with complexities of high volume, velocity and variety of data

built by

Hadoop is an open source project that develops software for scalable, distributed computing.

• is a framework for distributed processing of large data sets across clusters of computers using simple programming models.

• scales up from single servers to 1,000’s of machines, each offering local computation and storage.

• detects and handles failures at the application layer

Page 7: Testing Big Data: Automated ETL Testing of Hadoop

Key Attributes of Hadoop

• Redundant and reliable

• Extremely powerful

• Easy to program distributed apps

• Runs on commodity hardware

built by

Page 8: Testing Big Data: Automated ETL Testing of Hadoop

MapReduce(Task Tracker)

HDFS(Data Node)

Basic Hadoop Architecture

MapReduce – processing part that manages the programming jobs. (a.k.a. Task Tracker)

HDFS (Hadoop Distributed File System) – stores data on the machines. (a.k.a. Data Node)

machine

built by

Page 9: Testing Big Data: Automated ETL Testing of Hadoop

ClusterAdd more machines for scaling – from 1 to 100 to 1,000

Job Tracker accepts jobs, assigns tasks, identifies failed machines

Name NodeCoordination for HDFS. Inserts and extraction are communicated through the Name Node.

Task TrackerData Node

Task TrackerData Node

Task TrackerData Node

Task TrackerData Node

Task TrackerData Node

Task TrackerData Node

Task TrackerData Node

Task TrackerData Node

Task TrackerData Node

Task TrackerData Node

Task TrackerData Node

Task TrackerData Node

Name Node

Job Tracker

Basic Hadoop Architecture (continued)

built by

Page 10: Testing Big Data: Automated ETL Testing of Hadoop

MapReduce(Task Tracker)

HDFS(Data Node)HiveQLHiveQL

HiveQL

HiveQL

HiveQL

HiveQL

Apache Hive - a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

Hive provides a mechanism to query the data using a SQL-like language called HiveQL that interacts with the HDFS files

• create• insert • update • delete• select

Apache Hive

built by

Page 11: Testing Big Data: Automated ETL Testing of Hadoop

Data Warehouse Review

built by

Page 12: Testing Big Data: Automated ETL Testing of Hadoop

about Data Warehouses…

Data Warehouse• typically a relational database that is designed for query and analysis rather

than for transaction processing

• a place where historical data is stored for archival, analysis and security purposes.

• contains either raw data or formatted data

• combines data from multiple sources

built by

• Sales• salaries • operational data • human resource data• inventory data• web logs• Social networks• Internet text and docs• other

Page 13: Testing Big Data: Automated ETL Testing of Hadoop

Data Warehousing: the ETL process

ETL = Extract, Transform, Load

Why ETL?Need to load the data warehouse regularly (daily/weekly) so that it can serve its purpose of facilitating business analysis.

Transform – removing inconsistencies, assemble to a common format, adding missing fields, summarizing detailed data and deriving new fields to store calculated data.

Load – map the data and load it into the DWH

100010110101010101010101010101111

101011111111111110101010101010101011 DATA LOAD

Extract - data from one or more OLTP systems and copied into the warehouse

built by

Page 14: Testing Big Data: Automated ETL Testing of Hadoop

Data Warehouse – the ETL process

Source Data ETL Process Target DWH

Transform

1000101101010101 01010101010101111

101011111111111110101010101010101011 DATA LOAD

Load

Extract

built by

Legacy DB

CRM/ERP DB

Finance DB

Page 15: Testing Big Data: Automated ETL Testing of Hadoop

Data Warehouse & Hadoop:

A Use Case

built by

DWH

Hadoop

Page 16: Testing Big Data: Automated ETL Testing of Hadoop

USE CASE***

Use Hadoop as a landing zone for big data & raw data

1) bring all raw, big data into Hadoop

2) perform some pre-processing of this data

3) determine which data goes to EDWH

4) Extract, transform and load (ETL) pertinent data into EDHW

built by

DWH & Hadoop: A Use Case

***Source: Vijay Ramaiah, IBM product manager, datanami magazine, June 10, 2013

Page 17: Testing Big Data: Automated ETL Testing of Hadoop

ETL

Source Data

Source Target DWHETL Process

built by

DWH & Hadoop: A Use Case

Use case data flow

Page 18: Testing Big Data: Automated ETL Testing of Hadoop

Testing Big Data

built by

Page 19: Testing Big Data: Automated ETL Testing of Hadoop

Testing Big Data: Entry Points

Recommended functional test strategy: Test every entry point in the system (feeds, databases, internal messaging, front-end transactions).

The goal: provide rapid localization of data issues between points

test entry point

built by

test entry point

ETL

Source Data

Source Hadoop ETL Process Target DWH

BI

Page 20: Testing Big Data: Automated ETL Testing of Hadoop

Testing Big Data: 3 Big Issues

- we need to verify more data and to do it faster

- we need to automate the testing effort

- We need to be able to test across different platforms

built by

We need a testing tool!

Page 21: Testing Big Data: Automated ETL Testing of Hadoop

21

About QuerySurge

built by

Page 22: Testing Big Data: Automated ETL Testing of Hadoop

QuerySurge is the

premier test tool built

to automate Data Warehouse testing

and the ETL Testing Process

What is QuerySurge?

built by

Page 23: Testing Big Data: Automated ETL Testing of Hadoop

What does

QuerySurge ™do?

built by

QuerySurge finds bad data

• Most firms test < 1% of their data

• BI apps sit on top of DWHs that have at best, untested data & at worst, bad data

• CEOs, CFOs, CTOs, executives rely on BI apps to make strategic decisions

• Bad data will cause execs to make decisions that will cost them $millions

• QuerySurge tests up to 100% of your data quickly & finds bad data

Page 24: Testing Big Data: Automated ETL Testing of Hadoop

QuerySurge Roles & Uses

Testers - functional testing - regression testing

ETL Developers - unit testing

Data Analysts- review, analyze data - verify mappings and

failures.

Operations teams - monitoring

built by

Page 25: Testing Big Data: Automated ETL Testing of Hadoop

QuerySurge™ Architecture

built by

Target

Sources

Page 26: Testing Big Data: Automated ETL Testing of Hadoop

built by

Design Library Create Query Pairs (source & target queries)

26

QuerySurge™ Modules

Scheduling Build groups of Query Pairs Schedule Test Runs

Page 27: Testing Big Data: Automated ETL Testing of Hadoop

Deep-Dive Reporting Examine and automatically

email test results

Run Dashboard View real-time execution Analyze real-time results

QuerySurge™ Modules

built by

Page 28: Testing Big Data: Automated ETL Testing of Hadoop

automates the testing effort the kickoff, the tests, the comparison, emailing the results

speeds up testing up to 1,000 times faster than manual testing

tests across different platformsany JDBC-compliant db, DWH, DMart, flat file, XML, Hadoop

the QuerySurge solution…

built by

verifies more data verifies upwards of 100% of all data quickly

Page 29: Testing Big Data: Automated ETL Testing of Hadoop

QuerySurge Value-Add

QuerySurge provides value by either:

in testing data coverage from < 1% to upwards of 100%

in testing time by as much as 1,000 x 

combination of in test coverage while in testing time

29built by

Page 30: Testing Big Data: Automated ETL Testing of Hadoop

Return on Investment (ROI)

redeployment of head count because of an increase in coverage

a savings over manual testing (minus queries, manual compares, other)

an increase in better data due to shorter / more thorough testing cycle, possibly saving $ millions by preventing key decisions made on bad data. 

30built by

Page 31: Testing Big Data: Automated ETL Testing of Hadoop

Ensuring Data Warehouse Quality

Demonstration

Jeff Bocarsly, Ph.D.Chief Architect

RTTS