testing big data: automated etl testing of hadoop

built by

Bill HaydukCEO/President

RTTS

Testing Big Data: Automated ETL Testing of Hadoop

Jeff Bocarsly, Ph.D.Chief Architect

RTTS

Laura PoggiMarketing Manager

RTTS

Webinar

Today’s Agenda

• About Big Data and Hadoop

• Data Warehouse refresher

• Hadoop and DWH Use Case

• How to test Big Data

• Demo of QuerySurge & Hadoop

built by

AGENDA

Topic: Testing Big Data: Automated ETL Testing of Hadoop

Host: RTTS

Date: Thursday, January 30, 2014

Time: 1:00 pm, Eastern Standard

Time (New York, GMT-05:00)

Session number:630 771 732

About

RTTS is the leading provider of software quality for critical business systems

FACTSFounded: 1996

Primary Focus:

consulting services, software

Locations: New York, Atlanta, Philly, Phoenix

Geographic region:North America

Customer profile:Fortune 1000, > 600 clients

Software:

QuerySurge

built by

Facebook handles 300 million photos a day and about 105 terabytes of data every 30 minutes.

- TechCrunch

The big data market will grow from $3.2 billion in 2010 to $32.4 billion in 2017.- Research Firm IDC

65% of…advanced analytics will have Hadoop embedded (in them) by 2015.-Gartner

Big data – defined as too much volume, velocity and variety to work on normal database architectures.

SizeDefined as 5 petabytes or more 1 petabyte = 1,000 terabytes 1,000 terabytes = 1,000,000 gigabytes1,000,000 gigabytes = 1,000,000,000 megabytes

about Big Data

built by

What is ?

• easily deals with complexities of high volume, velocity and variety of data

built by

Hadoop is an open source project that develops software for scalable, distributed computing.

• is a framework for distributed processing of large data sets across clusters of computers using simple programming models.

• scales up from single servers to 1,000’s of machines, each offering local computation and storage.

• detects and handles failures at the application layer

Key Attributes of Hadoop

• Redundant and reliable

• Extremely powerful

• Easy to program distributed apps

• Runs on commodity hardware

built by

MapReduce(Task Tracker)

HDFS(Data Node)

Basic Hadoop Architecture

MapReduce – processing part that manages the programming jobs. (a.k.a. Task Tracker)

HDFS (Hadoop Distributed File System) – stores data on the machines. (a.k.a. Data Node)

machine

built by

ClusterAdd more machines for scaling – from 1 to 100 to 1,000

Job Tracker accepts jobs, assigns tasks, identifies failed machines

Name NodeCoordination for HDFS. Inserts and extraction are communicated through the Name Node.

Task TrackerData Node












Name Node

Job Tracker

Basic Hadoop Architecture (continued)

built by

MapReduce(Task Tracker)

HDFS(Data Node)HiveQLHiveQL

HiveQL

HiveQL

HiveQL

HiveQL

Apache Hive - a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

Hive provides a mechanism to query the data using a SQL-like language called HiveQL that interacts with the HDFS files

• create• insert • update • delete• select

Apache Hive

built by

Data Warehouse Review

built by

about Data Warehouses…

Data Warehouse• typically a relational database that is designed for query and analysis rather

than for transaction processing

• a place where historical data is stored for archival, analysis and security purposes.

• contains either raw data or formatted data

• combines data from multiple sources

built by

• Sales• salaries • operational data • human resource data• inventory data• web logs• Social networks• Internet text and docs• other

Data Warehousing: the ETL process

ETL = Extract, Transform, Load

Why ETL?Need to load the data warehouse regularly (daily/weekly) so that it can serve its purpose of facilitating business analysis.

Transform – removing inconsistencies, assemble to a common format, adding missing fields, summarizing detailed data and deriving new fields to store calculated data.

Load – map the data and load it into the DWH

100010110101010101010101010101111

101011111111111110101010101010101011 DATA LOAD

Extract - data from one or more OLTP systems and copied into the warehouse

built by

Data Warehouse – the ETL process

Source Data ETL Process Target DWH

Transform

1000101101010101 01010101010101111

101011111111111110101010101010101011 DATA LOAD

Load

Extract

built by

Legacy DB

CRM/ERP DB

Finance DB

Data Warehouse & Hadoop:

A Use Case

built by

DWH

Hadoop

USE CASE***

Use Hadoop as a landing zone for big data & raw data

1) bring all raw, big data into Hadoop

2) perform some pre-processing of this data

3) determine which data goes to EDWH

4) Extract, transform and load (ETL) pertinent data into EDHW

built by

DWH & Hadoop: A Use Case

***Source: Vijay Ramaiah, IBM product manager, datanami magazine, June 10, 2013

ETL

Source Data

Source Target DWHETL Process

built by

DWH & Hadoop: A Use Case

Use case data flow

Testing Big Data

built by

Testing Big Data: Entry Points

Recommended functional test strategy: Test every entry point in the system (feeds, databases, internal messaging, front-end transactions).

The goal: provide rapid localization of data issues between points

test entry point

built by

test entry point

ETL

Source Data

Source Hadoop ETL Process Target DWH

BI

Testing Big Data: 3 Big Issues

- we need to verify more data and to do it faster

- we need to automate the testing effort

- We need to be able to test across different platforms

built by

We need a testing tool!

21

About QuerySurge

built by

QuerySurge is the

premier test tool built

to automate Data Warehouse testing

and the ETL Testing Process

What is QuerySurge?

built by

What does

QuerySurge ™do?

built by

QuerySurge finds bad data

• Most firms test < 1% of their data

• BI apps sit on top of DWHs that have at best, untested data & at worst, bad data

• CEOs, CFOs, CTOs, executives rely on BI apps to make strategic decisions

• Bad data will cause execs to make decisions that will cost them $millions

• QuerySurge tests up to 100% of your data quickly & finds bad data

QuerySurge Roles & Uses

Testers - functional testing - regression testing

ETL Developers - unit testing

Data Analysts- review, analyze data - verify mappings and

failures.

Operations teams - monitoring

built by

QuerySurge™ Architecture

built by

Target

Sources

built by

Design Library Create Query Pairs (source & target queries)

26

QuerySurge™ Modules

Scheduling Build groups of Query Pairs Schedule Test Runs

Deep-Dive Reporting Examine and automatically

email test results

Run Dashboard View real-time execution Analyze real-time results

QuerySurge™ Modules

built by

automates the testing effort the kickoff, the tests, the comparison, emailing the results

speeds up testing up to 1,000 times faster than manual testing

tests across different platformsany JDBC-compliant db, DWH, DMart, flat file, XML, Hadoop

the QuerySurge solution…

built by

verifies more data verifies upwards of 100% of all data quickly

QuerySurge Value-Add

QuerySurge provides value by either:

in testing data coverage from < 1% to upwards of 100%

in testing time by as much as 1,000 x

combination of in test coverage while in testing time

29built by

Return on Investment (ROI)

redeployment of head count because of an increase in coverage

a savings over manual testing (minus queries, manual compares, other)

an increase in better data due to shorter / more thorough testing cycle, possibly saving $ millions by preventing key decisions made on bad data.

30built by

Ensuring Data Warehouse Quality

Demonstration

Jeff Bocarsly, Ph.D.Chief Architect

RTTS

testing big data: automated etl testing of hadoop

Technology

big data big data

big data raw data

data warehouses data

terabytes of data

extract data

historical data

detailed data

data warehousing