testing big data: automated testing of hadoop with querysurge

built by

Bill HaydukCEO/President

RTTS

Testing Big Data: Automated ETL Testing of Hadoop

Jeff Bocarsly, Ph.D.Chief Architect

QuerySurge Division, RTTS

built by

QuerySurge ™

Automate your Data Warehouse & Big Data Testing and Reap the Benefits

built by

QuerySurge ™

Today’s Agenda

• About Big Data and Hadoop

• Data Warehouse refresher

• Hadoop and DWH Use Case

• How to test Big Data

• Demo of QuerySurge & Hadoop

AGENDA

Topic: Testing Big Data: Automated ETL Testing of Hadoop

Host: RTTS

Date: Thursday, January 30, 2014

Time: 1:00 pm, Eastern Standard

Time (New York, GMT-05:00)

Session number:630 771 732

built by

QuerySurge ™

About FACTS

Founded: 1996

Locations: New York (HQ), Atlanta, Philadelphia, Phoenix

Strategic Partners:IBM, Microsoft, HP, Oracle, Teradata, HortonWorks, Cloudera, Amazon

Software:

QuerySurge

RTTS is the leading provider of software & data quality for critical business systems

built by

Facebook handles 300 million photos a day and about 105 terabytes of data every 30 minutes.

- TechCrunch

The big data market will grow from $3.2 billion in 2010 to $32.4 billion in 2017.- Research Firm IDC

65% of…advanced analytics will have Hadoop embedded (in them) by 2015.-Gartner

built by

QuerySurge ™

ETL

Source Data ETL Process Data WarehouseBig Data

Business Intelligence (BI) software

CxOs are using Business Intelligence & Analytics to make critical business decisions – with the assumption that the underlying data is fine.

“The average organization loses $8.2 million annually through poor Data Quality.”

- Gartner

Data Architecture

The Executive Office and Big Data

potential problem areas

Big data – defined as too much volume, velocity and variety to work on normal database architectures.

SizeDefined as 5 petabytes or more 1 petabyte = 1,000 terabytes 1,000 terabytes = 1,000,000 gigabytes1,000,000 gigabytes = 1,000,000,000 megabytes

about Big Data

built bybuilt by

QuerySurge ™

Big Data Impact

Handles more than 1 million customer transactions every hour.• data imported into databases that contain > 2.5 petabytes of data • the equivalent of 167 times the information contained in all the books in the US Library of

Congress.

Facebook handles 40 billion photos from its user base.

Google processes 1 Terabyte per hour

Twitter processes 85 million tweets per day

eBay processes 80 Terabytes per day

others

built by

QuerySurge ™

Requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times.

Technologies include:• massively parallel processing (MPP) databases• data warehouses• Data mining grids• distributed file systems• distributed databases• cloud computing platforms • the Internet, and • scalable storage system

Big Data Solutions

built by

QuerySurge ™

built by

QuerySurge ™

What is ?

• easily deals with complexities of high volume, velocity and variety of data

Hadoop is an open source project that develops software for scalable, distributed computing.

• is a framework for distributed processing of large data sets across clusters of computers using simple programming models.

• scales up from single servers to 1,000’s of machines, each offering local computation and storage.

• detects and handles failures at the application layer

built by

QuerySurge ™

Key Attributes of Hadoop

• Redundant and reliable

• Extremely powerful

• Easy to program distributed apps

• Runs on commodity hardware

Top Vendors

built by

QuerySurge ™

“Spending on Hadoop software and subscriptions will increase to approximately $677 million by the end of 2017, with overall big data market anticipated to reach the $50 billion mark.”

- Wikibon

built by

QuerySurge ™

MapReduce(Task Tracker)

HDFS(Data Node)

Basic Hadoop Architecture

MapReduce – processing part that manages the programming jobs. (a.k.a. Task Tracker)

HDFS (Hadoop Distributed File System) – stores data on the machines. (a.k.a. Data Node)

machine

built by

QuerySurge ™

ClusterAdd more machines for scaling – from 1 to 100 to 1,000

Job Tracker accepts jobs, assigns tasks, identifies failed machines

Name NodeCoordination for HDFS. Inserts and extraction are communicated through the Name Node.

Task TrackerData Node












Name Node

Job Tracker

Basic Hadoop Architecture (continued)

built by

QuerySurge ™

MapReduce(Task Tracker)

HDFS(Data Node)HiveQLHiveQL

HiveQL

HiveQL

HiveQL

HiveQL

Apache Hive - a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

Hive provides a mechanism to query the data using a SQL-like language called HiveQL that interacts with the HDFS files

• create• insert • update • delete• select

Apache Hive

built by

QuerySurge ™

Data Warehouse Review

about Data Warehouses…Data Warehouse• typically a relational database that is designed for query and analysis rather than

for transaction processing• a place where historical data is stored for archival, analysis & security purposes. • contains either raw or formatted data• combines data from multiple sources:

o saleso salaries o operational data o human resource datao inventory datao web logso social networkso internet text and docso other

built by

QuerySurge ™

Data Warehouse: the ETL process

ETL: Extract, Transform, LoadWhy ETL?Need to load the data warehouse regularly (daily/weekly) so that it can serve its purpose of facilitating business analysis.

Extract - data from one or more OLTP systems and copied into the warehouse

Extract

Transform – removing inconsistencies, assemble to a common format, adding missing fields, summarizing detailed data and deriving new fields to store calculated data.

Transform

Load – map the data and load it into the DW

Load

built by

QuerySurge ™

Data Warehouse: the Marketplace

“The data warehousing market will see a compound annual growth rate of 11.5% …to reach a total of $13.2 billion in revenue.”

- consulting specialist The 451 Group

Data Warehouse sizeSmall data warehouses: < 5 TBMidsize data warehouses: 5 TB - 20 TBLarge data warehouses: >20 TB- Analyst firm Gartner

Leaders in Data Warehouse Data Management Systems

- Analyst firm Gartner’s ‘Magic Quadrant for Data Warehouse Database Management Systems’

built by

QuerySurge ™

Extract

built by

QuerySurge ™

Legacy DB

CRM/ERP DB

Finance DB

Testing the Data Warehouse: the ETL process

Source Data

ETL Process Target Data Warehouse

Transform

Load

built by

QuerySurge ™

Testing Big Data

built by

QuerySurge ™

Data Warehouse & Hadoop:

2 Use Cases:Data

Warehouse

Hadoop

NoSQL

Hadoop Data Warehouse

built by

QuerySurge ™

USE CASE 1*** Use Hadoop as a landing zone for big data & raw data

1) bring all raw, big data into Hadoop

2) perform some pre-processing of this data

3) determine which data goes to Data Warehouse

4) Extract, transform and load (ETL) pertinent data into Data Warehouse

Use Case #1:Data Warehouse & Hadoop

***Source: Vijay Ramaiah, IBM product manager, datanami magazine, June 10, 2013

built by

QuerySurge ™

Recommended functional test strategy: Test every entry point in the system (feeds, databases, internal messaging, front-end transactions).

The goal: provide rapid localization of data issues between points

test entry point

built by

Business Intelligence

software

ETL

Source Data

Source Hadoop ETL Process Target DWH

built by

QuerySurge ™

Use Case #1:Data Warehouse & Hadoop

test entry point test entry points

Use Case #2: MongoDB, Hadoop, DWH &

Relational DB & Data WarehousingSource Data

@

BI, Analytics & ReportingIngestion

built by

QuerySurge ™

QuerySurge ™

test entry point

test entry point

test entry point

test entry point test entry point

built by

QuerySurge ™

Testing Big Data: 3 Big Issues

- we need to verify more data and to do it faster

- we need to automate the testing effort

- We need to be able to test across different platforms

We need a testing tool!

built by

QuerySurge ™

About QuerySurge ™

The Testing Solution

built by

built by

QuerySurge ™

What is QuerySurge ™?

the collaborative Big Data

Testing solution that finds bad data & provides a holistic view of your data’s

health

built by

the QuerySurge advantage

built by

QuerySurge ™

Automate the entire testing cycle Automate kickoff, tests, comparison, auto-emailed results

Create Tests easily with no SQL programming ensures minimal time & effort to create tests / obtain results

Test across different platforms Hadoop, data warehouses, NoSQL, database, flat file, XML

Collaborate with team Data Health dashboard, shared tests & auto-emailed reports

Verify more data & do it quickly verifies up to 100% of all data up to 1,000 x faster

Integrate for Continuous Delivery Integrates with most Build, ETL & QA management software

QuerySurge™ Architecture

Web-based…

Installs on...

Linux

Connects to…

…or any other JDBC compliant data source

built by

QuerySurge ™

QuerySurgeController

QuerySurgeServer

QuerySurgeAgents

Flat Files

built by

QuerySurge ™

QuerySurge™ Modules

Design Library

SchedulingDeep-Dive Reporting

Run Dashboard

Query Wizards

Data Health Dashboard

Fast and Easy. No programming needed.

built by

QuerySurge ™


• Perform 80% of all data tests - no SQL coding needed

• Opens up testing to novices & non-technical team members

• Speeds up testing for skilled SQL coders

• provides a huge Return-On-Investment

Design Library• Create Query Pairs (source & target SQLs)

• Great for team members skilled with SQL


Scheduling Build groups of Query Pairs Schedule Test Runs

built by

QuerySurge ™

Deep-Dive Reporting Examine and automatically

email test results

Run Dashboard View real-time execution Analyze real-time results


built by

QuerySurge ™

built by

QuerySurge ™

Data Health Dashboard• view data reliability & pass rate

• add, move, filter, zoom-in on any data widget & underlying data

• verify build success or failure


(1) Trial in the Cloud of QuerySurgeTM, including self-learning tutorial that works with sample data for 3 days

(2) Downloaded Trial of QuerySurgeTM, including self-learning tutorial with sample data or your data for 15 days

for more information on our Trials, please visit: www.querysurge.com/compare-trial-options

TRIAL IN THE CLOUD

built by

QuerySurge ™

Free Trials & TrainingQuerySurge™

http://www.rttsweb.com/training/courses/big-data-testing-courses

Big Data Testing CoursesFilled with examples and labs, this hands-on training teaches concepts and HQL techniques used in Big Data testing.

For more information on our Big Data Testing classes, please visit:

http://www.querysurge.com/compare-trial-options

http://www.querysurge.com/compare-trial-options



a last word about Hadoop…

built by

built by

QuerySurge ™

To see the video of this webinar please visit:http://www.querysurge.com/solutions/testing-big-data/big-data-testing-for-hadoop

Big Data and Hadoop are on the verge of revolutionizing enterprise data management architectures.

- DeZyre

http://www.querysurge.com/solutions/testing-big-data/big-data-testing-for-hadoop

http://www.querysurge.com/solutions/testing-big-data/big-data-testing-for-hadoop

testing big data: automated testing of hadoop with querysurge

Technology