testing big data: automated etl testing of hadoop
DESCRIPTION
Learn why testing your enterprise's data is pivotal for success with Big Data and Hadoop. See how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data warehouse - all with one ETL testing tool.TRANSCRIPT
built by
Bill HaydukCEO/President
RTTS
Testing Big Data: Automated ETL Testing of Hadoop
Jeff Bocarsly, Ph.D.Chief Architect
RTTS
Laura PoggiMarketing Manager
RTTS
Webinar
Today’s Agenda
• About Big Data and Hadoop
• Data Warehouse refresher
• Hadoop and DWH Use Case
• How to test Big Data
• Demo of QuerySurge & Hadoop
built by
AGENDA
Topic: Testing Big Data: Automated ETL Testing of Hadoop
Host: RTTS
Date: Thursday, January 30, 2014
Time: 1:00 pm, Eastern Standard
Time (New York, GMT-05:00)
Session number:630 771 732
About
RTTS is the leading provider of software quality for critical business systems
FACTSFounded: 1996
Primary Focus:
consulting services, software
Locations: New York, Atlanta, Philly, Phoenix
Geographic region:North America
Customer profile:Fortune 1000, > 600 clients
Software:
QuerySurge
built by
Facebook handles 300 million photos a day and about 105 terabytes of data every 30 minutes.
- TechCrunch
The big data market will grow from $3.2 billion in 2010 to $32.4 billion in 2017.- Research Firm IDC
65% of…advanced analytics will have Hadoop embedded (in them) by 2015.-Gartner
Big data – defined as too much volume, velocity and variety to work on normal database architectures.
SizeDefined as 5 petabytes or more 1 petabyte = 1,000 terabytes 1,000 terabytes = 1,000,000 gigabytes1,000,000 gigabytes = 1,000,000,000 megabytes
about Big Data
built by
What is ?
• easily deals with complexities of high volume, velocity and variety of data
built by
Hadoop is an open source project that develops software for scalable, distributed computing.
• is a framework for distributed processing of large data sets across clusters of computers using simple programming models.
• scales up from single servers to 1,000’s of machines, each offering local computation and storage.
• detects and handles failures at the application layer
Key Attributes of Hadoop
• Redundant and reliable
• Extremely powerful
• Easy to program distributed apps
• Runs on commodity hardware
built by
MapReduce(Task Tracker)
HDFS(Data Node)
Basic Hadoop Architecture
MapReduce – processing part that manages the programming jobs. (a.k.a. Task Tracker)
HDFS (Hadoop Distributed File System) – stores data on the machines. (a.k.a. Data Node)
machine
built by
ClusterAdd more machines for scaling – from 1 to 100 to 1,000
Job Tracker accepts jobs, assigns tasks, identifies failed machines
Name NodeCoordination for HDFS. Inserts and extraction are communicated through the Name Node.
Task TrackerData Node
Task TrackerData Node
Task TrackerData Node
Task TrackerData Node
Task TrackerData Node
Task TrackerData Node
Task TrackerData Node
Task TrackerData Node
Task TrackerData Node
Task TrackerData Node
Task TrackerData Node
Task TrackerData Node
Name Node
Job Tracker
Basic Hadoop Architecture (continued)
built by
MapReduce(Task Tracker)
HDFS(Data Node)HiveQLHiveQL
HiveQL
HiveQL
HiveQL
HiveQL
Apache Hive - a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
Hive provides a mechanism to query the data using a SQL-like language called HiveQL that interacts with the HDFS files
• create• insert • update • delete• select
Apache Hive
built by
Data Warehouse Review
built by
about Data Warehouses…
Data Warehouse• typically a relational database that is designed for query and analysis rather
than for transaction processing
• a place where historical data is stored for archival, analysis and security purposes.
• contains either raw data or formatted data
• combines data from multiple sources
built by
• Sales• salaries • operational data • human resource data• inventory data• web logs• Social networks• Internet text and docs• other
Data Warehousing: the ETL process
ETL = Extract, Transform, Load
Why ETL?Need to load the data warehouse regularly (daily/weekly) so that it can serve its purpose of facilitating business analysis.
Transform – removing inconsistencies, assemble to a common format, adding missing fields, summarizing detailed data and deriving new fields to store calculated data.
Load – map the data and load it into the DWH
100010110101010101010101010101111
101011111111111110101010101010101011 DATA LOAD
Extract - data from one or more OLTP systems and copied into the warehouse
built by
Data Warehouse – the ETL process
Source Data ETL Process Target DWH
Transform
1000101101010101 01010101010101111
101011111111111110101010101010101011 DATA LOAD
Load
Extract
built by
Legacy DB
CRM/ERP DB
Finance DB
Data Warehouse & Hadoop:
A Use Case
built by
DWH
Hadoop
USE CASE***
Use Hadoop as a landing zone for big data & raw data
1) bring all raw, big data into Hadoop
2) perform some pre-processing of this data
3) determine which data goes to EDWH
4) Extract, transform and load (ETL) pertinent data into EDHW
built by
DWH & Hadoop: A Use Case
***Source: Vijay Ramaiah, IBM product manager, datanami magazine, June 10, 2013
ETL
Source Data
Source Target DWHETL Process
built by
DWH & Hadoop: A Use Case
Use case data flow
Testing Big Data
built by
Testing Big Data: Entry Points
Recommended functional test strategy: Test every entry point in the system (feeds, databases, internal messaging, front-end transactions).
The goal: provide rapid localization of data issues between points
test entry point
built by
test entry point
ETL
Source Data
Source Hadoop ETL Process Target DWH
BI
Testing Big Data: 3 Big Issues
- we need to verify more data and to do it faster
- we need to automate the testing effort
- We need to be able to test across different platforms
built by
We need a testing tool!
21
About QuerySurge
built by
QuerySurge is the
premier test tool built
to automate Data Warehouse testing
and the ETL Testing Process
What is QuerySurge?
built by
What does
QuerySurge ™do?
built by
QuerySurge finds bad data
• Most firms test < 1% of their data
• BI apps sit on top of DWHs that have at best, untested data & at worst, bad data
• CEOs, CFOs, CTOs, executives rely on BI apps to make strategic decisions
• Bad data will cause execs to make decisions that will cost them $millions
• QuerySurge tests up to 100% of your data quickly & finds bad data
QuerySurge Roles & Uses
Testers - functional testing - regression testing
ETL Developers - unit testing
Data Analysts- review, analyze data - verify mappings and
failures.
Operations teams - monitoring
built by
QuerySurge™ Architecture
built by
Target
Sources
built by
Design Library Create Query Pairs (source & target queries)
26
QuerySurge™ Modules
Scheduling Build groups of Query Pairs Schedule Test Runs
Deep-Dive Reporting Examine and automatically
email test results
Run Dashboard View real-time execution Analyze real-time results
QuerySurge™ Modules
built by
automates the testing effort the kickoff, the tests, the comparison, emailing the results
speeds up testing up to 1,000 times faster than manual testing
tests across different platformsany JDBC-compliant db, DWH, DMart, flat file, XML, Hadoop
the QuerySurge solution…
built by
verifies more data verifies upwards of 100% of all data quickly
QuerySurge Value-Add
QuerySurge provides value by either:
in testing data coverage from < 1% to upwards of 100%
in testing time by as much as 1,000 x
combination of in test coverage while in testing time
29built by
Return on Investment (ROI)
redeployment of head count because of an increase in coverage
a savings over manual testing (minus queries, manual compares, other)
an increase in better data due to shorter / more thorough testing cycle, possibly saving $ millions by preventing key decisions made on bad data.
30built by
Ensuring Data Warehouse Quality
Demonstration
Jeff Bocarsly, Ph.D.Chief Architect
RTTS