audit and validation testing for big data applications€¦ · the data that is available for big...

14
Audit and Validation Testing For Big Data Applications Ravi Shukla, Specialist Senior Deloitte Consulting Pvt. Ltd.

Upload: others

Post on 20-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Audit and Validation Testing For Big Data Applications€¦ · The data that is available for big data applications is extracted from a wide variety of sources. This data is generally

Audit and Validation Testing For Big Data Applications

Ravi Shukla, Specialist Senior

Deloitte Consulting Pvt. Ltd.

Page 2: Audit and Validation Testing For Big Data Applications€¦ · The data that is available for big data applications is extracted from a wide variety of sources. This data is generally

Abstract

In today’s world, we are awash in a flood of data. Across a broad range of application areas, data is being collected at unprecedented scale. Decisions that previously were based on guesswork, or on painstakingly constructed models of reality, can now be made based on the data itself. Big Data analytics now drives a whole range of applications that impact us on a daily basis, such as retail, manufacturing, healthcare, mobile services and financial services.

Organizations are seeing big data analytics as means to reduce cost and improve co-ordination, quality and outcomes. For them to better manage their businesses, organizations are ensuring that their data, present in different systems are migrated to a Distributed File System.

Good data is helpful in providing insights. Businesses, when armed with this, can improve the day-to-day decisions they make. If the accuracy of data is low at the beginning of the process, it leads to lack of insight, and hence, the decisions it influences are also likely to be poor. Therefore, organizations must realize the criticality of data and understand that quality is more important that quantity. Most people prioritize only on gathering information without giving importance to the accuracy of information and if/how it could be used for further processing.

In Big data testing, QA engineers verify the successful processing of petabytes of data using commodity cluster and other supportive components. It demands a high level of testing skills as the processing is very fast.

The congruence of all these results in immense focus on testing data migration activities which is the buzz word in many industries these days.

This paper attempts to highlight the significance of Audit & Validation testing approach in the big data application landscape.

Page 3: Audit and Validation Testing For Big Data Applications€¦ · The data that is available for big data applications is extracted from a wide variety of sources. This data is generally

What is Big Data?

Big data is a term that describes the large volume of data, both structured and unstructured, that inundates a business on a day-to-day basis. This data is so large that it is difficult to process using traditional database and software techniques.

Big data can help organizations improve operations by helping in making better decisions and providing accurate insights leading to strategic business moves.

While the term “Big data” is relatively new, the act of gathering and storing large amounts of information for eventual analysis is ages old. The concept gained momentum in the early 2000s when industry analysts articulated the now-mainstream definition of big data as the three Vs:

Fig. Three V’s of Big Data

1. Volume: Organizations collect data from a variety of sources, including business transactions, social media and information from sensor or machine-to-machine data. In the past, storing it would’ve been a problem but new technologies (such as Hadoop) have eased the burden.

2. Velocity: Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time.

3. Variety: Data comes in all types of formats from structured, numeric data in traditional databases to unstructured text documents, email, video, audio, stock ticker data and financial transactions.

Page 4: Audit and Validation Testing For Big Data Applications€¦ · The data that is available for big data applications is extracted from a wide variety of sources. This data is generally

Testing challenges in Big Data Analytics

“Without big data analytics, companies are blind and deaf, wandering out onto the web like deer on a

freeway.” – Geoffrey Moore, Big Data Author and Consultant

Data is the lifeline of an organization and is getting bigger with each day. In 2011, experts predicted

that Big Data will become “the next frontier of competition, innovation and productivity”. Today,

businesses face data challenges in terms of volume, variety and sources. Structured business data is

supplemented with unstructured data, and semi-structured data from social media and other third

parties. Finding essential data from such a large volume of data is becoming a real challenge for

businesses, and quality analysis is the only option.

Fig. Audit & Validation Key Challenges

QA teams face multiple challenges in testing Big Data. These are detailed below:

• Large Volumes of diversified data

Testing any large volume of data is the biggest challenge in itself. A decade ago, a data pool of 10

million records was considered gigantic. Today, businesses have to store Petabyte or Exabyte data,

Page 5: Audit and Validation Testing For Big Data Applications€¦ · The data that is available for big data applications is extracted from a wide variety of sources. This data is generally

extracted from various online and offline sources, to conduct their daily business. Testers are required

to audit such voluminous data to ensure that they are a fit for business purposes and consumption.

• Data Analysis

For the Big Data testing strategy to be effective, testers need to continuously monitor and validate

the 3Vs of Data – Volume, Variety and Velocity. Understanding the data and its impact on the business

is the real challenge faced by any Big Data tester. It is not easy to measure the testing efforts and

strategy without proper knowledge of the nature of available data. Testers need to understand

business rules and the relationship between different subsets of data.

• Inefficient data

The data that is available for big data applications is extracted from a wide variety of sources. This

data is generally complex and potentially inaccurate. The data needs to be tested in order to ensure

its efficiency and accuracy before being loaded onto the target big data systems.

• Need of Technical Expertise

Technology is growing, and everyone is struggling to understand the algorithm of processing Big Data.

Big Data testers need to understand the components of the Big Data ecosystem thoroughly. Today,

testers understand that they have to think beyond the regular parameters of automated testing and

manual testing. Big Data, with its unexpected format, can cause problems that automated test cases

fail to understand. Creating automated test cases for such a Big Data pool requires expertise and

coordination between team members. The testing team should coordinate with the development

team and marketing team to understand data extraction from different resources, data filtering and

pre and post processing algorithms.

• Additional costs and resources

The big data testing process involves spending time and money on having additional set of resources

working on performing and testing data validation and verification activities. If the testing process is

not standardized and strengthened for re-utilization and optimization of test case sets, the test cycle

/ test suite would go beyond the intended and in turn causes increased costs, maintenance issues and

delivery slippages. Test cycles might stretch into weeks or even longer in manual testing.

Page 6: Audit and Validation Testing For Big Data Applications€¦ · The data that is available for big data applications is extracted from a wide variety of sources. This data is generally

Audit & Validation process – How it can be a solution?

To help resolve above challenges, having an Audit and Validation process allows the verification and validation of data flowing into the big data systems. Under the A&V process, the approach is to have processes setup that perform similar transformation logic as that of development team responsible for extracting data from different sources and loading large datasets into target systems. The extracts generated are compared with the one generated by the development team using the proposed Audit and Validation process which categorize the testing on the basis of:

• Auditing test results – Validates the extract criteria and tests if all the data has been extracted from source systems and loaded into the big data system (target). • Validating test results- Verifies the transformation logic of data during conversion. This is to ensure that the transformation rules have been applied correctly over the source data to be extracted for load to target.

Objective of A&V:

The objective of the Audit and Validation testing process is to ensure that the data that is being migrated is both Validated and verified.

Fig. Audit & Validation checkpoints

Once the implementation is done, we need to verify the data against the implemented solution. In Parallel to this, we also need to ensure that the data that is migrated satisfies the business need. We can do this by validating the data against the business case. These critical steps of verification and validation is done through the Audit and Validation Process.

Business needs Check Points

Solution

Validation Verification Data

Page 7: Audit and Validation Testing For Big Data Applications€¦ · The data that is available for big data applications is extracted from a wide variety of sources. This data is generally

The Audit & Validation involves following:

QA team setting up processes that implement similar transformation logic as the development team. Audit Process:

• This involves comparing the number of records which are extracted by the A&V and development teams.

• This process involves detecting 2 kinds of errors: o More records are pulled than the expected number of records o Lesser records are pulled than the expected number of records

• This process is performed after the data is extracted and is before the data is loaded. Validation Process:

• The validation process involves comparing the data (field by field) which is loaded by the development team against the data extracted by the Audit and Validation team.

• This testing of data results in validation of the quality of data as well as the fallouts that occur while loading the data.

The A&V process flow can be classified into 2 steps:

emails

videos

twitter

Semi-structured data

fitbit

Data Sources

Main Node

Data Node 1

Data Node 2

Statistical Analytics

Predictive Modeling

Text Analytics

Semantic Analytics

Real TimeProcessingAV Health Report

Audit & Validation Process

Data V

isualization Layer

Data Ingestion

Data Storage Layer

Fig. Audit & Validation process flow in a Big Data application

# 1: The first level of testing is performed as soon as the data is extracted

# 2: The second level of testing is performed after the data is loaded into the big data system

Page 8: Audit and Validation Testing For Big Data Applications€¦ · The data that is available for big data applications is extracted from a wide variety of sources. This data is generally

Relevance of Audit & Validation process

“You can’t manage what you don’t measure.” – Peter Drucker

For organizations handling large amounts of big data, managers can now measure, and know, radically more about their businesses, and directly translate that knowledge into improved decision making and performance.

When there is a huge amount of Organizational data movement, a quality check to ensure if the huge data has moved as expected from source systems to target big data application becomes imperative. Amongst the different testing techniques for big data testing that includes performance and functional testing, there is a growing importance on the data quality testing which can be achieved with greatest results through building and having a well-constructed Audit and Validation process.

The Audit and validation approach allows a check point to ensure data getting loaded into the big data systems such as Hadoop is accurate and consistent with data sent across from different source systems.

Fig. Relevance of Audit & Validation process

Heterogenous Data

Audit & Validation

Refined & Validated Data

Page 9: Audit and Validation Testing For Big Data Applications€¦ · The data that is available for big data applications is extracted from a wide variety of sources. This data is generally

Case Study

Netflix, the world’s leading internet television network, uses big data analytics (Amazon Kinesis System) to analyze billions of bytes of data across more than 150,000 application instances daily in real time, enabling it to optimize user experience, reduce costs, and improve application resilience.

Netflix is said to account for one third of peak-time internet traffic in the USA. Data from users is collected and monitored in an attempt to understand viewing habits. But its data isn’t just “big” in the literal sense. It is the combination of this data with cutting edge analytical techniques that makes Netflix a true Big Data company. The key to Netflix success has always been to predict what its customers will enjoy watching. Big Data analytics is the fuel that fires the “recommendation engines” designed to serve this purpose.

Netflix uses big data analytics for:

• Predicting viewing habits • Improving Search Quality • Finding next smash hit series • Ensuring end users high quality experience • Recommendation Engines • Improved ratings

Netflix has made use of Audit and validation techniques that ensure that all data from various sources such as devices, program searches are collected and loaded into the target systems. The availability of such data ensures correct data analytics for predicting user viewing habits and ensuring recommendations based on users search criteria.

Page 10: Audit and Validation Testing For Big Data Applications€¦ · The data that is available for big data applications is extracted from a wide variety of sources. This data is generally

Benefits of Audit & Validation

Audit and validation implementation provides the following capabilities to the testing team:

Fig. Benefits of Audit & Validation

• Independent validation ensuring accurate quantity of records are available in target big data systems for analysis

• Ensures data quality • Results in accurate data analysis. In doing so, it results in more data-driven customer-centric

marketing, which provides the opportunity to deliver more targeted messages and develop a one-to-one relationship with customers.

• Allows organizations to perform risk analysis • Reduces maintenance costs as the massive amount of data available for analysis allows

organizations to spot issues and predict when they might occur. The results in a much more cost-effective replacement strategy for the utility and less downtime

Page 11: Audit and Validation Testing For Big Data Applications€¦ · The data that is available for big data applications is extracted from a wide variety of sources. This data is generally

Conclusion All in all, Big Data testing has much prominence for today’s businesses. If right test strategies are embraced and best practices are followed, defects can be identified in early stages and overall testing costs can be reduced while achieving high Big Data quality. The Audit and validation process empowers testing teams to accurately determine if there are any inconsistencies regarding data flowing into system, and help organization take corrective measures in case discrepancies are identified. A&V provides a capability to analyze astonishing data sets quickly and cost-effectively. These capabilities are neither theoretical nor trivial. They represent a genuine leap forward and a clear opportunity to realize enormous gains in terms of efficiency, productivity, revenue, and profitability. The Age of Big Data is here, and these are truly revolutionary times if both business and technology professionals continue to work together and deliver on the promise.

Page 12: Audit and Validation Testing For Big Data Applications€¦ · The data that is available for big data applications is extracted from a wide variety of sources. This data is generally

References & Index

1. http://www.dataintensity.com/characteristics-of-big-data-part-one 2. https://hbr.org/2012/10/big-data-the-management-revolution 3. “Hype Cycle around Big Data Analytics”. Published by Forbes 4. http://www.cigniti.com/blog/5-big-data-testing-challenges

Page 13: Audit and Validation Testing For Big Data Applications€¦ · The data that is available for big data applications is extracted from a wide variety of sources. This data is generally

Author Biography

Ravi Shukla

Software professional with 12 years’ experience in Industry having worked primarily on Data warehousing projects and healthcare domain. Ravi is currently working as a Test Program Manager for a leading Healthcare provider based in California, USA. He is an avid traveler, loves hiking and playing volleyball.

Page 14: Audit and Validation Testing For Big Data Applications€¦ · The data that is available for big data applications is extracted from a wide variety of sources. This data is generally

THANK YOU!