optimizing big data value across hadoop and relational

12
WHITE PAPER 10.13 EB 7873 INTEGRATED APPROACH INCORPORATES RELATIONAL DATABASES AND APACHE HADOOP TO PROVIDE A FRAMEWORK FOR THE ENTERPRISE DATA ARCHITECTURE BY CHAD MELEY DIRECTOR OF ECOMMERCE & DIGITAL MEDIA OPTIMIZE THE BUSINESS VALUE OF ALL YOUR ENTERPRISE DATA

Upload: chad-meley

Post on 15-Jan-2015

259 views

Category:

Technology


0 download

DESCRIPTION

An integrated approach and methodology that incorporates massively parallel processing relational databases and Apache Hadoop to provide a framework for the enterprise data architecture.

TRANSCRIPT

Page 1: Optimizing Big Data Value Across Hadoop and Relational

White PaPer

10.13 eb 7873

Integrated approach Incorporates relatIonal databases and apache hadoop to provIde a

framework for the enterprIse data archItecture

bY chad meleYdIrector of ecommerce & dIgItal medIa

Optimize the Business Value Of all YOur

enterprise Data

Page 2: Optimizing Big Data Value Across Hadoop and Relational

White PaPer

10.13 eb 7873

2

Optimize the Business Value Of all YOur enterprise Data

executiVe summarYfew industries have evolved as quickly as data processing, thanks to the effect of moore’s law coupled with silicon valley–style software innovation. so it comes as no surprise that innovations in data analysis have led to new data, new tools, and new demands to remain competitive. market leaders in many industries are adopting these new capabilities, fast followers are on their heels, and the mainstream is not far behind.

this renaissance has affected the data warehouse in powerful ways. In the 1990s and earlier 2000s, the massively parallel processing (mpp) relational data warehouse was the only proven and scalable place to hold corporate memory. In the late 2000s, an explosion of new data types and enabling technologies lead some to claim the demise of the traditional data warehouse. a more pragmatic view has emerged recently, that a one-size-fits-all approach—whether a traditional data warehouse or apache™ hadoop®—is insufficient by itself in a time when datasets and usage patterns vary widely. technology advances have expanded the options to include permutations of the data warehouse in what is referred to as built-for-purpose solutions.

Yet even seasoned practitioners who embrace multiplatform data environments still struggle to decide which technology is the best choice for each use case. by analogy, consider the transformations that have occurred in moving physical goods around the world in the past century—first cargo ships, then rail and trucks, and finally airplanes. because of our familiarity with these modes, we know intrinsically what use cases are best for each transportation option, and nobody questions the need for all of them to exist within a global logistics framework. knowing the value propositions and economics for each, it would be foolish for someone to say “why would anyone ever use an airplane to ship goods when rail is a fraction of the cost per pound?” or “why would I ever consider using a cargo ship to move oil when I can get it to market faster using air?”

but the best fit for data platform technologies is not as universally understood at this time.

this paper will not bring instant clarity to this complex sub-ject; rather, the intent is to define a framework of capabili-ties and costs for various options to encourage informed dialogue that will accelerate more comprehensive under-standing in the industry.

teradata has defined the teradata® unified data architecture™, a solution that allows the analytics renais-sance to flourish while controlling costs and discovering new analytics. as guideposts in this expansion, we have identified workloads that fit into built-for-purpose zones of activity:

~ Integrated data warehouse

~ Interactive discovery

~ batch data processing

~ general-purpose file system

by making use of this array of analytical environments, com-panies can extract significant value from a broader range of data—much of which would have been discarded just a few years ago. as a result, business users can solve more high-value business problems, achieve greater operational efficiencies, and execute faster on strategic initiatives.

while the big data landscape is spawning new and innovative products at an astonishing pace, a great deal of attention continues to be focused on one of the seminal technologies that launched the big data analytics expansion: hadoop. an open source software framework that supports the processing of large datasets in a distributed applications environment, hadoop uses parallelism over raw files through its mapreduce framework. It has the momentum and community support that make it the most likely to eventually become the dominant enterprise standard in its space in a new breed of data technologies.

the teraData unifieD Data architectureteradata offers a hybrid enterprise data architecture that integrates hadoop and massively parallel processing (mpp) relational database management systems (rdbms). known as the teradata unified data architecture™, this solution relies on input from teradata subject-matter experts and teradata customers who are experienced practitioners with both hadoop and traditional data warehousing. this architecture has also been validated with leading industry analysts and provides a strong foundation for designing next-generation enterprise data architectures.

the essence of the teradata unified data architecture™ is captured in a comprehensive infographic that is intended to be a reference for database architects and strategic planners as they develop their next-generation enterprise

Page 3: Optimizing Big Data Value Across Hadoop and Relational

White PaPer

10.13 eb 7873

3

Optimize the Business Value Of all YOur enterprise Data

data architectures (figure 1). the graphic, along with the more detailed explanations in this paper, provide objective criteria for deciding which technology is best suited to particular needs within the organization.

to provide a framework for understanding the use cases, the following sections describe a number of important concepts such as business value density (bvd), stable and evolving schemas, and query and data volumes. there are many different concepts that interplay within the graphic, so it is broken down in a logical order.

Business value density

one of the most important concepts for understanding the teradata unified data architecture™ is bvd, defined as the amount of business relevance per gigabyte of data (figure 2). put another way, how many business insights can be extracted for a given amount of data? there are a number of factors that influence bvd, including when the data was captured, the amount of detail in the data, the percentage of inaccurate or corrupt records (data hygiene), and how often the data is accessed and reused (see table).

before the big data revolution, organizations established clear guidelines to determine what data would be captured and how long it would be retained. as a result, only the dense data (high bvd) was retained. lower bvd data was discarded, compounded by the absence of identified use cases and tools to exploit it.

Factors Affecting Business Value Density

data parameter hIgh bvd low bvd

age recent older

form modeled raw

hygiene clean raw

access frequent rare

reuse frequent rare

the big data movement has brought a fundamental shift in data capture, retention, and processing philosophies. declining storage costs and file-based data capture

Figure 1. The Teradata Unified Data Architecture

BUSINESS VALUE DENSITYThe ratio of business relevance to the size of the data.

OPTIMIZING VALUEIN THE UNIFIED

DATA ARCHITECTURE

HIGH BUSINESS VALUE DENSITY

LOW BUSINESS VALUE DENSITY

DATA VOLUME

NO SCHEMA

Accommodates both Stable and Evolving Schemas

Does not require extensive data modeling

SQL / NoSQL / Map Reduce / statistical functions

Pre-packaged analytic modules

Analysts and data scientists

INTERACTIVEDISCOVERY

CHARACTERISTICSLOW MED HIGH

HIGH

LOW

LOW

MED

MED

HIGH

LOW MED HIGH

HARDWARE / SOFTWARE

DEVELOPMENT / MAINTENANCE

USAGE

RESOURCE CONSUMPTION

COST

ANALYTICAL FLEXIBILITY

FINE GRAIN SECURITY

FAST RESPONSE/THROUGHPUT

RDBMS

HADOOP

THE HIGHER THE BUSINESS VALUE DENSITY,THE MORE IT MAKES SENSE TO MANAGE IT

USING RELATIONAL TECHNIQUES

THE LOWER THE BUSINESS VALUE DENSITY,THE MORE IT MAKES SENSE TO MANAGE

USING HADOOP

DATA GOVERNANCE

Single view of your business

Shared source for analytics

Load once, Use many times

SQL / 3rd party applications

Knowledge workers and analysts

DATA WAREHOUSE

CHARACTERISTICSLOW MED HIGH

LOW MED HIGH

LOW MED HIGH

LOW MED HIGH

HARDWARE / SOFTWARE

DEVELOPMENT / MAINTENANCE

USAGE

RESOURCE CONSUMPTION

COST

DATA QUALITY/INTEGRITY

Flexible programming languages (Java, Python, C++, etc.)

Economic online archive

Land/source operational data

Data scientists and engineers

GENERAL PURPOSEFILE SYSTEM

CHARACTERISTICS COST

HARDWARE / SOFTWARE LOW MED HIGH

LOW MED HIGH

LOW

LOW

MED HIGH

MED HIGH

DEVELOPMENT / MAINTENANCE

USAGE

RESOURCE CONSUMPTION

BATCH DATAPROCESSING

CHARACTERISTICS

LOW MED HIGH

LOW MED HIGH

LOW MED HIGH

LOW MED HIGH

HARDWARE / SOFTWARE

DEVELOPMENT / MAINTENANCE

USAGE

RESOURCE CONSUMPTION

COST

No transformations of data required

Scripting / Declarative languages

Analysis against raw files

Refinement, transformation, and cleansing

Analysts and data scientists

EVOLVING SCHEMA

STABLE SCHEMA

CROSS FUNCTIONAL REUSE

QUERY VOLUME

Page 4: Optimizing Big Data Value Across Hadoop and Relational

White PaPer

10.13 eb 7873

4

Optimize the Business Value Of all YOur enterprise Data

In contrast, imagine capturing web log data representing every click on the company’s web site over the past five years. compared to the order data described previously, this dataset is significantly larger. while there is potentially a treasure trove of business insights within this dataset, the number of people and applications interrogating it in its raw form would be less than the dataset made up of cleansed and packaged orders. so, this raw web site data has sparse bvd, but is still highly valuable.

stable and evolving schemas

the ability to handle evolving schemas is an important capability. In contrast to stable schemas that change slowly (e.g., order records and product information), evolving schemas change continually—think of new

and processing now allow enterprises to capture and retain most, if not all, of the information generated by business activities. why capture so much lower bvd data? because low bvd does not mean no value. In fact, many organizations are discovering that sparse data that was routinely discarded not so long ago now holds tremendous potential business value—but only if it can be accessed efficiently.

to illustrate the concept of bvd, consider a dataset made up of cleansed and packaged online order information for a given time period such as the previous three months. this dataset is relatively small and yet highly valuable to business users in operations, marketing, finance, and other functional areas. this order data is considered to have high bvd; in other words, it contains a high level of useful business insights per gigabyte.

Figure 2. Business value density

BUSINESS VALUE DENSITYThe ratio of business relevance to the size of the data.

OPTIMIZING VALUEIN THE UNIFIED

DATA ARCHITECTURE

HIGH BUSINESS VALUE DENSITY

LOW BUSINESS VALUE DENSITY

DATA VOLUME

AB

legend ~ data volume—represented by the thickness of the circle; greatest at point a and decreases counterclockwise around the circle ~ bvd—lowest at point a and increases around the circle ~ sparse/dense—sparse data represented by light rows with a few dark blue squares at point a; dense data represented by darker blue rows

at point b

Page 5: Optimizing Big Data Value Across Hadoop and Relational

White PaPer

10.13 eb 7873

5

Optimize the Business Value Of all YOur enterprise Data

no-schema data

as noted previously, all data has structure and therefore what is frequently seen as unstructured data should be reclassified as no-schema data (figure 4). what’s interesting about no-schema data with respect to analytics is that it has analytical value in unanticipated ways. In fact, a skilled data scientist can draw substantial insights from no-schema data. here are two real-life scenarios:

~ an online retailer is boosting revenue through image analysis. In a typical case, a merchant is marketing a red dress and supplies the search terms size 6 and Ralph Lauren along with an image of the dress itself. using sophisticated image-analysis software, the retailer can with a high degree of confidence attach additional descriptors such as A-line and cardinal red, which makes searching more accurate, benefiting both merchants and buyers.

~ an innovative insurance company is using audio recordings of phone conversations between customer service representatives and policyholders to determine the likelihood of a fraudulent claim based on signals derived from voice inflections.

In both examples, the companies had made the decision to capture the data before they had a complete idea of how to use it. business users developed the innovative uses after they had become familiar with the data structure and had access to tools to extract the hidden value.

columns being added frequently, for example, web log data (figure 3).

all data has structure. Instead of the oft-used (and misused) terms structured, semi-structured, and unstructured, the more useful concepts are stable and evolving schemas. for example, even though Xml and Json formats are often classified as semi-structured, the schema for an individual event such as order check-out can be highly stable over long periods of time. as a result, this information can be easily accessed using standard etl (extract, transform, and load) tools with little maintenance overhead. conversely, Xml and Json formats frequently—and unexpectedly, from the viewpoint of a data platform engineer—capture a new event type such as “hovered over a particular image with pointer.” this scenario describes evolving schema, which is particularly challenging for traditional relational tools.

Figure 3. Stable and evolving schemas Figure 4. No-schema data

HIGH BUSINESS VALUE DENSITY

LOW BUSINESS VALUE DENSITY

DATA VOLUME

EVOLVING SCHEMA

STABLE SCHEMA

LOW BUSINESS VALUE DENSITY

DATA VOLUME

NO SCHEMA

EVOLVING SCHEMA

STABLE SCHEMA

legend ~ stable schema data—the blue section of the band. note that the

areas of high bvd are composed entirely of stable schemas. ~ evolving schema data—the gray section of the band. while much

of the data volume corresponds to evolving schemas, the bvd is fairly low compared to the stable schemas.

legendno-schema data—the magenta band between the evolving and stable schemas

Page 6: Optimizing Big Data Value Across Hadoop and Relational

White PaPer

10.13 eb 7873

6

Optimize the Business Value Of all YOur enterprise Data

rDBms Or haDOOpbuilding on the core concepts of bvd; query volume; and stable, evolving, and no-schema data, we can draw a line showing which data is most appropriate for an rdbms or hadoop and give some background about that particular placement.

In general, as bvd increases, the more it makes sense to use relational techniques; while decreasing bvd indicates that hadoop may be the best choice. while the graphic (figure 6) draws the line arbitrarily through the equator, every organization will have its own threshold based on its information culture and maturity. also note that no-schema data resides solely within hadoop because relational constructs are often less-suited for managing this type of data.

rdbms technology has clear advantages over hadoop in terms of response time, throughput, and security, which make it more appropriate for higher bvd data that has greater concurrency and more security requirements given the shared nature of the data.

these differentiators are due to the following:

~ mature cost-base optimizers—when a query is submitted, the optimizer evaluates various execution plans and estimates the resource consumption for each. the optimizer then selects the plan that minimizes resource usage and thus maximizes throughput.

~ Indexing—rdbms software has a multitude of robust indexes with stored statistics to facilitate access, thus shortening response times.

~ advanced partitioning—today’s rdbms products feature a number of advanced partitioning methods and criteria to optimize database performance and improve manageability.

~ workload management—rdbms technology addresses the throughput problem that occurs when many queries are executing concurrently. the workload manager prioritizes the query queue so that short queries are executed quickly and long queries receive adequate resources to avoid excessively long execution times. filters and throttles regulate database activity by rejecting or limiting requests. (a filter causes specific logon and query requests to be rejected, while a throttle limits the number of active sessions, query requests, or load utilities on the database.)

~ extensive security features—relational databases offer sophisticated row- and column-level security, which enables role-based security. they also include fine-grain

usage and query volume

by definition, there is a strong correlation between bvd and usage volume. for example, if a company captures 100 petabytes of data, 80 percent of all queries would be addressed to just 20 petabytes—the high bvd portion of the dataset (figure 5).

usage volume includes two primary access methods: ad-hoc and scheduled queries. ad-hoc queries are usually initiated by the person who needs the information using sQl interfaces, analytical tools, and business applications. scheduled queries are set up and monitored by business analysts or data platform engineers. applicable tools include sQl interfaces for regularly scheduled reports, au-tomated business applications, and low-level programming scripts for scheduled analytics and data transformations.

a significant and growing portion of usage volume is due to applications such as campaign management, ad serving, search, and supply chain management that depend on insights from the data to drive more intelligent decisions.

Figure 5. Usage and query volume

HIGH BUSINESS VALUE DENSITY

LOW BUSINESS VALUE DENSITY

DATA VOLUME

NO SCHEMA

EVOLVING SCHEMA

STABLE SCHEMA

CROSS FUNCTIONAL REUSE

QUERY VOLUME

legend ~ usage base volume—the amplitude of the outside spirals

indicates usage volume. note the inverse correlation between bvd and usage volume.

~ cross-functional reuse—the three colors represent the percent-age of the data that is reused by groups such as marketing, cus-tomer service, and finance. these groups typically need access to the same high-bvd data such as recent orders.

Page 7: Optimizing Big Data Value Across Hadoop and Relational

White PaPer

10.13 eb 7873

7

Optimize the Business Value Of all YOur enterprise Data

~ usage—the costs of querying and analyzing the data to derive actionable insights, primarily based on market compensation for required skills, time to author and alter scripts and code, and wait time as it relates to productiv-ity; these costs often are spread across multiple depart-ments and budgets and therefore often go unnoticed; however, they are very real for business initiatives that leverage data and analytics for strategic advantage.

~ resource consumption—the extent to which the cpu, I/o, and disk resources are utilized over time; when system resources are close to full utilization, the organization is achieving the maximum value for its investment in hardware and therefore resource consumption costs would be low; underutilized systems waste resources and drive up costs without adding value and would therefore be medium or high.

security features such as authentication options, security roles, directory integration, and encryption, versus more coarse-grain features of the same within hadoop.

cOst factOrsalong with technological capabilities, cost drives the design of the enterprise data architecture. the teradata unified data architecture™ rates the relative cost of use cases using a four-factor cost analysis:

~ hardware and software investment—the costs associ-ated with the acquisition of the hardware and software.

~ development and maintenance—the ongoing cost of acquiring data and packaging it for consumption as well as the costs of implementing systemwide changes such as software upgrades and changes to code and scripts running in the environment.

Figure 6. The RDBMS-Hadoop partition

BUSINESS VALUE DENSITYThe ratio of business relevance to the size of the data.

HIGH BUSINESS VALUE DENSITY

LOW BUSINESS VALUE DENSITY

DATA VOLUME

NO SCHEMA

FINE GRAIN SECURITY

FAST RESPONSE/THROUGHPUT

RDBMS

HADOOP

EVOLVING SCHEMA

STABLE SCHEMA

CROSS FUNCTIONAL REUSE

QUERY VOLUME

THE HIGHER THE BUSINESS VALUE DENSITY,THE MORE IT MAKES SENSE TO MANAGE IT

USING RELATIONAL TECHNIQUES

THE LOWER THE BUSINESS VALUE DENSITY,THE MORE IT MAKES SENSE TO MANAGE

USING HADOOP

legend ~ rdbms-hadoop partition—the horizontal line partitions the bvd space between high-bvd data that can be effectively managed with

an rdbms and low-bvd data that is best suited to hadoop. the partitioning point (intersection of line and data curve) is unique to each organization and may change over time.

~ rdbms features—the two arcs within the data circles represent key advantages of rdbms: fast response times/throughput and fine-grain security.

Page 8: Optimizing Big Data Value Across Hadoop and Relational

White PaPer

10.13 eb 7873

8

Optimize the Business Value Of all YOur enterprise Data

integrated data warehouse is the overwhelming choice for the important data that drives organizational decision-making where a single, accurate, timely, and unambiguous version of the information is required.

the integrated data warehouse uses a well-defined schema to offer a single view of the business to enable easy data access and ensure consistent results across the entire enterprise. It also provides a shared source for analytics across multiple departments within the enterprise. data is loaded once and used many times without the need for the user to repeatedly define and execute agreed-upon transformation rules such as the definitions of customer, order, and lifetime value score. the integrated data warehouse supports ansI sQl as well as many mature third-party applications. Information in the integrated data warehouse is scalable and can be accessed by knowledge workers and business analysts across the enterprise.

the integrated data warehouse is the tried-and-true gold standard for high-bvd data, supporting cross-functional reuse and the largest number of business users with a full set of features and benefits unmatched by other approaches to data management.

use case OVerViewwhile there are a large number of possible data scenarios in the enterprise world today, the majority fall into these four use cases:

~ Integrated data warehouse—provides an unambiguous view of information for timely and accurate decision making

~ Interactive discovery—addresses the challenge of explor-ing large datasets with less defined or evolving schemas

~ batch data processing—transforms data and performs analytics against larger datasets when storage costs are valued over interactive response times and throughput

~ general file system—Ingests and stores raw data with no transformation, making this use case an economical online archive for the lowest bvd data

each use case is described in more detail in the following sections.

integrated data warehouse

the association of the relational database and big data occurs in the integrated data warehouse (figure 7). the

Figure 7. Integrated data warehouse

BUSINESS VALUE DENSITYThe ratio of business relevance to the size of the data.

OPTIMIZING VALUEIN THE UNIFIED

DATA ARCHITECTURE

HIGH BUSINESS VALUE DENSITY

LOW BUSINESS VALUE DENSITY

DATA VOLUME

NO SCHEMA

FINE GRAIN SECURITY

FAST RESPONSE/THROUGHPUT

RDBMS

HADOOP

THE HIGHER THE BUSINESS VALUE DENSITY,THE MORE IT MAKES SENSE TO MANAGE IT

USING RELATIONAL TECHNIQUES

THE LOWER THE BUSINESS VALUE DENSITY,THE MORE IT MAKES SENSE TO MANAGE

USING HADOOP

DATA GOVERNANCE

Single view of your business

Shared source for analytics

Load once, Use many times

SQL / 3rd party applications

Knowledge workers and analysts

DATA WAREHOUSE

CHARACTERISTICSLOW MED HIGH

LOW MED HIGH

LOW MED HIGH

LOW MED HIGH

HARDWARE / SOFTWARE

DEVELOPMENT / MAINTENANCE

USAGE

RESOURCE CONSUMPTION

COST

DATA QUALITY/INTEGRITY

EVOLVING SCHEMA

STABLE SCHEMA

CROSS FUNCTIONAL REUSE

QUERY VOLUME

Page 9: Optimizing Big Data Value Across Hadoop and Relational

White PaPer

10.13 eb 7873

9

Optimize the Business Value Of all YOur enterprise Data

cpu and I/o resources so that the maximum amount of throughput can be achieved within an environment bounded by cpu and I/o.

interactive discovery

Interactive discovery platforms address the challenge of exploring large datasets with less-defined or evolving schemas by adapting methodologies that originate from the hadoop ecosystem within an rbdms (figure 8). some of the inherent advantages of the rdbms technology are particularly fast response times and throughput, as well as the ease of use stemming from ansI sQl compliance. Interactive discovery requires less time spent on data governance, data quality, and data integrity because users are looking for new insights in advance of such rigor required for more formal auctioning of the data and insights. the fast response times enable accelerated insight discovery, and the ansI sQl interface democratizes the data in the widest possible user base.

this approach combines schema-on-read, mapreduce, and flexible programming languages with rbdms features such as ansI sQl support, low latency, fine-grain security, data quality, and reliability. Interactive discovery has cost and flexibility advantages over the integrated data warehouse, but at the expense of concurrency (usage volume) and governance control.

cost analYsIs

~ hardware and software investment: high—software development for the commercial engineering effort required to deliver the differentiated benefits described previously, as well as an optimized, integrated hardware platform, warrant substantial initial investments.

~ development and maintenance expense: medium—realizing the maximum benefit of clean, integrated, easy-to-consume information requires data modeling and etl operations, which drive up development costs. however, the productivity tools and people skills for developing and maintaining a relational environment are readily available in the marketplace, mitigating the development costs. also, the data warehouse has diminishing incremental development costs because it builds on existing data and transformation rules and facilitates data reuse.

~ usage expense: low—users can navigate the enterprise data and create complex queries in sQl that return results quickly, minimizing the need for expensive programmers and reducing unproductive wait times. this benefit is a result of the costs incurred in development and maintenance as described previously.

~ resource consumption: low—tight vertical integration across the stack enables optimal utilization of system

Figure 8. Interactive discovery

BUSINESS VALUE DENSITYThe ratio of business relevance to the size of the data.

OPTIMIZING VALUEIN THE UNIFIED

DATA ARCHITECTURE

HIGH BUSINESS VALUE DENSITY

LOW BUSINESS VALUE DENSITY

DATA VOLUME

NO SCHEMA

Accommodates both Stable and Evolving Schemas

Does not require extensive data modeling

SQL / NoSQL / Map Reduce / statistical functions

Pre-packaged analytic modules

Analysts and data scientists

INTERACTIVEDISCOVERY

CHARACTERISTICSLOW MED HIGH

HIGH

LOW

LOW

MED

MED

HIGH

LOW MED HIGH

HARDWARE / SOFTWARE

DEVELOPMENT / MAINTENANCE

USAGE

RESOURCE CONSUMPTION

COST

ANALYTICAL FLEXIBILITY

FINE GRAIN SECURITY

FAST RESPONSE/THROUGHPUT

RDBMS

HADOOP

THE HIGHER THE BUSINESS VALUE DENSITY,THE MORE IT MAKES SENSE TO MANAGE IT

USING RELATIONAL TECHNIQUES

THE LOWER THE BUSINESS VALUE DENSITY,THE MORE IT MAKES SENSE TO MANAGE

USING HADOOP

DATA GOVERNANCE

Single view of your business

Shared source for analytics

Load once, Use many times

SQL / 3rd party applications

Knowledge workers and analysts

DATA WAREHOUSE

CHARACTERISTICSLOW MED HIGH

LOW MED HIGH

LOW MED HIGH

LOW MED HIGH

HARDWARE / SOFTWARE

DEVELOPMENT / MAINTENANCE

USAGE

RESOURCE CONSUMPTION

COST

DATA QUALITY/INTEGRITY

EVOLVING SCHEMA

STABLE SCHEMA

CROSS FUNCTIONAL REUSE

QUERY VOLUME

Page 10: Optimizing Big Data Value Across Hadoop and Relational

White PaPer

10.13 eb 7873

10

Optimize the Business Value Of all YOur enterprise Data

sQl script, data scientists as well as business analysts can use interactive discovery without additional training.

cost analYsIs

~ hardware and software investment: medium—Interactive discovery platforms are less expensive than the integrated data warehouse.

~ development and maintenance: low—Interactive discovery uses light modeling techniques, which minimize efforts for etl and data modeling.

~ usage: low—sQl is easy to use, reducing user time required to generate queries. built-in analytical functions reduce hundreds of lines of code to single statements. the performance characteristics of an rdbms reduce unproductive wait times.

~ resource consumption: low—commercial rdbms software is optimized for efficient utilization of resources.

Batch data processing

unlike the integrated data warehouse and interactive discovery platforms, batch processing lies within the hadoop sphere (figure 9). a key difference between batch data processing and interactive discovery is that batch processing involves no physical data movement

a key reason to use interactive discovery is analytical flexibility (also applicable to hadoop), which is based on these features:

~ schema-on-read—structure is imposed when the data is read, unlike the schema-on-write approach of the integrated data warehouse. this feature allows complete freedom to transform and manipulate the data at a later time. the use cases in the hadoop hemisphere also use schema-on-read techniques.

~ low-level programming—languages such as Java and python can be used to construct complex queries and even perform row-over-row comparisons, both of which are extremely challenging with sQl. this kind of pro-cessing is more commonly associated with row-over-row comparisons, such as time-series and pathing analysis.

Interactive discovery accommodates both stable and evolving schemas without extensive data modeling. It leverages sQl, nosQl, mapreduce, and statistical functions in a single analytical process and incorporates prepackaged analytical modules. nosQl and mapreduce are particularly useful for analyses such as time series and social graph that require complex processing beyond the capabilities of ansI sQl. as a result of ansI sQl compliance and a myriad of prebuilt mapreduce analytical functions that can be incorporated into an ansI

Figure 9. Batch data processing

BUSINESS VALUE DENSITYThe ratio of business relevance to the size of the data.

OPTIMIZING VALUEIN THE UNIFIED

DATA ARCHITECTURE

HIGH BUSINESS VALUE DENSITY

LOW BUSINESS VALUE DENSITY

DATA VOLUME

NO SCHEMA

Accommodates both Stable and Evolving Schemas

Does not require extensive data modeling

SQL / NoSQL / Map Reduce / statistical functions

Pre-packaged analytic modules

Analysts and data scientists

INTERACTIVEDISCOVERY

CHARACTERISTICSLOW MED HIGH

HIGH

LOW

LOW

MED

MED

HIGH

LOW MED HIGH

HARDWARE / SOFTWARE

DEVELOPMENT / MAINTENANCE

USAGE

RESOURCE CONSUMPTION

COST

ANALYTICAL FLEXIBILITY

FINE GRAIN SECURITY

FAST RESPONSE/THROUGHPUT

RDBMS

HADOOP

THE HIGHER THE BUSINESS VALUE DENSITY,THE MORE IT MAKES SENSE TO MANAGE IT

USING RELATIONAL TECHNIQUES

THE LOWER THE BUSINESS VALUE DENSITY,THE MORE IT MAKES SENSE TO MANAGE

USING HADOOP

DATA GOVERNANCE

Single view of your business

Shared source for analytics

Load once, Use many times

SQL / 3rd party applications

Knowledge workers and analysts

DATA WAREHOUSE

CHARACTERISTICSLOW MED HIGH

LOW MED HIGH

LOW MED HIGH

LOW MED HIGH

HARDWARE / SOFTWARE

DEVELOPMENT / MAINTENANCE

USAGE

RESOURCE CONSUMPTION

COST

DATA QUALITY/INTEGRITY

BATCH DATAPROCESSING

CHARACTERISTICS

LOW MED HIGH

LOW MED HIGH

LOW MED HIGH

LOW MED HIGH

HARDWARE / SOFTWARE

DEVELOPMENT / MAINTENANCE

USAGE

RESOURCE CONSUMPTION

COST

No transformations of data required

Scripting / Declarative languages

Analysis against raw files

Refinement, transformation, and cleansing

Analysts and data scientists

EVOLVING SCHEMA

STABLE SCHEMA

CROSS FUNCTIONAL REUSE

QUERY VOLUME

Page 11: Optimizing Big Data Value Across Hadoop and Relational

White PaPer

10.13 eb 7873

11

Optimize the Business Value Of all YOur enterprise Data

~ usage: medium—unlike the previous use cases that are accessible to sQl users, batch processing requires new skills for authoring queries and is not compatible with the full breadth of features and functionality found in modern business intelligence tools. In addition, query run times are longer, resulting in wait times that lower productivity.

~ resource consumption: high—In general, hadoop software makes less efficient use of hardware resources than rdbms.

General-purpose file system

as used in this context, the general-purpose file system refers to the hadoop distributed file system (hdfs) and flexible programming languages (figure 10). raw data is ingested and stored with no transformation, making this use case an economical online archive for the lowest bvd data. hadoop allows data scientists and engineers to apply flexible low-level programming languages such as Java, python, and c++ against the largest datasets without any up-front characterization of the data.

cost analYsIs

~ hardware and software investment: low—like batch processing, this approach benefits from open source software and commodity hardware.

~ development and maintenance: high—working effec-tively in this environment requires not only proficiency with low-level programming languages but also a work-ing understanding of linux and the network configura-tion. the lack of mature development tools and appli-cations and the premium salaries demanded by skilled scientists and engineers all contribute to costs.

as part of the transformation into a more usable model. light data modeling is applied against the raw data files to facilitate more intuitive usage. the nature of the file system and the ability to flexibly manipulate data makes batch processing an ideal environment for refining, transforming, and cleansing data, as well as performing analytics against larger datasets when storage costs are valued over fast response times and throughput.

since the underlying data is raw, the task of transforming the data must be performed when the query is processed. this is immensely valuable in that it provides a high degree of flexibility for the user.

batch processing incorporates a wide range of declarative language processing using pig, hive, and other emerging access tools in the hadoop ecosystem. these tools are especially valuable for analyzing low bvd data when query response time is not as critical, the logic applied to the data is complex, and full scans of the data are required—for example, sessionizing web log data, counting events, and executing complex algorithms. this approach is ideal for analysts, developers, and data scientists.

cost analYsIs

~ hardware and software investment: low—batch processing is available through open source software and runs on commodity hardware.

~ development and maintenance: medium—the skills required to do development and maintain the hadoop environment are relatively scarce in the marketplace, driving up labor costs. optimizing code in the environ-ment is primarily a burden on the development team.

Figure 10. General-purpose file system

BUSINESS VALUE DENSITYThe ratio of business relevance to the size of the data.

OPTIMIZING VALUEIN THE UNIFIED

DATA ARCHITECTURE

HIGH BUSINESS VALUE DENSITY

LOW BUSINESS VALUE DENSITY

DATA VOLUME

NO SCHEMA

Accommodates both Stable and Evolving Schemas

Does not require extensive data modeling

SQL / NoSQL / Map Reduce / statistical functions

Pre-packaged analytic modules

Analysts and data scientists

INTERACTIVEDISCOVERY

CHARACTERISTICSLOW MED HIGH

HIGH

LOW

LOW

MED

MED

HIGH

LOW MED HIGH

HARDWARE / SOFTWARE

DEVELOPMENT / MAINTENANCE

USAGE

RESOURCE CONSUMPTION

COST

ANALYTICAL FLEXIBILITY

FINE GRAIN SECURITY

FAST RESPONSE/THROUGHPUT

RDBMS

HADOOP

THE HIGHER THE BUSINESS VALUE DENSITY,THE MORE IT MAKES SENSE TO MANAGE IT

USING RELATIONAL TECHNIQUES

THE LOWER THE BUSINESS VALUE DENSITY,THE MORE IT MAKES SENSE TO MANAGE

USING HADOOP

DATA GOVERNANCE

Single view of your business

Shared source for analytics

Load once, Use many times

SQL / 3rd party applications

Knowledge workers and analysts

DATA WAREHOUSE

CHARACTERISTICSLOW MED HIGH

LOW MED HIGH

LOW MED HIGH

LOW MED HIGH

HARDWARE / SOFTWARE

DEVELOPMENT / MAINTENANCE

USAGE

RESOURCE CONSUMPTION

COST

DATA QUALITY/INTEGRITY

Flexible programming languages (Java, Python, C++, etc.)

Economic online archive

Land/source operational data

Data scientists and engineers

GENERAL PURPOSEFILE SYSTEM

CHARACTERISTICS COST

HARDWARE / SOFTWARE LOW MED HIGH

LOW MED HIGH

LOW

LOW

MED HIGH

MED HIGH

DEVELOPMENT / MAINTENANCE

USAGE

RESOURCE CONSUMPTION

BATCH DATAPROCESSING

CHARACTERISTICS

LOW MED HIGH

LOW MED HIGH

LOW MED HIGH

LOW MED HIGH

HARDWARE / SOFTWARE

DEVELOPMENT / MAINTENANCE

USAGE

RESOURCE CONSUMPTION

COST

No transformations of data required

Scripting / Declarative languages

Analysis against raw files

Refinement, transformation, and cleansing

Analysts and data scientists

EVOLVING SCHEMA

STABLE SCHEMA

CROSS FUNCTIONAL REUSE

QUERY VOLUME

Page 12: Optimizing Big Data Value Across Hadoop and Relational

White PaPer

10.13 eb 7873

Optimize the Business Value Of all YOur enterprise Data

10000 Innovation drive dayton, oh 45342 teradata.com

unified data architecture is a trademark, and teradata and the teradata logo are registered trademarks of teradata corporation and/or its affiliates in the u.s. and worldwide. apache is a trademark, and hadoop is a registered trademark of apache software foundation. teradata continually improves products as new technologies and components

become available. teradata, therefore, reserves the right to change specifications without prior notice. all features, functions, and operations described herein may not be marketed in all parts of the world. consult your teradata representative or teradata.com for more information.

copyright © 2013 by teradata corporation    all rights reserved.    produced in usa.

eb-7873 > 1013

the integrated data warehouse is most appropriate for the highest bvd data, where demands for the data across the enterprise are the greatest. when deployed optimally, there is the right balance of hardware and software costs for the benefits realized in lower development, usage, and resource consumption costs.

Interactive discovery is best for capturing and analyzing both stable and evolving schema data through traditional set or advanced procedural processing when there is a premium on fast response times or ease of access to better democratize the data.

batch data processing is ideal for analyzing and transforming any kind of data through procedural processing by end users who possess either low-level programming languages or higher-order declarative language skills, and where fast response times and throughput are not essential.

general-purpose file system offers the greatest degree of flexibility and lowest storage costs for engineers and data scientists with the skills and patience to navigate all enterprise data.

for more information, visit www.teradata.com.

~ usage: high—data processing in this environment is essentially a development task, requiring the same skill set and incurring the same labor costs as described previously in development and maintenance.

~ resource consumption: high—hadoop is less efficient than rdbms software in utilizing cpu and I/o processing cycles.

cOnclusiOndatabase technology is no longer a one-size-fits-all world—maximizing the business of volumes of enterprise data requires the right tool for the right job. this paper is intended to help It architects and data platform stakeholders understand how to map available technologies—in particular, relational databases and big data frameworks such as hadoop—to each use case. Integrating these and other tools into a single, unified data platform gives data scientists, business analysts, and other users powerful new capabilities to streamline workflows, realize operational efficiencies, and drive competitive advantage—exactly the value proposition of the teradata unified data architecture™.