emerging database landscape july 2011

WHITEPAPER

WHITEPAPER

Overview

Businesses today are challenged by the ongoing explo-

sion of data. Organizations capture, track, analyze and store

more information than ever before—everything from mass

quantities of transactional, online and mobile data, to grow-

ing amounts of machine-generated data. In fact, machine-

generated data represents the fastest-growing category of

Big Data.

How can you effectively address the impact of data over-

load on application performance, speed and reliability?

Where do newer technologies such as columnar databases

and NoSQL come into play?

The first thing to recognize is that, in the new data manage-

ment paradigm, one size will not fit all data needs. The right

IT solution may encompass one to two to even three tech-

nologies working together.

Figuring out which of the several technologies (and even

subvariants of these technologies) meets your needs—

while also fitting your IT staffing and budget parameters—

is no small issue. We hope this User Guide will help clarify

which data management approach is best for which of your

company’s data challenges.

INFOBRIGHT

Corporate Headquarters

47 Colborne Street, Suite 403

Toronto, Ontario M5E1P8 Canada

Tel. 416 596 2483

Toll Free 877 596 2483

[email protected]

www.infobright.com

Sales:

North America

Tel. 312-924-1695

EMEA

Tel. +353 (0)87 743 7107

User’s Guide to the Emerging Database Landscape: Row vs. Columnar vs. NoSQL

WHITEPAPER

2

WHITEPAPER

Today’s Top Data-Management Challenge

Businesses today are challenged by the ongoing explosion of data. Gartner is predicting data growth will exceed 650%

over the next five years.1 Organizations capture, track, analyze and store everything from mass quantities of transactional,

online and mobile data, to growing amounts of machine-generated data. In fact, machine-generated data, including

sources ranging from web, telecom network and call-detail records, to data from online gaming, social networks, sensors,

computer logs, satellites, financial transaction feeds and more, represents the fastest-growing category of Big Data. High-

volume web sites can generate billions of data entries every month.

As volumes expand into the tens of terabytes and even the

petabyte range, IT departments are being pushed by end users to

provide enhanced analytics and reporting against these ever-

increasing volumes of data. Managers need to be able to quickly

understand this information, but, all too often, extracting useful

intelligence can be like finding the proverbial ‘needle in the

haystack.’

Using traditional row-based databases that were not designed to

analyze this amount of data, IT managers typically try to mitigate

these plummeting response times using several responses. Unfor-

tunately, each method has a significant adverse impact on analytic

effectiveness and/or costs. A recent survey from Unisphere

Research2 highlighted the most typical approaches:

• Tuning or upgrading existing database, the most common response, translates into significantly increased costs,

either through admin costs or licensing fees

• Upgrading hardware processing capabilities increases overall TCO

• Expanding storage systems increases overall costs in direct proportion to the growth of data

• Archiving old data translates into less data your analysts and business users can analyze at any one time. Frequently,

this results in less comprehensive analysis of user patterns—and can greatly impact forward-looking analytic conclu-

sions

• Upgrading network infrastructure leads to both increased costs and, potentially, more complex network configura-

tions.

So, if throwing money at your database problem doesn’t really solve the issues, what should you do? How can you effec-

tively address the impact of data overload on application performance, speed and reliability? Where do newer technolo-

gies such as columnar databases and NoSQL come into play?

1 Gartner IT Infrastructure, Operations & Management Summit 2009 Post Event Brief.

2 “Keeping Up with Ever-expanding Enterprise Data,” Joseph McKendrick, Research Analyst, Unisphere Research, October 2010.

Figure 1. Machine-Generated Data Drives Big Data

WHITEPAPER

3

WHITEPAPER

Coexistence. Not Competition.

The first thing to recognize is that, in the new data management paradigm, one size will not fit all data needs. Instead of building the

one, single, ultimate database, the driving force behind the behemoth data-warehousing efforts of the last decade or so, IT managers

need to identify the right technologies to solve their particular business and data problems. The right IT solution may encompass one

to two to even three technologies working together. Open-source technology will coexist with proprietary software. Row-based data-

bases will live peacefully next to a columnar databases—and both will share data with NoSQL solutions.

Sounds simple, doesn’t it? Almost idyllic. Of course, there’s a bit more to it than that.

As Mike Vizard of ITBusinessEdge recently noted, “[T]here is more diversity in the database world than any time in recent memory.” 3

Figuring out which of the several technologies (and even subvariants of these technologies) meets your needs—while also fitting your

IT staffing and budget parameters—is no small issue. We hope this User Guide will help clarify which data management approach is

best for which 0o your company’s data challenges.

The Ubiquity of Thinking in Rows

Organizing data in rows has been the standard approach for so long that it can seem like the only way to do it. An address list, a cus-

tomer roster, and inventory information—you can just envision the neat row of fields and data going from left to right on your screen.

Databases such as Oracle, MS SQL Server, DB2 and MySQL are the best known row-based databases.

Row-based databases are ubiquitous because so many of our most important business systems are transactional. Row-oriented data-

bases are well suited for transactional environments, such as a call center where a customer’s entire record is required when their profile

is retrieved and/or when fields are frequently updated. Other examples include:

• Mail merging and customized emails

• Inventory transactions

• Billing and invoicing

Where row-based databases run into trouble is when they are used to handle

analytic loads against large volumes of data, especially when user queries are

dynamic and ad hoc.

To see why, let’s look at a database of sales transactions with 50-days of data

and 1 million rows per day. Each row has 30 columns of data. So, this data-

base has 30 columns and 50 million rows. Say you want to see how many

toasters were sold for the third week of this period. A row-based database

would return 7-million rows (1 million for each day of the third week) with 30

columns for each row—or 210-million data elements. That’s a lot of data ele-

ments to crunch to find out how many toasters were sold that week. As the

3 “The Rise of the Columnar Database,” Mike Vizard, IT BusinessEdge, June 14 2011.

Row-based D

atabaseTransactional Pow

erhouse

FIgure 2. Example Data Set

WHITEPAPER

4

WHITEPAPER

data set increases in size, disk I/O becomes a substantial limiting factor since a row-oriented design forces the database to

retrieve all column data for any query.

As we mentioned above, many companies try to solve this I/O problem by creating indices to optimize queries. This may

work for routine reports (i.e. you always want to know how many toasters you sold for the third week of a reporting pe-

riod) but there is a point of diminishing returns as load speed degrades since indices need to be recreated as data is added

In addition, users are severely limited in their ability to quickly do ad-hoc queries (i.e., how many toaster did we sell

through our first Groupon offer? Should we do it again?) that can’t depend on indices to optimize results.

Pivoting Your Perspective: Columnar Technology

Column-oriented databases allow data to be stored column-by-column rather than row-by-row. This simple pivot in

perspective—looking down rather than looking across—has profound implications for analytic speed.

Column-oriented databases are better suited for analytics where, unlike transactions, only portions of each record

are required. By grouping the data together this way, the database only needs to retrieve columns that are relevant

to the query, greatly reducing the overall I/O.

Returning to the example in the section above, we see

that a columnar database would not only eliminate

43 days of data, it would also eliminate 28 columns

of data. Returning only the columns for toasters and

units sold, the columnar database would return only

14 million data elements or 93% less data. By return-

ing so much less data, columnar databases are much

faster than row-based databases when analyzing

large data sets.

In addition, some columnar databases (such as Infobright®)

compress data at high rates because each column stores a

single data type (as opposed to rows that typically contain

several data types), and allow compression to be optimized for each particular data type. Row-based databases have

multiple data types and limitless range of values, thus making compression less efficient overall.

Read the sidebar “Infobright: Putting Intelligence in Columns” to learn how Infobright improves query speed even more,

while simplifying administration and lowering costs, with its Knowledge Grid and Domain ExpertiseTM capabilities.

Figure 3. Pivoting Data for Columnar View

Colu

mn-

base

d D

atab

ase

Ligh

tnin

g An

alyt

ics

WHITEPAPER

5

WHITEPAPER

Will the Real NoSQL Please Stand Up?

A term invented by Carlo Strozzi in 19984 , NoSQL has been a hard term

to pin down from the beginning. For one thing, while most people

now translate the term to mean ‘Not Only SQL,’ there are other accepted

variations. More importantly, the term refers to a broad, emerging class

of non-relational database solutions.

NoSQL technologies have evolved to address specific business needs for

which row technologies couldn’t scale to meet and column technolo-

gies were unsuited to address. Currently, there are over 112 products or

open-source projects in the NoSQL space, with each solution matching

a specific business need. For example:

• Real-time data logging such as in finance or web analytics

• Web apps or any app which needs better performance without hav-

ing to define columns in an RDBMS

• Storing frequently requested data for a web app

While each technology addresses different problems, they all share

certain attributes: huge volume of data and transaction rates, a distrib-

uted architecture and often unstructured (or semi-structured data) with

heavy read/write workloads. Unstructured information is typically text

heavy but may contain data such as dates and other numbers as well.

The resulting irregularities and ambiguities make this data unsuitable

for traditional row-based or column-based structured databases.

In short, NoSQL solutions are typically beasts in terms of their data

capacity, lookup speed and ability to handle streaming data, especially

over highly scaled environments.

On the other hand, they generally lack a SQL interface and often come

with little or no programmatic interfaces—meaning that setup and

administration may require some specialized skills. In addition, NoSQL

can be limited in terms of their ability to execute complex queries, re-

stricting the types of actionable analytics they can deliver. For example,

queries that JOIN two tables or employ nested SELECTs are typically not

possible using these technologies.

Below, we go a bit deeper into each of three main NoSQL subvariants: key-value stores, document stores and column stores.

Infobright:Putting Intelligence in Columns

Infobright’s high performance analytic database is designed to handle business-driven queries on large volumes of data—without IT intervention. Easy to im-plement and manage, Infobright provides the answers your business users need at a price you can afford.

How is this achieved?

Infobright combines a columnar database with intelli-gence we call the Knowledge Grid to deliver fast query response with unmatched administrative simplicity: no indexes, no data partitioning, and no manual tun-ing.

Infobright uses intelligence, not hardware, to drive query performance:• Creates information about the data upon load,

automatically• Uses this to eliminate or reduce the need to ac-

cess data to respond to a query• The less data that needs to be accessed, the faster

the response

What this means to customers:• Self-managing: 90% less administrative effort• Low-cost: More than 50% less than alternative

solutions• Scalable, high-performance: Up to 50 TB using a

single industry standard server • Fast queries: Ad-hoc queries are as fast as antici-

pated queries, so users have total flexibility• Compression: Data compression of 10:1 to 40:1

that means a lot less storage is needed

Infobright offers an open source and a commercial edition of its software. Both products are designed to handle data volumes up to 50TB.

Try it yourself—download our Community Edition at www.infobright.org, or a free trial of our Enterprise Edition at www.infobright.com.

4 Wikipedia, http://en.wikipedia.org/wiki/NoSQL

WHITEPAPER

6

WHITEPAPER

NoS

QL

Dat

abas

eD

ata

Bea

sts

Key-value Store

A key-value store does what it sounds like it does: values are stored and indexed by a key, usually built on a hash or tree data-structure. 5

Key-value pairs are widely used in tables and configuration files. Key-value stores allow the application to store its data without

predefining a schema—there is no need for a fixed data-model.

In a key-value store, for example, a record may look like:

12345 => “img456.jpg,checkout.js,20”

Companies turn to key-value stores when they require the functionality of key-values but do not require the technology

overhead of a traditional RDBMS system, either because they require more efficient, cost-effective scalability or they are work-

ing with unstructured or semi-structured data. Key-value stores are great for unstructured data centered on a single object,

and where data is stored in memory with some persistent backup. Consequently, they are typically used as a cache for data

frequently requested by web applications such as online shopping carts or social-media sites. As these web pages are created

on the fly, the static components are quickly retrieved and served up to the user.

Document Store

As with a key-value store, companies turn to NoSQL document stores when they are dealing with huge volumes of data and transac-

tions requiring massive horizontal scaling or sharding. And, similarly, there is no need for a pre-set schema. However, the data in docu-

ment stores can contain several keys, so queries aren’t as limited as they are in key-value stores. For example, in a document data store

an example record could read:

“id” => 12345,

“name” => “Jane”,

“age” => 22,

“email” => “[email protected]”

While multiple keys increase the types of possible queries, the data stored in these ‘documents’ do not need to be predefined and can

change from document to document. The tradeoff for the more complex query-options is speed: queries with a key-value store are

much simpler and often faster.

Document stores are often deployed for web-traffic analysis, user-behavior/action analysis, or log-file analysis in real time. However,

while document stores allow more query capabilities than key-value stores, there are still limitations given the non-relational basis of

the document-store database.

Column Store

Column stores are an emerging NoSQL option, created in response to very specific database problems involving beyond-massive

amounts of data across a hugely distributed system. Think Google. Think Facebook. Imagine the colossal amount of data that Google

stores in its data farms. And then imagine how many permutations of data sets need to be compiled to respond to all possible Google

5For more on hash functions see http://en.wikipedia.org/wiki/Hash_function. For more on tree data see http://en.wikipedia.org/wiki/Tree_%28data_structure%29.

WHITEPAPER

7

WHITEPAPER

searches. Clearly, this task could never be accomplished in any reasonable time frame with a traditional relational database. It requires the

ability to handle massive amounts of data but with more query complexity than either key-value stores or document stores would deliver.

Most column stores also use MapReduce, a fault-tolerant framework for processing huge datasets on certain kinds of distributable problems

using a large number of computers. This technology is still emerging—and

use cases may eventually overlap with document stores as both technologies

mature. But at the moment, the use cases in production for column stores

are generally limited to applications such as Google and Facebook.

A Column by Any Other Name…..

It should go without saying, but we’ll say it anyway—a column store is only

similar to a column-based database in that they both have the word ‘column’

in their names. A column-based database is still a structured relational

database, albeit one optimized for analytics. A column store is still firmly in

the NoSQL camp—this is a system for handling huge volumes of data and

transactions, in a massively distributed manner, without the need to define

the database structure up front—though it tends to have more SQL traits

than either a key-value store or document store.

Can I Get a Hadoop From Anyone?

While this User Guide addresses the emerging database landscape, no con-

versation would be complete without mentioning Hadoop.

Hadoop is a scalable fault-tolerant distributed system for data storage and

processing (open source under the Apache license). It has two main parts:

• Hadoop Distributed File System (HDFS): self-healing high-bandwidth

clustered storage

• MapReduce: fault-tolerant distributed processing framework

The data typically stored with Hadoop is complex, from multiple data sources

and, well, there’s always lots and lots of it. Beyond being a mass-storage

system, Hadoop, through MapReduce, also is used for batch processing and

computation done in parallel execution spread over a cluster of servers.

While running MapReduce jobs is a common way to access data stored in

Hadoop, technologies such as Hbase and Hive which sit on top of HDFS are

also used to query the data.

LiveRail: Infobright & Hadoop Power Video Advertising AnalyticsLiveRail delivers technology solutions that enable and enhance the monetization of internet-distributed video. By focusing specifically on challenges and opportunities created by online video, LiveRail’s tools are designed to be easier, more efficient and more effective than traditional display ad servers to deliver and track advertising into this medium. Their platform enables publishers, advertisers, ad networks and media groups to manage, target, display and track advertising in online video.

The Challenge: LiveRail’s platform enables publish-ers, advertisers, ad networks and media groups to manage, target, display and track advertising in online video. With a growing number of customers, LiveRail was faced with managing increasingly large data volumes.They also needed to provide near real-time access to their customers for reporting and ad hoc analysis.

The Solution: LiveRail chose two complementary technologies to manage hundreds of millions of rows of data each day—Apache Hadoop and Infobright. Detail is loaded hourly into Hadoop and at the same time summarized and loaded into Infobright. Custom-ers access Infobright 7x24 for ad-hoc reporting and analysis and can schedule time if needed to access cookie-level data stored in Hadoop.

“Infobright and Hadoop are complementary tech-nologies that help us manage large amounts of data

while meeting diverse customers needs to analyze the performance of video advertising investments.”

Andrei Dunca, CTO of LiveRail

WHITEPAPER

8

WHITEPAPER

Summary and Next Steps

The world of one-size-fits-all database is done. Myriad technology approaches have been (and are being) developed to meet the challenges of Big

Data. This activity impels corporate IT groups to look beyond rows-based solutions to find the right fit for their analytic needs, staffing and budget

requirements.

We hope that this paper, and the following Emerging Database Landscape chart, serves as a useful resource for figuring out the strengths and the

weaknesses of the various database approaches available today.

Infobright: High-performance Analytics for Machine-generated Data

Infobright’s high-performance database is the preferred choice for applications and data marts that analyze large volumes of “machine-generated

data” such as Web data, network logs, telecom records, stock tick data and sensor data. Easy to implement and with unmatched data compression,

operational simplicity and low cost, Infobright is being used by enterprises, SaaS and software companies in online businesses, telecommunications,

financial services and other industries to provide rapid access to critical business data.

If you decide that a columnar database has a place in your analytic solutions, you can try it for yourself, free. Either download our Community Edi-

tion at www.infobright.org, or a free trial of our Enterprise Edition at www.infobright.com. For more information, please visit http://www.infobright.

com or join our open source community at http://www.infobright.org.

WHITEPAPER

9

The Emerging Database LandscapeThis chart gives a quick overview of the strengths, weaknesses and use cases for row-based, columnar and NoSQL databases.

Row-Based Columnar NoSQL—Key Value Store NoSQL—Document Store NoSQL—Column Store

Basic Description Data structured in rows Data is vertically striped Data stored usually in Persistent storage for unstructured Very large data storage, MapReduce

and stored in columns memory with some persistent or semi-structured data along with support

backup some SQL-like querying functionality

Common Use Cases Transaction processing, Historical data analysis, Used as a cache for storing Web apps or any app which needs Real-time data logging

interactive transactional data warehousing, business frequently requested data better performance and scalability as in finance or web analytics

applications intelligence for a web app without having to define columns

in an RDBMS

Strengths Capturing and Fast query support, Scalability, very fast storage Persistent store with scalability Very high throughput for Big Data,

inputting new records. especially for ad hoc queries and retrieval of unstructured features such as sharding built in strong partitioning support, random

Robust, proven technology. on large datasets, and partly structured data with and better query support read-write access

compression than key-value stores

Weaknesses Scale issues—less suitable Not suited for transactions; Usually all data must fit into Lack of sophisticated query Low-level API, inability to perform

for queries, especially import and export speed; memory, no complex query capabilities complex queries, high latency of

against large databases heavy computing resource capabilities response to queries

utilization

Typical Database Size Range Several GBs to 50 TB Several GBs to several TBs Few TBs to several PBs Few TBs to several PBs

Key Players MySQL, Oracle, SQL Sever, Infobright, Aster Data, MemCached, Amazon S3, MongoDb, Couchdb, SimpleDb HBase, Big Table, Cassandra

Sybase ASE Sybase IQ, Vertica, ParAccel Redis, Voldemort

© Copyright 2011 Infobright Inc. Infobright is a registered trademark of Infobright Inc. All other trademarks and registered trademarks are the property of their respective owners.