new data warehouse economics
DESCRIPTION
The is my original text before GigaOM mangled it without my approvalTRANSCRIPT
The New Data Warehouse Economics
Executive Summary
The real message about the economics of data warehousing is not that
everything is less costly as a result of Moore’s Law1, but rather, that the vastly
expanded mission of data warehousing, and analytics in general, is only
feasible today because of those economics.
While “classical” data warehousing was focused on creating a process for
gathering and integrating data from many mostly internal sources and
provisioning data for reporting and analysis, today organizations want to know
not only what happened, but what is likely to happen and how to react to it.
This presents tremendous opportunities, but a very different economic model.
For twenty-five years, data warehousing has proven to be a useful approach
to report and understand an organization’s operations. Before data
warehousing, reporting and provisioning data from disparate systems was an
expensive, time-consuming and often futile attempt to satisfy an
organization’s need for actionable information. Data warehousing promised
clean, integrated data from a single repository.
The ability to connect a wide variety of reporting tools to a single model of the
data catalyzed an entire industry, Business Intelligence (BI). However, the
application of the original concepts of data warehouse architecture and
methodology are today weighed down by complicated methods and designs,
inadequate tools and high development, maintenance and infrastructure
costs. Until recently, computing resources were prohibitively costly, and data
warehousing was constrained by a sense of “managing from scarcity.”
Instead, approaches were designed to minimize the size of the databases
1 The number of transistors per square inch on integrated circuits had doubled every year since the integrated circuit was invented. Moore predicted that this trend would continue for the foreseeable future. Current thinking now is every eighteen months.
© 2013 Hired Brains Inc. All Rights Reserved
1
through aggregating data, creating complicated designs of sub-setted
databases and closely monitoring the usage of resources.
Proprietary hardware and database software are a “sticky” economic
proposition – market forces that drive down the cost of data warehousing are
blocked by the locked-in nature of these tools. Database professionals
continue to build structures to maintain adequate performance, such as data
marts, cubes, specialized schemas tuned for a single application and multi-
stage ETL2 that slow the update of information and hinder innovation.
Market forces and, in particular, the economics of data warehousing are
intervening to change the picture entirely. Today, the economics of data
warehousing are significantly different. Managing from scarcity has given way
to a rush to find the insight from all of the data, internal and external, and to
expand the scope of uses to everyone. Tools, disciplines, opportunities and
risk have been turned upside-down.
How do the new economics of data warehouse play out? There are already
enormous proven benefits to organizations such as:
- Offloading analytical work from data warehouses to analytical platforms that are purpose built for the work
- Capturing data more quickly, in greater detail, decreasing the time to decisions
- Delaying or even foregoing expensive upgrades to existing platforms
- Shortening the lag to analyzing new sources of data, or even eliminating it
- Broadening the audience of those who perform analytics and those who benefit from it
- Shortening the time needed to add new subject areas for analysis from months to days
2 ETL, or Extract, Transform and Load tools manage the process of moving data from original source systems to data warehouses and often include operations for profiling, cleansing and integrating the data to a single model.
© 2013 Hired Brains Inc. All Rights Reserved
2
Economic realities present a new era in Enterprise Business Intelligence,
disseminating analytical processing throughout and beyond the organization,
not just to a handful of analysts. With the accumulated effect of Moore’s Law,
the relentless doubling of computing and storage capacity3 every two years,
with a corresponding drop in prices, plus engineering innovation and Linux-
based and open-source software, a redrawing of the data warehouse model
is needed.
Expanding the scope and range of data warehousing would be prohibitively
expensive based on the economics of twenty-five years ago, or even ten
years ago. The numbers are in your favor.
The Numbers
Part of the driving force behind the new economics of data warehousing is
simply the advance of technology. For example:
- From 1993 to 2013, the cost of a terabyte of disk storage has dropped
from $2,000,000 to less than a $100 (this is the cost of the drives alone.
Storage management hardware and software are separate)
- Dynamic RAM (DRAM), the volatile memory of the computer, has seen a
similar drop in price ($250/4Mb of 8-bit memory versus $450/32Gb of 64-
bit memory). Today’s servers can be fitted with 64,000 times as much
memory for only twice the cost twenty years ago.
- Cost/Gb is only part of the story. There is a similar dramatic increase in
density (from 1Mb to 8Gb/chip), allowing for smaller installations, less
space and using less energy
- A proprietary Unix server with four Intel 486 chips, 384Mb RAM and 32Gb
of storage cost $650,000 in 1993, greater than $1 million in constant
3 Though technically not a result of Moore’s Law, the capacity and cost for storage follows a similar and even more dramatic curve
© 2013 Hired Brains Inc. All Rights Reserved
3
dollars. Today’s mid-range laptops of the same capacity cost between
$2,000-$4,000.
To put these numbers in perspective, since a chart simply cannot adequately
convey the magnitude of the effect of Moore’s Law over time, the table below
compares the specifications of a 1988 Ferrari Testarossa, one of the true
supercars of its era, its twenty-five year’s later version in the 2013, the F12,
and the effect of twenty-five years of Moore’s Law on the 1988 version for
comparison:
1988 Ferrari
Testarossa
2013 Ferrari
F12
1988 Ferrari
“enhanced”
By Moore’s Law
Weight 3300 lb. 3355 lb. 1.65 oz.
MPG 11 12 352,000
Cost New $200,000 $330,000 $6.25
Horsepowe
r
390 730 12,500,000
0-60 5.3 seconds 3.1 seconds .0002 seconds
Top Speed 180 mph 211 mph 4,800,000 mph
Expensive to Build and Maintain
The reality today is that moderately priced, high-performance hardware is
widely available. The cost of CPUs, memory and storage devices have
plummeted in the past five to ten years, but data warehouse design,
completely out of step with market realities, is still based on getting extreme
© 2013 Hired Brains Inc. All Rights Reserved
4
optimization from expensive and proprietary hardware. A great deal of the
effort involved in data modeling a data warehouse is performance-related. In
some popular methodologies, six or seven different schema types are
employed, with as many as twenty or thirty or more separate schemas
various use cases.
This has the effect of providing adequate query performance for mixed
workloads, but the cost is limited access to the data and extreme inflexibility
borne of the complexity of the design. Agility is difficult due to the number of
elements that must be designed and modified over time. Under today’s cost
structure, it makes more sense to ramp up the hardware and simplify the
models for the sake of access and flexibility.
Annual data warehouse maintenance costs are considerably higher than
operational systems. This is partly due to the iterative nature of analytics, with
changing focus based on business conditions, but that is only part of the
equation. The major cause is that the architecture and best practices are no
longer aligned with the technology. It takes a great deal of time to maintain
complex structures and it vastly complicates the understanding and usability
of the utility on the front end. For those BI tools that do provide a useful
semantic layer4, allowing business models to be expressed in business terms,
not database terms, user queries quickly overwhelm resources with complex
queries (complex from the database perspective.) Many BI tools still operate
without a semantic layer between the user and the physical database objects,
forcing non-technical users to deal with concepts like tables, columns, keys
and joins which makes navigation nearly impossible. All of these problems
can be solved today by taking advantage of the tools made possible by new
market economics.
In today’s BI environments, with “best practices” data warehouses, the
business models are burned into the circuitry of the relational database
4 A semantic layer translates physical names of data in source systems into common names for analysis. For example, System 1 and 2 may name “dog” CANINE_D4 and “Chien,” respectively, but the semantic layer abstracts these terms to a common naming convention of “dog” for understandability and consistency
© 2013 Hired Brains Inc. All Rights Reserved
5
schemas. The range of analysis possible is limited by the ability of the
database server to process the queries. Some BI software has the capability
to do much more, but it is constrained by the computing resources in the
database server. These tools can present a visual model to business users,
allow them to manipulate, modify, enhance and share their models. When a
request for analysis is invoked, the tools dynamically generate thousands of
lines of SQL per second, against the largest databases in the world, asking
the most difficult questions. The use of these tools, though they are precisely
what knowledge workers need, is very limited today because most data
warehouses are built on proprietary platforms that are an order of magnitude
more expensive and an order or two of magnitude slower than those available
from newer entrants into the market like Vertica.
How This Affects Operations and Strategy
The dramatic change in cost of computing may seem compelling, and is is,
but relative hardware costs are not the entire picture. When considering the
“economics” of data warehousing, it is just as important to take into account
the whole picture: the “push” from technology and the lessening of managing
from scarcity: For example:
- A shift in labor from highly skilled data management staff to more flexible
scenarios where more powerful platforms can take over more of the work
through both brute force (faster, more complex processing) and better
software designed for the new platforms
- The amount of data captured for analytics has also grown dramatically, to
some extent, but not entirely, normalizing the relative cost decrease. In
addition, as today’s analytical applications become more operational, the
need for speed and reduced latency (time to value) require faster devices
© 2013 Hired Brains Inc. All Rights Reserved
6
such as solid-state storage devices (SSD) which cost 5x to 10x the price
of conventional disk drives (but will continue to drop in relative price)
- There is far wider audience for reporting and analytics as a result of better
presentation tools and mobile platforms which drive the requirement for
more fresh data and analysis, particularly to mobile devices
- Data warehousing and BI were predominantly an inward-focused
operation, providing information and reporting to staff. This is no longer
the case. Now analytics are embedded in connections to customers,
suppliers, regulators and investors.
- The availability of cloud computing offers some very promising
opportunities, in many cases, particularly in agility and scale
A reversal of role of the data warehouse as the single repository of integrated
historical data with an emphasis on the repository function to supply reporting
needs rather than analytics has been reversed. BI is still useful and needed to
explain the Who, What and Where questions, but there is a driving force
across all organizations today to understand the “Why,” and to do something
about it. To understand the “Why” requires direct access to much more
detailed and diverse data not possible in first generation data warehouse
architectures, and it has to be handled by systems capable of orders of
magnitude faster processing.
New database technology designed specifically for analytics and data
warehouses such as data warehouse/analytics appliances, columnar-
orientation, optimizers designed for analytics, large memory models and
cloud orientation appeared a few years ago and are mature products today.
Most organizations are taking a fresh look at data warehouse architecture
more purpose-built for fast implementation, greater scale and concurrency,
rapid and almost continuous update, lower cost of maintenance and
dramatically more detailed data for analysis to deliver better outcomes to the
organization.
© 2013 Hired Brains Inc. All Rights Reserved
7
One thing is certain: the data warehousing calculus changed dramatically in
the past few years, with many options not available previously. As a result,
the cost/benefit analysis has altered. In particular, data warehouse technology
today can offer more function and value, often at lower costs, especially when
considering Total Cost of Ownership.
Looking at Total Cost of Ownership
Total Cost of Ownership is a financial metric that is part of a collection of tools
to measure the financial impact of investments. Unlike ROI, Return on
Investment or IRR, Internal Rate of Return, TCO does not take into account,
directly, the benefits of an investment. What it does is attempt to calculate,
either prospectively or retrospectively, the cost of implementing and running a
project over a fixed period of time, often three to five years. A simple TCO
calculation examines the replacement cost of something plus its operating
costs, and compares them with some other option, such as not replacing the
thing. Alternatively, it may compare the TCO of various options. People often
perform an informal TCO analysis when buying a car, considering say a five-
year TCO including purchase or lease, fuel, insurance and maintenance.
The value of a TCO model is to reduce the prominence of initial startup costs
in decision making in favor of taking a longer view. For example, of three
candidates for data warehouse platform, the one with the higher purchase
cost may have a lower longer term running cost because of maintenance
costs, energy costs or labor.
One application of TCO analysis in the new economics of data warehousing
might compare the costs of implementing Hadoop as a data integration
platform instead of existing tools such as enterprise ETL/ELT (Extraction,
Transform & Load). Because Hadoop is Open Source software it has no
purchase or license costs (though there are commercial distributions that
© 2013 Hired Brains Inc. All Rights Reserved
8
have conventional licensing but at modest cost) and appears to be a very
appealing alternative using TCO:
Software
License
Purchase
Software
Annual
Maintenance
Server
Purchase
w/utilities
Server
Annual
Maintenance
ETL $500,000 $100,000 $175,000 $35,000
Hadoop 0 0 $40,000 0
ETL Hadoop
At Inception $810,000* $40,000
Year 2 135,000 0
Year 3 135,000 0
Year 4 135,000 0
Year 5 135,000 0
Total 5-year TCO $1,215,000 $40,000
* Maintenance fees are typically paid a year in advance
This example does not demonstrate the desirability of Hadoop of ETL. It
demonstrates how easy it is to make mistakes with TCO by not considering
all of the factors. For example, ETL technology is in its third decade of
development and is designed purposely for the one application of the
integration of data for data warehouses. Hadoop is a general-purpose
collection of tools for managing large sets of messy data. What the model
above excludes is the cost of programming effort needed for Hadoop to
perform what ETL already can without programming. IN addition, ETL tools
have added benefits that are not often evident until a few years into the
© 2013 Hired Brains Inc. All Rights Reserved
9
project, such as connectors to every iniginable type of data without
programming, mature metadata models and services that ease enhancement
and agility, developer’s environments with version control, administrative
controls and even run-time robots that advise of defects or anomalies as they
occur.
Once the costs of labor, delays and downtime are factoring, the TCO model
provide a very different answer. The point of this example is not to promote
ETL over Hadoop or vice-versa, it is only to illustrate why it is so important in
this fairly turbulent time of change to build your models carefully.
Unlike ROI and it various derivatives, TCO does not discount future cash
flows for two reasons: first, because the time periods are relatively short and
second, because TCO costs are not actually investments. The goal of TCO is
to find (or measure) the cost of the implementation over time.
Keep in mind that a comparative analysis using TCO does not address the
question of whether you should make the investments; it only compares
various alternatives. The actual decision has to be based on an investment
methodology that considers benefits as well as costs.
Big Data and Data Warehouses
Until now, this discussion centered on data warehouses and BI. Big Data is
the newest complication in the broader field of analytics, but its ramifications
affect data warehousing to a great extent. Much is made of big data being a
“game changer” and the successor to data warehousing. It is not. If anything,
it represents an opportunity for data warehousing to fulfill, or at a least extend
its reach closer to its original purpose: to be the source of useful, actionable
data for all analytical processing. Data warehouse thinking will of necessity
have to abandon its vision of one physical structure to do this, however.
Instead, it will have to cooperate with many different data sources as a
virtualized structure, a sort of surround strategy of “quiet” historical data
warehouses, extreme, unconstrained analytical databases providing both
© 2013 Hired Brains Inc. All Rights Reserved
10
real-time update and near-time response and non-relational, big data clusters
such as Hadoop. Big data forces organizations to extend the scale of their
analytical operations, both in terms of volume and variety of inputs, and,
equally important, to stretch their vision of how IT can expand and enhance
the use of technology both within and beyond the organization’s boundaries.
Big data is useful for a wide range of applications in gathering and sifting
through data, and most often repeated uses are for gathering and analyzing
information after the fact to draw new insights not possible previously.
However, in 3-5 years, big data will move into production of all sorts of real-
time activities such as biometric recognition, self-driving cars (maybe 10
years for that) and cognitive computing including natural language question
and answering systems. This will only be possible because computing
resources continue to expand in power and drop in relative cost. Examples
include in-memory processing, ubiquitous cloud computing, text and
sentiment analysis and better devices for using the information. However, as
crucial as the hardware and software components are, the real advances will
happen with the wide adoption of semantic technologies that use ontology5,
graph analysis6 and open data7.
Economic realities are ushering in a new era in Enterprise Business
Intelligence, disseminating analytical processing throughout and beyond the
organization. BI is currently too expensive, too limited in scope and too rigid.
The economics of computing are not reflected in current data warehousing
best practices. The solution to problems of latency and depth of data is the
application of the vast computing and storage resources now available at
reasonable prices. Due to Moore’s Law and innovations in query processing
5 Ontologies are a way to represent information in a logical relationship format. One benefit is the ability to create them and link the together with other, even external, ontologies providing a way to use the meaning of things across domains6 Relational databases employ a logical concept of tables, two-dimensional arrangements of identically structured rows of records. Graphs are a completely different arrangement: things connected to other things by relationships Graphs are capable of traversing massive amounts of data quickly. Often in used in search, path analysis etc.7 Open data is a collaborative effort of organizations and people to provide access to clean, consistent shared information
© 2013 Hired Brains Inc. All Rights Reserved
11
techniques, workarounds and limitations that have become “best practices”
can be abandoned. Forward-looking companies like H-P/Vertica are
positioned to provide the kinds of capacity, easy scalability, affordability (both
purchase cost and TCO8) and simplicity that allow for broadly inclusive
environments with far greater flexibility. Tedious design work and endless
optimization that slow a project can be eliminated with new approaches at a
much lower cost.
In addition to servicing the reporting and analysis needs of business people,
data warehouses will soon be positioned to support a new class of
applications - unattended decision support. So-called hybrid or composite
systems, that mix operational and analytical processing in real-time are
already in limited production and will become mainstream within the next few
years. Most BI architectures today are not equipped to meet these needs.
Making data warehouses more current, flexible and useful will lead to much
greater propagation of BI, improving comprehension and retention of
business information, facilitating collaboration and increasing the depth and
breadth of business intelligence.
Bringing data warehouse practices up to speed with the state of technology
requires a reversal in polarity: the focus of data warehousing is still solidly on
data and databases and needs to shift to business users and practices that
need its functionality. The preponderance of effort in delivering BI solutions
today is still highly technical and dominated by technologists. For the effort to
be relevant, the focus and control must be turned over to the stakeholders.
This is impossible with current best practices.
Unlike systems designed to support business processes such as supply chain
management, fulfillment, accounting or even human resources, where the
business processes can be clearly defined (even though it may be arduous),
BI isn’t really a business process at all, at least not yet. Traditional systems
architecture and development methodologies presuppose that a distinct set of
8 Total Cost of Ownership
© 2013 Hired Brains Inc. All Rights Reserved
12
requirements can be discovered at the beginning of an implementation.
Asking knowledge workers to describe their needs and “requirements” for BI
is often a fruitless effort because the most important needs are usually unmet
and largely unknown. Thinking out of the box is extremely difficult in this
context. The entire process of “requirements gathering” in technology
initiatives, and BI in particular, is questionable because the potential
beneficiaries of the new and improved capabilities are operating under a
mental model formed by the existing technology.
A popular alternative to this approach is the iterative or “spiral” methodology
that simply repeats the requirements gathering in steps, implementing partial
solutions in between. Nevertheless, this process is still an result of managing
from scarcity – models have to be built in anticipation of use. But the new
economics of data warehousing are changing that. It is now possible to gather
data and start investigations before the requirements are known and to
continuously increment the breadth of a data warehouse because the tools
are so much more capable of processing information.
Prescription for Realignment
No amount of interviewing or “requirements gathering” can break through and
find out what sorts of questions people would have if they allowed themselves
to think freely. The alternative is to ramp up the capacity, allow exploration
and discovery, guiding it over a period of time. What’s needed is a new
paradigm that breaks open the data warehouse to business users, enabling
ad hoc and iterative analyses on full sets of detailed and dynamic data.
Unfortunately, old habits die hard. When it comes to BI, the industry is largely
constrained by a drag of technology. What passes as acceptable BI in
organizations today is rarely much more than re-platforming reports and
queries that are ten- to twenty-years old.
For data warehousing and BI to truly pay off in organizations, IT needs to shift
its focus from deciding the informational needs of the organization through
© 2013 Hired Brains Inc. All Rights Reserved
13
technical architecture and discipline, to one of responding to those needs as
quickly as they arise and change.
Conclusion: Walk in Green Fields
The new economics of data warehousing, more than anything else, free data
from the doctrine of scarcity. Many ordinary business questions are
dependent on examining attributes of the most atomic pieces of data. These
bits of information are lost as the data is summarized. Some examples of
queries that are simple in appearance but cannot be resolved with
summarized data are:
- Complementarity (a.k.a. market basket analysis): What options are most
likely to be purchased, sorted by profitability, in connection with others?
Extra credit: which aren’t?
- Metric Filters (a metric derived within a query is used as a filter): Who are
the top ten salespersons based on the number of times they have been in
the top ten monthly (inclusion based on % of quota)?
- Lift Analysis (what do we get from giving something away): How many
cardholders will we send a gift to for making a purchase greater than $250
on their VISA card in the next 90 days and which of them did so because
they were given an incentive? What if the amount is $300? How a
visualization of the lift from $50 to $500? Extra credit: greater than triple
their average charge in the trailing 90 days before they were sent the
promotion?
- Portfolio Analysis (to evaluate concentration risk, for example):
Aggregating data obscures information, especially in a portfolio where the
positive or negative impact of one or a small proportion of entities can
adversely affect the whole collection. Evaluating and understanding all of
the attributes of each security, for example, is essential, not an average.
- Valuations (such as asset/liability matching in life insurance): Each
coverage and its associated attributes (face amount, age at issue, etc.) is
© 2013 Hired Brains Inc. All Rights Reserved
14
modeled with interest assumption to generate cash flows to demonstrate
solvency for a book of business.
All of these examples are just extrapolations of classic analytics, but
enhanced by the ability to make more observations (data points) and fewer
assumptions, and to do it in near real-time. In addition to these examples, the
following are examples of hybrid processing, involving the coordination of
analytics and operational processing:
- Manufacturing Quality: Real-time feeds from the shop floor are monitored
and combined with historical data to spot anomalies, resulting in
drastically reduced turnaround times for quality problems.
- Structured/Unstructured Analysis: The Web is monitored for press
releases that indicate price changes by competitors, causing the release
to be combined with relevant market share reports from the data
warehouse and broadcast to appropriate people, all on an unattended
basis.
- Personalization: B2B and B2C eCommerce dialogues are dynamically
customized based on analysis of rich integrated data from the data
warehouse in real-time and real-time feeds
- Dynamic Pricing: Operational systems, like airline reservations, embed
real-time analysis to price each transaction (also known as Yield
Management).
This list is endless. Providing people (and processes) access to this level of
detail is considered too difficult and too expensive. Porting these applications
to newer, commodity-based platforms and purpose-built software that exploits
the new economics can empower a multitude of powerful new analytical
applications.
Conclusion
The new economics of data warehousing provide attractive alternatives in
both costs and benefits. While big data gets most of the attention, evolved
© 2013 Hired Brains Inc. All Rights Reserved
15
data warehousing will play an important role for the foreseeable future. In
order to be relevant, data warehouse design and operation needs to be
simplified, taking advantage of the greatly improved hardware, software and
methods. On the benefits side, careful location of processing to separate data
integration and stewardship activities from those of an operational and
analytical nature is clearly required. The industry has moved beyond a single
relational database to do everything. There are more choices today than ever
before. New BI tools are arriving every day that take reporting and analysis to
new levels of visualization, collaboration and distribution. Embedded
analytical processes in operational systems for decision-making rely on fast,
deep and clean data from data warehouses.
Data warehousing can trace its antecedents to traditional Information
Technology. Big data, and especially its most prominent tool, Hadoop, trace
their lineage to search and e-business. While these two streams are more or
less incompatible, they are converging. Data warehouse providers are quickly
adding Hadoop distributions, or even their own versions of Hadoop, into their
architecture, adding further cost advantages to collection of extremely large
data sets.
Finding and retaining the talent to manage in this newly converged
environment will not be easy. Current estimates are that in the U.S. alone,
there is an existing undersupply of “data scientists” with the skill and training
to be effective. To change the mix in the educational system will take more
than a decade, but the economy will adjust much more quickly. New and
better software has already arrived allowing knowledge workers without
Ph.D.’s and/or extensive background in quantitative methods to analyze data
and to build and execute models.
Driven by the new economics of data warehousing, the proper design and
placement of a data warehouse today would be almost unrecognizable to a
practitioner of a decade ago. The opportunities are there.
© 2013 Hired Brains Inc. All Rights Reserved
16
ABOUT THE AUTHOR
Neil Raden, based in Santa Fe, NM, is an industry analyst and active
consultant, widely published author and speaker and the founder of Hired
Brains, Inc., http://www.hiredbrains.com. Hired Brains provides consulting,
systems integration and implementation services in Data Warehousing,
Business Intelligence, “big data,” Decision Automation and Advanced
Analytics for clients worldwide. Hired Brains Research provides consulting,
market research, product marketing and advisory services to the software
industry.
Neil was a contributing author to one of the first (1995) books on designing
data warehouses and he is more recently the co-author of Smart (Enough)
Systems: How to Deliver Competitive Advantage by Automating Hidden
Decisions, Prentice-Hall, 2007. He welcomes your comments at
[email protected] or at his blog at Competing on Decisions.
Appendix
© 2013 Hired Brains Inc. All Rights Reserved
17
Relational Database Inverted: What Columnar Databases Do
For a complete description of the features and functions of Vertica, please
refer to their website at www.vertica.com. What follows is a more general
description of the concept. It is very slightly technical.
Classic relational databases store information in records and collections of
records in tables (actually, Codd, the originator of the relational model,
referred to them as tuples and relations, hence, relational databases). This
scheme is excellent for dealing with the insertion, deletion or retrieval of
individual records as the elements of each record, or the attributes, are
logically grouped. These records can get very wide, such as network
elements in a telecom company or register tickets from a Point of Sale
application. This physical arrangement closely mirrors the events or
transactions of a business process.
When accessing a record, however, the database accesses the entire record,
even if only one attribute is needed in the query. In fact, since records are
physically stored in “pages,” the database has to retrieve more than one
record, often dozens, to access one piece of data. Furthermore, wide
database tables are often very “sparse,” meaning there are many attributes
with null values that still have to be processed when used in a query. These
are major hurdles in using relational databases for analytics.
Columnar database technology inverts the database structure and stores
each attribute separately. This completely eliminates the wasteful retrieval
(and burden of carrying the needless data) as the query is satisfied. As an
added benefit, when dealing with just one attribute at a time, the odds are
vastly increased that instances are not unique which leads to often dramatic
compression of the data (depending on the “distinct”-ness of each collection
of attributes.) As a result, much more data can fit in memory, and moving data
into memory is much faster.
© 2013 Hired Brains Inc. All Rights Reserved
18
In summary, the Vertica Analytical Database innovations include:
Shared-nothing massively parallel processing
Columnar architecture residing in commodity hardware
Aggressive data compression leading to smaller, faster databases
Auto-administration: design, optimization, failover, recovery
Concurrent loading and querying
© 2013 Hired Brains Inc. All Rights Reserved
19