database revolution opening webcast 01 18-12

Fit For Purpose:The New Database Revolution

Mark Madsen & Robin Bloor

Introduction

Significant and revolutionary changes are taking place in database technology

In order to investigate and analyze these changes and where they may lead, The Bloor Group has teamed up with Third Nature to launch an Open Research project.

This is the first webinar in a series of webinars and research activities that will comprise the project

All research will be made available through our web site: Databaserevolution.com

Sponsors of This Research

General Webinar Structure

What & why

History of Database Part 1: How we got to the RDBMS

History of Database Part 2: Relational and Post- relational

Food For Thought: Issues, Problems, Assumptions, Challenges

Current Conclusions: Insofar as we have any

Change? Why?

Increased data volumes

Significant hardware changes

Database product innovation

New workloads, different data structures

Established database concepts are being challenged

Market Forces can drive change

Data Volumes: Moore’s Law Cubed

Moore’s Law suggests that CPU power increases 10-fold every 6 years (and other technologies have stayed in step to some degree)

Large database volumes have grown 1000-fold every 6 years:

In 1992, measured in megabytesIn 1998 measured in gigabytesIn 2004 measured in terabytesIn 2010 measured in petabytes

Exabytes by 2016?

Hardware ChangesMoore’s Law now proceeds by adding cores rather than by increasing clock speed.

Computer grids using commodity servers are now relatively inexpensive

Parallelism is now on the rise and will eventually become the normal mode of processing

Memory is about 1 million times faster than disk and random reads have become very expensive in respect of latency

SSD are augmenting and may eventually replace spinning disk

Time

Data

Active70%

Static30%

10%

90%

Transactional Data

Majority of Data becomes Historical Data over time or even all historic when no longer active

100%

Application Performance

Cost $$$ and PAIN

Image courtesy: RainStror

Market Forces

A new set of products appear

They include some fundamental innovations

A few are sufficiently popular to last

Fashion and marketing drive greater adoption

Products defects begin to be addressed

They eventually challenge the dominant products

Section 1: History Part 1

Pre-relational and RelationalWhat we had in prior technology regimes

Where we came from

What we traded away and why

The Dawn of Database

Schema defines logical structure of dataThe schema enables extensive reuse

Logical structure vs Physical structure

ACID propertiesAtomicity – transactions must be atomic

Consistency – a transaction ensures consistency

Isolation – a transaction runs in isolation

Durability – a completed transaction causes permanent change to data

Database Performance Bottlenecks

CPU saturation

Memory saturation

Disk I/O channel saturation

Locking

Network saturation

Parallelism – inefficient load balancing

The Joys of SQL?

SQL is a declarative query language targeted at data organized in two-dimensional tables.

It enables set operations on those tables via: Select, Project and Join operations which can be qualified (Order By, etc.)

It imposes some limitations on the logical model of data.

It can create a barrier between the user and the data....

The Ordering Of Data“A data set is an unordered collection of unique, non-duplicated items.”

Data is naturally ordered by time if by nothing else.

Events are ordered by time.

Changes to entities are ordered by time

Having an inherent physical order to data can save many processing cycles in some areas of application

This is particularly the case for time series applications.

The RDBMS Optimizer

The database can know how to access data better and faster than any programmer…

It wasn’t true

It became true

It isn’t always true

It only optimizes for persistent data

Section 2: History Part 2

Relational and Post-relationalWhere we are today: oldsql, newsql and nosql

The finalizing of the distributed web architecture

Rediscovery of the past, when we had purpose-built data stores of different types, with a twist.

Revisiting of old arguments

Challenging old assumptions

Database Product Innovation

Column Stores and Query-biased Workloads

Column store databases are still RDBMSs

Most SQL queries do not require all columns of a tableSo partitioning data by columns (vertically) will usually be better than partitioning by rows (horizontally)

And data compression can be more efficient

Column store databases scale up [somewhat] better than traditional RDBMSs depending on workload, queries, etc.

Column store <> column family

New Lamps For Old

Google, Yahoo!, Facebook and others had data management problems that established products did not cater for: Big Data, unusual data structures, new workloads

They had money to invest and some smart engineers

They built their own solutions: Big Table, MapReduce, Cassandra, etc.

In doing so, they provoked a database revolution

In others words, the internet happened and some people noticed.

A random selection of databasesSybase IQ, ASETeradata, Aster DataOracle, RACMicrosoft SQLServer, PDWIBM DB2s, NetezzaParaccelKognitioEMC/GreenplumOracle ExadataSAP HANAInfobrightMySQLMarkLogicTokyo Cabinet

EnterpriseDB LucidDBVectorwiseMonetDBExasolIlluminateVerticaInfiniDB1010 DataSANDEndecaXtreme DataIMSHive

AlgebraixIntersystems CachéStreambaseSQLStreamCoral8IngresPostgresCassandraCouchDBMongoHbaseRedisRainStorScalaris

And a few hundred more…

Section 3: Database Discussion TopicsThe core post-relational changes in assumptions.

Key aspects of the code-database mismatch

Reclassifying pre-relational as NoSQL

Complex data, emergent structure, types and schemas

Cloud and databases, uhoh?

Changing Assumptions

One single scalable piece of reliable hardware

You really need a schema all the time

A handful of discrete types are all anybody will ever need, and when they need more they can code UDTs and UDFs in C++

SQL is the optimal way to write and retrieve data

ACID always applies

Data integrity is a key component of a database

No SQL, New Concepts

Maybe SQL is an unacceptable constraint

Maybe SQL is unnecessary for some fit-for-purpose databases, or perhaps just unimportant

Maybe the impedance mismatch can be avoided

Maybe a formal schema is a constraint

Maybe ACID properties can be compromised

The “Impedance Mismatch”

The RDBMS stores data organized according to table structures

The OO programmer manipulates data organized according to complex object structures, which may have specific methods associated with them.

The data does not simply map to the structure it has within the database

Consequently a mapping activity is necessary to get and put data

Basically: hierarchies, types, result sets, crappy APIs, language bindings, tools

NoSQL Directions: Technology TypesSome NoSQL DBs do not attempt to provide all ACID properties. (Atomicity, Consistency, Isolation, Durability)

Some NoSQL DBs deploy a distributed scale-out architecture with data redundancy.

XML DBMS using XQuery are NoSQL DBs

Some documents stores are NoSQL DBs (OrientDB, Terrastore, etc.)

Object databases are NoSQL DBs (Gemstone, Objectivity, ObjectStore, etc.)

Key value stores = schema-less stores (Cassandra, MongoDB, Berkeley DB, etc.)

Graph DBMS (DEX, OrientDB, etc.) are NoSQL DBs

Large data pools (BigTable, Hbase, Mnesia, etc.) are NoSQL DBs

The Cloud, uh-oh

Negative implications for shared-everything databases that have scalability needs

There are architectural implications and possible incompatibilities for shared-nothing databases too

Not at scale and at scale (concurrency, ingest volumes and frequencies, etc.) are different

How does the database permit dynamic provisioning, elasticity (+/-), etc?

The new database problems for IT

…are probably like old problems for people who went through the Unix client-server era.

Best of breed, no standards for anything, “polyglot persistence” = silos on steroids, data integration challenges, shifting data movement architectures

Recognize Tradeoffs Read consistency vs programmatic correction

Schema vs a program to interpret each data structure

Standard access interface vs an API for each type of store

Data integrity enforcement vs programmatic control

Query performance for arbitrary queries vs planned access paths

Space efficiency vs simplicity / latency

Network transfer performance vs simplicity / latency

For the primary goals of

Horizontal scale

Looser coupling

Flexibility for developers building and changing applications

Information Management Through Human History

New technology development

creates

New methods to cope

creates

New information scale and availability

creates…

Big Data

Unstructured data isn’t really unstructured.

The problem is that this data is unmodeled.

Big data?

The holy grail of databases under current market hype

The other problem is that we’re talking mostly about computation over data when we talk about “big data” and analytics, another potential mismatch.

Conclusion

Wherein all is revealed, or ignorance exposed

Best of breed is back baby

Workload types and characteristics

The importance of understanding workload in order to select technology

Pragmatism, babies and bathwater

Solving the Problem Depends on the Diagnosis

Types of workloads

Write‐biased: ▪ OLTP▪ OLTP, batch▪ OLTP, lite▪ Object persistence▪ Data ingest, batch▪ Data ingest, real‐time

Read‐biased:▪ Query▪ Query, simple retrieval

▪ Query, complex

▪ Query‐hierarchical / object / network

▪ Analytic

Mixed?

The real challenge is that few systems are all one workload.

Who said you have to write everything to one place, and read everything from the same place?

SOA offers a partial way out, and is how many apps work.

You must understand your workload ‐ throughput and response time requirements aren’t enough.▪ 100 simple queries accessing month‐to‐date data

▪ 90 simple queries accessing month‐to‐date data and 10 complex queries using two years of history

▪ Hazard calculation for the entire customer master

▪ Performance problems are rarely due to a single factor.

Six Key Query Workload Elements

These characteristics help determine suitability of technologies to improve query performance.

1. Retrieval – how much data comes back?

2. Selectivity – how much data is filtered?

3. Repetition – how often for the same query?

4. Concurrency – how many queries at once?

5. Data volume – how much data is being queried?

6. Query complexity – how many joins, aggregations, columns, filters, subselects, etc.?

7. Computational complexity – how much computation is performed over the data?

Characteristics of BI workloads

Workload Selectivity Retrieval Repetition Complexity

Reporting / BI Moderate Low Moderate Moderate

Dashboards / scorecards

Moderate Low High Low

Ad‐hoc query and analysis

Low to high

Moderateto low

Low Low to moderate

Analytics (batch) Low High Low to High Low*

Analytics (inline) High Low High Low*

Operational / embedded BI

High Low High Low

* Low for retrieving the data, high if doing analytics in SQL

Choosing Hardware Architectures

Compute and data sizes are key requirements

40

Data volume<10s GB 100s GB 1s TB 10s TB 100sTB PB

PCShared everythingor shared disk

Shared nothing

MR and related

Com

puta

tions

MF

GF

TFP

F


41


Com

puta

tions

MF

GF

TFP

FToday’s reality, and true for a while in most businesses.

The bulk of the market resides here!


42


Com

puta

tions

MF

GF

TFP

FToday’s reality, and true for a while in most businesses.

The bulk of the market resides here!

…but analytics pushes many things into the MPP zone.

Evaluating DB Technology

1. Define the key problems: response time, throughput, scalability?

2. Examine the workloads and their requirements

3. Match those to suitable technologies

4. Look for vendors using those technologies

5. Evaluate on real data with real workloads

Slide 43Copyright Third Nature, Inc.

Thank YouFor YourAttention

Back-Up Slides

NoSQL DirectionsSome NDBMS do not attempt to provide all ACID properties. (Atomicity, Consistency, Isolation, Durability)

Some NDBMS deploy a distributed scale-out architecture with data redundancy.

XML DBMS using XQuery are NDBMS.

Some documents stores are NDBMS (OrientDB, Terrastore, etc.)

Object databases are NDBMS (Gemstone, Objectivity, ObjectStore, etc.)

Key value stores = schema-less stores (Cassandra, MongoDB, Berkeley DB, etc.)

Graph DBMS (DEX, OrientDB, etc.) are NDMBS

Large data pools (BigTable, Hbase, Mnesia, etc.) are NDBMS

The SQL Barrier

SQL has:DDL (for data definition)

DML (for Select, Project and Join)

But it has no MML (Math) or TML (Time)

Usually result sets are brought to the client for further analytical manipulation, but this creates problems

Alternatively doing all analytical manipulation in the database creates problems

Discussion TopicsIf not covered in history through today:

the core post-relational change in assumptions

nosql core drivers, persistence in cloud, finalizing of web arch, SOAizing

a NoSQL classification list (types and projects/products)

key aspects of the OR mismatch

complex data and emergent structure

database technology types

a giant list of databases

cloud and databases, uhoh?

database revolution opening webcast 01 18-12

Technology

big data

partitioning data

data compression

historical data

persistent data

majority of data

logical structure of

data management problems