database revolution opening webcast 01 18-12
DESCRIPTION
Robin Bloor and Mark Madsen offer their theories on where the rapidly-changing database market stands today: What’s new? What’s standard? What is the trajectory of this evolving market? Each Analyst will present for 10-15 minutes, then will engage in a dialogue with the moderator and attendees. The webcast audio and video archive can be found at https://bloorgroup.webex.com/bloorgroup/lsr.php?AT=pb&SP=EC&rID=4695777&rKey=4b284990a1db4ec0TRANSCRIPT
Fit For Purpose:The New Database Revolution
Mark Madsen & Robin Bloor
Introduction
Significant and revolutionary changes are taking place in database technology
In order to investigate and analyze these changes and where they may lead, The Bloor Group has teamed up with Third Nature to launch an Open Research project.
This is the first webinar in a series of webinars and research activities that will comprise the project
All research will be made available through our web site: Databaserevolution.com
Sponsors of This Research
General Webinar Structure
What & why
History of Database Part 1: How we got to the RDBMS
History of Database Part 2: Relational and Post- relational
Food For Thought: Issues, Problems, Assumptions, Challenges
Current Conclusions: Insofar as we have any
Change? Why?
Increased data volumes
Significant hardware changes
Database product innovation
New workloads, different data structures
Established database concepts are being challenged
Market Forces can drive change
Data Volumes: Moore’s Law Cubed
Moore’s Law suggests that CPU power increases 10-fold every 6 years (and other technologies have stayed in step to some degree)
Large database volumes have grown 1000-fold every 6 years:
In 1992, measured in megabytesIn 1998 measured in gigabytesIn 2004 measured in terabytesIn 2010 measured in petabytes
Exabytes by 2016?
Hardware ChangesMoore’s Law now proceeds by adding cores rather than by increasing clock speed.
Computer grids using commodity servers are now relatively inexpensive
Parallelism is now on the rise and will eventually become the normal mode of processing
Memory is about 1 million times faster than disk and random reads have become very expensive in respect of latency
SSD are augmenting and may eventually replace spinning disk
Time
Data
Active70%
Static30%
10%
90%
Transactional Data
Majority of Data becomes Historical Data over time or even all historic when no longer active
100%
Application Performance
Cost $$$ and PAIN
Image courtesy: RainStror
Market Forces
A new set of products appear
They include some fundamental innovations
A few are sufficiently popular to last
Fashion and marketing drive greater adoption
Products defects begin to be addressed
They eventually challenge the dominant products
Section 1: History Part 1
Pre-relational and RelationalWhat we had in prior technology regimes
Where we came from
What we traded away and why
The Dawn of Database
Schema defines logical structure of dataThe schema enables extensive reuse
Logical structure vs Physical structure
ACID propertiesAtomicity – transactions must be atomic
Consistency – a transaction ensures consistency
Isolation – a transaction runs in isolation
Durability – a completed transaction causes permanent change to data
Database Performance Bottlenecks
CPU saturation
Memory saturation
Disk I/O channel saturation
Locking
Network saturation
Parallelism – inefficient load balancing
The Joys of SQL?
SQL is a declarative query language targeted at data organized in two-dimensional tables.
It enables set operations on those tables via: Select, Project and Join operations which can be qualified (Order By, etc.)
It imposes some limitations on the logical model of data.
It can create a barrier between the user and the data....
The Ordering Of Data“A data set is an unordered collection of unique, non-duplicated items.”
Data is naturally ordered by time if by nothing else.
Events are ordered by time.
Changes to entities are ordered by time
Having an inherent physical order to data can save many processing cycles in some areas of application
This is particularly the case for time series applications.
The RDBMS Optimizer
The database can know how to access data better and faster than any programmer…
It wasn’t true
It became true
It isn’t always true
It only optimizes for persistent data
Section 2: History Part 2
Relational and Post-relationalWhere we are today: oldsql, newsql and nosql
The finalizing of the distributed web architecture
Rediscovery of the past, when we had purpose-built data stores of different types, with a twist.
Revisiting of old arguments
Challenging old assumptions
Database Product Innovation
Column Stores and Query-biased Workloads
Column store databases are still RDBMSs
Most SQL queries do not require all columns of a tableSo partitioning data by columns (vertically) will usually be better than partitioning by rows (horizontally)
And data compression can be more efficient
Column store databases scale up [somewhat] better than traditional RDBMSs depending on workload, queries, etc.
Column store <> column family
New Lamps For Old
Google, Yahoo!, Facebook and others had data management problems that established products did not cater for: Big Data, unusual data structures, new workloads
They had money to invest and some smart engineers
They built their own solutions: Big Table, MapReduce, Cassandra, etc.
In doing so, they provoked a database revolution
In others words, the internet happened and some people noticed.
A random selection of databasesSybase IQ, ASETeradata, Aster DataOracle, RACMicrosoft SQLServer, PDWIBM DB2s, NetezzaParaccelKognitioEMC/GreenplumOracle ExadataSAP HANAInfobrightMySQLMarkLogicTokyo Cabinet
EnterpriseDB LucidDBVectorwiseMonetDBExasolIlluminateVerticaInfiniDB1010 DataSANDEndecaXtreme DataIMSHive
AlgebraixIntersystems CachéStreambaseSQLStreamCoral8IngresPostgresCassandraCouchDBMongoHbaseRedisRainStorScalaris
And a few hundred more…
Section 3: Database Discussion TopicsThe core post-relational changes in assumptions.
Key aspects of the code-database mismatch
Reclassifying pre-relational as NoSQL
Complex data, emergent structure, types and schemas
Cloud and databases, uhoh?
Changing Assumptions
One single scalable piece of reliable hardware
You really need a schema all the time
A handful of discrete types are all anybody will ever need, and when they need more they can code UDTs and UDFs in C++
SQL is the optimal way to write and retrieve data
ACID always applies
Data integrity is a key component of a database
No SQL, New Concepts
Maybe SQL is an unacceptable constraint
Maybe SQL is unnecessary for some fit-for-purpose databases, or perhaps just unimportant
Maybe the impedance mismatch can be avoided
Maybe a formal schema is a constraint
Maybe ACID properties can be compromised
The “Impedance Mismatch”
The RDBMS stores data organized according to table structures
The OO programmer manipulates data organized according to complex object structures, which may have specific methods associated with them.
The data does not simply map to the structure it has within the database
Consequently a mapping activity is necessary to get and put data
Basically: hierarchies, types, result sets, crappy APIs, language bindings, tools
NoSQL Directions: Technology TypesSome NoSQL DBs do not attempt to provide all ACID properties. (Atomicity, Consistency, Isolation, Durability)
Some NoSQL DBs deploy a distributed scale-out architecture with data redundancy.
XML DBMS using XQuery are NoSQL DBs
Some documents stores are NoSQL DBs (OrientDB, Terrastore, etc.)
Object databases are NoSQL DBs (Gemstone, Objectivity, ObjectStore, etc.)
Key value stores = schema-less stores (Cassandra, MongoDB, Berkeley DB, etc.)
Graph DBMS (DEX, OrientDB, etc.) are NoSQL DBs
Large data pools (BigTable, Hbase, Mnesia, etc.) are NoSQL DBs
The Cloud, uh-oh
Negative implications for shared-everything databases that have scalability needs
There are architectural implications and possible incompatibilities for shared-nothing databases too
Not at scale and at scale (concurrency, ingest volumes and frequencies, etc.) are different
How does the database permit dynamic provisioning, elasticity (+/-), etc?
The new database problems for IT
…are probably like old problems for people who went through the Unix client-server era.
Best of breed, no standards for anything, “polyglot persistence” = silos on steroids, data integration challenges, shifting data movement architectures
Recognize Tradeoffs Read consistency vs programmatic correction
Schema vs a program to interpret each data structure
Standard access interface vs an API for each type of store
Data integrity enforcement vs programmatic control
Query performance for arbitrary queries vs planned access paths
Space efficiency vs simplicity / latency
Network transfer performance vs simplicity / latency
For the primary goals of
Horizontal scale
Looser coupling
Flexibility for developers building and changing applications
Information Management Through Human History
New technology development
creates
New methods to cope
creates
New information scale and availability
creates…
Big Data
Unstructured data isn’t really unstructured.
The problem is that this data is unmodeled.
Big data?
The holy grail of databases under current market hype
The other problem is that we’re talking mostly about computation over data when we talk about “big data” and analytics, another potential mismatch.
Conclusion
Wherein all is revealed, or ignorance exposed
Best of breed is back baby
Workload types and characteristics
The importance of understanding workload in order to select technology
Pragmatism, babies and bathwater
Solving the Problem Depends on the Diagnosis
Types of workloads
Write‐biased: ▪ OLTP▪ OLTP, batch▪ OLTP, lite▪ Object persistence▪ Data ingest, batch▪ Data ingest, real‐time
Read‐biased:▪ Query▪ Query, simple retrieval
▪ Query, complex
▪ Query‐hierarchical / object / network
▪ Analytic
Mixed?
The real challenge is that few systems are all one workload.
Who said you have to write everything to one place, and read everything from the same place?
SOA offers a partial way out, and is how many apps work.
You must understand your workload ‐ throughput and response time requirements aren’t enough.▪ 100 simple queries accessing month‐to‐date data
▪ 90 simple queries accessing month‐to‐date data and 10 complex queries using two years of history
▪ Hazard calculation for the entire customer master
▪ Performance problems are rarely due to a single factor.
Six Key Query Workload Elements
These characteristics help determine suitability of technologies to improve query performance.
1. Retrieval – how much data comes back?
2. Selectivity – how much data is filtered?
3. Repetition – how often for the same query?
4. Concurrency – how many queries at once?
5. Data volume – how much data is being queried?
6. Query complexity – how many joins, aggregations, columns, filters, subselects, etc.?
7. Computational complexity – how much computation is performed over the data?
Characteristics of BI workloads
Workload Selectivity Retrieval Repetition Complexity
Reporting / BI Moderate Low Moderate Moderate
Dashboards / scorecards
Moderate Low High Low
Ad‐hoc query and analysis
Low to high
Moderateto low
Low Low to moderate
Analytics (batch) Low High Low to High Low*
Analytics (inline) High Low High Low*
Operational / embedded BI
High Low High Low
* Low for retrieving the data, high if doing analytics in SQL
Choosing Hardware Architectures
Compute and data sizes are key requirements
40
Data volume<10s GB 100s GB 1s TB 10s TB 100sTB PB
PCShared everythingor shared disk
Shared nothing
MR and related
Com
puta
tions
MF
GF
TFP
F
Choosing Hardware Architectures
41
Data volume<10s GB 100s GB 1s TB 10s TB 100sTB PB
Com
puta
tions
MF
GF
TFP
FToday’s reality, and true for a while in most businesses.
The bulk of the market resides here!
Choosing Hardware Architectures
42
Data volume<10s GB 100s GB 1s TB 10s TB 100sTB PB
Com
puta
tions
MF
GF
TFP
FToday’s reality, and true for a while in most businesses.
The bulk of the market resides here!
…but analytics pushes many things into the MPP zone.
Evaluating DB Technology
1. Define the key problems: response time, throughput, scalability?
2. Examine the workloads and their requirements
3. Match those to suitable technologies
4. Look for vendors using those technologies
5. Evaluate on real data with real workloads
Slide 43Copyright Third Nature, Inc.
Thank YouFor YourAttention
Back-Up Slides
NoSQL DirectionsSome NDBMS do not attempt to provide all ACID properties. (Atomicity, Consistency, Isolation, Durability)
Some NDBMS deploy a distributed scale-out architecture with data redundancy.
XML DBMS using XQuery are NDBMS.
Some documents stores are NDBMS (OrientDB, Terrastore, etc.)
Object databases are NDBMS (Gemstone, Objectivity, ObjectStore, etc.)
Key value stores = schema-less stores (Cassandra, MongoDB, Berkeley DB, etc.)
Graph DBMS (DEX, OrientDB, etc.) are NDMBS
Large data pools (BigTable, Hbase, Mnesia, etc.) are NDBMS
The SQL Barrier
SQL has:DDL (for data definition)
DML (for Select, Project and Join)
But it has no MML (Math) or TML (Time)
Usually result sets are brought to the client for further analytical manipulation, but this creates problems
Alternatively doing all analytical manipulation in the database creates problems
Discussion TopicsIf not covered in history through today:
the core post-relational change in assumptions
nosql core drivers, persistence in cloud, finalizing of web arch, SOAizing
a NoSQL classification list (types and projects/products)
key aspects of the OR mismatch
complex data and emergent structure
database technology types
a giant list of databases
cloud and databases, uhoh?