java tech & tools | beyond the data grid: coherence, normalisation, joins and linear scalability...

ODCBeyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability

Ben Stopford : RBS

How do you support low latency and high throughput?

Database Architecture is Aging

Most modern databases still follow a 1970s architecture (for example IBM’s System R)

“Because RDBMSs can be beaten by more than an order of magnitude on the standard OLTP benchmark, then there is no market where they are competitive. As such, they should be considered as legacy technology more than a quarter of a century in age, for which a complete redesign and re-architecting is the appropriate next step.”

Michael Stonebraker (Creator of Ingres and Postgres)

The lay of the land: The main architectural

constructs in the database industry

The Traditional Architecture

• Data is on disk• Users have an allocated user

space, where intermediary results are calculated.

• The database brings data, normally via indexes, into memory and performs filters, joins, reordering and aggregation operations.

• The result is sent to the user.

Improving Database Performance

Shared Disk Architecture

SharedDisk

• More ‘grunt’• Suffers from issues

arising from disk and lock contention.

Improving Database Performance

Shared Nothing Architecture

• Massive storage potential

• Massive scalability of processing

• Commodity hardware• Limited by cross

partition joins

Improving Database Performance (3)

In Memory Databases

Memory is at least 100x faster than disk

0.000,000,000,000

μs ns psms

L1 Cache Ref

L2 Cache Ref

Main MemoryRef

1MB Main Memory

Cross Network Round Trip

Cross Continental Round Trip

1MB Disk/Network

* L1 ref is about 2 clock cycles or 0.7ns. This is the time it takes light to travel 20cm

The architecture of an in memory database

• All data is at your fingertips.

• Query plans become less important as there is no IO

• Intermediary results are just pointers.

This makes them very fast!!

The proof is in the stats. TPC-H Benchmarks on a

1TB data set• Exasol: 4,253,937 QphH (In-Memory DB)• Oracle Database 11g (RAC): 1,166,976 QphH• SQL Server: 173,961QphH

So why haven’t in memory databases

taken off?

Address-Spaces are relatively small and of a

finite, fixed size

• Is there enough memory for your all your data?

• What happens when your data grows: The ‘One more bit problem’?

Durability

What happens when you pull the plug?

Improving Database Performance (4)

Distributed In Memory (Shared Nothing)

Distribution Solves These Three Problems

• Solve the ‘one more bit’ problem by adding more hardware.

• Solve the durability problem with backups on another machine.

Improving Database Performance (5) No SQL / Distributed

Caching

Simplifying the Contract

Do you really need all of ACID?• Atomic• Consistent• Isolated• Durable

Databases have huge operational overheads

Research with Shore DB indicates only 6.8% of

instructions contribute to ‘useful work’

Taken from “OLTP Through the Looking Glass, and What We Found There” Harizopoulos et al

Avoid all that overhead

RAM means:• No IO• Single Threaded

No locking / latching

Regular DatabaseOracle, Sybase,

MySql

Scale out

Distributed

In-Memory

Exasol, VoltDB, Hana

NoSQL Mongo, Cassandra

Shared Nothing

Vertica, Greenplum

b

Scale out & drop

ACID

Data Grid Coherence,

Teracotta

Scale out & drop

Disk & ACID

In-Memory Databas

e

TimesTen, HSQL,

KDBSingle

Address

Space

So what are the themes that underpin this

progression?Distributed Architecture

Simplify the Contract

Stick to RAM

ODC – Distributed, Shared Nothing, In Memory, Semi-

Normalised, Graph DB450 processes

Messaging (Topic Based) as a system of record

(persistence)

2TB of RAMOracle

Coherence

The LayersD

ata

Layer Transactio

ns

Cashflows

Query

Layer

Mtms

Acc

ess

La

yer

Java client

API

Java client

API

Pers

iste

nce

Layer

Three Tools of Distributed Data Architecture

Indexing

Replication

Partitioning

How should we use these tools?

Replication puts data everywhere

Wherever you go the data will be there

But your storage is limited by the memory on a node

Partitioning scalesKeys Aa-Ap

Scalable storage, bandwidth and processing

Associating data in different partitions implies moving it.

What about Denormalisation?

Denormalisation implies the duplication of some

sub-entities

…and that means managing consistency over

lots of copies

…and all the duplication means you run out of space really quickly

Spaces issues are exaggerated further when

data is versioned

Trade

Party

Trader Version 1

Trade

Party

Trader Version 2

Trade

Party

Trader Version 3

Trade

Party

Trader Version 4

…and you need versioning to do MVCC

And reconstituting a previous time slice

becomes very diffi cult.Trad

ePart

yTrade

r

Trade

Trade

Party

Party

Party

Trader

Trader

So we want to hold entities separately

(normalised) to alleviate concerns around consistency and space usage

This means the object graph will be split across

multiple machines.

Trade

Party

Trader

Trade

Party

Trader

Independently Versioned

Data is Singleton

Binding them back together involves a “distributed join” =>

Lots of network hops

Trade

Party

Trader

Trade

Party

Trader

All these network hops make it slow

Whereas the denormalised model the join is already

done

Hence denormalisation is FAST!

(for reads)

So what we want is the advantages of a normalised store at the speed of a denormalised one!

This is what using Snowflake Schemas and the Connected Replication pattern

is all about!

Looking more closely: Why does normalisation mean we have to spread data around the cluster. Why

can’t we hold it all together?

It’s all about the keys

We can collocate data with common keys but if they crosscut the only way to

collocate is to replicate

Common Keys

Crosscutting

Keys

We tackle this problem with a hybrid model:

Trade

PartyTrader

Partitioned

Replicated

We adapt the concept of a Snowflake Schema.

Taking the concept of Facts and Dimensions

Everything starts from a Core Fact (Trades for us)

Facts are Big, dimensions are small

Facts have one key that relates them all (used to

partition)

Dimensions have many keys

(which crosscut the partitioning key)

Looking at the data:

Facts:=>Big, common keys

Dimensions=>Small,crosscutting Keys

We remember we are a grid. We should avoid the

distributed join.

… so we only want to ‘join’ data that is in the same

process

Trades

MTMs

Common Key

Use a Key Assignment

Policy (e.g.

KeyAssociation in Coherence)

So we prescribe different physical storage for Facts

and Dimensions

Trade

PartyTrader

Partitioned

Replicated

Facts are partitioned, dimensions are replicated

Data

La

yer

Transactions

Cashflows

Query

Layer

Mtms

Fact Storage(Partitioned)

Trade

PartyTrader

Facts are partitioned, dimensions are replicated

Transactions

Cashflows

Dimensions(repliacte)

Mtms


Facts(distribute/ partition)

The data volumes back this up as a sensible hypothesis

Facts:=>Big=>Distrib

ute

Dimensions=>Small => Replicate

Key Point

We use a variant on a Snowflake Schema to

partition big entities that can be related via a partitioning key and

replicate small stuff who’s keys can’t map to our

partitioning key.

Replicate

Distribute

So how does they help us to run queries without

distributed joins?

This query involves:• Joins between Dimensions• Joins between Facts

Select Transaction, MTM, RefrenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’

What would this look like without this pattern?

Get Cost

Centers

Get LedgerBooks

Get SourceBooks

Get Transac-tions

Get MTMs

Get Legs

Get Cost

Centers

Network

Time

But by balancing Replication and Partitioning we don’t need all

those hops

Stage 1: Focus on the where clause:

Where Cost Centre = ‘CC1’

Transactions

Cashflows

Mtms

Partitioned Storage

Stage 1: Get the right keys to query the Facts

Join Dimensions in Query Layer

Select Transaction, MTM, ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’

Transactions

Cashflows

Mtms

Partitioned Storage

Stage 2: Cluster Join to get Facts


Join Facts across cluster


Stage 2: Join the facts together effi ciently as we know they are

collocated

Transactions

Cashflows

Mtms

Partitioned Storage

Stage 3: Augment raw Facts with relevant

Dimensions


Join Facts across cluster



Stage 3: Bind relevant dimensions to the result

Bringing it together:

Java client

API

Replicated Dimensions

Partitioned Facts

We never have to do a distributed join!

So all the big stuff is held partitioned

And we can join without shipping keys around and

having intermediate

results

We get to do this…

Trade

Party

Trader

Trade

Party

Trader

…and this…

Trade

Party

Trader Version 1

Trade

Party

Trader Version 2

Trade

Party

Trader Version 3

Trade

Party

Trader Version 4

..and this..

Trade

Party

Trader

Trade

Trade

Party

Party

Party

Trader

Trader

…without the problems of this…

…or this..

..all at the speed of this… well almost!

But there is a fly in the ointment…

I lied earlier. These aren’t all Facts.

Facts

Dimensions

This is a dimension• It has a different

key to the Facts.• And it’s BIG

We can’t replicate really big stuff… we’ll run out of space => Big Dimensions are a problem.

Fortunately there is a simple solution!

The Connected Replication

Pattern

Whilst there are lots of these big dimensions, a large majority are never used. They are not all “connected”.

If there are no Trades for Goldmans in the data store then a Trade Query will never need the Goldmans Counterparty

Looking at the Dimension data some are quite large

But Connected Dimension Data is tiny by comparison

One recent independent study from the database community showed that 80% of data remains unused

So we only replicate

‘Connected’ or ‘Used’ dimensions

As data is written to the data store we keep our ‘Connected Caches’ up

to dateD

ata

Layer

Dimension Caches

(Replicated)

Transactions

Cashflows

Pro

cessin

g

Layer

Mtms


As new Facts are added relevant Dimensions that they reference are moved to processing layer caches

The Replicated Layer is updated by recursing through the arcs on the domain model when facts change

Saving a trade causes all it’s 1st level references to be triggered

Trade

Party

Alias

Source

Book

Ccy

Data Layer(All Normalised)

Query Layer(With connected dimension Caches)

Save Trade

Partitioned Cache

Cache Store

Trigger

This updates the connected caches

Trade

Party

Alias

Source

Book

Ccy



The process recurses through the object graph

Trade

Party

Alias

Source

Book

Ccy

Party

LedgerBook



‘Connected Replication’A simple pattern which recurses through the foreign keys in the

domain model, ensuring only ‘Connected’ dimensions are

replicated

With ‘Connected Replication’ only 1/10th of the data

needs to be replicated (on

average).

Limitations of this approach

•Data set size. Size of connected dimensions limits scalability.

• Joins are only supported between “Facts” that can share a partitioning key (But any dimension join can be supported)

Everything is Java

Java client

APIJava schema Java ‘Stored

Procedures’ and

‘Triggers’

API – Queries utilise a fluent interface

Performance

Query with more than

twenty joins conditions:

2GB per min / 250Mb/s

(per client)3ms latency

Conclusion

Data warehousing, OLTP and Distributed caching fields are all converging on in-memory architectures to get away from disk induced latencies.

Conclusion

Shared-Nothing architectures are always subject to the distributed join problem if they are to retain a degree of normalisation.

Conclusion

We present a novel mechanism for avoiding the distributed join problem by using a Star Schema to define whether data should be replicated or partitioned.

Partitioned Storage

Conclusion

We make the pattern applicable to ‘real’ in-memory data models by only replicating objects that are actually used: the Connected Replication pattern.

The End• Further details online

http://www.benstopford.com (there are textual and slideshare versions of this talk)

• A big thanks to the team in both India and the UK who built this thing.

• Questions?

http://www.benstopford.com/

java tech & tools | beyond the data grid: coherence, normalisation, joins and linear scalability...

Technology

y trade traderpartversion

trade rpart y tradetrad

replicatedtrader party

ytrad e trade rpartversion

ytrad e trade rpart

partitioned storage

connected dimensions

cost centre