java tech & tools | beyond the data grid: coherence, normalisation, joins and linear scalability...
DESCRIPTION
2011-11-02 | 02:25 PM - 03:15 PMIn 2009 RBS set out to build a single store of trade and risk data that all applications in the bank could access simultaniously. This talk discusses a number of novel techniques that were developed as part of this work. Based on Oracle Coherence the ODC departs from the trend set by most caching solutions by holding its data in a normalised form making it both memory efficient and easy to change. However it does this in a novel way that supports most arbitrary queries without the usual problems associated with distributed joins. We'll be discussing these patterns as well as others that allow linear scalability, fault tolerance and millisecond latencies.TRANSCRIPT
ODCBeyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability
Ben Stopford : RBS
How do you support low latency and high throughput?
Database Architecture is Aging
Most modern databases still follow a 1970s architecture (for example IBM’s System R)
“Because RDBMSs can be beaten by more than an order of magnitude on the standard OLTP benchmark, then there is no market where they are competitive. As such, they should be considered as legacy technology more than a quarter of a century in age, for which a complete redesign and re-architecting is the appropriate next step.”
Michael Stonebraker (Creator of Ingres and Postgres)
The lay of the land: The main architectural
constructs in the database industry
The Traditional Architecture
• Data is on disk• Users have an allocated user
space, where intermediary results are calculated.
• The database brings data, normally via indexes, into memory and performs filters, joins, reordering and aggregation operations.
• The result is sent to the user.
Improving Database Performance
Shared Disk Architecture
SharedDisk
• More ‘grunt’• Suffers from issues
arising from disk and lock contention.
Improving Database Performance
Shared Nothing Architecture
• Massive storage potential
• Massive scalability of processing
• Commodity hardware• Limited by cross
partition joins
Improving Database Performance (3)
In Memory Databases
Memory is at least 100x faster than disk
0.000,000,000,000
μs ns psms
L1 Cache Ref
L2 Cache Ref
Main MemoryRef
1MB Main Memory
Cross Network Round Trip
Cross Continental Round Trip
1MB Disk/Network
* L1 ref is about 2 clock cycles or 0.7ns. This is the time it takes light to travel 20cm
The architecture of an in memory database
• All data is at your fingertips.
• Query plans become less important as there is no IO
• Intermediary results are just pointers.
This makes them very fast!!
The proof is in the stats. TPC-H Benchmarks on a
1TB data set• Exasol: 4,253,937 QphH (In-Memory DB)• Oracle Database 11g (RAC): 1,166,976 QphH• SQL Server: 173,961QphH
So why haven’t in memory databases
taken off?
Address-Spaces are relatively small and of a
finite, fixed size
• Is there enough memory for your all your data?
• What happens when your data grows: The ‘One more bit problem’?
Durability
What happens when you pull the plug?
Improving Database Performance (4)
Distributed In Memory (Shared Nothing)
Distribution Solves These Three Problems
• Solve the ‘one more bit’ problem by adding more hardware.
• Solve the durability problem with backups on another machine.
Improving Database Performance (5) No SQL / Distributed
Caching
Simplifying the Contract
Do you really need all of ACID?• Atomic• Consistent• Isolated• Durable
Databases have huge operational overheads
Research with Shore DB indicates only 6.8% of
instructions contribute to ‘useful work’
Taken from “OLTP Through the Looking Glass, and What We Found There” Harizopoulos et al
Avoid all that overhead
RAM means:• No IO• Single Threaded
No locking / latching
Regular DatabaseOracle, Sybase,
MySql
Scale out
Distributed
In-Memory
Exasol, VoltDB, Hana
NoSQL Mongo, Cassandra
Shared Nothing
Vertica, Greenplum
b
Scale out & drop
ACID
Data Grid Coherence,
Teracotta
Scale out & drop
Disk & ACID
In-Memory Databas
e
TimesTen, HSQL,
KDBSingle
Address
Space
So what are the themes that underpin this
progression?Distributed Architecture
Simplify the Contract
Stick to RAM
ODC – Distributed, Shared Nothing, In Memory, Semi-
Normalised, Graph DB450 processes
Messaging (Topic Based) as a system of record
(persistence)
2TB of RAMOracle
Coherence
The LayersD
ata
Layer Transactio
ns
Cashflows
Query
Layer
Mtms
Acc
ess
La
yer
Java client
API
Java client
API
Pers
iste
nce
Layer
Three Tools of Distributed Data Architecture
Indexing
Replication
Partitioning
How should we use these tools?
Replication puts data everywhere
Wherever you go the data will be there
But your storage is limited by the memory on a node
Partitioning scalesKeys Aa-Ap
Scalable storage, bandwidth and processing
Associating data in different partitions implies moving it.
What about Denormalisation?
Denormalisation implies the duplication of some
sub-entities
…and that means managing consistency over
lots of copies
…and all the duplication means you run out of space really quickly
Spaces issues are exaggerated further when
data is versioned
Trade
Party
Trader Version 1
Trade
Party
Trader Version 2
Trade
Party
Trader Version 3
Trade
Party
Trader Version 4
…and you need versioning to do MVCC
And reconstituting a previous time slice
becomes very diffi cult.Trad
ePart
yTrade
r
Trade
Trade
Party
Party
Party
Trader
Trader
So we want to hold entities separately
(normalised) to alleviate concerns around consistency and space usage
This means the object graph will be split across
multiple machines.
Trade
Party
Trader
Trade
Party
Trader
Independently Versioned
Data is Singleton
Binding them back together involves a “distributed join” =>
Lots of network hops
Trade
Party
Trader
Trade
Party
Trader
All these network hops make it slow
Whereas the denormalised model the join is already
done
Hence denormalisation is FAST!
(for reads)
So what we want is the advantages of a normalised store at the speed of a denormalised one!
This is what using Snowflake Schemas and the Connected Replication pattern
is all about!
Looking more closely: Why does normalisation mean we have to spread data around the cluster. Why
can’t we hold it all together?
It’s all about the keys
We can collocate data with common keys but if they crosscut the only way to
collocate is to replicate
Common Keys
Crosscutting
Keys
We tackle this problem with a hybrid model:
Trade
PartyTrader
Partitioned
Replicated
We adapt the concept of a Snowflake Schema.
Taking the concept of Facts and Dimensions
Everything starts from a Core Fact (Trades for us)
Facts are Big, dimensions are small
Facts have one key that relates them all (used to
partition)
Dimensions have many keys
(which crosscut the partitioning key)
Looking at the data:
Facts:=>Big, common keys
Dimensions=>Small,crosscutting Keys
We remember we are a grid. We should avoid the
distributed join.
… so we only want to ‘join’ data that is in the same
process
Trades
MTMs
Common Key
Use a Key Assignment
Policy (e.g.
KeyAssociation in Coherence)
So we prescribe different physical storage for Facts
and Dimensions
Trade
PartyTrader
Partitioned
Replicated
Facts are partitioned, dimensions are replicated
Data
La
yer
Transactions
Cashflows
Query
Layer
Mtms
Fact Storage(Partitioned)
Trade
PartyTrader
Facts are partitioned, dimensions are replicated
Transactions
Cashflows
Dimensions(repliacte)
Mtms
Fact Storage(Partitioned)
Facts(distribute/ partition)
The data volumes back this up as a sensible hypothesis
Facts:=>Big=>Distrib
ute
Dimensions=>Small => Replicate
Key Point
We use a variant on a Snowflake Schema to
partition big entities that can be related via a partitioning key and
replicate small stuff who’s keys can’t map to our
partitioning key.
Replicate
Distribute
So how does they help us to run queries without
distributed joins?
This query involves:• Joins between Dimensions• Joins between Facts
Select Transaction, MTM, RefrenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’
What would this look like without this pattern?
Get Cost
Centers
Get LedgerBooks
Get SourceBooks
Get Transac-tions
Get MTMs
Get Legs
Get Cost
Centers
Network
Time
But by balancing Replication and Partitioning we don’t need all
those hops
Stage 1: Focus on the where clause:
Where Cost Centre = ‘CC1’
Transactions
Cashflows
Mtms
Partitioned Storage
Stage 1: Get the right keys to query the Facts
Join Dimensions in Query Layer
Select Transaction, MTM, ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’
Transactions
Cashflows
Mtms
Partitioned Storage
Stage 2: Cluster Join to get Facts
Join Dimensions in Query Layer
Join Facts across cluster
Select Transaction, MTM, ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’
Stage 2: Join the facts together effi ciently as we know they are
collocated
Transactions
Cashflows
Mtms
Partitioned Storage
Stage 3: Augment raw Facts with relevant
Dimensions
Join Dimensions in Query Layer
Join Facts across cluster
Join Dimensions in Query Layer
Select Transaction, MTM, ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’
Stage 3: Bind relevant dimensions to the result
Bringing it together:
Java client
API
Replicated Dimensions
Partitioned Facts
We never have to do a distributed join!
So all the big stuff is held partitioned
And we can join without shipping keys around and
having intermediate
results
We get to do this…
Trade
Party
Trader
Trade
Party
Trader
…and this…
Trade
Party
Trader Version 1
Trade
Party
Trader Version 2
Trade
Party
Trader Version 3
Trade
Party
Trader Version 4
..and this..
Trade
Party
Trader
Trade
Trade
Party
Party
Party
Trader
Trader
…without the problems of this…
…or this..
..all at the speed of this… well almost!
But there is a fly in the ointment…
I lied earlier. These aren’t all Facts.
Facts
Dimensions
This is a dimension• It has a different
key to the Facts.• And it’s BIG
We can’t replicate really big stuff… we’ll run out of space => Big Dimensions are a problem.
Fortunately there is a simple solution!
The Connected Replication
Pattern
Whilst there are lots of these big dimensions, a large majority are never used. They are not all “connected”.
If there are no Trades for Goldmans in the data store then a Trade Query will never need the Goldmans Counterparty
Looking at the Dimension data some are quite large
But Connected Dimension Data is tiny by comparison
One recent independent study from the database community showed that 80% of data remains unused
So we only replicate
‘Connected’ or ‘Used’ dimensions
As data is written to the data store we keep our ‘Connected Caches’ up
to dateD
ata
Layer
Dimension Caches
(Replicated)
Transactions
Cashflows
Pro
cessin
g
Layer
Mtms
Fact Storage(Partitioned)
As new Facts are added relevant Dimensions that they reference are moved to processing layer caches
The Replicated Layer is updated by recursing through the arcs on the domain model when facts change
Saving a trade causes all it’s 1st level references to be triggered
Trade
Party
Alias
Source
Book
Ccy
Data Layer(All Normalised)
Query Layer(With connected dimension Caches)
Save Trade
Partitioned Cache
Cache Store
Trigger
This updates the connected caches
Trade
Party
Alias
Source
Book
Ccy
Data Layer(All Normalised)
Query Layer(With connected dimension Caches)
The process recurses through the object graph
Trade
Party
Alias
Source
Book
Ccy
Party
LedgerBook
Data Layer(All Normalised)
Query Layer(With connected dimension Caches)
‘Connected Replication’A simple pattern which recurses through the foreign keys in the
domain model, ensuring only ‘Connected’ dimensions are
replicated
With ‘Connected Replication’ only 1/10th of the data
needs to be replicated (on
average).
Limitations of this approach
•Data set size. Size of connected dimensions limits scalability.
• Joins are only supported between “Facts” that can share a partitioning key (But any dimension join can be supported)
Everything is Java
Java client
APIJava schema Java ‘Stored
Procedures’ and
‘Triggers’
API – Queries utilise a fluent interface
Performance
Query with more than
twenty joins conditions:
2GB per min / 250Mb/s
(per client)3ms latency
Conclusion
Data warehousing, OLTP and Distributed caching fields are all converging on in-memory architectures to get away from disk induced latencies.
Conclusion
Shared-Nothing architectures are always subject to the distributed join problem if they are to retain a degree of normalisation.
Conclusion
We present a novel mechanism for avoiding the distributed join problem by using a Star Schema to define whether data should be replicated or partitioned.
Partitioned Storage
Conclusion
We make the pattern applicable to ‘real’ in-memory data models by only replicating objects that are actually used: the Connected Replication pattern.
The End• Further details online
http://www.benstopford.com (there are textual and slideshare versions of this talk)
• A big thanks to the team in both India and the UK who built this thing.
• Questions?