craig blitz oracle coherence product management...for maximum resilience and risk-free linear...

31
1 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Software Architecture for Highly Available, Scalable Trading Apps: Meeting Low-Latency Requirements Intentionally Craig Blitz Oracle Coherence Product Management

Upload: others

Post on 15-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

1 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

Software Architecture for Highly Available, Scalable Trading Apps:

Meeting Low-Latency Requirements Intentionally

Craig Blitz

Oracle Coherence Product Management

Page 2: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

2 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

The following is intended to outline general product

use and direction. It is intended for information

purposes only, and may not be incorporated into any

contract. It is not a commitment to deliver any

material, code, or functionality, and should not be

relied upon in making purchasing decisions.

The development, release, and timing of any features

or functionality described for Oracle’s products

remains at the sole discretion of Oracle.

Page 3: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

3 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

Agenda

• Level-Setting: Why We Care and What We Mean

• Legacy Solutions and Architectural Patterns

• A New Paradigm

Page 4: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

4 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

Why Care About Scalability?

This is a low-latency event, isn’t it?

• Growth driven by multiple factors

• All in the context of competitive pressures

• Low latency = more business

• High latency = Sucker!

Customer

Acquisition

Product

Growth

Trading

Growth

Page 5: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

5 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

Ok, So We’ll Just Scale Up

• Application deployments deliver low-latency at given loads

• Scale-up strategies risky

– Depend on systems growing larger and larger

– Still need to ensure all components can scale-up

Page 6: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

6 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

Limits to Scale-Up

• Size of available systems

• Programming constraints

• JVM Garbage Collection

• Network capacity

Page 7: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

7 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

What Do We Mean By Scalability?

0

100

200

300

400

500

600

700

800

900

2 systems 4 systems 8 systems 16 systems

Throughput

Throughput

• Scale linearly and

predictability by

adding resources as

load increases.

• Question: Does

system on right scale?

Page 8: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

8 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

What Do We Mean By Scalability?

0

0.5

1

1.5

2

2.5

0

100

200

300

400

500

600

700

800

900

2 systems 4 systems 8 systems 16 systems

Throughput

Mean Latency

• Latency must not change

• Are we there yet?

Page 9: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

9 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

What Do We Mean By Scalability?

0

0.5

1

1.5

2

2.5

0

100

200

300

400

500

600

700

800

900

2 systems 4 systems 8 systems 16 systems

Throughput

Latency Std Dev

Mean Latency

• Doh!

• Increased Std Dev mean

increased SLA failures

• Done? Ok. Enough.

Page 10: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

10 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

Why Care About High-Availability?

• Revelation: Not everyone does

– “We’ll just stop trading if we crash”

– But if HA were free, this would be silly

– How cheap does it have to be?

• Downtime = Lost opportunity at the very least

• But, more scalability = more chance of component failure

• HA needs to be scalable, architectural and strategic

Page 11: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

11 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

Agenda

• Level-Setting: Why We Care and What We Mean

• Legacy Solutions and Architectural Patterns

• A New Paradigm

Page 12: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

12 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

12 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

Insert Information Protection Policy Classification from Slide 8

Conceptual Trading & Risk Platform

Page 13: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

13 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

Scalable Apps, Stove-piped Systems Simplified Process

Pre-Trade Analysis

Order Management

Trade Execution

Post-Trade Analysis

• Scalable best practices applied per application

• Low-latency messaging to communicate between applications

Page 14: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

14 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

Challenges Scaling Stovepipe Architectures Which came first, the organization or the silo?

Technical

• Who is managing state?

– Database? Distributed Cache?

– Processing does not scale with data

– Excessive data movement

• HA managed at component level

• Low-level messaging must scale

• Deploying systems and networks for

new scale requirements difficult

Organizational

• Many organizations involved

– App teams

– Networking

– Q&A

– Systems

– Database

• Cross-org communications difficult

• Vested interests

Page 15: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

15 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

Scalable Apps, Stove-piped Systems A little better

Pre-Trade Analysis

Order Management

Trade Execution

Post-Trade Analysis

• Recoverable state managed on data tier

• Data tier scalable as demand or data grows

• Still expensive (will revisit this later)

Distributed Cache

Page 16: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

16 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

Agenda

• Level-Setting: Why We Care and What We Mean

• Legacy Solutions and Architectural Patterns

• A New Paradigm

Page 17: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

17 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

Distributed Caching and Data Grids

Distributed Caches

• Scalable object caching across multiple

servers

• Possibly lossy

– It’s a cache!

– No clustering or backups

• Read-through/ write-through to data sources

• Expiration

• Eviction

Data Grids

• Processing scales with data

• Event model

• Cannot be lossy

– Clustering and Backups

– Death detection and transparent recovery

• Queries

• Map-Reduce Aggregations

• Write-Behind

Page 18: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

18 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

From Distributed Cache…

App

Primary

Partition

Backup

Partition

LOCK (1)

GET (3)

PUT (4)

UNLOCK (6)

• Fast, scalable, highly available access to application objects

• App tier and data tier scale separately

• Too many network roundtrips for low latency

• Lock held across many network roundtrips

PUT (5)

UNLOCK (7)

LOCK (2)

Cache Server Cache Server

Page 19: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

19 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

… To Data Grid …

Client Tier

Primary

Partition

(App)

Backup

Partition

(App)

INVOKE

Cache Server Cache Server

BACKUP

• Processing moved to data grid

• App tier and data tier scale together

• Lockless processing

• Transactional processing on co-located related objects (trade and orders)

• State always recoverable

Page 20: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

20 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

… To Event Driven Architecture …

Client Tier

Primary

Partition

Process 1

Process 2

Process 3

Backup

Partition

(App)

INVOKE

Cache Server Cache Server

BACKUP

• “Live Objects” listen to state change on itself to schedule next process phase

• State (and hence processing) always recoverable

• Eliminates need for messaging between application processors

• Highly scalable, completely asynchronous

BACKUP

BACKUP

Page 21: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

21 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

Oracle Coherence Data Grid Distributed In Memory Data Management

• Provides a reliable data tier with a single,

consistent view of data

• Enables dynamic data capacity including fault tolerance and load balancing

• Ensures that data capacity scales with processing capacity

Mainframes Databases Web Services

Enterprise

Applications

Real Time

Clients

Web

Services

Oracle Coherence Data Grid

Data Services

Page 22: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

22 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

Coherence: A Unique Approach

• In Coherence…

– Members share responsibilities (health, services, data…)

– No Single Points of Bottleneck (SPOBs)

– No Single Points of Failure (SPOFs)

– Linearly scalable to hundreds of servers by design

• Servers form a full “mesh”

– No Masters / Slaves etc.

– Data Grid members work together as a team

– Communication is almost always point-to-point

• Scalable throughput up to the limit of the backplane

Page 23: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

23 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

How Does Oracle Coherence Work?

• Data load-balanced in-memory across a cluster of servers

• Data automatically and synchronously replicated to at least one other server for continuous availability

• Single System Image: Logical view of all data on all servers

• Servers monitor the health of each other

• In the event a server fails or is unhealthy, other

servers cooperatively diagnose the state

• The healthy servers immediately assume the

responsibilities of the failed server

• Continuous Operation: No interruption of service

or loss of data due when a server fails

?

X

Page 24: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

24 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

Trading Platform Example

• System designed as Finite State Machine

• Data Affinity co-locates Orders and Market

Matching Engines in Cluster

• Coherence manages recoverable state

(always recoverable)

• Used standard Java Concurrency library for

asynchronous tasks

• Individual components unit testable and

provable – simplifies development

• Through-put and performance dependent on

cores and network

• Designed to minimize storage and network

tasks

Oracle Confidential and Proprietary

Page 25: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

25 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

Revisiting Silos

• Co-located processing elements (PE) via Coherence EDA.

– Scaling and HA architected into system.

– Messaging component between PE eliminated.

• Several teams still involved in “elastic scaling”

– Need to procure, configure, and deploy new systems

– Need to configure and test new system on network

• Latency much better than where we started

– Removed network hops, data movement

– Limited by network speed

Page 26: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

26 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

Exabus: Exalogic I/O and Network Design Eliminates cloud, cluster and network virtualization I/O bottlenecks

Data Center

Service Network

(10GbE)

Management Network (GbE)

Data Center

Mgmt Network

(GbE)

10GbE

GbE

Ethernet Gateway

Switches

Standard

Oracle

Database

Exab

us

(In

finiB

an

d I/O

Backp

lan

e)

Exadata

Exalogic

SPARC SuperCluster

Management Switch Storage

Compute Nodes

Spine Switch

Exalogic X2-2

Copyright © 2011 Oracle Corporation

ZFS Storage

IB

Page 27: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

27 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

Coherence Exabus Optimizations Direct Memory I/O for Java and C++

• Leverage new Java APIs and Exalogic Elastic

Cloud Software

- Low Latency support for Infiniband

- Optimized implementation for Exalogic Infiniband

• Scalable to massively multi-core systems

• Surfacing low-level advanced networking capabilities

4x Throughput,

6x Better Response Time

Page 28: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

28 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

Coherence on Exalogic Engineered System Optimized Scalability and Performance in a Box

• Coherence optimized for Exabus

• Pre-configured, pre-optimized

• Elastic Data: Expand Capacity with

Flash

• Easy deployment as demand spikes

• Scale from ¼ to multi-rack

Page 29: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

29 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

Risk Systems Built on Oracle Coherence

Credit Suisse

Challenges Solutions

• Achieve five millisecond or lower response time for pre-transaction credit checks against counterparties globally

• Process intraday credit checks for a large number of transactions daily and scale by up to a factor of ten without risk of increase in latency

• Built in-memory application grid for its performance, resilience and risk-free scale-out with Oracle Coherence and JRockit to achieve consistent low-latency for credit checks.

• Preferred Coherence for its simplicity, which enabled a team of four to deliver the system quickly and support it globally.

• Coherence stores intraday data and processes credit checks.

• Installed regional system instances to ensure proximity to clients and enable low-latency and instant failover.

JP Morgan Chase

Challenges Solutions

• Provide traders, researchers, and financial controllers with accurate, timely risk exposure and profit and loss (P&L) figures for the rates, exotics, and hybrids business in a volatile trading environment

• Gain drill down from aggregated book-level to trade- level details and slice data in multiple dimensions while reducing preparation and run times to support real-time decisions.

• Create a fully backed-up, highly redundant loss-less environment to guarantee data availability in case of IT failure.

• Built Project Orion, a risk exposure and P&L reporting solution, on Oracle Coherence for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache.

• Deployed Orion on a large (more than 200 node) cluster in Europe, the Middle East, Asia. and North America.

• Loaded data into Coherence to provide dynamic aggregations for on-demand slicing and dicing

• Reduced turnaround time for delivering trade level risk exposure and P&L to users.

Page 30: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

30 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

For More Information

• General Information: http://coherence.oracle.com

• Coherence YouTube Channel:

http://www.youtube.com/user/OracleCoherence

• Coherence Training: http://education.oracle.com

• Coherence Discussion Forum: http://forums.oracle.com

• Coherence User Group on Linkedin

• “Oracle Coherence 3.5” by Aleks Seovic

• My email: [email protected]

Page 31: Craig Blitz Oracle Coherence Product Management...for maximum resilience and risk-free linear scalability with a distributed, in-memory data cache. •Deployed Orion on a large (more

31 Copyright © 2011, Oracle and/or its affiliates. All rights

reserved.

Q&A