the data dichotomy- rethinking the way we treat data and services

The Data Dichotomy:

Rethinking Data & Services with Streams

What are microservices really about?

UI Service

Orders Service

Returns Service

Fulfilment Service

Payment Service

Stock Service

Splitting the Monolith? Single Process /Code base

/Deployment

Many Processes /Code bases

/Deployments

Orders Service

Autonomy?

Orders Service

Stock Service

Email Service

Fulfillment Service

Independently Deployable

Independence is where services get

their value

Allows Scaling

but not just in terms of data/load/throughput...

Scaling in terms of people

What happens when we grow?

Companies are inevitably a collection of applications They must work together to some degree

Interconnection is an afterthought

FTP / Enterprise Messaging etc

Internal World

External world

External World

External world

Service Boundary

Services Force Us To Consider The External World

Internal World

External world

External World

External world

Service Boundary

External World is something we should Design For

But this independence

comes at a cost $$$

Orders Object

Statement Object getOpenOrders(),

Less Surface Area = Less Coupling

Encapsulate State

Encapsulate Functionality

Consider Two Objects in one address space

Encapsulation => Loose Coupling

Change 1 Change 2

Redeploy

Singly-deployable apps are easy

Orders Service Statement Service getOpenOrders(),

Independently Deployable Independently Deployable

Synchronized changes are painful

Services work best where requirements are isolated in a single

bounded context

Single Sign On Business Service authorise(),

It’s unlikely that a Business Service would need the internal SSO state / function to change

Single Sign On

SSO has a tightly bounded context

But business services are

different

Catalog Authorisation

Most business services share the same core stream of facts.

Most services live in here

The futures of business services

are far more tightly intertwined.

We need encapsulation to hide internal state. Be loosely coupled.

But we need the freedom to slice & dice shared data like any other

dataset

But data systems have little to do

with encapsulation

Service Database

Data on inside

Data on outside

Data on inside

Data on outside

Interface hides data

Interface amplifies

Databases amplify the data they hold

The data dichotomy Data systems are about exposing data.

Services are about hiding it.

This affects services in one of

two ways

getOpenOrders( fulfilled=false,

deliveryLocation=CA, orderValue=100,

operator=GreatherThan)

Either (1) we constantly add to the interface, as datasets grow

getOrder(Id)

getOrder(UserId)

getAllOpenOrders()

getAllOrdersUnfulfilled(Id)

getAllOrders()

(1) Services can end up looking like kookie home-grown databases

...and DATA amplifies this service boundary

problem

(2) Give up and move whole datasets en masse

getAllOrders()

Data Copied Internally

This leads to a different problem

Data diverges over time

Orders Service

Stock Service

Email Service

Fulfill- ment Service

The more mutable copies,

the more data will diverge over time

Nice neat services

Service contract too

limiting

Can we change both services

together, easily? NO

Broaden Contract

Eek $$$

Is it a shared

Database?

Frack it! Just give me ALL the data

Data diverges. (many different

versions of the same facts) Lets encapsulate

Lets centralise to one copy

Start here

Cycle of inadequacy:

These forces compete in the systems we build

Divergence Accessibility Encapsulation

Shared database

Service Interfaces

Better Accessibility

Better Independence

No perfect solution

Is there a better way?

Data on the Inside

Data on the Outside

Service Boundary

Make data-on-the-outside a 1st class citizen

Separate Reads & Writes with Immutable Streams

•  Writes => “Normalized” into each service

•  Reads => “Denormalized” into Services

•  One canonical set of streams

Orders Service

Stock Service

Orders

Fulfilment Service

UI Service

Fraud Service

Writes enter via the Orders Service (which manages the workflow of orders)

Orders Service

Stock Service

Orders

Fulfilment Service

UI Service

Fraud Service

Readers use streams, materialized in each service

Orders Service

Stock Service

Orders

Fulfilment Service

UI Service

Fraud Service

Important: data is not retained, it’s just a view over those centralized streams

Use a Stream Processing tool-kit

Kafka is a Streaming Platform

Kafka: a Streaming Platform

The Log Connectors Connectors

Producer Consumer

Streaming Engine

The Log

Shard on the way in

Producing Services

Consuming Services

Each shard is a queue

Producing Services

Consuming Services

Consumers share load

Producing Services

Consuming Services

Scaling becomes a concern of the broker, not the

service

Load Balanced Services

Fault Tolerant Services

Build ‘Always On’ Services

Rely on Fault Tolerant Broker

Reset to any point in the shared narrative

Rewind & Replay

Compacted Log (retains only latest version)

Version 3

Version 2

Version 1

Version 2

Version 1

Version 5

Version 4

Version 3

Version 2

Version 1

Service Backbone Scalable, Fault Tolerant, Concurrent, Strongly Ordered, Retentive

Producer Consumer

Streaming Engine

A place to keep the data-on-the-outside

as a immutable narrative

Now add Stream Processing

Kafka’s Streaming API A general, embeddable streaming engine

Producer Consumer

Streaming Engine

What is Stream Processing?

Max(price) From orders where ccy=‘GBP’ over 1 day window emitting every second

DB Engine designed to process streams

Features: similar to database query engine

Join Filter Aggr- egate

Window

Avg(o.time – p.time) From orders, payment, user Group by user.region over 1 day window emitting every second

Stateful Stream Processing

Streams

Stream Processing Engine

Derived “Table”

Table Domain Logic

Stateful Stream Processing

stream

Compacted stream

Stream Data

Stream-Tabular Data

Domain Logic

Tables /

Event Driven Service

Join shared streams from many other services

Email Service

Legacy App Orders Payments Stock

KSTREAMS

Shared State is only cached in the service, so there is no way to diverge

Email Service

Orders Payments Stock

Lives here !

Cached here!

KSTREAMS

Tool for accessing shared, retentive streams

customer orders

catalogue Pay-

But sometimes you have to move data

Replicate it, so both copies are identical

Keep Immutable

Take only the data you need today

Oracle

Confluent’s Connectors make this easier

The Log

Connector Connector

Producer Consumer

Streaming Engine

Cassandra

When building services consider more than just

The data dichotomy Data systems are about exposing data.

Services are about hiding it.

Remember:

Shared data is a reality for most organisations

Catalog

Embrace the data that both lives and flows between services

Avoid data services that have complex / amplifying

interfaces

Share Data via Immutable Streams

Embed Function into each service

Distributed Log Event Driven

Service

(A) Shared Database (B) Messaging

(C) Service Interfaces

Data & Function Centralized

Data & Function in Owning Services

Data & Function Everywhere

Middle-ground between shared database, messaging and service interfaces

Middle Ground

Shared database

Service Interfaces

Better Accessibility

Better Independence

Balance the Data Dichotomy

RIDER Principals

•  Reactive - Leverage Asynchronicity. Focus on the now.

•  Immutable - Build an immutable, shared narrative.

•  Decentralized - No GOD services. Receiver driven.

•  Evolutionary - Use only the data you need today.

•  Retentive - Rely on the central event log, indefinitely.

Event Driven Service Backbone

The Log (streams & tables)

Ingestion Services

Services with Polyglotic

persistence

Simple Services

Streaming Services

Tune in next time

• How do we actually build these things?

• Putting the micro into microservices

Discount code: kafcom17!Use the Apache Kafka community discount code to get $50 off !www.kafka-summit.org!Kafka Summit New York: May 8!Kafka Summit San Francisco: August 28!

Presented by!

Twitter: @benstopford

Blog of this talk at http://confluent.io/blog

the data dichotomy- rethinking the way we treat data and services

Software

strata software architecture ny: the data dichotomy

rethinking brand strategy with technology and data

friday lunchtime lecture: rethinking open government data

the self/other dichotomy:...

rethinking data-intensive science using scalable analytics...

"rethinking digital research" - wolfram data summit

data performativity: rethinking liveness in network art

3.rethinking data protection

rethinking data security: the...

the dichotomy of luxury

dichotomy 3, vol. 4

gender dichotomy

rethinking energy data management: trends and challenges...

rethinking data intensive science using scalable analytics...

rethinking seo - facts, figures & data

the law/justice dichotomy

rethinking the data center network

false dichotomy

rethinking data-intensive science using scalable analytics...

rethinking data, tdwi executive summit talk