building real time data-driven products

51
Building Real-time Data-driven Products Øyvind Løkling & Lars Albertsson Version 1.3 - 2016.10.12

Upload: lars-albertsson

Post on 16-Apr-2017

341 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Building real time data-driven products

Building Real-time Data-driven Products

Øyvind Løkling & Lars Albertsson

Version 1.3 - 2016.10.12

Page 2: Building real time data-driven products

Øyvind LøklingStaff Software Engineer

Schibsted Product & Technology

Oslo - Stockholm - London - Barcelona - Krakowhttp://jobs.schibsted.com/

Page 3: Building real time data-driven products

Lars AlbertssonIndependent Consultant

Mapflat.com

Page 4: Building real time data-driven products

.. and more

Page 5: Building real time data-driven products

Presentation goals● Spark your interest in building data driven products.● Give an overview of components and how these relate.● Suggest technologies and approaches that can be used in practice.

● Event Collection● The Unified Log● Stream Processing● Serving Results● Schema

5

Page 6: Building real time data-driven products

Data driven products• Services and applications

primarily driven by capturing and making sense of data• Health trackers• Recommendations• Analytics

CC BY © https://500px.com/tommrazek

Page 7: Building real time data-driven products

Data driven products• Hosted services need to

• Handle large volumes of data

• Cleaning and structuring data

• Serve individual users fast

CC BY © https://500px.com/tommrazek

Page 8: Building real time data-driven products

Big Data, Fast Data, Smart Data• Accelerating data volumes and

speeds

• Internet of Things

• AB Testing and Experiments

• Personalised products

CC BY © https://500px.com/erfgorini

Page 9: Building real time data-driven products

Big Data, Fast Data, Smart Data

A need to make sense of data and act on fast

• Faster development cycle

• Data driven organisations

• Data driven products

CC BY © https://500px.com/erfgorini

Page 10: Building real time data-driven products

Time scales - what parts become obsolete?

10

Credits: Confluent

Credits: Netflix

Page 11: Building real time data-driven products

The Log

• Other common names

• Commit log

• Journal

https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Page 12: Building real time data-driven products

Jay Kreps - I <3 Logs

The state machine replication principle:

If two identical, deterministic processes begin in the same state and get the same inputs in the same order, they will produce the same output

and end in the same state.

Page 13: Building real time data-driven products

The Unified LogSimple idea; All of the organizations data,available in one unified log that is;

• Unified: the one source of truth• Append only: Data items are immutable• Ordered: addressable by offset unique per partition• Fast and scalable: Able to handle 1000´s msg/sec• Distributed, robust and fault tolerant.

Page 14: Building real time data-driven products

Kafka• “Kafka is used for

building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.”

But… :-) http://www.confluent.io/kafka-summit-2016-101-ways-to-configure-kafka-badly

Page 15: Building real time data-driven products

Kafka• “Kafka is used for

building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.”

But… :-) http://www.confluent.io/kafka-summit-2016-101-ways-to-configure-kafka-badly

Page 16: Building real time data-driven products

© 2016 Apache Software Foundation

Page 17: Building real time data-driven products

© 2016 Apache Software Foundation

Page 18: Building real time data-driven products

A streaming data product’s anatomy

18

Pub / sub

Unified log

Ingress Stream processing Egress

DB

Service

TopicJobPipeline

Service

Export

Visualisation

DB

Page 19: Building real time data-driven products

Architectural patterns  The Unified Log and Lambda, Kappa architectures

Page 20: Building real time data-driven products

Lambda Architecture

Example: Recommendation Engine @ Schibsted

Page 21: Building real time data-driven products
Page 22: Building real time data-driven products
Page 23: Building real time data-driven products
Page 24: Building real time data-driven products
Page 25: Building real time data-driven products
Page 26: Building real time data-driven products

Kappa ArchitectureA software architecture pattern… where the canonical data store is an append-only immutable log. From the log, data is streamed through a computational system and fed into auxiliary stores for serving.

A Lambda architecture system with batch processing removed.

https://www.oreilly.com/ideas/questioning-the-lambda-architecture

Page 27: Building real time data-driven products
Page 28: Building real time data-driven products
Page 29: Building real time data-driven products

Kappa architecture

Page 30: Building real time data-driven products

Lambda vs Kappa● Lambda

○ Leverages existing batch infrastructure○ Cognitive overhead maintaining two approaches in parallel

● Kappa○ Is real-time processing is inherently approximate, less powerful, and more

lossy than batch processing. True?

○ Simpler model

https://www.oreilly.com/ideas/questioning-the-lambda-architecture

Page 31: Building real time data-driven products

Cold Storage• Source of truth for replay in case of failure• Available for ad-hoc batch querying (Spark)• Wishlist; Fast writes, Reliable, Cheap

• Cloud storage - s3 (with Glacier)• Traditional - HDFS, SAN

• Consider what read performance do you need for a) error recovery b) bootstrapping new deployments

Page 32: Building real time data-driven products

Capturing the data

Page 33: Building real time data-driven products

Event collection

33

Cluster storageHDFS

(NFS, S3, Google CS, C*)

Service

Reliable, simple, write available

KafkaEvent Bus with history

(Secor,Camus)

Immediate handoff to append-only replicated log.Once in the log, events eventually arrive in storage.

Unified logImmutable events, append-only,

source of truth

Page 34: Building real time data-driven products

Event collection - guarantees

34

Unified log

Service(unimportant)

Events are safe from here

Replicated Topics

Non-critical data: Asynchronous fire-and-forget handoff

Critical data: Synchronous, replicated, with acknowledgement

Service(important)

Page 35: Building real time data-driven products

Event collection• Source Types

• Firehose api´s• Mobile apps and websites• Internet of things / embedded sensors• Event sourcing from existing systems

Page 36: Building real time data-driven products

Event collection

• Considerations• Can you control the data flow, ask sender to wait?• Can clients be expected to have their logic updated?• Can you afford to loose some data, make tradeoffs and

still solve your task?

Page 37: Building real time data-driven products

Stream Processing

Page 38: Building real time data-driven products

Pipeline graph• Parallelised jobs read and write to Kafka topics• Jobs are completely decoupled • Downstream jobs do not impact upstream• Usually an acyclic graph

CC BY © https://500px.com/thakurdalipsinghIllustration from Apache Samza Doc - Concepts

Page 39: Building real time data-driven products

Stream processing components● Building blocks

○ Aggregate■ Calculate time windows■ Aggregate state (database/in memory)

○ Filter■ Slim down stream■ Privacy, Security concerns

○ Join ■ Enrich by joining with datasets (geoip)

○ Transform■ Bring data into same “shape”, schema

Page 40: Building real time data-driven products

Stream Processing Platforms• Spark Streaming

• Ideal if you are already using Spark, same model• Bridges gap between data science / data

engineers, batch and stream

• Kafka Stream• Library - New, positions itself as a lightweight

alternative• Tightly coupled to on Kafka

• Others○ Storm, Flink, Samza, Google Dataflow, AWS

Lambdahttp://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/

Page 41: Building real time data-driven products

Schemas

Page 42: Building real time data-driven products

Schemas• You always have a schema• Schema on write

• Requires upfront schema design before data can be received

• Synchronised deployment of whole pipeline• Schema on read

• Allows data to be captured -as is-. • Suitable for “Data lake”• Often requires cleaning and transform bring datasets into

consistency downstream.

Page 43: Building real time data-driven products

Schema on read or write?

43

DBDB

DBService

Service

Export

BusinessintelligenceChange agility important here

Production stability important here

Page 44: Building real time data-driven products

Schemas• Options In streaming applications

• Schema bundled with every record • Schema registry + id in record

• Schema formats• JSON Schema• AVRO

Page 45: Building real time data-driven products

Evolving Schemas• Declare schema version even if no

guarantee. Captures intention of source. • Be prepared for bad and

non-validating data.• Decide on strategy for bringing schema

versions in alignment.• Maintain upgrade path through

transforms.• What are the needs of the consumer.

• Data exploration vs Stable services.

Page 46: Building real time data-driven products

Results

Page 47: Building real time data-driven products

Serving Results

● As streams○ Internal Consumer○ External Consumer bridges

■ ex. REST post to external ingest endpoint● As Views

○ Serving index, NoSQL ○ SQL / cubes for BI

Page 48: Building real time data-driven products

Reactive Streams• [...] an initiative to provide a standard for asynchronous stream processing with

non-blocking back pressure. [...] aimed at runtime environments (JVM and JavaScript) as well as network protocols.

• The scope [...] is to find a minimal set of interfaces, methods and protocols that will describe the necessary operations and entities to achieve this goal.

• “Glue” between libraries. Reactive Kafka -> Akka Stream -> RxJava

Page 49: Building real time data-driven products

https://imgflip.com/memetemplate/17759370/dog-meditation-funny

Page 50: Building real time data-driven products

Thank you!

Page 51: Building real time data-driven products

Schemas• You always have a schema

• Even if you are “Schemaless”• Build tooling and workflows for handling schema changes