flink forward sf 2017: stephan ewen - convergence of real-time analytics and data-driven...
Post on 21-Apr-2017
166 Views
Preview:
TRANSCRIPT
Big thanks to everyone!
The convergence ofreal-time analytics and
event-driven applications@StephanEwen
Flink Forward San FranciscoApril 11, 2017
2
3
2016 was the year when streaming technologies became mainstream
2017 is the year to realize the full spectrum
of streaming applications
Some large scale streaming applications
4
5
Detecting fraud in real time
As fraudsters get better, need to update models without downtime
Live 24/7 service
Credit card transactions
Notificationsand alerts
Evolving fraudmodels built bydata scientists
@
6
@ Athena X SQL to define metrics Thresholds and actions to trigger Blends analytics and
actionsStreams from Hadoop, Kafka, etc
SQL, thresholds,
actions
AnalyticsAlerts
Derived streams
7
Route events to Kafka, ES, Hive Complex interaction sessions rules Mix of stateless / small state / large state
Stream Processing as a Service• Launching, monitoring, scaling, updating• DSL to define jobs
@
8
Blink based on Flink A core system in Alibaba Search
• Machine learning, search, recommendations• A/B testing of search algorithms• Online feature updates to boost conversion rate
Alibaba is a major contributor to Flink Contributing many changes back to open source
@
9
@
Complete social network implementedusing event sourcing andCQRS (Command Query Responsibility Segregation)
What can we learn from these?
10
All these applications run on Flink Applications, not just analytics
• Not just finding out what the data means but acting on that at the same time
Workloads going beyond the traditional Hadoop realm• Hadoop is possible deploy, source, and sink• Container engines and other storage systems
increasingly popular with Flink
So, what is data streaming?
11
First wave for streaming was lambda architecture• Aid batch systems to be more real-time
Second wave was analytics (real time and lag-time)• Based on distributed collections, functions, and
windows
The next wave is much broader:A new architecture for event-driven applications
12
Event–driven applications
Events, State, Time, and Snapshots
14
f(a,b)
Event-driven functionexecuted distributedly
Events, State, Time, and Snapshots
15
f(a,b)
Maintain fault tolerant local state similar toany normal application
Events, State, Time, and Snapshots
16
f(a,b)
wall clock
event time clock
Access and react tonotions of time and progress,handle out-of-order events
Events, State, Time, and Snapshots
17
f(a,b)
wall clock
event time clock
Snapshot point-in-timeview for recovery,rollback, cloning,versioning, etc.
Event–driven applications
18
Event-drivenApplications
Stream Processing
Batch Processing
Stateful, event-driven,event-time-aware processing
(event sourcing, CQRS, …)
(streams, windows, …)
(data sets)
The APIs
19
Process Function (events, state, time)
DataStream API (streams, windows)
Table API (dynamic tables)
Stream SQL
Stream- &Batch Processing
Analytics
StatefulEvent-DrivenApplications
Process Function
20
class MyFunction extends ProcessFunction[MyEvent, Result] {
// declare state to use in the program lazy val state: ValueState[CountWithTimestamp] = getRuntimeContext().getState(…)
def processElement(event: MyEvent, ctx: Context, out: Collector[Result]): Unit = { // work with event and state (event, state.value) match { … }
out.collect(…) // emit events state.update(…) // modify state
// schedule a timer callback ctx.timerService.registerEventTimeTimer(event.timestamp + 500) }
def onTimer(timestamp: Long, ctx: OnTimerContext, out: Collector[Result]): Unit = { // handle callback when event-/processing- time instant is reached }}
Data Stream API
21
val lines: DataStream[String] = env.addSource( new FlinkKafkaConsumer09<>(…))
val events: DataStream[Event] = lines.map((line) => parse(line))
val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum(new MyAggregationFunction())
stats.addSink(new RollingSink(path))
Table API & Stream SQL
22
Streaming Architecturefor Event-driven Applications
23
Compute, State, and Storage
24
Classic tiered architecture
Streaming architecture
database
layer
computelayer
application state+ backup
compute+
stream storageand
snapshot storage(backup)
application state
Performance
25
synchronous reads/writesacross tier boundary
asynchronous writesof large blobs
all modificationsare local
Classic tiered architecture
Streaming architecture
Consistency
26
distributed transactions
at scale typicallyat-most / at-least once
exactly onceper state
=1 =1snapshot consistency
across states
Classic tiered architecture
Streaming architecture
Scaling a Service
27
separately provision additionaldatabase capacity
provision computeand state together
Classic tiered architecture
Streaming architecture
provision compute
Rolling out a new Service
28
provision a new database(or add capacity to an existing one)
provision compute
and state together
simply occupies someadditional backup
space
Classic tiered architecture
Streaming architecture
Time, Completeness, Out-of-order
29
?
event time clocksdefine data
completenessevent time timers
handle actions for
out-of-order data
Classic tiered architecture
Streaming architecture
Repair External State
30
Streaming architecture
streams(lets say Kafka etc) live application external state
wrong results
backed up data(HDFS, S3, etc.)
Repair External State
31
Streaming architecture
live application external state
overwritewith correct results
streams(lets say Kafka etc)
backed up data(HDFS, S3, etc.) application on backup
input
Repair External State
32
Streaming architecture
live application external state
overwritewith correct results
streams(lets say Kafka etc)
backed up date(HDFS, S3, etc.)
Each service doubles as a batch job!
application on backup input
33
Streaming has outgrown the Hadoop Stack
Event-driven applications and realtime analytics converge with Apache Flink
Event-driven applications become easierto manage, faster, and more powerful following a
streaming architecture implemented with Flink
top related