![Page 2: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/2.jpg)
2
![Page 3: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/3.jpg)
Better be safe than sorry Failures will happen EMC estimated $1.7 billion costs
due to data loss and system downtime
Recovery will save you time and costs
Switch between algorithms Live upgrade of your system
3
![Page 4: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/4.jpg)
4
Fault Tolerance
![Page 5: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/5.jpg)
Fault tolerance guarantees At most once• No guarantees at all
At least once• For many applications
sufficient Exactly once
Flink provides all guarantees
5
![Page 6: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/6.jpg)
Checkpoints Consistent snapshots of distributed
data stream and operator state
6
![Page 7: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/7.jpg)
Barriers Markers for checkpoints Injected in the data flow
7
![Page 8: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/8.jpg)
8
Alignment for multi-input operators
![Page 9: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/9.jpg)
Operator State Stateless operators System state
User defined state
9
ds.filter(_ != 0)
ds.keyBy(0).window(TumblingTimeWindows.of(5, TimeUnit.SECONDS))
public class CounterSum implements RichReduceFunction<Long> { private OperatorState<Long> counter;
@Override public Long reduce(Long v1, Long v2) throws Exception { counter.update(counter.value() + 1); return v1 + v2; }
@Override public void open(Configuration config) { counter = getRuntimeContext().getOperatorState(“counter”, 0L, false); }}
![Page 10: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/10.jpg)
10
![Page 11: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/11.jpg)
11
![Page 12: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/12.jpg)
12
![Page 13: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/13.jpg)
13
![Page 14: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/14.jpg)
Advantages Separation of app logic from recovery• Checkpointing interval is just a config
parameter
High throughput• Controllable checkpointing overhead
Low impact on latency
14
![Page 15: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/15.jpg)
15
![Page 16: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/16.jpg)
Cluster High Availability
16
![Page 17: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/17.jpg)
Without high availability
17
JobManager
TaskManager
![Page 18: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/18.jpg)
With high availability
18
JobManager
TaskManager
Stand-byJobManager
Apache Zookeeper™
KEEP GOING
![Page 19: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/19.jpg)
Persisting jobs
19
JobManager
Client
TaskManagers
Apache Zookeeper™
Job
1. Submit job
![Page 20: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/20.jpg)
Persisting jobs
20
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Submit job2. Persist execution graph
![Page 21: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/21.jpg)
Persisting jobs
21
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Submit job2. Persist execution graph3. Write handle to ZooKeeper
![Page 22: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/22.jpg)
Persisting jobs
22
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Submit job2. Persist execution graph3. Write handle to ZooKeeper4. Deploy tasks
![Page 23: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/23.jpg)
Handling checkpoints
23
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Take snapshots
![Page 24: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/24.jpg)
Handling checkpoints
24
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Take snapshots2. Persist snapshots3. Send handles to JM
![Page 25: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/25.jpg)
Handling checkpoints
25
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Take snapshots2. Persist snapshots3. Send handles to JM4. Create global checkpoint
![Page 26: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/26.jpg)
Handling checkpoints
26
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Take snapshots2. Persist snapshots3. Send handles to JM4. Create global checkpoint5. Persist global checkpoint
![Page 27: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/27.jpg)
Handling checkpoints
27
JobManager
Client
TaskManagers
Apache Zookeeper™
1. Take snapshots2. Persist snapshots3. Send handles to JM4. Create global checkpoint5. Persist global checkpoint6. Write handle to ZooKeeper
![Page 28: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/28.jpg)
28
Conclusion
![Page 29: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/29.jpg)
29
![Page 30: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/30.jpg)
30
![Page 31: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/31.jpg)
TL;DL Job recovery mechanism with low
latency and high throughput Exactly one processing semantics No single point of failure
Flink will always keep processing your data
31
![Page 32: Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink](https://reader033.vdocuments.us/reader033/viewer/2022061306/5871382c1a28abf0568b62bf/html5/thumbnails/32.jpg)
flink.apache.org@ApacheFlink