Till Rohrmann Flink PMC member
[email protected] @stsffap
Interactive Data Analysis with Apache Flink
Data Analysis
1
Exploratory Data Analysis § Visualize data § Calculate main
characteristics § Understand data and
find possibly new hypothesis
2
Data Analysts
3
Read-Evaluate-Print Loop § New Scala shell offers REPL § Interactive queries § Let’s you explore data quickly
4
Scala Shell
5
Simple Scala Shell Example
6
Problems § No visualization § No saving or replaying of written code § No assistance à Bad IDE
7
Notebooks § Web-based interactive
computation environment
§ Combines rich text, execution code, plots and rich media
§ Storytelling
8
Apache Zeppelin § Web-based REPL with pluggable
interpreters § Since 2014 in the Apache Incubator § Supported interpreters: • Flink • Spark • Python • Markdown • Many more …
9
Word Count with Zeppelin § Find the 10 most frequent words with
more than 4 letters in the King James version of the bible.
10
11
12
13
14
Linear regression § Let’s predict the influence of advertisement
spending on sales § Input data set:
http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv
§ Features: • TV advertisement money • Radio advertisement money • Newspaper advertisement money
§ Response: • Sales
15
16
17
18
19
20
21
22
23
24
Classification § Let’s build a classifier for insult detection § Kaggle challenge
https://www.kaggle.com/c/detecting-insults-in-social-commentary
§ Label: 1 – Insult, 0 – No insult § Feature: Comment text
25
26
27
Conclusion § Interactive data analysis is really easy with
Apache Flink § Apache Zeppelin is great interactive
notebook § Zeppelin and Flink play well together to
solve machine learning tasks and more
28
29
flink.apache.org @ApacheFlink