deep dive of flink & spark on amazon emr - february online tech talks
TRANSCRIPT
Real-time Stream Processing on EMR: Apache Flink vs Apache Spark Streaming
Keith Steward, Ph.D.
Specialist (EMR) Solution Architect
AWS
What we’ll cover:1. The need for real-time stream processing, and challenges in
accomplishing it
2. Flink stream processor (versus Spark Streaming):• What are its aims?
• How does it address real-time stream processing challenges?
• How does it differ from Spark Streaming?
• Real-world Flink examples
• When to use Flink vs Spark Streaming?
3. Flink Demo: How to deploy & run a Flink stream processing architecture in AWS?
The Need for Real-Time Stream Processing
Increasingly, data is arriving as continuous flows of events:
• cars in motion emitting GPS signals • financial transactions • interchange of signals between cell phone towers
and people busy with their smartphones • web traffic • machine logs • measurements from industrial sensors and wearable
devices
Streaming data is a better fit for the way we live.
Challenges in Processing Streams:
• Event time (rather than data processing time); out of order events
• Consistency, fault tolerance, and high availability
• Rich forms of window queries; real-time alerts
• Low latency and high throughput
Hot
Reference architecture
COLLECT STORE CONSUMEPROCESS / ANALYZEETL
Traditional Big Data Pipeline (without Stream Processing)
Reference architecture
COLLECT STORE CONSUMEPROCESS / ANALYZE
Amazon Elasticsearch Service
Apache Kafka
Amazon SQS
Amazon KinesisStreams
Amazon Kinesis Firehose
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
Amazon DynamoDB Streams
Hot
Hot
War
m
File
Mes
sage
Stre
amSe
arch
S
QL
N
oSQ
L
Cach
e
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Reference architecture
ETL
Amazon SQS apps
Streaming
Amazon Kinesis Analytics
KCLapps
AWS Lambda
Amazon Redshift
Amazon Machine Learning
Presto
AmazonEMR
Fast
Slow
Fast
Batc
hM
essa
geIn
tera
ctiv
eSt
ream
ML
Amazon EC2
Amazon EC2
Amazon EMR
Amazon QuickSight
Apps & Services
Anal
ysis
& v
isua
lizat
ion
Not
eboo
ks
IDE
API
Appl
icat
ions
Mobile apps
Web apps
Devices
MessagingMessage
Sensors & IoT platforms
Data centers
AWS Import/ExportSnowball
Logging
Amazon CloudWatch
AWS CloudTrail
Logg
ing
IoT
Tran
spor
tM
essa
ging
AWS Direct Connect
AWS IoT
Reference architecture
COLLECT STORE CONSUMEPROCESS / ANALYZE
Amazon Elasticsearch Service
Apache Kafka
Amazon KinesisStreams
Amazon DynamoDB
Amazon ElastiCache
Amazon DynamoDB Streams
Hot
Hot
War
m
Stre
amSe
arch
NoS
QL
Cac
he
RECORDS
STREAMS
Reference architecture
ETL
Streaming
Amazon Kinesis Analytics
KCLapps
AWS Lambda
Fast
Batc
hIn
tera
ctiv
eSt
ream
ML
Amazon EC2
Amazon EMR
Apps & Services
Anal
ysis
& v
isua
lizat
ion
Not
eboo
ks
IDE
API
Appl
icat
ions
Mobile apps
Web apps
Devices
Message
Sensors & IoT platforms
Data centers
Logging
Amazon CloudWatch
AWS CloudTrail
Logg
ing
IoT
Tran
spor
t
AWS Direct Connect
AWS IoT
Stream ProcessingReference architecture
Fast
“Apache Flink is an open source platform for distributed stream and batch data processing.”
Flink is a Stream-First Architecture, that happens to also do batch processing as special case of bounded stream processing.
Apache Flink
Flink Handles Challenging Scenarios:
1. Application code upgrades
2. Flink version upgrades
3. Maintenance and migration
4. What-if simulations (reinstatements)
5. A/B testing
Flink addresses all Streaming Challenges Simultaneously
Flink Performance Tests
• Amazon Kinesis Streams
• Apache Kafka
• Elasticsearch
• Twitter Streaming API
• Cassandra
There are connectors for third-party data sources:
Where is Flink being used in production?
When to use Flink vs Spark Streaming?
Flink might be best when:• workload demands true real-time
stream processing performance with low latency, high throughput, and fault-tolerance.
• not yet heavily invested in Spark Streaming (existing systems, staff training/experience)
• want convenience of replaying and reprocessing streams after code/system changes
Spark Streaming might be best when:• Primarily do batch processing• Already invested in Spark Streaming
(existing deployments, staff)• The micro-batching is acceptable for
your workload• Need to code in Python or R• Want to “wait and see” how Flink
matures before adopting.
How to deploy & run Flink in AWS ?
Flink Demo: Analyzing NYC Taxi Rides in Real Time
Demo Event Processing Architecture
“Replayable Log”
Stream Processing Visualization
Demo Event Processing Architecture
Amazon Kinesis
Amazon EMR
+
EC2 instance(bastion host)
Amazon Elasticsearch
Service
Real-time streamingHigh throughput; elasticKeeps a ‘replayable log’ of your eventsEasy to useS3, Redshift, DynamoDB Integrations
Amazon Kinesis
Dynamically Scalable transient or persistent Hadoop clusters as a serviceHadoop, Hive, Spark, Presto, Hbase, … (17 applications)
Easy to use; fully managedOn demand, reserved, spot pricingHDFS, S3, and Amazon EBS filesystemsEnd to end security: access controls, firewalls, encryption
Amazon EMR
Provisions & maintains an Elasticsearch cluster (distributed index)Complete ELK stack, including KibanaFully managed service; zero adminHighly available & reliableScalable
{show demo!}
1. For predominantly stream-processing workloads but also for batch processing, Apache Flink has much to offer:• Simultaneously addresses high throughput, low latency,
and fault-tolerance• Do both stream processing & batch processing with a
single technology• Powerful windowing functions• Convenient capabilities to pause, restart, and change
Flink applications without data loss.
2. Flink is still new and adoption is not as far advanced as Spark Streaming.
3. AWS makes it easy to run streaming workloads with Amazon Kinesis and either Spark Streaming or Flink running on EMR clusters.
Additional Details
Approximate Event Time – Kinesis details• Each Amazon Kinesis record includes an
ApproximateArrivalTimestamp
• The timestamp is set when an Amazon Kinesis stream successfully receives and stores a record
• By default the event time of Flink uses this timestamp when reading from a Kinesis stream
StreamExecutionEnvironment env =StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
Event Time and Watermarks• With event time the time of an event is determined by the
producer
• Flink measures progress in event time by means of Watermarks
• Watermarks must be ingested to each individual Kinesis shard
DataStream<Event> kinesis = env.addSource(new FlinkKinesisConsumer<>(...)).assignTimestampsAndWatermarks(new
PunctuatedAssigner())
Data Encryption with Amazon EMR and FlinkSecurity configuration supports encryption
• for data stored within the file system
• Hadoop Distributed File System (HDFS) block-transfer and RPC
• S3 data (SSE-S3, SSE-KMS, CSE-KMS, CSE-Custom)
• Local disk (except boot volumes)
• In-transit data (no Flink support yet)
env.readTextFile("s3://...")env.setStateBackend(new FsStateBackend("hdfs://..."))
Connecting to the Flink Dashboard• Use dynamic port forwarding to the Master node
ssh -D 8157 hadoop@...
• Use FoxyProxy to redirect URLs to localhost
*ec2*.amazonaws.com*
*.compute.internal*
• Navigate to the YARN Resource Manager and select the Tracking UI
Starting Flink and Submitting JobsUse steps to interact with Flink through the AWS API
Extending Flink Functionality• Flink Elasticsearch sink merely supports TCP transport
• A custom Elasticsearch sink with HTTP support requires only a few dozens lines of code using• Jest (io.searchbox)
• aws-signing-request-interceptor (vc.inreach.aws)
Amazon SQS apps
Streaming
Amazon Kinesis Analytics
KCLapps
AWS Lambda
Amazon Redshift
COLLECT STORE CONSUMEPROCESS / ANALYZE
Amazon Machine Learning
Presto
AmazonEMR
Amazon Elasticsearch Service
Apache Kafka
Amazon SQS
Amazon KinesisStreams
Amazon Kinesis Firehose
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
Amazon DynamoDB Streams
Hot
Hot
War
m
Fast
Slow
Fast
Batc
hM
essa
geIn
tera
ctiv
eSt
ream
ML
Sear
ch
SQ
L
NoS
QL
Ca
che
File
Mes
sage
Stre
am
Amazon EC2
Amazon EC2
Mobile apps
Web apps
Devices
MessagingMessage
Sensors & IoT platforms
AWS IoT
Data centersAWS Direct
Connect
AWS Import/ExportSnowball
Logging
Amazon CloudWatch
AWS CloudTrail
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Amazon QuickSight
Apps & Services
Anal
ysis
& v
isua
lizat
ion
Not
eboo
ks
IDE
API
Reference architecture
Logg
ing
IoT
Appl
icat
ions
Tran
spor
tM
essa
ging
ETL
Amazon EMR