ml through streaming at - qconlondon.com · qcon london 2020 sherin thomas @doodlesmt. stopping a...

60
ML through Streaming at QCON LONDON 2020 Sherin Thomas @doodlesmt

Upload: others

Post on 28-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

ML through Streaming at QCON LONDON 2020

Sherin Thomas@doodlesmt

Page 2: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing
Page 3: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Stopping a Phishing Attack

Page 4: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Hello Alex, I’m Tracy calling from Lyft HQ. This month we’re awarding $200 to all 4.7+ star drivers. Congratulations!

Hey Tracy, thanks!

Np! And because we see that you’re in a ride, we’ll dispatch another driver so you can park at a safe location….

….Alright your passenger will be taken care of by another driver

Page 5: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Before we can credit you the award, we just need to quickly verify your identity.

We’ll now send you a verification text. Can you please tell us what those numbers are…...

12345

Page 6: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Fingerprinting Fraudulent Behaviour

Page 7: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Request Ride

...

Driver Contact

Cancel Ride

…..

Something

Sequence of User Actions

Reference: Fingerprinting Fraudulent Behaviour

Page 8: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Reference: Fingerprinting Fraudulent Behaviour

Red Flag

Request Ride

...

Driver Contact

Cancel Ride

…..

Something

Sequence of User Actions

Page 9: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

SELECTuser_id,TOP(2056, action) OVER ( PARTITION BY user_id ORDER BY event_time RANGE INTERVAL ‘90’ DAYS PRECEDING) AS client_action_sequence

FROM event_user_action

Temporally ordered user action sequence

Page 10: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

SELECTuser_id,TOP(2056, action) OVER ( PARTITION BY user_id ORDER BY event_time RANGE INTERVAL ‘90’ DAYS PRECEDING) AS client_action_sequence

FROM event_user_action

Last x events sorted by time

Temporally ordered user action sequence

Page 11: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

SELECTuser_id,TOP(2056, action) OVER ( PARTITION BY user_id ORDER BY event_time RANGE INTERVAL ‘90’ DAYS PRECEDING) AS client_action_sequence

FROM event_user_action

Historic context is also important (large lookback)

Temporally ordered user action sequence

Page 12: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

SELECTuser_id,TOP(2056, action) OVER ( PARTITION BY user_id ORDER BY event_time RANGE INTERVAL ‘90’ DAYS PRECEDING) AS client_action_sequence

FROM event_user_action

Event time processing

Temporally ordered user action sequence

Page 13: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Make streaming features accessible for ML use cases

Page 14: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing
Page 15: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Flink

Page 16: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

● Low latency stateful operations on streaming data - in the order or

milliseconds

● Event time processing - replayability, correctness

● Exactly once processing

● Failure recovery

● SQL Api

Apache Flink

Page 17: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Event Ingestion Pipeline

HDFS

S3

Event Ingestion Pipeline

KinesisKinesisKinesisKinesisFilters

(Offline/Batch)

(Streaming)

{ “ride_req”, “user_id”: 123, “event_time”: t0}

Page 18: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Credit: The Beam Model by Tyler Akidau and Frances Perry

Processing Time vs Event Time

Processing time

System time when the event is processed -> determined by processor

Event time

Logical time when the event occurred -> part of event metadata

Page 19: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

episodeIV

1977 1980 1983 1999 2002 2005 2015 2016 2018 2019

episodeV

episodeVI

episodeI

episodeII

episodeIII

episodevii

ROGUEONEIII.5

episodeviii

episodeIX

Event Time

Processing Time

Page 20: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Example: integer sum over 2 min window

Credit: The Beam Model by Tyler Akidau and Frances Perry

Page 21: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Watermark

12:09 12:08 12:03 12:05 12:04 12:01 12:02

W = 12:05 W = 12:02W = 12:10

Page 22: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Example: integer sum over 2 min window

Credit: The Beam Model by Tyler Akidau and Frances Perry

Page 23: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Example: integer sum over 2 min window

Credit: The Beam Model by Tyler Akidau and Frances Perry

Page 24: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Usability

Page 25: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

1 Model Development

3 Data Quality

5 Compute Resources

2Feature Engineering

4Scheduling, Execution,

Data Collection

What Data Scientists care about

Page 26: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Data Input Data Prep Modeling Deployment

DATA DISCOVERY

NORMALIZE AND CLEAN UP DATA

EXTRACT & TRANSFORM

FEATURES

LABEL DATA

MAINTAIN EXTERNAL

FEATURE SETS

TRAIN MODELS

EVALUATE AND OPTIMIZE

DEPLOY

MONITOR & VISUALIZE

PERFORMANCE

ML Workflow

Page 27: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Data Input Data Prep Modeling Deployment

DATA DISCOVERY

NORMALIZE AND CLEAN UP DATA

EXTRACT & TRANSFORM

FEATURES

LABEL DATA

MAINTAIN EXTERNAL

FEATURE SETS

TRAIN MODELS

EVALUATE AND OPTIMIZE

DEPLOY

MONITOR & VISUALIZE

PERFORMANCE

ML Workflow

Page 28: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

User Plane

Dryft UI

Data Plane

Kafka

DynamoDB

Druid

Hive

Control Plane

Query Analysis

Job Cluster

Data Discovery

Dryft! - Self Service Streaming Framework

Elastic Search

Page 29: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Declarative Job Definition

{ “retention”: {}, “lookback”: {}, “stream”: { “kinesis”: user_activities }, “features”: { “user_activity_per_geohash”: { “type”: “int” “version”: 1, “description”: “user activities per geohash” } }}

Job Config

SELECTgeohash,COUNT(*) AS total_events,TUMBLE_END( rowtime, INTERVAL ‘1’ hour)

FROM event_user_actionGROUP BY TUMBLE(

rowtime,INTERVAL ‘1’ hour

)

Flink SQL

Page 30: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Feature FanoutFeature Fanout

User Apps

Kinesis

Sources

Kinesis

S3

Kinesis

Sinks

DynamoDB

Hive

Page 31: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Eating our own dogfood

Page 32: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

SELECT -- this will be used in keyBy CONCAT_WS('_', feature_name, version, id), feature_data, CONCAT_WS('_', feature_name, version) AS feature_definition, occurred_atFROM features

Feature Fanout App - also uses Dryft

{ “stream”: { “kinesis”: feature_stream }, “sink”: { “feature_service_dynamodb”: { “write_rate”: 1000, “retry_count”: 5 } }}

Page 34: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Deployment

Page 35: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Previously...

● Ran on AWS EC2 using custom deployment

● Separate autoscaling groups for JobManager and

Taskmanagers

● Instance provisioning done during deployment

● Multiple jobs(60+) running on the same cluster

Page 36: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Multi tenancy hell!!

Page 37: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Kubernetes Based Deployment

Managing Flink on Kubernetes

TM TM TM

JM

TM TM TM

TM TM TM

JM

TM TM TM

TM

JM

App 1 App 2 App 3

Page 38: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Flink-K8s-Operator

Managing Flink on Kubernetes

CustomResourceDescriptor

Flink Operator

TM TM TM

TM TM TM

JM

Page 39: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Custom Resource Descriptor

apiVersion: flink.k8s.io/v1alphakind: FlinkApplicationmetadata: name: flink-speeds-working-stats namespace: flinkspec: image: ‘100,dkr.ecr.us-east-1.amazonaws.com/abc:xyz’ flinkJob: jarName: name.jar parallelism: 10taskManagerConfig: { resources: { limits: {

memory: 15Gi,cpu: 4

} }, replicas: num_task_managers, taskSlots: NUM_SLOTS_PER_TASK_MANAGER, envConfig: {...},}

● Custom resource represents Flink application

● Docker Image containsall dependencies

● CRD modifications trigger update (includes parallelism and other Flink configuration properties)

Page 40: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

ValidateCompute Resources

Generate CRD

Dryft Conf---------------------------

Flink Operator

TM TM TM

TM TM TM

JM

KubernetesCRD

Page 41: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Flink on Kubernetes

Managing Flink on Kubernetes - by Anand and Ketan

● Separate Flink cluster for each application

● Resource allocation customized per job - at job creation time

● Scales to 100s of Flink applications

● Automatic application updates

Page 42: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Bootstrapping

Page 43: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

SELECT passenger_id, COUNT(ride_id)FROM event_ride_completedGROUP BY passenger_id, HOP(rowtime, INTERVAL ‘30’ DAY, INTERVAL ‘1’ HOUR)

What is bootstrapping?

Page 44: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

-6 -3 -4 -2-5-7 645321-1

Current TimeHistoric Data Future Data

Read historic data to ‘bootstrap’ the program with 30 days worth of data. Now your program returns results on day 1. But what if the source does not have all 30 days worth of data?

Bootstrap with historic data

Page 45: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Read historic data from persistent store(AWS S3) and streaming data from Kafka/Kinesis

Solution - Consume from two sources

Bootstrapping state in Apache Flink - Hadoop Summit

(historic)

(real-time)

Business

LogicSink

< Target Time

>= Target Time

Page 46: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing
Page 47: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Job starts

Page 48: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Bootstrapping over

Page 49: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Detect Bootstrap CompletionJob sends a signal to the control plane once watermark has progressed beyond a point where we no longer need historic data

“Update” Job with lower parallelism but same job graphControl plane cancels job with savepoint and starts it again from savepoint but with a much lower parallelism

Start JobWith a higher parallelism for fast bootstrapping

Page 50: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Output volume spike during bootstrapping

Bootstrapping

Page 51: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Output volume spike during bootstrapping

● Features need to be fresh but eventually complete

● Smooth out data writes during bootstrap to match throughput

● Write features produced during bootstrapping separately

Low Priority Kinesis Stream

High Priority Kinesis Stream

bootstrap

steady state

Idempotent Sink

Page 52: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

What about skew between historic and real-time data?

Page 53: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Skew

Watermark =

Kinesis

Page 54: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Solution: Source synchronization

partition 1

partition 2

consumer

partition 3

partition 4

consumer

global watermark

global watermark

global watermark

shared state

FLINK-10887, FLINK-10921, FLIP-27

Page 55: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Now...

Page 56: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

● 120+ features

● Features available in DynamoDB(real time point lookup), Hive(offline analysis),

Druid(real time analysis) and more…

● Time to write, test and deploy a feature is < 1/2 day

● p99 latency <5 seconds

● Coming Up - Python Support!

Page 57: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Thank you!

Sherin Thomas@doodlesmt

Page 58: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Later

Page 59: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

Backfill

Real-time Scoring DataLive Data

Lorem 3Lorem 1 Training DataHistoric Data

● What if one implementation could provide the training time and scoring time feature values?○ Batch processing mode to backfill historic values for training○ Stream processing mode to generate values in real-time for model scoring

● Enable delivery of consistent features between training and scoring

Page 60: ML through Streaming at - qconlondon.com · QCON LONDON 2020 Sherin Thomas @doodlesmt. Stopping a Phishing Attack. Hello Alex, I’m Tracy calling from Lyft ... Event time processing

● Green/Blue deploy - zero downtime deploys

● “Auto” scaling of Flink cluster and/or job parallelism

● Feature library