doing it live - scale 15x... · elastic search generic, sharded & redundant search engine...

25
Doing It Live Machine learning at scale, and in an instant

Upload: nguyenkhuong

Post on 17-Mar-2018

218 views

Category:

Documents


3 download

TRANSCRIPT

Doing It Live Machine learning at scale, and in an instant

Christopher Smith

● VP of Engineering, Data Science

[email protected]

● FanGraph - Single view of the fan

3

Visible Impact

Sell fun at scale

Sell out a stadium …in minutes

▪Massive Traffic Spikes

▪Real Time Data at Scale

▪Separating Fans from Bots

Data Science?

Doing it live 6

7

8 Data Sets

Engage-

ment

Entitle-

ments

Access

Scans 10 Inventory

Avails

20 Inventory

Avails Payment

Member

Account

In

Venue Social

Event

Meta

Settle-

ment

BC Distributed

Commerce

Recos Abuse

Prevention

Resale

Fraud

Marketing &

Sponsorship

Winback Customer

Service

Analytics Verified Fan

Data From Everywhere… to Everywhere

9

Core Principles

• Source systems have one job: publish mutations/data

• Consumers don’t impact producers

• Process data one tuple at a time, in a FP-like fashion

• Data is available everywhere

• Data is always archived

• Organize data per project’s needs

10

Core Components

• Kafka: reliable, scalable data transit – one cluster per DC

• CKAN: data discovery

• Avro: efficient, extensible serialization w/schema

• Secor: archiving

• Storm/Trident: complex lambda processing

• Vowpal Wabbit: fast, online machine learning

• ElasticSearch: flexible document search

11

CKAN

12

Kafka

13

Avro

14

Secor

• Highly scalable

• Aggregate Kafka topic data by time

• Flexible output formats (sequence files, parquet, csv)

• Compression

• Store in s3 (cheap)

15

Storm/Trident

16

Storm/Trident

public class MyFunction extends BaseFunction {

public void execute(TridentTuple tuple, TridentCollector collector) { … }

}

public class MyFilter extends BaseFilter {

public boolean isKeep(TridentTuple tuple) { … }

}

17

Storm/Trident w/Kafka

18

Vowpal Wabbit

• Online/out-of-core learning

• Extraordinarily fast – handles large sparse feature spaces

• Broad selection of algorithms

• Scalable (all reduce)

• Active & reinforcement learning

• Contextual bandit FTW!

19

Elastic Search

• Generic, sharded & redundant search engine

• Real-time indexing

• Distinct master, data, ingest, coordination node roles

• Handles semi-structured data

• Kibana: secret weapon

20

Kibana

21

Abuse Prevention

22

Example - Analytics

FanGraph

Auto Delist on Entry

24

Lambda Architecture FTW!

• Reusuable functions

• Dev team autonomy/loose coupling

• Simplifies handling failure

• Simplifies scaling

• Complex computation with low latency

• Even less computationally intensive tasks become easier