operating samza at skyscanner

31
Stream-processing with Samza

Upload: joseph-francis

Post on 19-Jan-2017

372 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Stream-processing with Samza

01020304050607

IntroductionUnified log in SkyscannerBasics of SamzaUse cases for stream processing in SkyscannerDeployment & Local development environmentMonitoringFuture

Agenda

Introduction

Introduction

• Skyscanner is a travel search company with over 50m UMVs and over 700 employees globally.

• Joseph Francis, Senior Software Engineer in Skyscanner

• Some use cases in Skyscanner

• Make samza jobs easily deployable and operable in a multi-tenant cluster

Unified log in Skyscanner

Past

• One (big) monolith SQL database for reporting and monitoring

• Central team to deliver data needs for the organization

• Not yet jumped into the bandwagon of large scale batch processing

Unified Log & Eco-system

Basics of Samza

Key Points

• Samza consumes 1 message at a time with at-least once delivery guarantee

• Single thread of execution

• API offers init(), process() and window() methods

• State management with embedded key-value store

Configuration

Samza Job with State

Use cases in Skyscanner

Use Cases

• Building a user timeline

• Data enrichment downstream

• Stream join and windowed aggregations

Use Cases

• Indicative pricing for car hire users

Use Cases

• Real-time metrics computation off streams

Deployment & Local Development

Current Deployment Pipelines

Current Deployment

• No centralised configuration

• Restrictive source folder structure

• Ansible deployment scripts were embedded with the samza job

Local Development Environment

New Deployment Configuration

Centralised global config

Drone Plugin

Drone reads .drone.yml file

Reduced per environment configuration

Monitoring

Metrics Pipeline

Job Metrics

Job Alerts

Application Logs

• Application logs forwarded to elasticsearch through logstash

• Requires a shared format for logging (log4j.xml)

• Yarn UI is not the most intuitive!

Future

Future

• More generic jobs

• Developers should only worry about writing code

• Fully automated production deployment

• Cross the boundaries of Batch vs Streaming?

Question Time

Questions?

[email protected]

Thank you