building data pipelines in python - qcon london 2020 · building data pipelines in python marco...

49
Building Data Pipelines in Python Marco Bonzanini QCon London 2017

Upload: others

Post on 24-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Building Data Pipelines in Python

Marco Bonzanini

QCon London 2017

Page 2: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Nice to meet you

Page 3: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

R&D ≠ Engineering

Page 4: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

R&D ≠ EngineeringR&D results in production = high value

Page 5: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering
Page 6: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Big Data Problems vs

Big Data Problems

Page 7: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Data Pipelines (from 30,000ft)

Data ETL Analytics

Page 8: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Data Pipelines (zooming in)

ETL {Extract

Transform

Load {Clean

Augment

Join

Page 9: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Good Data Pipelines

Easy toReproduce Productise{

Page 10: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Towards Good Data Pipelines

Page 11: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Towards Good Data Pipelines (a)

Your Data is Dirtyunless proven otherwise

“It’s in the database, so it’s already good”

Page 12: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Towards Good Data Pipelines (b)

All Your Data is Importantunless proven otherwise

Page 13: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Towards Good Data Pipelines (b)

All Your Data is Importantunless proven otherwise

Keep it. Transform it. Don’t overwrite it.

Page 14: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Towards Good Data Pipelines (c)

Pipelines vs Script Soups

Page 15: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Tasty, but not a pipeline

Pic: Romanian potato soup from Wikipedia

Page 16: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

$ ./do_something.sh

$ ./do_something_else.sh

$ ./extract_some_data.sh

$ ./join_some_other_data.sh

...

Anti-pattern: the script soup

Page 17: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Script soups kill replicability

Page 18: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

$ cat ./run_everything.sh ./do_something.sh ./do_something_else.sh ./extract_some_data.sh ./join_some_other_data.sh

$ ./run_everything.sh

Anti-pattern: the master script

Page 19: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Towards Good Data Pipelines (d)

Break it Downsetup.py and conda

Page 20: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Towards Good Data Pipelines (e)

Automated Testingi.e. why scientists don’t write unit tests

Page 21: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Intermezzo

Let me rant about testing

Icon by Freepik from flaticon.com

Page 22: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

(Unit) Testing

Unit tests in three easy steps: • import unittest

• Write your tests • Quit complaining about lack of time to write tests

Page 23: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Benefits of (unit) testing

• Safety net for refactoring

• Safety net for lib upgrades

• Validate your assumptions

• Document code / communicate your intentions

• You’re forced to think

Page 24: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Testing: not convinced yet?

Page 25: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Testing: not convinced yet?

Page 26: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Testing: not convinced yet?

f1 = fscore(p, r) min_bound, max_bound = sorted([p, r])

assert min_bound <= f1 <= max_bound

Page 27: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Testing: I’m almost done• Unit tests vs Defensive Programming

• Say no to tautologies

• Say no to vanity tests

• The Python ecosystem is rich: py.test, nosetests, hypothesis, coverage.py, …

Page 28: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

</rant>

Page 29: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Towards Good Data Pipelines (f)

OrchestrationDon’t re-invent the wheel

Page 30: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

You need a workflow manager

Think: GNU Make + Unix pipes + Steroids

Page 31: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Intro to Luigi

• Task dependency management

• Error control, checkpoints, failure recovery

• Minimal boilerplate

• Dependency graph visualisation $ pip install luigi

Page 32: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Luigi Task: unit of execution

class MyTask(luigi.Task):

def requires(self): return [SomeTask()]

def output(self): return luigi.LocalTarget(…)

def run(self): mylib.run()

Page 33: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Luigi Target: output of a task

class MyTarget(luigi.Target): def exists(self): ... # return bool

Great off the shelf support local file system, S3, Elasticsearch, RDBMS

(also via luigi.contrib)

Page 34: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering
Page 35: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Intro to Airflow

• Like Luigi, just younger

• Nicer (?) GUI

• Scheduling

• Apache Project

Page 36: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Towards Good Data Pipelines (g)

When things go wrongThe Joy of debugging

Page 37: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

import logging

Page 38: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Who reads the logs?

You’re not going to read the logs, unless… • E-mail notifications (built-in in Luigi) • Slack notifications $ pip install luigi_slack # WIP

Page 39: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Towards Good Data Pipelines (h)

Static AnalysisThe Joy of Duck Typing

Page 40: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

If it looks like a duck,swims like a duck,and quacks like a duck,then it probably is a duck.

— somebody on the Web

Page 41: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

>>> 1.0 == 1 == True True >>> 1 + True 2

Page 42: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

>>> '1' * 2 '11' >>> '1' + 2 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: Can't convert 'int' object to str implicitly

Page 43: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

def do_stuff(a: int, b: int) -> str: ... return something

PEP 3107 — Function Annotations (since Python 3.0)

(annotations are ignored by the interpreter)

Page 44: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

typing module: semantically coherent

PEP 484 — Type Hints(since Python 3.5)

(still ignored by the interpreter)

Page 45: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

pip install mypy

Page 46: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

• Add optional types• Run:mypy --follow-imports silent mylib

• Refine gradual typing (e.g. Any)

Page 47: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Summary

Basic engineering principles help(packaging, testing, orchestration, logging, static analysis, ...)

Page 48: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Summary

R&D is not Engineering: can we meet halfway?

Page 49: Building Data Pipelines in Python - QCon London 2020 · Building Data Pipelines in Python Marco Bonzanini QCon London 2017. Nice to meet you. R&D ≠ Engineering. R&D ≠ Engineering

Vanity Slide

• speakerdeck.com/marcobonzanini • github.com/bonzanini • marcobonzanini.com • @MarcoBonzanini