python as part of a production machine learning stack by michael manapat pydata sv 2014

23
Python as part of a produc0on machine learning stack Michael Manapat @mlmanapat Stripe

Upload: pydata

Post on 27-Jan-2015

122 views

Category:

Technology


2 download

DESCRIPTION

Over the course of three years, we've built Stripe from scratch and scaled it to process billions of dollars of transaction volume a year by making it easy and painless for merchants to get set up and start accepting payments. While the vast majority of transactions facilitated by Stripe are honest, we do need to protect our merchants from rogue individuals and groups seeing to "test" or "cash" stolen credit cards. To combat this sort of activity, Stripe uses Python (together with Scala and Ruby) as part of its production machine learning pipeline to detect and block fraud in real time. In this talk, I'll go through the scikit-based modeling process for a sample data set that is derived from production data to illustrate how we train and validate our models. We'll also walk through how we deploy the models and monitor them in our production environment and how Python has allowed us to do this at scale.

TRANSCRIPT

Page 1: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Python  as  part  of  a  produc0on  machine  learning  stack        Michael  Manapat  @mlmanapat  Stripe    

Page 2: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Outline    -­‐Why  we  need  ML  at  Stripe  -­‐Simple  models  with  sklearn  -­‐Pipelines  with  Luigi  -­‐Scoring  as  a  service    

Page 3: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Stripe  is  a  technology  company  focusing  on  making  payments  easy    -­‐Short  applica>on    

Page 4: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Tokeniza0on       Customer  

browser   Stripe  

Stripe.js  

Token  

Merchant  server   Stripe  

API  call  

Result  

Page 5: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

API  Call    import stripe stripe.Charge.create( amount=400, currency="usd", card="tok_103xnl2gR5VxTSB” [email protected]" )"

Page 6: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Fraud  /  business  viola0ons    -­‐Terms  of  service  viola>ons  (weapons)  -­‐Merchant  fraud  (card  “cashers”)      -­‐Transac>on  fraud    -­‐No  machine  learning  a  year  ago  

Page 7: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Fraud  /  business  viola0ons    -­‐Terms  of  service  viola>ons    E-­‐cigareMes,  drugs,  weapons,  etc.    How  do  we  find  these  automa>cally?  

Page 8: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Merchant  sign  up  flow          

Applica>on  submission  

Website  scraped  

Text  scored  Applica>on  reviewed  

Page 9: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Merchant  sign  up  flow          

Applica>on  submission  

Website  scraped  

Text  scored  Applica>on  reviewed  

Machine  learning  

pipeline  and  service  

Page 10: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Building  a  classifier:  e-­‐cigareIes    data = pandas.from_pickle(‘ecigs’) data.head() text violator 0 " please verify your age i am 21 years or older ... True 1 coming soon toggle me drag me with your mouse ... False 2 drink moscow mules cart 0 log in or create an ... False 3 vapors electronic cigarette buy now insuper st... True 4 t-shirts shorts hawaii about us silver coll... False [5 rows x 2 columns]  

Page 11: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Features  for  text  classifica0on    cv = CountVectorizer features = cv.fit_transform(data['text'])

Sparse  matrix  of  word  counts  from  input  text  (omiSng  feature  selec>on)  

Page 12: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Features  for  text  classifica0on  X_train, X_test, y_train, y_test = train_test_split( features, data['violator'], test_size=0.2)

-­‐Avoid  leakage  -Other  cross-­‐valida>on  methods  

Page 13: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Training  model = LogisticRegression() model.fit(X_train, y_train)

Serializer  reads  from  model.intercept_ model.coef_

 

Page 14: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Valida0on  probs = model.predict_proba(X_test) fpr, tpr, thresholds = roc_curve(y_test, probs[:, 1]) matplotlib.pyplot(fpr, tpr)  

Page 15: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

ROC:  Receiver  opera0ng  characteris0c  

 

Page 16: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Pipeline    -­‐Fetch  website  snapshots  from  S3  -­‐Fetch  classifica>ons  from  SQL/Impala  -­‐Sani>ze  text  (strip  HTML)  -­‐Run  feature  genera>on  and  selec>on  -­‐Train  and  serialize  model  -­‐Export  valida>on  sta>s>cs  

Page 17: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Luigi    class GetSnapshots(luigi.Task): def run(self): " "... class GenFeatures(luigi.Task): def requires(self): return GetSnapshots()"

Page 18: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Luigi  runs  tasks  on  Hadoop  cluster  "

Page 19: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Scoring  as  a  service    " Applica>on  submission  

Website  scraped  

Text  scored  Applica>on  reviewed  

ThriO  RPC  

Scoring  Service  

Page 20: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Scoring  as  a  service    struct ScoringRequest { 1: string text 2: optional string model_name } struct ScoringResponse { 1: double score" " "// Experiments? 2: double request_duration }"

Page 21: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Why  a  service?    -­‐Same  code  base  for  training/scoring    -­‐Reduced  duplica>on/easier  deploys    -­‐Experimenta>on    

Page 22: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

-­‐Log  requests    and  responses    (Parquet/Impala)    -­‐Centralized    monitoring    (Graphite)  

Page 23: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

Summary    -­‐Simple  models  with  sklearn  -­‐Pipelines  with  Luigi  -­‐Scoring  as  a  service    Thanks!  @mlmanapat