python as part of a production machine learning stack by michael manapat pydata sv 2014
DESCRIPTION
Over the course of three years, we've built Stripe from scratch and scaled it to process billions of dollars of transaction volume a year by making it easy and painless for merchants to get set up and start accepting payments. While the vast majority of transactions facilitated by Stripe are honest, we do need to protect our merchants from rogue individuals and groups seeing to "test" or "cash" stolen credit cards. To combat this sort of activity, Stripe uses Python (together with Scala and Ruby) as part of its production machine learning pipeline to detect and block fraud in real time. In this talk, I'll go through the scikit-based modeling process for a sample data set that is derived from production data to illustrate how we train and validate our models. We'll also walk through how we deploy the models and monitor them in our production environment and how Python has allowed us to do this at scale.TRANSCRIPT
Python as part of a produc0on machine learning stack Michael Manapat @mlmanapat Stripe
Outline -‐Why we need ML at Stripe -‐Simple models with sklearn -‐Pipelines with Luigi -‐Scoring as a service
Stripe is a technology company focusing on making payments easy -‐Short applica>on
Tokeniza0on Customer
browser Stripe
Stripe.js
Token
Merchant server Stripe
API call
Result
API Call import stripe stripe.Charge.create( amount=400, currency="usd", card="tok_103xnl2gR5VxTSB” [email protected]" )"
Fraud / business viola0ons -‐Terms of service viola>ons (weapons) -‐Merchant fraud (card “cashers”) -‐Transac>on fraud -‐No machine learning a year ago
Fraud / business viola0ons -‐Terms of service viola>ons E-‐cigareMes, drugs, weapons, etc. How do we find these automa>cally?
Merchant sign up flow
Applica>on submission
Website scraped
Text scored Applica>on reviewed
Merchant sign up flow
Applica>on submission
Website scraped
Text scored Applica>on reviewed
Machine learning
pipeline and service
Building a classifier: e-‐cigareIes data = pandas.from_pickle(‘ecigs’) data.head() text violator 0 " please verify your age i am 21 years or older ... True 1 coming soon toggle me drag me with your mouse ... False 2 drink moscow mules cart 0 log in or create an ... False 3 vapors electronic cigarette buy now insuper st... True 4 t-shirts shorts hawaii about us silver coll... False [5 rows x 2 columns]
Features for text classifica0on cv = CountVectorizer features = cv.fit_transform(data['text'])
Sparse matrix of word counts from input text (omiSng feature selec>on)
Features for text classifica0on X_train, X_test, y_train, y_test = train_test_split( features, data['violator'], test_size=0.2)
-‐Avoid leakage -Other cross-‐valida>on methods
Training model = LogisticRegression() model.fit(X_train, y_train)
Serializer reads from model.intercept_ model.coef_
Valida0on probs = model.predict_proba(X_test) fpr, tpr, thresholds = roc_curve(y_test, probs[:, 1]) matplotlib.pyplot(fpr, tpr)
ROC: Receiver opera0ng characteris0c
Pipeline -‐Fetch website snapshots from S3 -‐Fetch classifica>ons from SQL/Impala -‐Sani>ze text (strip HTML) -‐Run feature genera>on and selec>on -‐Train and serialize model -‐Export valida>on sta>s>cs
Luigi class GetSnapshots(luigi.Task): def run(self): " "... class GenFeatures(luigi.Task): def requires(self): return GetSnapshots()"
Luigi runs tasks on Hadoop cluster "
Scoring as a service " Applica>on submission
Website scraped
Text scored Applica>on reviewed
ThriO RPC
Scoring Service
Scoring as a service struct ScoringRequest { 1: string text 2: optional string model_name } struct ScoringResponse { 1: double score" " "// Experiments? 2: double request_duration }"
Why a service? -‐Same code base for training/scoring -‐Reduced duplica>on/easier deploys -‐Experimenta>on
-‐Log requests and responses (Parquet/Impala) -‐Centralized monitoring (Graphite)
Summary -‐Simple models with sklearn -‐Pipelines with Luigi -‐Scoring as a service Thanks! @mlmanapat