machine learning startups...full stack product team calls backend team apis a model is not a product...

Machine Learning Startups

My Background

Lessons Learned

Turn hard problems into easy ones

ML in practice requires carefully formulating research problems

...and being creative about bootstrapping training data

Lessons Learned

Many ways to capture dependencies

Training data and features > models

Lessons Learned

A model is not a product

Nobody cares about your ideas

Flightcaster

Predicting the real-time state of the global air traffic network

The Prediction Problem

Flight F departing at time T

Likelihood that F departs at T, T+n1, T+n2

Featurizing

Carrier, FAA, weather data

Nightly reset natural cadence for feature vecs

Every aircraft has a unique tail #

Fuzzy k-way join on tail #, time, location

Isolate incorrect joins by keeping feature vecs independent

positions in past - already delayed at prediction time?

weather and status - FAA groundings at airports on path?

featurizing time - how delayed and how many mins from departure?

Models

trees could pick up dependencies that linear model couldn’t

but perf became trivially incremental once adding more sophisticated ways of featurizing dependencies

Tools and Deployment

Clojure on hadoop for featurizing and model training

Wrap complexity in simple API

FP awesome for data pipelines

Write models to json

Product team used Rails

Read json and make predictions

Predictions stored in production DB for eval

Pain Points

Log-based-debugging paradigm sucks

Don’t want to catch ETL and feature eng issues in hadoop setting

At same time can not catch at tiny scale because needs real data at material scale

dirty data -- manual entry

early days of clojure / hadoop

deploying json models rather than services

Lessons learned

Model selection mattered less than featurizing

Many ways to capture dependencies

Intuitions of domain expert useful but also often misleading

Use domain experts to identify data sources

Then build good tools and take scientific approach to exploring the feature space

Computational graph with HOF in order to log structured data

Inspired fast debugging with plumbing.graph at prismatic

Isolate issues: single thread, multi thread, multi process and multi machine

Production was OK not great

Better to to put ML behind services

Full stack product team calls backend team APIs

A model is not a product

Humans don’t understand probability distributions

Even if discretized or turned into classification

Solve a human need directly -- turn into recommendations, etc

Prismatic

Personalized Ranking of People, Topics, and Content

The Personalized Ranking Problem

Given a index of content, display the content that maximizes the likelihood of user engagement

Intention: max LT engagement

Proxy: max session interactions

Content

Focused crawling of Twitter, FB and web

Maximum coverage algorithms

Spam content and de-duping

Featurizing

Content and interaction features

Feature crosses and hacks for dependencies

Bootstrapping weight hacks -- can’t train on overly sparse interactions

Scores for interests (topics, people, publishers, …)

Models: Personalized Ranking

Logistic -- newsfeed ranking has to be ultra fast in prod 100ms

Learning to rank -- inversions

Universal features, user specific weight-vectors

Snapshot every session

Models: Classification

How do you train a large set of topic classifiers?

Latent topic models don’t work

But how would we get labeled data to train a classifier for each topic?

Enter distant supervision

Create mechanism to bootstrap training data with noisy labels

Requires lots of heuristics and clever hacks

Snarf docs with twitter queries, etc

Create pos and negs using filters and distance measures

Lots of techniques to featurize text for filters and training

Tools and Deployment

Clojure! Plumbing on github

Clj backend and cljs frontend

Graph, schema, ml libs

Pain Points

Presentation biases

People click what they're shown

Biases clicks on top stories

Self-reinforcing viral propagation engine

Data Issues

Dupes easy, but spambots and nets keep getting more sophisticated esp on twitter

Bootstrapping distance supervision is hard but OK

Bootstrapping ranking with sparse interactions is super hard

Social vs interest based personalization

What’s interesting vs what’s viral?

How do you define what’s interesting?

How much is a share worth compared with dwell time?

Researchers bias on their own prefs

Lessons learned

Overboard with clojure NIH

Environment changes fast -- missed spark etc

automated classifier training data and retrain with zero intervention

Can optimize interactions a lot > 50%

When data is too sparse, optimize product before optimizing models

Heuristic IR may be good enough for a while

Investment in learning to rank is massive

10 Companies3 Years$65M

So far

2 Companies6 Months$1MM

Unsexy low beta

Prop modelsProp data

Cyber MGA

Indirect losses● stock price● credit rating● sales

Market

➔ 2.5B today➔ 35% growth➔ 50B in 10 years

Catalysts

➔ SEC and EU regs➔ High profile breaches➔ Large indirect losses

➔ Positives : recorded breaches

➔ Negatives : random sample of companies (not

attacked)

➔ Features : Security features

● DNS records, certs, service vulnerabilities, …

First iteration - Supervised Learning

➔ Incorrect assumption: breached companies having worse

security and negative samples not being attacked

First iteration - Supervised Learning

Likelihood of breach

Absence of historical data and nonstationarity create a challenging environment

➔ Rich current data isn’t available historically and decays in predictive power over time

➔ Could static data be a more robust and stable predictor of risk?

Relationship with catastrophes

Insurance models for earthquakes, floods, hurricanesSparse events (cannot estimate probs from freqs)Events are correlated (how true is this for cyber?)

Can we draw from ideas in cat risk to model cyber risk?

Relationship with catastrophes

Cat: ➔ Stochastic simulation using physical models➔ Impacts change in magnitude but not type

Cyber:➔ Behavior of incentivised cyber-attackers hard to model➔ Impacts change over time

TSDynamic behavior

Industry baseline

Infrastructure security

Social Engineering

Freq. of breaches

Size of loss

Assets (+Lifecycle)Uncertainty Load

Broader Approach

Premium Decomposition

Simplifying assumption: we can start incrementally and loss magnitude will always hit limit

Likelihood and uncertainty depend on breach sample➔ Estimate uncertainty from on confidence➔ Estimate likelihood from risk features

Indirect Losses

Quantifying Indirect losses is complicated➔ Normalizing market and industry effects➔ Effect of news and corporate events?➔ Over what time period?➔ How do we define a statistically significant loss?

Investigation Tools

Roadmap

Freq. estimation Loss model Pricing support

V1 Industry based freq. Stock loss Uncertainty from variance

V2 Net. security model -- --

V3 Behavior of company -- Better uncertainty quant.

V4 -- Sales losses --

V5 -- Credit Rating --

V6 Social engineering -- --

V7 -- -- Pricing model

Future Challenges

Accumulation risk

➔ Correlated breaches➔ Autonomous vehicles➔ Supply chain➔ Physical damage

Bloomberg for Back Office

The world’s first AI enabled

compliance solution

Market

➔ Banks spend ~100B on compliance

➔ ~20B on analytics alone

➔ growing at 20% annually

Catalysts

➔ 9/11 and 2008 crisis

➔ 20X explosion in fines

➔ Exec departures

Computer Vision

Image due diligence

Image distance for ID check

➔ Detects faces in the image using pre-trained models

➔ Transform the image using the real-time pose estimation

➔ Represent the face on the hypersphere using the neural network

➔ Apply any classification technique to the found features

Image due diligence

➔ Check whether the photos on several IDs belongs to the same person

➔ Perform image due diligence in the databases of criminals and other databases

NLP: Detecting Adverse News

IR approach: name + keyword in same sentence

● Low false negs● High false pos

John Smith

● Judge John Smith sentenced James Doe for money laundering.● Amy Smith is accused of murder of her brother John Smith.

Raptor NLP: detecting adverse news

Classification approaches:

● General entity centric “sentiment” classifier○ High coverage○ Not easy to interpret and understand what is going on

● Multiple specific relationship extractors (X sentenced for Y, X accused of Y, …)○ Lower coverage○ Easy to debug and understand

Problem formulation

Training data: generate noisy training data using heuristics

Positive examples: Look for mentions of people with bad news

Negative examples: Tricky and hard part. Many heuristics:

● Use list of judges and attorneys, search for their mentions● Simple syntactic rules: “X said”, ...

Distant Supervision: Training data

heuristics fall into three categories:

a) Poor: doesn’t work

b) Low coverage: only catches few samples

c) Good: big impact on performance

Distant Supervision: Training data

Different sources have different rates of true vs false positives (think bbc.com vs court proceedings reports).

Use this info with some other heuristic to gain a lots of negative samples.

One heuristic might be even previous version of classifier.

Distant Supervision: One nice heuristic

We have to work in multiple languages which limits use of features coming from tools like dependency parsers.

Currently exploring heuristics based on parse trees and machine translation.

Distant Supervision: Languages

Need a model that captures entity centric features and word order

Logistic with classic text features (raw bigrams, entity centric features, dependency parse features, …)

○ Lot of time spent in building features○ Easier to understand, interpret and debug than neural nets

Deep learning: RNN/CNNs

○ Saves time on feature engineering○ Hard to debug, understand and interpret○ Currently, slightly better performance than features + logistic

Distant Supervision: Modeling Approach

Modeling Approach: Recurrent Networks

Pre-trained word embeddings

a) Help to achieve better performanceb) Can be easily obtained for any languagec) Can be shared across multiple tasks

Modeling Approach: Convolutional Networks

CNN vs RNN setups for NLP

a) CNNs are coming into NLP from CVb) CNNs faster than RNNs and can have similar performancec) In our case currently a tie

Open problems with false negatives:

● Information spanning multiple sentences:○ Coreference resolution (John is mayor of Boston. He was

sentenced for …)○ Discourse analysis (relations between sentences)

● Analysis of formatted text (tables, bullet points, …)

Key Takeaways and open problems

Key Takeaways:

● Improving training data helps a lot more than tweaking model● Avoid the academic trap of testing many neural net architectures

Key Takeaways and open problems

Risk Ranking and Networks

Google for Risk

➔ Google wins in ranking because it has the most user click data

➔ We win in risk because we have analyst annotation data

Task: Search for CDD/EDD sources and rank the results based on the risk they represent

Goal: Do not miss anything important AND filter as many false positives as possible

Google for risk

fraudchargedforgery

Raptor

Ranked results

Problems:● accurate identification of the person (name collisions)● identify the right context the person is mentioned in

Additional requirements:● interpretable results on all levels: rank, risk, NLP● utilize user feedback: implicit vs explicit

Google for risk - approaches

Risk Model Validation and Interpretability

Prediction vs. Ranking:

Prediction● Want scores and filtering● Still interesting to order results● Loss is error

Learning To Rank● Want optimal ordering of results ● Scores not interesting● Loss is number of inversions

Google for risk - approaches

Risk Networks

Task: Identify risks from the person’s social network

Evaluate risk in network on different levels:● node● edge● path● subgraph

Facebook for risk

Facebook for risk: Streamline investigation with the risk network

➔ Links between all people and business entities

➔ Pagerank for risk

➔ See the riskiest paths through the network

➔ Drill down into high risk customer-customer and customer-entity

relationships

Building the multigraph (1)

Nodes = entities (people and orgs)Edges = relationships

1. Start with extracting relationships from structured databases:○ Wikipedia○ Company Registers ○ Panama Papers, etc.

2. De-duplicate nodes across different datasets○ another ML problem

Facebook for risk

David Cameron

Ian Donald Cameron

Mary Fleur MountNancy Gwen

Arthur Elwen

Blairmore Holdings Inc

Cameron Optometry Limited

Univel Limited

Accelerated Mailing & Marketing Limited

Mannings Heath Hotel Limited

Wikipedia

Open Corporates

Panama Papers

Query: David CameronRisk: Blairmore Holdings Inc

Building the multigraph (2)

Extract additional relationships from text => more NLP ○ named entity extraction for nodes○ relation extraction for edges

Facebook for risk

Ian Cameron was a director of Blairmore Holdings Inc, an investment fund run from the Bahamas but named after the family’s ancestral home in Aberdeenshire.

subject objectrelation

Roadmap

1. Start with risk CDD/EDD risk scores at node level2. Propagate risk across edges to derive edge weights3. Social Network Analysis:

○ random walk with restarts: PageRank, HITS, Personalized PageRank

4. Subgraph risk ranking: use SNA approaches to featurize graph for ranking

Facebook for risk

Link prediction Task: Inferring new relationships from network and behavior

Approach1. add behavior data to network2. extract features from network

○ node based: Jaccard’s coef., Preferential attachment …○ path based: Katz, PageRank, Personalized PageRank … ○ feature matrices over pair of nodes: Path Ranking Algorithm

3. combine with semantic features for each node4. treat as a binary class classification problem

Facebook for risk

CEAI Topics

Computational Finance

Computer Vision

Computational Bio & Medicine

Proactive Full-stackMGAs

machine learning startups...full stack product team calls backend team apis a model is not a product...

Documents

discretized marching cubes

team beachbody product catalog

the product team product school

geometric constructions with discretized random...

applications: discretized heat equation

max-product particle belief propagation - brown … ·...

fully discretized energy stable schemes for hydrodynamic...

emergent properties of discretized wave - complex systems

learning classifiers from discretized expression...

patterns in discretized parabolas and length estimation...

integrated product team effectiveness in the … ·...

discretized picard’s method -...

rad-behavior (recombining atomized, discretized, behavior):...

product marketing team

convolution quadrature and discretized operational ... ·...

discretized light-cone quantization: formalism for quantum...

discretized streams: fault-tolerant streaming computation...

differentiating discretized metrics and applications...

an integrated approach to discretized 3d modeling of … ·...

homoclinic orbits and chaos in discretized perturbed nls...