machine learning startups...full stack product team calls backend team apis a model is not a product...

125
Machine Learning Startups

Upload: others

Post on 08-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Machine Learning Startups

Page 2: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

My Background

Page 3: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Lessons Learned

Turn hard problems into easy ones

ML in practice requires carefully formulating research problems

...and being creative about bootstrapping training data

Page 4: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Lessons Learned

Many ways to capture dependencies

Training data and features > models

Page 5: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Lessons Learned

A model is not a product

Nobody cares about your ideas

Page 6: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned
Page 7: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Flightcaster

Predicting the real-time state of the global air traffic network

Page 8: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

The Prediction Problem

Page 9: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Flight F departing at time T

Likelihood that F departs at T, T+n1, T+n2

Page 10: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Featurizing

Page 11: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Carrier, FAA, weather data

Nightly reset natural cadence for feature vecs

Page 12: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Every aircraft has a unique tail #

Page 13: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Fuzzy k-way join on tail #, time, location

Isolate incorrect joins by keeping feature vecs independent

Page 14: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

positions in past - already delayed at prediction time?

weather and status - FAA groundings at airports on path?

featurizing time - how delayed and how many mins from departure?

Page 15: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Models

Page 16: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

trees could pick up dependencies that linear model couldn’t

but perf became trivially incremental once adding more sophisticated ways of featurizing dependencies

Page 17: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Tools and Deployment

Page 18: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Clojure on hadoop for featurizing and model training

Wrap complexity in simple API

FP awesome for data pipelines

Page 19: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Write models to json

Product team used Rails

Read json and make predictions

Predictions stored in production DB for eval

Page 20: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Pain Points

Page 21: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Log-based-debugging paradigm sucks

Don’t want to catch ETL and feature eng issues in hadoop setting

At same time can not catch at tiny scale because needs real data at material scale

Page 22: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

dirty data -- manual entry

early days of clojure / hadoop

deploying json models rather than services

Page 23: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Lessons learned

Page 24: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Model selection mattered less than featurizing

Many ways to capture dependencies

Page 25: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Intuitions of domain expert useful but also often misleading

Use domain experts to identify data sources

Then build good tools and take scientific approach to exploring the feature space

Page 26: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Computational graph with HOF in order to log structured data

Inspired fast debugging with plumbing.graph at prismatic

Isolate issues: single thread, multi thread, multi process and multi machine

Page 27: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Production was OK not great

Better to to put ML behind services

Full stack product team calls backend team APIs

Page 28: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

A model is not a product

Humans don’t understand probability distributions

Even if discretized or turned into classification

Solve a human need directly -- turn into recommendations, etc

Page 29: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Prismatic

Personalized Ranking of People, Topics, and Content

Page 30: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

The Personalized Ranking Problem

Page 31: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Given a index of content, display the content that maximizes the likelihood of user engagement

Intention: max LT engagement

Proxy: max session interactions

Page 32: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Content

Page 33: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Focused crawling of Twitter, FB and web

Maximum coverage algorithms

Spam content and de-duping

Page 34: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Featurizing

Page 35: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Content and interaction features

Feature crosses and hacks for dependencies

Bootstrapping weight hacks -- can’t train on overly sparse interactions

Scores for interests (topics, people, publishers, …)

Page 36: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned
Page 37: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Models: Personalized Ranking

Page 38: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned
Page 39: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Logistic -- newsfeed ranking has to be ultra fast in prod 100ms

Learning to rank -- inversions

Universal features, user specific weight-vectors

Snapshot every session

Page 40: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Models: Classification

Page 41: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

How do you train a large set of topic classifiers?

Latent topic models don’t work

But how would we get labeled data to train a classifier for each topic?

Page 42: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Enter distant supervision

Create mechanism to bootstrap training data with noisy labels

Requires lots of heuristics and clever hacks

Page 43: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Snarf docs with twitter queries, etc

Create pos and negs using filters and distance measures

Lots of techniques to featurize text for filters and training

Page 44: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Tools and Deployment

Page 45: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Clojure! Plumbing on github

Clj backend and cljs frontend

Graph, schema, ml libs

Page 46: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Pain Points

Page 47: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Presentation biases

People click what they're shown

Biases clicks on top stories

Self-reinforcing viral propagation engine

Page 48: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Data Issues

Dupes easy, but spambots and nets keep getting more sophisticated esp on twitter

Bootstrapping distance supervision is hard but OK

Bootstrapping ranking with sparse interactions is super hard

Page 49: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Social vs interest based personalization

What’s interesting vs what’s viral?

How do you define what’s interesting?

How much is a share worth compared with dwell time?

Researchers bias on their own prefs

Page 50: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Lessons learned

Page 51: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Overboard with clojure NIH

Environment changes fast -- missed spark etc

automated classifier training data and retrain with zero intervention

Can optimize interactions a lot > 50%

Page 52: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

When data is too sparse, optimize product before optimizing models

Heuristic IR may be good enough for a while

Investment in learning to rank is massive

Page 53: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned
Page 54: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Goal

10 Companies3 Years$65M

So far

2 Companies6 Months$1MM

Page 55: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Unsexy low beta

Page 56: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Prop modelsProp data

Page 57: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned
Page 58: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned
Page 59: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned
Page 60: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Cyber MGA

Indirect losses● stock price● credit rating● sales

Page 61: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Market

➔ 2.5B today➔ 35% growth➔ 50B in 10 years

Page 62: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Catalysts

➔ SEC and EU regs➔ High profile breaches➔ Large indirect losses

Page 63: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned
Page 64: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned
Page 65: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

➔ Positives : recorded breaches

➔ Negatives : random sample of companies (not

attacked)

➔ Features : Security features

● DNS records, certs, service vulnerabilities, …

First iteration - Supervised Learning

Page 66: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

#FAIL

➔ Incorrect assumption: breached companies having worse

security and negative samples not being attacked

First iteration - Supervised Learning

Page 67: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Likelihood of breach

Absence of historical data and nonstationarity create a challenging environment

➔ Rich current data isn’t available historically and decays in predictive power over time

➔ Could static data be a more robust and stable predictor of risk?

Page 68: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Relationship with catastrophes

Insurance models for earthquakes, floods, hurricanesSparse events (cannot estimate probs from freqs)Events are correlated (how true is this for cyber?)

Can we draw from ideas in cat risk to model cyber risk?

Page 69: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Relationship with catastrophes

Cat: ➔ Stochastic simulation using physical models➔ Impacts change in magnitude but not type

Cyber:➔ Behavior of incentivised cyber-attackers hard to model➔ Impacts change over time

Page 70: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

TSDynamic behavior

Industry baseline

Infrastructure security

Social Engineering

Freq. of breaches

Size of loss

Assets (+Lifecycle)Uncertainty Load

Broader Approach

Page 71: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Premium Decomposition

Page 72: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Premium Decomposition

Simplifying assumption: we can start incrementally and loss magnitude will always hit limit

Likelihood and uncertainty depend on breach sample➔ Estimate uncertainty from on confidence➔ Estimate likelihood from risk features

Page 73: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Indirect Losses

Quantifying Indirect losses is complicated➔ Normalizing market and industry effects➔ Effect of news and corporate events?➔ Over what time period?➔ How do we define a statistically significant loss?

Page 74: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Investigation Tools

Page 75: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Roadmap

Freq. estimation Loss model Pricing support

V1 Industry based freq. Stock loss Uncertainty from variance

V2 Net. security model -- --

V3 Behavior of company -- Better uncertainty quant.

V4 -- Sales losses --

V5 -- Credit Rating --

V6 Social engineering -- --

V7 -- -- Pricing model

Page 76: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Future Challenges

Accumulation risk

➔ Correlated breaches➔ Autonomous vehicles➔ Supply chain➔ Physical damage

Page 77: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned
Page 78: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Bloomberg for Back Office

The world’s first AI enabled

compliance solution

Page 79: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Market

➔ Banks spend ~100B on compliance

➔ ~20B on analytics alone

➔ growing at 20% annually

Page 80: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Catalysts

➔ 9/11 and 2008 crisis

➔ 20X explosion in fines

➔ Exec departures

Page 81: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned
Page 82: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned
Page 83: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Computer Vision

Page 84: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Image due diligence

Page 85: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Image distance for ID check

➔ Detects faces in the image using pre-trained models

➔ Transform the image using the real-time pose estimation

➔ Represent the face on the hypersphere using the neural network

➔ Apply any classification technique to the found features

Page 86: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Image due diligence

➔ Check whether the photos on several IDs belongs to the same person

➔ Perform image due diligence in the databases of criminals and other databases

Page 87: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

NLP: Detecting Adverse News

Page 88: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

IR approach: name + keyword in same sentence

● Low false negs● High false pos

John Smith

● Judge John Smith sentenced James Doe for money laundering.● Amy Smith is accused of murder of her brother John Smith.

Raptor NLP: detecting adverse news

Page 89: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Classification approaches:

● General entity centric “sentiment” classifier○ High coverage○ Not easy to interpret and understand what is going on

● Multiple specific relationship extractors (X sentenced for Y, X accused of Y, …)○ Lower coverage○ Easy to debug and understand

Problem formulation

Page 90: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Training data: generate noisy training data using heuristics

Positive examples: Look for mentions of people with bad news

Negative examples: Tricky and hard part. Many heuristics:

● Use list of judges and attorneys, search for their mentions● Simple syntactic rules: “X said”, ...

Distant Supervision: Training data

Page 91: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

heuristics fall into three categories:

a) Poor: doesn’t work

b) Low coverage: only catches few samples

c) Good: big impact on performance

Distant Supervision: Training data

Page 92: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Different sources have different rates of true vs false positives (think bbc.com vs court proceedings reports).

Use this info with some other heuristic to gain a lots of negative samples.

One heuristic might be even previous version of classifier.

Distant Supervision: One nice heuristic

Page 93: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

We have to work in multiple languages which limits use of features coming from tools like dependency parsers.

Currently exploring heuristics based on parse trees and machine translation.

Distant Supervision: Languages

Page 94: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Need a model that captures entity centric features and word order

Logistic with classic text features (raw bigrams, entity centric features, dependency parse features, …)

○ Lot of time spent in building features○ Easier to understand, interpret and debug than neural nets

Deep learning: RNN/CNNs

○ Saves time on feature engineering○ Hard to debug, understand and interpret○ Currently, slightly better performance than features + logistic

Distant Supervision: Modeling Approach

Page 95: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Modeling Approach: Recurrent Networks

Page 96: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Pre-trained word embeddings

a) Help to achieve better performanceb) Can be easily obtained for any languagec) Can be shared across multiple tasks

Distant Supervision: Modeling Approach

Page 97: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Modeling Approach: Convolutional Networks

Page 98: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

CNN vs RNN setups for NLP

a) CNNs are coming into NLP from CVb) CNNs faster than RNNs and can have similar performancec) In our case currently a tie

Distant Supervision: Modeling Approach

Page 99: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Open problems with false negatives:

● Information spanning multiple sentences:○ Coreference resolution (John is mayor of Boston. He was

sentenced for …)○ Discourse analysis (relations between sentences)

● Analysis of formatted text (tables, bullet points, …)

Key Takeaways and open problems

Page 100: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Key Takeaways:

● Improving training data helps a lot more than tweaking model● Avoid the academic trap of testing many neural net architectures

Key Takeaways and open problems

Page 101: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Risk Ranking and Networks

Page 102: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Google for Risk

➔ Google wins in ranking because it has the most user click data

➔ We win in risk because we have analyst annotation data

Page 103: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Task: Search for CDD/EDD sources and rank the results based on the risk they represent

Goal: Do not miss anything important AND filter as many false positives as possible

Google for risk

Query

fraudchargedforgery

Raptor

Ranked results

Page 104: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Problems:● accurate identification of the person (name collisions)● identify the right context the person is mentioned in

Additional requirements:● interpretable results on all levels: rank, risk, NLP● utilize user feedback: implicit vs explicit

Google for risk - approaches

Page 105: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Risk Model Validation and Interpretability

Page 106: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Prediction vs. Ranking:

Prediction● Want scores and filtering● Still interesting to order results● Loss is error

Learning To Rank● Want optimal ordering of results ● Scores not interesting● Loss is number of inversions

Google for risk - approaches

Page 107: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Risk Networks

Task: Identify risks from the person’s social network

Evaluate risk in network on different levels:● node● edge● path● subgraph

Facebook for risk

Page 108: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Facebook for risk: Streamline investigation with the risk network

➔ Links between all people and business entities

➔ Pagerank for risk

➔ See the riskiest paths through the network

➔ Drill down into high risk customer-customer and customer-entity

relationships

Page 109: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Building the multigraph (1)

Nodes = entities (people and orgs)Edges = relationships

1. Start with extracting relationships from structured databases:○ Wikipedia○ Company Registers ○ Panama Papers, etc.

2. De-duplicate nodes across different datasets○ another ML problem

Facebook for risk

Page 110: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Facebook for risk

David Cameron

Ian Donald Cameron

Mary Fleur MountNancy Gwen

Arthur Elwen

Blairmore Holdings Inc

Cameron Optometry Limited

Univel Limited

Accelerated Mailing & Marketing Limited

Mannings Heath Hotel Limited

Wikipedia

Open Corporates

Panama Papers

Query: David CameronRisk: Blairmore Holdings Inc

Page 111: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Building the multigraph (2)

Extract additional relationships from text => more NLP ○ named entity extraction for nodes○ relation extraction for edges

Facebook for risk

Ian Cameron was a director of Blairmore Holdings Inc, an investment fund run from the Bahamas but named after the family’s ancestral home in Aberdeenshire.

subject objectrelation

Page 112: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Roadmap

1. Start with risk CDD/EDD risk scores at node level2. Propagate risk across edges to derive edge weights3. Social Network Analysis:

○ random walk with restarts: PageRank, HITS, Personalized PageRank

4. Subgraph risk ranking: use SNA approaches to featurize graph for ranking

Facebook for risk

Page 113: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Link prediction Task: Inferring new relationships from network and behavior

Approach1. add behavior data to network2. extract features from network

○ node based: Jaccard’s coef., Preferential attachment …○ path based: Katz, PageRank, Personalized PageRank … ○ feature matrices over pair of nodes: Path Ranking Algorithm

3. combine with semantic features for each node4. treat as a binary class classification problem

Facebook for risk

Page 114: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

CEAI Topics

Computational Finance

Computer Vision

Computational Bio & Medicine

Page 115: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned

Proactive Full-stackMGAs

Page 116: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned
Page 117: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned
Page 118: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned
Page 119: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned
Page 120: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned
Page 121: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned
Page 122: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned
Page 123: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned
Page 124: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned
Page 125: Machine Learning Startups...Full stack product team calls backend team APIs A model is not a product Humans don’t understand probability distributions Even if discretized or turned