lessons learned

Post on 13-Aug-2015

46 Views

Category:

Software

5 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lessons Learned: Machine Learning and

Technical DebtMatthew Kirk

@mjkirk

Who uses data?

Responsive Enterprise

A Golden Opportunity

The Danger

The High Interest Debt of Machine Learning

What we’re covering

• Boundary Erosion

• Data Dependencies

• Spaghetti Code

• The Real World

`whoami`

• O’Reilly Author - Thoughtful Machine Learning. Use AUTHD to get a discount on OReilly.com.

• Former Financial Quant

• Independent Consultant

• @mjkirk

Boundary Erosion

• Entanglement

• Visibility Debt

Entanglement

Entanglement: Solution

• Isolate models as much as possible

• Regularization

Visibility Debt

Solutions

• Keeping an API Log

• Monitoring of tool use

• No sharing of usernames :)

Data Dependencies• Unstable

• Underutilized

Unstable Data

Solution

• Versioning

• Keep a specific version of a dataset. For instance a timestamped version of language data.

Underutilized

Solution

• Feature engineering: PCA, ICA, Random Feature Selection, VIMP, etc.

Spaghetti Code• Glue Code

• Pipeline Jungle

• Experimental Paths

• Configuration Debt

Glue CodeR, Matlab, Python, Java. All to use that one

implementation

Solution

• Write your own implementation of the algorithm….

Pipeline Jungle

Conway’s Law

The Clymb’s Database V1.0

PS: No Monitoring on any of this.

Clymb DB V2.0

Solution

• Map systems and reduce

• Reduce organizational disconnects by attending stand ups and being a part of the engineering team

Experimental Paths

Solution: Tombstones

!

• def run_this_once_in_prod!; Tombstone.new(‘2014-01-02’); end

• When you think something is dead put a Tombstone on it

• https://www.youtube.com/watch?v=29UXzfQWOhQ

Configuration Debt

Solution

• Find optimal configurations regularly

• Revisit initial configuration with new datapoints.

External World Changes• Fixed Thresholds

• Correlation changes

Fixed Thresholds

• Law’s Change: The drinking age used to be 19 in many states.

Solution

• Rebuild, or include accuracy as part of your model to minimize on.

• Min Cost = Actual - Predicted

Correlations Change

Solution

• Be careful when trying to find causal evidence. Think what if the model doesn’t work.

• Iterate often

Questions?

The Blissful Land of Opportunity

Lessons Learned In one Slide

Danger SolutionsEntanglement Regularize or Isolate ModelsVisibility Debt Keep an access log of who uses whatUnstable Data Version datasets

Underutilized Data Trim by finding better featuresGlue Code Write your own implementations

Pipeline Jungle Find minimum cut in systemsExperimental Paths Use TombstonesConfiguration Debt Reconfigure with new datasetsFixed Thresholds Include accuracy as part of model

Correlation Changes Trim non-causal data from models

Links and Contact

• @mjkirk

• matt@matthewkirk.com

• Machine Learning: The High-Interest Credit Card of Technical Debt: https://bit.ly/1zs9TXi

• Is that code dead?: http://bit.ly/1sg0B1L

Photo Sources• Cost of gigabyte: http://royal.pingdom.com/2011/12/19/would-you-pay-7260-for-a-3-tb-drive-charting-hdd-and-ssd-prices-over-time/

• Golden Opportunity: https://flic.kr/p/7xvfZr

• Problems are Opportunities: https://flic.kr/p/ifFos

• Master Charge: https://flic.kr/p/noQUh1

• Erosion: https://flic.kr/p/9agH2q

• Coupler: https://flic.kr/p/ppm9HG

• Fruit Loops: https://flic.kr/p/5rkLhP

• Somewhere in Quản Bạ, Hà Giang: https://flic.kr/p/q4K9Bo

• Data Dependencies: https://flic.kr/p/dVq7vg

• Unstable!: https://flic.kr/p/s7RLj

• Underutilized Piano: https://flic.kr/p/2sZVP

• Spaghetti: https://flic.kr/p/tuwkp

• Glue: https://flic.kr/p/6L13SK

• Pipelines at google: https://flic.kr/p/pvLQG2

top related