lessons learned
Post on 13-Aug-2015
46 Views
Preview:
TRANSCRIPT
Lessons Learned: Machine Learning and
Technical DebtMatthew Kirk
@mjkirk
Who uses data?
Responsive Enterprise
A Golden Opportunity
The Danger
The High Interest Debt of Machine Learning
What we’re covering
• Boundary Erosion
• Data Dependencies
• Spaghetti Code
• The Real World
`whoami`
• O’Reilly Author - Thoughtful Machine Learning. Use AUTHD to get a discount on OReilly.com.
• Former Financial Quant
• Independent Consultant
• @mjkirk
Boundary Erosion
• Entanglement
• Visibility Debt
Entanglement
Entanglement: Solution
• Isolate models as much as possible
• Regularization
Visibility Debt
Solutions
• Keeping an API Log
• Monitoring of tool use
• No sharing of usernames :)
Data Dependencies• Unstable
• Underutilized
Unstable Data
Solution
• Versioning
• Keep a specific version of a dataset. For instance a timestamped version of language data.
Underutilized
Solution
• Feature engineering: PCA, ICA, Random Feature Selection, VIMP, etc.
Spaghetti Code• Glue Code
• Pipeline Jungle
• Experimental Paths
• Configuration Debt
Glue CodeR, Matlab, Python, Java. All to use that one
implementation
Solution
• Write your own implementation of the algorithm….
Pipeline Jungle
Conway’s Law
The Clymb’s Database V1.0
PS: No Monitoring on any of this.
Clymb DB V2.0
Solution
• Map systems and reduce
• Reduce organizational disconnects by attending stand ups and being a part of the engineering team
Experimental Paths
Solution: Tombstones
!
• def run_this_once_in_prod!; Tombstone.new(‘2014-01-02’); end
• When you think something is dead put a Tombstone on it
• https://www.youtube.com/watch?v=29UXzfQWOhQ
Configuration Debt
Solution
• Find optimal configurations regularly
• Revisit initial configuration with new datapoints.
External World Changes• Fixed Thresholds
• Correlation changes
Fixed Thresholds
• Law’s Change: The drinking age used to be 19 in many states.
Solution
• Rebuild, or include accuracy as part of your model to minimize on.
• Min Cost = Actual - Predicted
Correlations Change
Solution
• Be careful when trying to find causal evidence. Think what if the model doesn’t work.
• Iterate often
Questions?
The Blissful Land of Opportunity
Lessons Learned In one Slide
Danger SolutionsEntanglement Regularize or Isolate ModelsVisibility Debt Keep an access log of who uses whatUnstable Data Version datasets
Underutilized Data Trim by finding better featuresGlue Code Write your own implementations
Pipeline Jungle Find minimum cut in systemsExperimental Paths Use TombstonesConfiguration Debt Reconfigure with new datasetsFixed Thresholds Include accuracy as part of model
Correlation Changes Trim non-causal data from models
Links and Contact
• @mjkirk
• matt@matthewkirk.com
• Machine Learning: The High-Interest Credit Card of Technical Debt: https://bit.ly/1zs9TXi
• Is that code dead?: http://bit.ly/1sg0B1L
Photo Sources• Cost of gigabyte: http://royal.pingdom.com/2011/12/19/would-you-pay-7260-for-a-3-tb-drive-charting-hdd-and-ssd-prices-over-time/
• Golden Opportunity: https://flic.kr/p/7xvfZr
• Problems are Opportunities: https://flic.kr/p/ifFos
• Master Charge: https://flic.kr/p/noQUh1
• Erosion: https://flic.kr/p/9agH2q
• Coupler: https://flic.kr/p/ppm9HG
• Fruit Loops: https://flic.kr/p/5rkLhP
• Somewhere in Quản Bạ, Hà Giang: https://flic.kr/p/q4K9Bo
• Data Dependencies: https://flic.kr/p/dVq7vg
• Unstable!: https://flic.kr/p/s7RLj
• Underutilized Piano: https://flic.kr/p/2sZVP
• Spaghetti: https://flic.kr/p/tuwkp
• Glue: https://flic.kr/p/6L13SK
• Pipelines at google: https://flic.kr/p/pvLQG2
top related