no-bullshit data science

80
No-Bullshit Data Science Szilárd Pafka, PhD Chief Scientist, Epoch Domino Data Science Popup San Francisco, Feb 2017

Upload: domino-data-lab

Post on 20-Mar-2017

32 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: No-Bullshit Data Science

No-Bullshit Data Science

Szilárd Pafka, PhDChief Scientist, Epoch

Domino Data Science PopupSan Francisco, Feb 2017

Page 2: No-Bullshit Data Science
Page 3: No-Bullshit Data Science

Disclaimer:

I am not representing my employer (Epoch) in this talk

I cannot confirm nor deny if Epoch is using any of the methods, tools, results etc. mentioned in this talk

Page 4: No-Bullshit Data Science
Page 5: No-Bullshit Data Science
Page 6: No-Bullshit Data Science
Page 7: No-Bullshit Data Science
Page 8: No-Bullshit Data Science

Example #1

Page 9: No-Bullshit Data Science
Page 10: No-Bullshit Data Science
Page 11: No-Bullshit Data Science
Page 12: No-Bullshit Data Science
Page 13: No-Bullshit Data Science
Page 14: No-Bullshit Data Science
Page 15: No-Bullshit Data Science
Page 16: No-Bullshit Data Science
Page 17: No-Bullshit Data Science
Page 18: No-Bullshit Data Science
Page 19: No-Bullshit Data Science
Page 20: No-Bullshit Data Science
Page 21: No-Bullshit Data Science
Page 22: No-Bullshit Data Science
Page 23: No-Bullshit Data Science
Page 24: No-Bullshit Data Science
Page 25: No-Bullshit Data Science

Aggregation 100M rows 1M groups Join 100M rows x 1M rows

time [s]

time [s]

Page 26: No-Bullshit Data Science

(largest data analyzed)

Page 27: No-Bullshit Data Science

(largest data analyzed)

Page 28: No-Bullshit Data Science

(largest data analyzed)

Page 29: No-Bullshit Data Science
Page 30: No-Bullshit Data Science

data size [M]

trainingtime [s]

10x

Gradient Boosting Machines

Page 31: No-Bullshit Data Science
Page 32: No-Bullshit Data Science

linear tops off(data size)

(accuracy)

Page 33: No-Bullshit Data Science

linear tops off

more data & better algo

(data size)

(accuracy)

Page 34: No-Bullshit Data Science

linear tops off

more data & better algorandom forest on 1% of data beats linear on all data

(data size)

(accuracy)

Page 35: No-Bullshit Data Science

linear tops off

more data & better algorandom forest on 1% of data beats linear on all data

(data size)

(accuracy)

Page 36: No-Bullshit Data Science
Page 37: No-Bullshit Data Science
Page 38: No-Bullshit Data Science
Page 39: No-Bullshit Data Science
Page 40: No-Bullshit Data Science
Page 41: No-Bullshit Data Science

Example #2

Page 42: No-Bullshit Data Science
Page 43: No-Bullshit Data Science

http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf

http://lowrank.net/nikos/pubs/empirical.pdf

Page 44: No-Bullshit Data Science

http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf

http://lowrank.net/nikos/pubs/empirical.pdf

Page 45: No-Bullshit Data Science
Page 46: No-Bullshit Data Science
Page 47: No-Bullshit Data Science

- R packages- Python scikit-learn- Vowpal Wabbit- H2O- xgboost- Spark MLlib- a few others

Page 48: No-Bullshit Data Science

- R packages- Python scikit-learn- Vowpal Wabbit- H2O- xgboost- Spark MLlib- a few others

Page 49: No-Bullshit Data Science
Page 50: No-Bullshit Data Science

EC2

Page 51: No-Bullshit Data Science

n = 10K, 100K, 1M, 10M, 100M

Training timeRAM usageAUCCPU % by coreread data, pre-process, score test data

Page 52: No-Bullshit Data Science
Page 53: No-Bullshit Data Science
Page 54: No-Bullshit Data Science
Page 55: No-Bullshit Data Science
Page 56: No-Bullshit Data Science

10x

Page 57: No-Bullshit Data Science
Page 58: No-Bullshit Data Science
Page 59: No-Bullshit Data Science
Page 60: No-Bullshit Data Science
Page 61: No-Bullshit Data Science
Page 62: No-Bullshit Data Science
Page 63: No-Bullshit Data Science
Page 64: No-Bullshit Data Science
Page 65: No-Bullshit Data Science

Best linear: 71.1

Page 66: No-Bullshit Data Science
Page 67: No-Bullshit Data Science
Page 68: No-Bullshit Data Science

learn_rate = 0.1, max_depth = 6, n_trees = 300learn_rate = 0.01, max_depth = 16, n_trees = 1000

Page 69: No-Bullshit Data Science
Page 70: No-Bullshit Data Science
Page 71: No-Bullshit Data Science

...

Page 72: No-Bullshit Data Science
Page 73: No-Bullshit Data Science
Page 74: No-Bullshit Data Science
Page 75: No-Bullshit Data Science
Page 76: No-Bullshit Data Science
Page 77: No-Bullshit Data Science
Page 78: No-Bullshit Data Science

Summary

Page 79: No-Bullshit Data Science
Page 80: No-Bullshit Data Science