no-bullshit data science

Post on 20-Mar-2017

32 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

No-Bullshit Data Science

Szilárd Pafka, PhDChief Scientist, Epoch

Domino Data Science PopupSan Francisco, Feb 2017

Disclaimer:

I am not representing my employer (Epoch) in this talk

I cannot confirm nor deny if Epoch is using any of the methods, tools, results etc. mentioned in this talk

Example #1

Aggregation 100M rows 1M groups Join 100M rows x 1M rows

time [s]

time [s]

(largest data analyzed)

(largest data analyzed)

(largest data analyzed)

data size [M]

trainingtime [s]

10x

Gradient Boosting Machines

linear tops off(data size)

(accuracy)

linear tops off

more data & better algo

(data size)

(accuracy)

linear tops off

more data & better algorandom forest on 1% of data beats linear on all data

(data size)

(accuracy)

linear tops off

more data & better algorandom forest on 1% of data beats linear on all data

(data size)

(accuracy)

Example #2

http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf

http://lowrank.net/nikos/pubs/empirical.pdf

http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf

http://lowrank.net/nikos/pubs/empirical.pdf

- R packages- Python scikit-learn- Vowpal Wabbit- H2O- xgboost- Spark MLlib- a few others

- R packages- Python scikit-learn- Vowpal Wabbit- H2O- xgboost- Spark MLlib- a few others

EC2

n = 10K, 100K, 1M, 10M, 100M

Training timeRAM usageAUCCPU % by coreread data, pre-process, score test data

10x

Best linear: 71.1

learn_rate = 0.1, max_depth = 6, n_trees = 300learn_rate = 0.01, max_depth = 16, n_trees = 1000

...

Summary

top related