bulldozers iikti.mff.cuni.cz/~bartak/ui_seminar/talks/2013ls... · 2013. 4. 4. · fit function...
TRANSCRIPT
![Page 1: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/1.jpg)
![Page 2: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/2.jpg)
Bulldozers IITomasek,Hajic, Havranek, Taufer
![Page 3: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/3.jpg)
Building database
● CSV -> SQL○ Table trainRaw
■ 1 : 1 parsed csv data input
![Page 4: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/4.jpg)
Building database
● table trainRaw○ Columns need transformed into the form for
effective use○ Foreach column:
■ Distinct analysis■ String data into separate table■ Replace with int indexed value■ Create relation constraint
![Page 5: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/5.jpg)
Structuring database
● trainRaw -> train table○ fast access to data○ useless data in detached tables
![Page 6: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/6.jpg)
Using database
● SQL stored & compiled procedures and functions for data analysis○ getBestEnum○ GetMedian○ GetAvgVar
![Page 7: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/7.jpg)
First solution - Decision tree
● solution for generic categorization problem
● category○ = price interval
● tree nodes○ switches for Enums
![Page 8: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/8.jpg)
Decision tree - example
![Page 9: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/9.jpg)
Our data
● 39 different enums● avg value range = 10● choosing best enum for node
○ Variance vs. Count○ counted in sql
■ first iteration takes ~10 min
![Page 10: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/10.jpg)
Categories
● how big ?○ no categories○ fix size
■ according to fit function is $100 ok○ variable size
■ 100 bulldozers for cat ○ use some genetic to find best size :)
![Page 11: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/11.jpg)
Decision tree results
● depth 4● runtime 2:34:02● result 0.5431
![Page 12: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/12.jpg)
To do
● don't use irrelevant enums● sql optimizations● multi core processing● find some very strong Machine
![Page 13: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/13.jpg)
Statistics
● Some columns give no usable information● About half columns are machine type
specific
![Page 14: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/14.jpg)
Second solution - genetic
● Population member○ Expression tree
■ Evaluates price○ Nodes
■ [Price] -> Price■ Constant, Arithmetic, Sql Aggregation, Switch
● Fit function○ Challenge official: RMSLE
● Reproduction○ switching subtrees between father and
mother
![Page 15: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/15.jpg)
Second solution - genetic
● Mutation○ Specific per node type
■ Only few types can mutate○ Random added members
■ Avoids of local extremes
![Page 16: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/16.jpg)
Genetics - use & experience
● Original input parameters○ Population size
■ Very large is not needed■ For performance■ Actual value 50 members
○ Max depth■ For performance and convergence■ 10 seems to be enough, actual value 12
○ Train data sample■ Size
● Performance & miscellany, actual 25%■ Select every generation / Keep same
![Page 17: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/17.jpg)
Genetics - use & experience
● Added parameters○ Min depth
■ Avoidance of● One-node trees● Train data specific expressions
○ Action probabilities■ Reproduction
● Makes variety, actually 0.6■ Clone
● Not important, actually 0.3■ Mutation
● Important is high value, actually 0.7● Helps with convergence● Needs to upgrade in several node types
![Page 18: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/18.jpg)
Genetics - use & experience
● Node type implementation & specific tuning○ Abstract node
■ Mutation is called recursively to children○ Constant
■ Finite universum● { k / 100 | k in N U {0} & k < 101 } U { pi, e }
■ Mutation● + d where d is from {-0.01, 0, 0.01}
○ Arithmetic■ +-*/ only binary
![Page 19: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/19.jpg)
Genetics - use & experience
● Node type implementation & specific tuning○ SqlAgg
■ Defined by agg. function and selected data columns
■ Returns aggregated price of database table rows what have same values in selected columns
■ Mutation changes agg. function● Maybe change of selected column is needed
○ Switch■ Defined by one data column■ k children
● k is loaded only once by column variety
![Page 20: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/20.jpg)
Genetics - use & experience
● Evolution process implementation○ For every generation
■ Selection■ Reproduction & cloning■ Mutation
○ Evaluating train data sample by each member
○ Fit calculation○ Best one serialisation○ GC.Collect()
● Genetic process is very slow○ Threadpool implemented
■ by member
![Page 21: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/21.jpg)
Genetics - results
● First whole night run on full train data○ 294 generations○ Best result fit 0.49 (381 / 454)
■ Challenge leader has 0.22■ Median benchmark has 0.74
○ Lesson■ Min depth constraint
● Very simple data specific nodes broke population development
■ Smaller train data sample● More data-specific results and more performance
■ More mutation■ More sql query parallelism■ Sql results caching
![Page 22: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/22.jpg)
Genetics - example
![Page 23: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/23.jpg)
Neural networks
What have we tried?- single MLP- 10 classes of equal magnitude- 18 / 51 features- network structure 18 - 10 - 10
![Page 24: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/24.jpg)
Neural networks
What have we tried ?(cont.)- backpropagation learning algorithm- different minimization techniques
- gradient descent- conjugated gradient ((C) Andrew Ng)
- different values of regularization- different training set sizes
![Page 25: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/25.jpg)
How did it go?
![Page 26: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/26.jpg)
Actual results
Best experiment:- trained 10000 samples- 50 iterations of Conjugate gradient- classification accuracy on all training data:
0.224RMSLE on Validation data:
0.773 (mean benchmark: 0.74745)
![Page 27: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/27.jpg)
What went wrong?
Non-numerical features- overall: 8 comparable features- 43 non-numerical features
Missing features- not all features are available for all samples- sometimes less than half- => inaccurate guesswork
Time constraints- not trained on all data (100k samples ~ 1
night)
![Page 28: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/28.jpg)
What do we do about it?
Non-comparable features- set up indicator variables
Missing features- better guesswork (mean of class)
![Page 29: Bulldozers IIkti.mff.cuni.cz/~bartak/ui_seminar/talks/2013LS... · 2013. 4. 4. · Fit function Challenge official: RMSLE Reproduction switching subtrees between father and mother](https://reader035.vdocuments.us/reader035/viewer/2022071105/5fdfdda7b2232e3c031cba01/html5/thumbnails/29.jpg)
Further work
- multiple MLPs + agreement algorithm- vary amount of classes- vary class size
(equal magnitude vs. equal width)- more detailed (class -> price) conversion- different cost function
- factor in cost of misclassification- different learning algorithm?