kaggle competition: allstate claims severitypalencar/cs846/fall-2016/presentations/... · kaggle...

15
KAGGLE COMPETITION: ALLSTATE CLAIMS SEVERITY Presenter: Yuwei(Ruby) Jiao 16-11-29 CS 846 Software Engineering for Big Data 1

Upload: phungdien

Post on 08-Apr-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

KAGGLE COMPETITION: ALLSTATE CLAIMS SEVERITY

Presenter: Yuwei(Ruby) Jiao

16-11-29 CS 846 Software Engineering for Big Data 1

Outline • Background • Problem • Goals • Approach • Expectation •  Future Work

16-11-29 CS 846 Software Engineering for Big Data 2

Background • Kaggle:

•  a platform for predictive modeling and analytics competitions

•  companies and researchers post their data •  statisticians and data miners from all over the world

compete to produce the best models • Allstate:

•  the second largest personal lines insurer in the United States

•  is currently developing automated methods of predicting the cost, and hence severity of claims

16-11-29 CS 846 Software Engineering for Big Data 3

Problem • Data

•  train.csv (188318 x 132) •  test.csv (125546 x 131)

• Attributes •  ID 1 •  Categorical 116 •  Continuous 14 •  Loss 1

16-11-29 CS 846 Software Engineering for Big Data 4

Goals • Explore raw data

•  Data Statistics •  Data Visualization •  Data Transformation •  Data Interaction •  Data Preparation

• Evaluation, prediction and analysis •  Explore different machine learning models and algorithms

16-11-29 CS 846 Software Engineering for Big Data 5

Approach •  Language:

•  Python 3.0

•  Library:

16-11-29 CS 846 Software Engineering for Big Data 6

Approach --- Data Statistics

16-11-29 CS 846 Software Engineering for Big Data 7

skew

Approach --- Data Visualization

16-11-29 CS 846 Software Engineering for Big Data 8

Approach --- Data Visualization

16-11-29 CS 846 Software Engineering for Big Data 9

Approach --- Data Transformation

16-11-29 CS 846 Software Engineering for Big Data 10

Approach --- Data Interaction

16-11-29 CS 846 Software Engineering for Big Data 11

Approach --- Data Preparation • Divide into dataset into train and validation set • Convert categorical attributes to binary vector with one-

hot encoding •  Determining the state has a low and constant cost •  Changing the state has the constant cost •  Easy to design and modify •  Easy to detect illegal states •  Takes advantage of an FPGA's abundant flip-flops

16-11-29 CS 846 Software Engineering for Big Data 12

Expectation • Make prediction:

•  XGBoost

• Current ranking: •  50%

• Expectation ranking: •  30%?

16-11-29 CS 846 Software Engineering for Big Data 13

Future Work •  Feature engineering

•  Use domain knowledge of the data to create features •  Make machine learning algorithms work

• Evaluation, prediction and analysis •  Linear Regression (Linear algo) •  LASSO Linear Regression (Linear algo) •  KNN (non-linear algo) •  SVM (Non-linear algo) •  Random Forest (Bagging) •  AdaBoost (Boosting)

16-11-29 CS 846 Software Engineering for Big Data 14

16-11-29 CS 846 Software Engineering for Big Data 15

Thank you!

Q + A?