feature engineering studio september 23, 2013. welcome to mucking around day

46
Feature Engineering Studio September 23, 2013

Upload: edgar-cole

Post on 12-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Feature Engineering Studio

September 23, 2013

Page 2: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Welcome to Mucking Around Day

Page 3: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Sort into pairs

• Partner with the person next to you

• One group of 3 is allowed

Page 4: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Sort into pairs

• Do we have a group of 3?• One of the 3 will work with me

Page 5: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Sort into pairs

• Go over your reports together– A maximum of 5 minutes apiece

Page 6: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

5 minutes for first person

Page 7: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

5 minutes for second person

Page 8: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Re-assemble into one big group

Page 9: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Who here found something really cool while mucking around?

• Show us, tell us

Page 10: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Who here found a histogram with a normal distribution?

• Show us, tell us

Page 11: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Who here found a histogram with a hypermode?

• Show us, tell us

Page 12: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Who here found a histogram with a flat distribution?

• Show us, tell us

Page 13: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Who here found a histogram with a skewed distribution?

• Show us, tell us

Page 14: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Who here found a histogram with a bimodal distribution?

• Show us, tell us

Page 15: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Who here found a histogram with something else interesting?

• Show us, tell us

Page 16: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Who here found something surprising with their min, max, average, stdev?

Page 17: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Categorical variables

• Who here found something curious, weird, or interesting in the distribution of their categorical variables?

Page 18: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Who here hasn’t spoken yet?(and analyzed data)

• Tell us something interesting you found in your data

Page 19: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Who here played with pivot tables?

• What did you learn?

Page 20: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

My turn to play with pivot tables

• Who wants to volunteer their data?• (I might request a 2nd or 3rd data set,

depending on how the 1st one goes)

Page 21: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Who here played with vlookup?

• What did you learn?

Page 22: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

My turn to play with vlookup

• Using the same volunteered data set(s)

Page 23: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Other cool things you can create with a few simple formulas (plus demos!)

Page 24: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Identifying specific cases of interest

Page 25: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Did event of interest ever occur for student?

Page 26: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Counts-so-far(and total value for student)

Page 27: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Counts-last-N-actions

Page 28: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

First attempts

Page 29: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Ratios between events of interest

Page 30: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

How many students had 3 (or 4, 5, 2,…) of an event

Page 31: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Times-so-far

Page 32: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Cutoff-based features

Page 33: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Unitized actions (such as unitized time)

Page 34: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Last 3 or 5 unitized

Page 35: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Comparing earlier behaviors to later behaviors through caching

Page 36: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Counts-if

Page 37: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Percentages of action type

Page 38: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Percentages of time spent per action/location/KC/etc.

Page 39: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Questions? Comments?

Page 40: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Other cool ideas?

Page 41: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Assignment 3• Feature Engineering 1

“Bring Me a Rock”

• Get your data set• Open it in Excel• Create as many features as you feel inspired to create

– Features should be created with the goal of predicting your ground truth variable– At least 12 separate features that are not just variations on a theme (e.g. “time for

last 3 actions” and “time for last 4 actions” are variations on a theme; but “time for last 3 actions” and “total time between help requests and next action” are two separate features)

• For each feature, write a 1-3 sentence “just so story” for why it might work• Test how good each features is

Page 42: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Testing Feature Goodness

• For this assignment, there are a bunch of ways to test feature goodness

• Single-feature prediction models in data mining or stats package, giving correlation or kappa (special session this Wednesday)

• Compute correlation in Excel (want to see?)– You can do this with binaries variables too, although it’s not really

optimal• Compute t-test in Excel (want to see?)• Compute kappa in Excel (if you don’t know how, easier to do in

RapidMiner)

Page 43: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Were you right?

• Which of your “just so stories” seem to be correct?

• Did any of your feature correlate in the opposite direction from what you expected?

Page 44: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Assignment 3

• Write a brief report for me• Email me an excel sheet with your features• You don’t need to prepare a presentation• But be ready to discuss your features in class

Page 45: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Next Classes

• 9/25 Special Session– Using RapidMiner to Produce Prediction Models– Come to this if you’ve never built a classifier or

regressor in RapidMiner (or a similar tool)– Statistical significance tests using linear regression

don’t count…

• 9/30 Advanced Feature Distillation in Excel– Assignment 3 due– Online Equation Solver Tutorials should be in your

INBOX

Page 46: Feature Engineering Studio September 23, 2013. Welcome to Mucking Around Day

Upcoming Classes

• 10/2 Special session on prediction models– Come to this if you don’t know why student-level

cross-validation is important, or if you don’t know what J48 is

• 10/7 Advanced Feature Distillation in Google Refine

• 10/9 Special session? TBD.