kaggle presentation at sf data mining meetup - trulia june 23, 2015

Post on 13-Aug-2015

229 Views

Category:

Data & Analytics

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Kaggle The home of data science

GE Flight Quest 2 Optimize flight routes based on weather & traffic

$250,000 122 teams

Hewlett Foundation: Automated Essay Scoring Develop an automated scoring algorithm for student-written essays

$100,000 155 teams

Allstate Purchase Prediction Challenge Develop an automated scoring algorithm for student-written essays

$50,000 1,570 teams

Merck Molecular Activity Challenge Help develop safe and effective medicines by predicting molecular activity

$40,000 236 teams

Higgs Boson Machine Learning Challenge Use the ATLAS experiment to identify the Higgs boson

$13,000 1,302 teams

Age Income Default

58 $95,824 True

73 $20,708 False

59 $82,152 False

66 $25,334 True

Age Income Default

73 $53,445

61 $36,679

47 $90,422

44 $79,040

Training Data Test Data

The Kaggle Approach

Mapping Dark Matter

Competition Progress

Accuracy (lower is better)

Week 1 Week 3 Week 5 Week 7 End

.0150

.0170 Martin O’Leary PhD student in Glaciology, Cambridge U

“In less than a week, Martin O’Leary, a PhD student in glaciology, outperformed the state-of-the-art algorithms”

“The world’s brightest physicists have been working for decades on solving one of the great unifying problems of our universe”

Mapping Dark Matter

Competition Progress

Accuracy (lower is better)

Week 1 Week 3 Week 5 Week 7 End

.0150

.0170

Martin O’Leary PhD student in Glaciology, Cambridge U

Marius Cobzarenco Grad student in computer vision, UC London

Ali Haissaine & Eu Jin Loc Signature Verification, Qatar U & Grad Student @ Deloitte

Other

deepZot (David Kirkby & Daniel Margala) Particle Physicist & Cosmologist

EXAMPLE ESSAY QUESTION —

We all understand the benefits of laughter. For example, someone once said, “Laughter is the shortest distance between two people.” Many other people believe that laughter is an important part of any relationship. Tell a true story in which laughter was one element or part.

We can work with difficult data —

The winning model correctly predicted seizures 82% of the time. Until that point, researchers had struggled to develop an algorithm that did better than chance

Mayo Clinic: Seizure detection from EEG readings

We’ve worked with many of the world’s largest companies

Healthcare & Pharma

Consumer Internet

Finance Industrial Consumer Marketing

Oil & Gas

$50b+ Beverage

Co.

Global Bank

Top Credit Card

Issuer

Top 5 E&P

Top 20 E&P

Community of over 320K data scientists

That submit over 100K machine learning models per month

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

May-10 May-11 May-12 May-13 May-14 May-15

Monthly Submissions to Kaggle Competitions

Feature engineering matters most

Good software engineering practices and robust statistical methods are key

80% of data science is grunt work and only 20% involves deep thinking

A good pipeline makes data scientists more productive and their work higher quality and more enjoyable

Our workflow environment will be the central repository for all data science work in a company

Anthony Goldbloom a@kaggle.com 650 283 9781

top related