data science: the main course @ kcdc 2016

Post on 06-Apr-2017

124 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

DATA SCIENCE: THE MAIN COURSE

I Can Science Data, and So Can You!

Arthur Doler @arthurdoler arthurdoler@gmail.com

TITANIUM SPONSORS

Platinum Sponsors

Gold Sponsors

HOW MANY APPETIZERS HAVE YOU EATEN?

Sources: Mediawiki, Publicdomainpictures.net

SO WE’RE SKIPPING RIGHT TO THE MAIN COURSEYOU HAVE THE DATA

YOU HAVE THE POWER

Sources: Mattel, he-manreviewed.net

WHAT’S FOR DINNER

Picking your problem

Using Knitr/R Markdown

Building a linear predictor

Making a predictive, repeatable document

WHAT’S NOT FOR DINNER

Learning R

Exhaustive discussion of statistics

Exhaustive discussion of regression modeling

Ways to run R in production

STEP 0: KNOW YOUR RECIPE FOR REPEATABILITY

Learn to Knit you some R

knitr ≈ Sweave + cacheSweave + pgfSweave + weaver + animation::saveLatex +

R2HTML::RweaveHTML + highlight::HighlightWeaveLatex + 0.2 * brew + 0.1 *

SweaveListingUtils + more

Source: Reddit

R Code

Markup

R Code

Markup

Markup

WHAT?! WHY IS THIS A GOOD IDEA?

Do you love me?

YN

LET’S GO FIND THAT RECIPE!

Source: Reddit

STEP 1: SHOP FOR YOUR INGREDIENTS

Finding the question to ask

WHAT ARE YOU TRYING TO DO?

Finding or proving a correlation

Looking for outliers

Building a predictive model

LET’S BUILD A LINEAR PREDICTIVE MODEL

Source: Wikipedia

WHAT ARE YOUR VARIABLES?

• Material Category• Material ID• Time-to-Incapacitation• 1000 / Time-To-Incapacitation• Carbon Monoxide• Hydrogen Cyanide• Hydrogen Sulfide• Hydrochloric Acid• Hydrobromic Acid• Nitrogen Dioxide• Sulfur Dioxide

WHAT DO YOU CARE ABOUT?

FORMULATE YOUR QUESTION

LET’S HEAD TO THE STORE!

Source: Reddit

STEP 2: GET YOUR MISE EN PLACE

Dividing your data

WHERE IS THE VALUE IN A PREDICTIVE MODEL?

WE BUILD OUR MODEL WITH A TRAINING SET

PARTITIONING YOUR DATA PREVENTS OVERTRAINING

²⁄³ Training¹⁄³ Test

½ Training¼ Test¼ Validation

LET’S MEASURE EVERYTHING OUT!

Source: Reddit

STEP 3: COOK UP YOUR PREDICTOR

Training your model

ONE WARNING FIRST

DO YOU NEED TO UNDERSTAND YOUR PREDICTOR?

LET’S GO COOK UP THE MODEL!

Source: Reddit

WHY DID 1000/TIME_TO_INCAPACITATION WORK BETTER?

STEP 3A: TRIM THE FATEliminating Outliers

LET’S GO CUT!

Source: Reddit

STEP 4: GARNISH WITH GRAPHICS

Adding visualizations to your report

plot ggplot2

Source: Wikimedia

LET’S FINISH UP THAT REPORT!

Source: Reddit

1. Know Your Recipe for Repeatability2. Shop for Your Column Ingredients3. Get your Data Divided4. Cook Up Your Predictor

1. Trim the Outlier Fat5. Garnish with Graphics

QUESTIONS?Source: Reddit

top related