data science, what even

Post on 14-Apr-2017

1.070 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data Science?!what even...

David Coallier@davidcoallier

Data ScientistEngine Yard

And I cook..A lot.

(n-1) items

Adapting.

Feedback.

Indifference.

Young mathematically inclined minds

Young mathematically inclined minds

We knew everything.

First Bad Assumption.

So we asked “experts”.

Bad Ingredients

Bad Data

Tasted like sh*t

From Our ResultsWe had questions.

Found ExpertiseNot Online.

Data Scientific Method

Find a QuestionYour Hypothesis

Current DataWhat do you have?

Features & TestsTry it.

Analyse ResultsWon’t be pretty.

ConversationFramed. By. Data.

But....

Good DiscussionsImply good data scientists

Hacking Skills

Hacking Skills

Maths & Stats

Hacking Skills

Maths & Stats

Expertise

Hacking Skills

Maths & Stats

Expertise

MachineLearning

Research

DangerZone!!!

Hacking Skills

Maths & Stats

Expertise

DataScience

Hacking Skills

Maths & StatsExpertise

MachineLearning

Research

DangerZone!!!

DataScience

BusinessDon’t need an MBA

In other words.

1. Hacking2. Maths & Stats3. Expertise

Apply MethodData Scientific

1. Question2. Current Data3. Features/Tests4. Analyse5. Converse

Find a QuestionLet’s imagine Github

Upgrade ReposAffect users as little as possible

import csvcontent = csv.read('repo1.csv')

f (k;λ) = λ ke−k

k!for k >= 0

ConversePresent Findings

IterateCommits aren’t key.

KPIs are keyIndicators from experience

QuestionsSuper Important.

Just test it..

We are Human.Emotional Connection

What next?Second Hypothesis.

Focus on DataRelevant to your KPIs.

Data gives you the what

Humans give you the why

Turn Information

Into

Actionable Insight

Create DiscussionsIntrospection Engines

Seeing, Feeling itThe brain sees.

Not regressions

Not p-values

Not slopes

Not F-statistics

Not coefficients

Another ExampleFraud Engine

FeaturesFraud Engine

ClustersUser Types

Machine LearningHistorical Analysis

DecisionReport as Fraudulent

Fact-Based Decision Failing

Fact-Based Decision Making

Measure

AnalysisKnowledge

Action

Failed.Noetic Intelligence

Measure

AnalysisKnowledge

Action

Measure

AnalysisKnowledge

Action

OfferingMissing Feature

ToolboxWhat do we use?

RModeling, Testing, Prototyping

RStudioThe IDE

lubridateand zoo

Dealing with Dates...

yy/mm/dd mm/dd/yyYYYY-mm-dd HH:MM:ss TZyy-mm-dd 1363784094.513425yy/mm different timezone

reshape2Reshape your Data

ggplot2Visualise your Data

RCurl, RJSONIOFind more Data

HMiscMiscellaneous useful functions

forecastCan you guess?

garchGeneralized Autoregressive Conditional Heteroskedasticity

quantmodStatistical Financial Trading

getSymbols('AAPL')barChart(AAPL)addMACD()

xtsExtensible Time Series

igraphStudy Networks

maptoolsRead & View Maps

map('state', region = c(row.names(USArrests)), col=cm.colors(16, 1)[floor(USArrests$Rape/max(USArrests$Rape)*28)], fill=T)

PythonScientific Computing

SciPyhttp://www.scipy.org

scipy.stats

scipy.statsDescriptive Statistics

from scipy.stats import describe

s = [1,2,1,3,4,5]

print describe(s)

scipy.statsProbability Distributions

ExamplePoisson Distribution

f (k;λ) = λ ke−k

k!for k >= 0

import scipy.stats.poissonp = poisson.pmf([1,2,3,4,1,2,3], 2)

print p.mean()print p.sum()...

NumPyhttp://www.numpy.org/

NumPyLinear Algebra

1 00 1

⎛⎝⎜

⎞⎠⎟

import numpy as npx = np.array([ [1, 0], [0, 1] ])vec, val = np.linalg.eig(x)np.linalg.eigvals(x)

>>> np.linalg.eig(x) ( array([ 1., 1.]), array([ [ 1., 0.], [ 0., 1.] ]) )

MatplotlibPython Plotting

statsmodelsAdvanced Statistics Modeling

NLTKNatural Language Tool Kit

scikit-learnMachine Learning

from sklearn import treeX = [[0, 0], [1, 1]]Y = [0, 1]clf = tree.DecisionTreeClassifier()clf = clf.fit(X, Y)

clf.predict([[2., 2.]])>>> array([1])

PyBrain... Machine Learning

PyMCBayesian Inference

PatternWeb Mining for Python

NetworkXStudy Networks

MILK: Machine Learning

Pandaseasy-to-use data structures

from pandas import *x = DataFrame([ {"age": 26}, {"age": 19}, {"age": 21}, {"age": 18}])

print x[x['age'] > 20].count()print x[x['age'] > 20].mean()

Python vs R?Different Purposes

Storage

Oppose“big” Data

Hadoop

Had - oops

RiakKey-Value Buckets

CouchDBDocument Database

RedisIn-Memory Database

CubeTime-series Database

PgSQLQuite Extensively

Visualisation

Right NowThe rule of 3

EngineerReport One

Mid-Level MgrReport Two

Board LevelReport Three

The FutureDiscoverable Insight

d3.jsData-Driven Documents

The FutureDiscoverable Insight

DashingElegant Dashboards

Edward TufteGo read his books.

DogfoodingData Scientific Method

Original QuestionWhat is Data Science?

Back to youFor questioning

top related