the practice of data science - ibm€¦ · − did the booking date stamp occur on a weekend? −...

20
© 2016 IBM Corporation The Practice of Data Science Lisa Sokol, Ph.D [email protected]

Upload: others

Post on 18-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?

© 2016 IBM Corporation

The Practice of Data Science

Lisa Sokol, [email protected]

Page 2: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?

© 2016 IBM Corporation2

Look At All The DataLook At All The Data

Let Data Lead the WayLet Data Lead the Way Leverage Data as it is CapturedLeverage Data as it is Captured

Changing the Way We Do Analytics

Page 3: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?

© 2016 IBM Corporation3

Basic Process

Ingest

data

Transform

: clean

Create

and build

model

Evaluate

Deliver

and deploy

model

Communicate

results

Understand

problem and

domain

Explore and

understand

data

Transform:

shape

OUTPUT

ANALYSIS

INPUT

Page 4: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?

© 2016 IBM Corporation4

Work Task Example

Given a person is Arrested

Who Gets Released on Bond? and

How Fast?

Page 5: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?

© 2016 IBM Corporation5

Understand the Domain

�Analytics requires an understanding of the data & the judicial process

�Need to learn how a Judge decides whether or not to allow bond

−SME’s indicate Judicial bond decisions are

based on

• “Threat to community” = Qualitative assessment (Current Charges + Past Charges +Time Line)

•Ties to Community

Page 6: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?

© 2016 IBM Corporation6

The Hunt for Data -- The Rap Sheet

�Charges

− Criminal code - thousands of numbers

− Time/date of arrest

− Sentenced or not (sometimes)/Released or not

�Personal Information

− Dirty & incomplete

�Arresting Organization

Page 7: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?

© 2016 IBM Corporation7

�Acquiring Rap Sheet Data

−Access required all sorts of agreements

−Different jurisdictions, different content and form

−Task requirement: Meta data mapping and integration

•Consistent Crime codes (NCIC)

Page 8: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?

© 2016 IBM Corporation8

Explore and Understand the Data

�Analyze variables

� Values, max, min, number of variables, coverage or % missing data, distribution shapes, etc.

� Outliers

� Anomalous values

, Number of days from arrest to adjudication

Page 9: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?

© 2016 IBM Corporation9

Data Transformation

�To clean or not to clean

−Strategic decision

−Decision criteria

�Identity normalization

−Alias challenge

−Alias challenge as it relates to Data Science

•Model Creation

•Model Score

Page 10: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?

© 2016 IBM Corporation10

Data Transformation: Enriching the data by adding context to data

Context: The cumulative history derived from data observations about entities

� Example – Safety of firefighters

� Current environmental temperature

Or

� Current environmental temperature and temperature history for that person

Or

� Current environmental temperature, temperature history for that person, and how long it will take to exit the building

Page 11: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?

© 2016 IBM Corporation11

Data Transformation: Threat

NCIC Charge Code NCIC Charge Category NCIC Charge

101 Sovereignty Treason

105 Sovereignty Sedition

�There a several thousand codes

− Which codes are considered “threat”

− How do codes compare in “threat”

− How do you combine codes

− How to figure in the temporal aspect of crime

Page 12: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?

© 2016 IBM Corporation12

Scoring: Threat to Community

�Two main components: scoring of individual charges and crime history

�Scoring of each charge derived from two parameters− A loss-of-memory parameter, which determines how fast the severity

of the charge declines over time (this parameter might be zero)− A lack-of-forgiveness parameter, which will determine what

proportion of the original severity level remains forever� Scoring of crime history

− Scores of each charge/conviction are accumulated (the model determines how)

For each crime,

look up the scoring parameters, and the time

the crime was committed, and evaluate the

individual crime scores

Submit all of

those scores into the

cumulative history scoring

function

Threat to the

Community of the

offender

12

Page 13: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?

© 2016 IBM Corporation13

Data Transformation

� Target Variable: Time to Release− The time-to-release variable is obtained by subtracting the booking time stamp

from the release time stamp.

� Counting− Total number of a type of crime

− Total number of a specific threat to community grouping

� Distance variables− Compare ZIP codes of booking location and arrestee’s home to determine if

arrestee is “local” to booking locality

� Date stamp variables− Did the booking date stamp occur on a weekend?

− Did the booking date stamp occur on a holiday, during holiday, just before a holiday?

� Time of Day− Early in shift? Late in shift?

− Net – about 1600 variables created

Page 14: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?

© 2016 IBM Corporation14

Modeling Process is Iterative

Predictive Modeling Algorithm: Train Model

Evaluate and Tweak Model

Score and Assess Model

Divide Data Set into 3 Segments

Page 15: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?

© 2016 IBM Corporation15

Picking a Model

�Target variable characteristics (binary, continuous, etc.) typically dictate model selection

�Model selection

− Assessment via Accuracy and Error

− Different models can select different variables as predictors

Page 16: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?

© 2016 IBM Corporation16

05/1 1

Predictive Modeling Environment

Page 17: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?

© 2016 IBM Corporation17

Model – Decision Tree

Page 18: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?

© 2016 IBM Corporation18

Scoring - How good is the Model?

�Mission dictates model accuracy requirements

�Lots of different measurements of goodness

− Model Confidence

− Two types of error

• Number of people who were predicted to be released AND were not

• Number of people who were not to predicted to be released AND were

− Number of different other scoring mechanisms

Page 19: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?

© 2016 IBM Corporation19

Disappointment

�Horrible accuracy and error

�Re-Think assumptions

�Aha moment

Page 20: The Practice of Data Science - IBM€¦ · − Did the booking date stamp occur on a weekend? − Did the booking date stamp occur on a holiday, during holiday, just before a holiday?

© 2016 IBM Corporation20

Deployment

�Models (or rules) get deployed to the mission environment

− Can deploy more than one model

�Model should exploit new data as it arrives

�Predictive power of models must be monitored over time

− Develop thresholds which define the limits of allowable model variance; if model exceeds variable, must re-calibrate the model

− Need to establish monitoring mechanism