data analysis for forensic scientists: fos 822, spring 2015 everything you always wanted to know but...

47
Data Analysis for Forensic Scientists: 3 2 1 0 1 2 3 FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

Upload: harvey-stevenson

Post on 26-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

Data Analysis for Forensic Scientists:

3 2 1 0 1 2 3

FOS 822, Spring 2015

Everything you always wanted to know but were afraid to ask

Page 2: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Daubert is a benchmark!!!:• Daubert (1993)- Judges are the “gatekeepers” of

scientific evidence.

• Must determine if the science is reliable • Has empirical testing been done?

• Falsifiability

• Has the science been subject to peer review?

• Are there known error rates?

• Is there general acceptance?

• Federal Government and 26(-ish) States are “Daubert States”

“Legal” Science

Page 3: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Any time an observation is made, one is making a “measurement”• As all scientists know, almost no two measurements of

the same quantity under the same conditions will never agree exactly

1. Experimental error is inherent in every measurement

• Refers to variation in observations between repetitions of the same experiment.

• It is unavoidable and many sources contribute

2. Error in a statistical context is a technical termBHH

Measurement and Randomness

Page 4: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Experimental error is a form of randomness• Randomness: inherent unpredictability in a

process• The the outcomes of the process follow a probability

distribution

• Statistical tools are used to both:• Describe the randomness

• Make inferences taking into account the randomness

Measurement and Randomness

Page 5: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Frequency: ratio of the number of observations of interest (ni) to the total number of observations (N)

• Probability (frequentist): frequency of observation i in the limit of a very large number of observations

• This definition is falsifiable (i.e. testable)

Probability

Page 6: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Belief: A “Bayesian’s” interpretation of probability. • An observation (outcome, event) is a “measure of

the state of knowlege”Jaynes.• Bayesian-probabilities reflect degree of belief and can

be assigned to any statement

• Beliefs (probabilities) can be updated in light of new evidence (data) via Bayes theorem.

• This definition is not always falsifiable

Probability

Page 7: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Algorithmic Probability: How do you rigorously assign a probably to an observation on which you have no data?• Solomonoff, Kolmogorov, Levin, and Chaitin

came up with a way.

• The basic idea:• Patterns which result from "computation" are

relatively likely

• Patterns that can not be produced from any computational processes are relatively unlikely.

Probability

Page 8: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Formally, a computable process produces an observation: • A program executed on a theoretical computer (universal

Turing machine) and produces the observation as output.

• Algorithmic probability, P(observation):• Probability that the output of a Turing machine is the

observation when provided with programs of "fair coin flips” are run

• A binary program of randomly drawn 1s and 0s each with a probability of ½.

Probability

Page 9: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Definition of algorithmic probability can be made mathematically rigorous, but does not yield exactly computable results.

• Approximations schemes can be made but so far have been difficult to put into practice.• The length of the shortest program producing obs. is

called the observations Kolmogorov complexity.

Probability

Page 10: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Study of relationships in data

• Descriptive Statistics – techniques to summarize data• E.g. mean, median, mode, range, standard deviation, stem

and leaf plots, histograms, box and whiskers plots, etc.

• Inferential Statistics – techniques to draw conclusions from a given data set taking into account inherent randomness• E.g. confidence intervals, hypothesis testing, Bayes’

theorem, forecasting, etc.

What is Statistics??

Page 11: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• For the Sciences, we ask:• Are the differences in measurements characterizing

two (or more) objects real or just due to (the characteristic) randomness?

• Furthermore, for the Forensic Sciences we ask:• Do two pieces of evidence originate from a common

source?

• For this, we must at least answer the above.

Why do we use statistical tools?

Page 12: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Almost all of statistics is based on a sample drawn from a population.• Population: The totality of observations that might

occur as a result of repeatedly performing an experiment• Why not measure the whole population?

• Usually impossible

• Likely wasteful

• Population should be relevant.• Part logic

• Part guess

• Part philosophy….

Population and Sample

Page 13: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Sampling:• Sample: a few observations that are made from a

population

• Draw members out of as population with some given probability• Random sample: if all observations have an equal

chance of being made and no observation affects any other

• Want a random sample to be representative of the population

• Biased sample if not the case*

Population and Sample

Page 14: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Sample Representations:

Data and Sampling

PopulationRepresentativeSample

Biased Samples

Population

Sample

Population

Sample

Page 15: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Types of sampling:• (Simple) Random Sampling

• Every data item is selected independently of every other.

• Every member of a population has an equal chance of being selected

• Systematic Sampling• Pick every kth data item to be in the sample

• Easier to conduct but risk getting a biased sample

• Stratified Sampling• Partition population into disjoint groups containing specific

attributes of a particular category (strata)

• Random sample from the groups

Data and Sampling

Page 16: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Types of sampling (con’t):• The Bootstrap

• Sample from a population, preferably as large as possible.

• Say sample size is n

• Bootstrap sample: Sample with replacement out of the original sample to build a new sample of size n

• Do this hundreds or thousands of times

• Use statistics from each bootstrap sample to get an idea of population variation

• Computationally intensive but often works well compared to traditional methods

• Free of many traditional assumptions

Data and Sampling

Page 17: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Parameter: any function of the population

• Statistic: any function of a sample from the population• Statistics are used to estimate population

parameters• Statistics can be biased or unbiased

• Sample average is an unbiased estimator for population mean

• We may construct distributions for statistics• Populations have distributions for observations

• Samples have distributions for observations and statistics

Parameters and Statistics

Page 18: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Univariate Statistics: Statistical tools used to analyze one random variable• Random variable could be raw observation or a

statistic

• Common tools are: (univariate) hypothesis testing, ANOVA, linear regression

• Multivariate Statistics: Statistical tools used to analyze many random variables• Random variables can also be raw observations

(often encountered in chemometrics) or statistics (currently popular in marketing, finance, surface metrology)

Univariate vs. Multivariate Statistics

Page 19: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Don’t if you can see clear differences/similarities in your data and can clearly articulate how in court!

• If you can’t differentiate or want to study/search for differences within a well defined population• AND univariate methods don’t do the trick:

Why Use Multivariate Statistics?

A linear (or non-linear) combination of many experimental variables (multivariate) may do the trick!

Page 20: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Vector – A list of numbers or attributes characterizing an observation or experiment

• Vectors can be pictures!

Some Important Terms

Represent normalized intensities of mixture Components as arrows:

Page 21: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Random variables - All measurements have an associated “randomness” component

• Randomness –patternless, unstructured, typical, total ignoranceChaitin, Claude

• For an experiment/observation, put many measurements together into a list • Collection random variables into a list called a

random vector

1. Also called: observation vectors

feature vectors

Multivariate Feature Vectors

Page 22: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

Pick Features

Page 23: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

oGC-MS instrument output for a gasoline :

Example Feature Vector

Page 24: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Gasoline (gas) GC-MS for 20 casework samples• Collected by Mark Gil, NYPD Crime lab at the time

• 15 normalized peak areas characterize each chromatogram

• 3 to 7 replicates per sample

• 92 chromatograms

• ¼ in Screwdriver striation marks for 9 screwdrivers (tool)• Collected by Nicholas Petraco, Petraco Forensic Consulting

and NYPD Crime Lab

Brief Description of Data Sets

Page 25: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• ¼ in Screwdriver striation mark data (con’t)• 140D binary vectors characterize each striation

pattern

• 6 to 9 replicate patterns per screwdriver

• 75 pattern total

• Also, ~750 simulated patterns too (toolsim)

Brief Description of Data Sets

Page 26: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Neel and Wells Consecutive Matching Strae (CMS) study(s)• AFTE J 39(3):176-198 2007 (Part I)

• Enumerated CMS runs for a large set of known match (KM) and known non-match (KNM) comparisons• 4188 comparisons

• Various toolmark sources (cf. pp 179 second column)

Brief Description of Data Sets

Page 27: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• LAM 2011 study• Mohammed et al., “The dynamic character of disguise

Behaviour for text-based, mixed, and stylized signatures”• J Forensic Sci 56(1),S136-S141 (2011)

• Variation of dynamic signature parameters with signing style and conditions• Params: duration, size, velocity, jerk, and pen pressure

• Style: text-based, stylized, and mixed

• Conditions: genuine, disguised, and auto-simulation

• Ninety writers: 10 genuine sigs, five disguised sigs, five auto-sim sigs.

• 1800 signatures total collected using a digitizing tablet

Brief Description of Data Sets

Page 28: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Glass data (glass) from 6 glass types• Collected by Forensic Science Service, UK

• “Famous” (infamous…) data set

• 9 variables: RI, Na, Mg, Al, Si, K, Ca, Ba, Fe

• 9-76 replicates per sample…

• 242 glass specimens total

Brief Description of Data Sets

• James Curran’s collected data sets in his R package dafs. See CRAN.• Lots of stuff. Explore it.

Page 29: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Dust data (dust) from 10 locations• Collected by Nicholas Petraco, Petraco Forensic

Consulting and NYPD Crime Lab

• 342 variables to characterize each sample

• 1/0 = present/absent in sample

• 3 replicates per sample

• 30 dust specimens total

Brief Description of Data Sets

Page 30: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Type written letters in different fonts data (lett)• Benchmark machine learning test set

• 17 variables: All easy to obtain

• Lots of replicates per letter

• 20,000 examples total

• Hand written digits (0-9) from US zip codes (zip)• Benchmark set from USPS

• 256 variables: Digitized/normalized grey levels

• Lots of replicates per number

• 9298 examples total

Brief Description of Data Sets

Page 31: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Data frame (data matrix):

First Thing: Look at your Data

1,1 1,2 1,3 1,

2,1 2,2

3,1

,1 ,

p

n n p

x x x x

x x

x

x x

X

L

O

M O

L

ID Ethylbenzene m.p.Xylene o.Xylene Propylbenzene…1 1 0.3738972 1.5189473 0.509374 0.170585692 1 0.3821145 1.4975333 0.4869311 0.152664143 1 0.3910006 1.5735967 0.5140853 0.170975284 1 0.3592879 1.0931521 0.469633 0.143561195 1 0.379583 1.4976838 0.5004265 0.167322636 1 0.3824838 1.5347461 0.5003289 0.159896517 1 0.3932254 1.5370547 0.5191838 0.16931528 2 0.1697284 0.7243938 0.2739452 0.071117859 2 0.1730064 0.7494535 0.2791126 0.07370284

10 2 0.1587106 0.684664 0.2484791 0.0627097711 2 0.1668295 0.6983527 0.2586032 0.0656859912 2 0.1655228 0.7036451 0.2689125 0.0702909913 2 0.1645599 0.6938837 0.2546212 0.065661614 2 0.1544826 0.6472038 0.2439379 0.0610005215 3 0.1575096 0.5890765 0.2220248 0.055626416 3 0.1610904 0.6069997 0.2319318 0.0596929717 3 0.1535362 0.5693021 0.2268586 0.0618779818 3 0.1532304 0.5735848 0.2147113 0.053536819 3 0.1664476 0.6248774 0.2396432 0.06084041

Part of data frame for composition of gasoline:

p variables

n observation vects

Page 32: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

o Explore the Glass dataset of the mlbench package• Scatter plots: plot any two variables against each

other

First Thing: Look at your Data

Page 33: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Pairs plots: do many scatter plots at once

First Thing: Look at your Data

Page 34: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Histograms: “bin” a variable and plot frequencies

First Thing: Look at your Data

Page 35: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Histograms conditioned on other variables: use lattice package

First Thing: Look at your Data

RIs Conditioned on glass group membership

Page 36: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Probability density plots: also needs lattice

First Thing: Look at your Data

Page 37: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Empirical Probability Distribution plots: also called empirical cumulative density

First Thing: Look at your Data

Page 38: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Box and Whiskers plots:

First Thing: Look at your Data

1 .5188 1 .5189 1 .5190 1 .5191 1 .5192

25th-%tile1st-quartile

75th-%tile3rd-quartile

median50th-%tile

range

possibleoutliers

possibleoutliers

RI

Page 39: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Note the relationship:

Visualizing Data

Page 40: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Box and Whiskers plots:

First Thing: Look at your Data

Box-Whiskers plots for actual variable values

Box-Whiskers plots for scaled variable values

Page 41: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Variance: a measure of variability of an experimental quantity• “Spread” of measurement about the average s2:

“Variability”

~ 68%±1s~ 95%±2s

~ 99%±3s

Page 42: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Sample variance:• (Almost) the average of squared deviations from the

sample mean.

• Standard deviation is • The sample mean and standard dev. are the most common

measures of central tendency and spread

• Sample mean and standard dev have the same units

Measures of Data Spread

22

1

1

1

n

ii

s x xn

data point i

sample mean

there are n data points

2s s

Page 43: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

“Variability”oCovariance: variability of two measured

quantities with each other:

• As one quantity increases the other increases:

1. si,j positive

• As one quantity increases the other decreases:

1. si,j negative

obs. #k, var. #i avg. of var. #i

Page 44: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Mean Centering: Subtract the mean of each column in X• Puts the origin at the data’s “center of mass”

• Variance Scaling: Divide each column by its standard deviation• Weights each column (variable) to have the same

importance

• All variables will have the same variance = 1

• Autoscaling: Mean center and variance scale X

Basic Data Pre-processing

Page 45: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Some Gasoline Data:

Basic Data Pre-processing

Raw Peak Areas Mean Centered

Page 46: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Some Gasoline Data:

Basic Data Pre-processing

Raw Peak Areas

Var =

0.0

14

Var = 0.002

Var =

1

Var = 1

Variance Scale

Page 47: Data Analysis for Forensic Scientists: FOS 822, Spring 2015 Everything you always wanted to know but were afraid to ask

• Some Gasoline Data:Basic Data Pre-processing

Autoscaled

Var =

1

Var = 1

*Exercise: basic_preprocessing.R