introduction to biostatistics and bioinformatics exploring data and descriptive statistics

52
Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Upload: wesley-washington

Post on 13-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Introduction to Biostatistics and Bioinformatics

Exploring Data and Descriptive Statistics

Page 2: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Learning Objectives

Python matplotlib library to visualize data:• Scatter plot• Histogram• Kernel density estimate• Box plots

Descriptive statistics:• Mean and median• Standard deviation and inter quartile range• Central limit theorem

Page 3: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

An Example Data Set

0.022-0.0830.048-0.010-0.1250.195-0.071-0.1470.0330.0800.0730.0160.1480.1350.006-0.0890.165-0.088-0.1370.094

Page 4: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Scatter Plot

0.022-0.0830.048-0.010-0.1250.195-0.071-0.1470.0330.0800.0730.0160.1480.1350.006-0.0890.165-0.088-0.1370.094

Order or Measurement

Mea

sure

men

t

Page 5: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Histogram

Order or Measurement

Me

as

ure

me

nt

Measurement Measurement Measurement

Bin size = 0.1 Bin size = 0.05 Bin size = 0.025

Nu

mb

er

of

Me

as

ure

me

nts

Nu

mb

er

of

Me

as

ure

me

nts

Nu

mb

er

of

Me

as

ure

me

nts

Page 6: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Cumulative Distributions

Order or Measurement

Me

as

ure

me

nt

Measurement

Cu

mu

lati

ve

Fre

qu

en

cy

Page 7: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Kernel Density Estimate

Order or Measurement

Me

as

ure

me

nt

Measurement

Nu

mb

er

of

Me

as

ure

me

nts

Page 8: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Original Distribution

Order or Measurement

Me

as

ure

me

nt

Measurement

Nu

mb

er

of

Me

as

ure

me

nts

Original Distribution Kernel Density Estimate

Fre

qu

en

cy

Measurement

Bin size = 0.05

Nu

mb

er

of

Me

as

ure

me

nts

Histogram

Measurement

Page 9: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

More Data

Order or Measurement

Me

as

ure

me

nt

Measurement

Nu

mb

er

of

Me

as

ure

me

nts

Original Distribution Kernel Density Estimate

Fre

qu

en

cy

Measurement

Bin size = 0.05

Nu

mb

er

of

Me

as

ure

me

nts

Histogram

Measurement

Page 10: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Exercise 1

Download ibb2015_7_exercise1.py

(a) Draw 20 points from a normal distribution with mean=0 and standard deviation=0.1.

import numpy as np

y=0.1*np.random.normal(size=20)print y

[-0.09946073 -0.19612617 0.03442682 0.02622746 -0.28418124 -0.04245968 0.05922837 0.01199874 0.13454915 -0.07482707 -0.11688758 0.01714036 0.03280043 0.01356022 0.09128649 -0.18923468 0.14536047 -0.07764629 -0.0349553 0.04300367]

Page 11: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Exercise 1

(b) Make scatter plot of the 20 points.

import matplotlib.pyplot as plt

x=range(1,points+1)fig, (ax1) = plt.subplots(1,figsize=(6,6))ax1.scatter(x,y,color='red',lw=0,s=40)ax1.set_xlim([0,points+1])ax1.set_ylim([-1,1])fig.savefig('ibb2015_7_exercise1_scatter_points'+str(poi

nts)+'.png',dpi=300,bbox_inches='tight')plt.close(fig)

Page 12: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Exercise 1

(c) Plot histograms.

for bin in [20,40,80]:fig, (ax1) = plt.subplots(1,figsize=(6,6))

ax1.hist(y,bins=bin,histtype='step',color='black', range=[-1,1], lw=2, normed=True)ax1.set_xlim([-1,1])fig.savefig('ibb2015_7_exercise1_bin'+str(bin)+'_points'+str(points)+'.png',dpi=300,bbox_inches='tight')plt.close(fig)

Page 13: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Exercise 1

(d) Plot cumulative distribution.

y_cumulative=np.linspace(0,1,points)x_cumulative=np.sort(y)fig, (ax1) = plt.subplots(1,figsize=(6,6))ax1.plot(x_cumulative,y_cumulative,color='black', lw=2)ax1.set_xlim([-1,1])ax1.set_ylim([0,1])fig.savefig('ibb2015_7_exercise1_cumulative_points'+

str(points)+'.png',dpi=300,bbox_inches='tight')plt.close(fig)

Page 14: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Exercise 1

(e) Plot kernel density estimate.

import scipy.stats as stats

kde_points=1000kde_x = np.linspace(-1,1,kde_points)fig, (ax1) = plt.subplots(1,figsize=(6,6))kde_y=stats.gaussian_kde(y)ax1.plot(kde_x,kde_y(kde_x),color='black', lw=2)ax1.set_xlim([-1,1])fig.savefig('ibb2015_7_exercise1_kde_points'+str(points)

+'.png',dpi=300,bbox_inches='tight')plt.close(fig)

Page 15: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Comparing Measurements

Page 16: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Comparing Measurements – Cumulative distributions

Page 17: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Systematic Shifts

Page 18: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Exercise 2

Download ibb2015_7_exercise2.py

(a) Generate 5 data sets with 20 data points each from normal distributions with means = 0, 0, 0.1, 0.5 and 0.3 and standard deviation=0.1.

y=[]for j in range(5):

y.append(0.1*np.random.normal(size=20))y[2]+=0.1y[3]+=0.5y[4]+=0.3print y

Page 19: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Exercise 2

(b) Make scatter plots for the 5 data sets.

sixcolors=['#D4C6DF','#8968AC','#3D6570','#91732B','#963725','#4D0132']

fig, (ax1) = plt.subplots(1,figsize=(6,6))for j in range(5):

ax1.scatter(np.linspace(j+1-0.2,j+1+0.2,20), y[j],color=sixcolors[6-(j+1)], lw=0, alpha=1)

ax1.set_xlim([0,6])ax1.set_ylim([-1,1])

fig.savefig('ibb2015_7_exercise2_scatter_sample'+str(20),dpi=300,bbox_inches='tight')

plt.close(fig)

Page 20: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Correlation Between Two Variables

Page 21: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Correlation Between Two Variables

Page 22: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Correlation Between Two Variables

Page 23: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Correlation Between Two Variables

Page 24: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Correlation Between Two Variables

Page 25: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Data Visualization

http://blogs.nature.com/methagora/2013/07/data-visualization-points-of-view.html

Page 26: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Process of Statistical Analysis

Population

Random Sample

Sample Statistics

Describe

MakeInferences

Page 27: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

DistributionsComplex Normal Skewed Long tails

n=3

n=10

n=100

Page 28: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Mean

n

ni

iix

1

xxx n,...,,21

Mean

Sample

Page 29: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Mean - Sample Size

Normal Distribution

100

0.2

0.0

Mean

806040200 Sample Size

-0.2

Page 30: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Mean – Sample SizeComplex Normal Skewed Long tails

Sample Size

100

1

-1

0.2

-0.2

Page 31: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Mode, Maximum and Minimum

xxx n,...,,21

Sample

Maximum),...,,max(

21 xxx n

Minimum

),...,,min(21 xxx n

Modethe most common value

Page 32: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Median, Quartiles and Percentiles

xxx n,...,,21

Sample

Quartiles

xQ i

1 for 25% of the sample

xQ i

2for 50% of the sample

(median)xQ i

3 for 75% of the sample

xP im for m% of the sample

Percentiles

Page 33: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Median and Mean – Sample SizeComplex Normal Skewed Long tails

Sample Size

100

1

-1

0.2

-0.2

Median - Gray

Page 34: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Variance

n

ni

iix

1

xxx n,...,,21

Variance

Sample

Mean

n

i

ni

ix

1

2

2)(

Page 35: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Variance – Sample SizeComplex Normal Skewed Long tails

Sample Size

100

0.6

0

0.1

0

Page 36: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Inter Quartile Range (IQR)

xxx n,...,,21

Sample

Quartiles

xQ i

1 for 25% of the sample

xQ i

2for 50% of the sample

(median)xQ i

3 for 75% of the sample

Inter Quartile Range

QQIQR13

Page 37: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Inter Quartile Range and Standard Deviation

Complex Normal Skewed Long tails

Sample Size

100

1.0

0

0.4

0

IRQ/1.349 - Gray

Page 38: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Central Limit Theorem

The sum of a large number of values drawn from many distributions converge normal if:

• The values are drawn independently;• The values are from the one distribution; and • The distribution has to have a finite mean and

variance.

Page 39: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Uncertainty in Determining the MeanComplex Normal Skewed Long tails

n=3

n=10

Mean

n=100

n=3

n=10

n=100

n=3

n=10

n=100

n=10

n=100

n=1000

Page 40: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Standard Error of the Mean

n

ni

iix

1

xxx n,...,,21

Variance

Sample

Mean

n

i

ni

ix

1

2

2)(

nmes

..

Standard Error of the Mean

Page 41: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Exercise 3

Download ibb2015_7_exercise3.py

(a) Generate skewed data sets.

sample_size=10x_test=np.random.uniform(-1.0,1.0,size=30*sample_size)y_test=np.random.uniform(0.0,1.0,size=30*sample_size)y_test2=skew(x_test,-0.1,0.2,10)y_test2/=max(y_test2)x_test2=x_test[y_test<y_test2]x_sample=x_test2[:sample_size]

1. Generate a pair of random numbers within the range.2. Assign them to x and y3. Keep x if the point (x,y) is within the distribution.4. Repeat 1-3 until the desired sample size is obtained.5. The values x obtained in this was will be distributed according to

the original distribution.

Page 42: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Exercise 3(b) Calculate the mean of samples drawn from the skewed data set and the

standard error of the mean, and plot the distribution of averages.

for repeat in range(1000):…average.append(np.mean(x_sample))

sem=np.std(average)fig, (ax1) = plt.subplots(1,figsize=(6,6))ax1.set_title('Sample size = '+str(sample_size)+', SEM = '

+str(sem))ax1.hist(average,bins=100,histtype='step',color='red',range=

[-0.5,0.5],normed=True,lw=2)ax1.set_xlim([-0.5,0.5])

Page 43: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Box Plot

M. Krzywinski & N. Altman, Visualizing samples with box plots, Nature Methods 11 (2014) 119

Page 44: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

n=5

Box PlotsComplex Normal Skewed Long tails

n=10

n=100

n=5

n=10

n=100

n=5

n=10

n=100

n=5

n=10

n=100

Page 45: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Box Plots with All the Data PointsComplex Normal Skewed Long tails

n=5

n=10

n=100

n=5

n=10

n=100

n=5

n=10

n=100

n=5

n=10

n=100

Page 46: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Box Plots, Scatter Plots and Bar GraphsNormal Distribution

Error bars: standard deviation error bars: standard deviation

error bars: standard error error bars: standard error

Page 47: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Box Plots, Scatter Plots and Bar GraphsSkewed Distribution

Error bars: standard deviation error bars: standard deviation

error bars: standard errorerror bars: standard error

Page 48: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Exercise 4

Download ibb2015_7_exercise4.py and plot box plots for a skewed data set.

fig, (ax1) = plt.subplots(1,figsize=(6,6))ax1.scatter(np.linspace(1-0.1, 1+0.1,sample_size),

x_sample, facecolors='none', edgecolor=thiscolor, lw=1)

bp=ax1.boxplot(x_samples, notch=False, sym='')plt.setp(bp['boxes'], color=thiscolor, lw=2)plt.setp(bp['whiskers'], color=thiscolor, lw=2)plt.setp(bp['medians'], color='black', lw=2)plt.setp(bp['caps'], color=thiscolor, lw=2)plt.setp(bp['fliers'], color=thiscolor, marker='o', lw=0)

fig.savefig(…)

Page 49: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Descriptive Statistics - Summary

• Example distribution: • Normal distribution• Skewed distribution• Distribution with long tails• Complex distribution with several peaks

• Mean, median, quartiles, percentiles

• Variance, Standard deviation, Inter Quartile Range (IQR), error bars

• Box plots, bar graphs, and scatter plots

Page 50: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Descriptive Statistics – Recommended Reading

http://blogs.nature.com/methagora/2013/08/giving_statistics_the_attention_it_deserves.html

Page 51: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Homework

Plot the ratio of the standard error of the mean and the standard deviation as a function of sample size (use sample sizes of 3, 10, 30, 100, 300, 1000) for the skewed distribution in Exercise 3. Modify ibb2015_7_exercise3.py to generate this plot and email both the script and the plot.

Page 52: Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Next Lecture: Sequence Alignment Concepts