previous lecture : exploring data

35
Previous Lecture: Exploring Data

Upload: stesha

Post on 04-Jan-2016

44 views

Category:

Documents


1 download

DESCRIPTION

Previous Lecture : Exploring Data. This Lecture. Introduction to Biostatistics and Bioinformatics Descriptive Statistics. Process of Statistical Analysis. Population. Random Sample. Make Inferences. Describe. Sample Statistics. Distributions. Normal. Skewed. Long tails. Complex. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Previous Lecture :  Exploring Data

Previous Lecture: Exploring Data

Page 2: Previous Lecture :  Exploring Data

Introduction to Biostatistics and Bioinformatics

Descriptive Statistics

This Lecture

Page 3: Previous Lecture :  Exploring Data

Process of Statistical Analysis

Population

Random Sample

Sample Statistics

Describe

MakeInferences

Page 4: Previous Lecture :  Exploring Data

DistributionsComplex Normal Skewed Long tails

Page 5: Previous Lecture :  Exploring Data

Randomly Sample from any Distribution

1. Generate a pair of random numbers within the range.

2. Assign them to x and y3. Keep x if the point (x,y) is within the distribution.4. Repeat 1-3 until the desired sample size is

obtained.5. The values x obtained in this was will be

distributed according to the original distribution.

Page 6: Previous Lecture :  Exploring Data

Mean

n

ni

iix

1

xxx n,...,,21

Mean

Sample

Page 7: Previous Lecture :  Exploring Data

MeanComplex Normal Skewed Long tails

Sample Size

100

1

-1

0.2

-0.2

Page 8: Previous Lecture :  Exploring Data

Median, Quartiles and Percentiles

xxx n,...,,21

Sample

Quartiles

xQ i

1 for 25% of the sample

xQ i

2for 50% of the sample

(median)xQ i

3 for 75% of the sample

xP im for m% of the sample

Percentiles

Inter Quartile Range

QQIQR13

Page 9: Previous Lecture :  Exploring Data

Median and MeanComplex Normal Skewed Long tails

Sample Size

100

1

-1

0.2

-0.2

Median - Gray

Page 10: Previous Lecture :  Exploring Data

Quartiles and MeanComplex Normal Skewed Long tails

Sample Size

100

1

-1

0.2

-0.2

Q3 - Purple

Q1 – Gray

Page 11: Previous Lecture :  Exploring Data

Central Limit Theorem

The sum of a large number of values drawn from many distributions converge normal if:

• The values are drawn independently;• The values are from the one distribution; and • The distribution has to have a finite mean and

variance.

Page 12: Previous Lecture :  Exploring Data

Variance

n

ni

iix

1

xxx n,...,,21

Variance

Sample

Mean

n

i

ni

ix

1

2

2)(

Page 13: Previous Lecture :  Exploring Data

VarianceComplex Normal Skewed Long tails

Sample Size

100

0.6

0

0.1

0

Page 14: Previous Lecture :  Exploring Data

Inter Quartile Range and Standard Deviation

Complex Normal Skewed Long tails

Sample Size

100

1.0

0

0.4

0

IRQ/1.349 - Gray

Page 15: Previous Lecture :  Exploring Data

Uncertainty in Determining the MeanComplex Normal Skewed Long tails

n=3

n=10

Average

n=100

n=3

n=10

n=100

n=3

n=10

n=100

n=10

n=100

n=1000

Page 16: Previous Lecture :  Exploring Data

Standard Error of the Mean

n

ni

iix

1

xxx n,...,,21

Variance

Sample

Mean

n

i

ni

ix

1

2

2)(

nmes

..

Standard Error of the Mean

Page 17: Previous Lecture :  Exploring Data

Error bars

M. Krzywinski & N. Altman, Error Bars, Nature Methods 10 (2013) 921

In 2012, error bars appeared in Nature Methods in about two-thirds of the figure panels in which they could be expected (scatter and bar plots). The type of error bars was nearly evenly split between s.d. and s.e.m. bars (45% versus 49%, respectively). In 5% of cases the error bar type was not specified in the legend. Only one figure used bars based on the 95% CI.

None of the error bar types is intuitive. An alternative is to select a value of CI% for which the bars touch at a desired P value (e.g., 83% CI bars touch at P = 0.05).

Page 18: Previous Lecture :  Exploring Data

Box Plot

M. Krzywinski & N. Altman, Visualizing samples with box plots, Nature Methods 11 (2014) 119

Page 19: Previous Lecture :  Exploring Data

n=5

Box PlotsComplex Normal Skewed Long tails

n=10

n=100

n=5

n=10

n=100

n=5

n=10

n=100

n=5

n=10

n=100

Page 20: Previous Lecture :  Exploring Data

Box Plots with All the Data PointsComplex Normal Skewed Long tails

n=5

n=10

n=100

n=5

n=10

n=100

n=5

n=10

n=100

n=5

n=10

n=100

Page 21: Previous Lecture :  Exploring Data

Box Plots, Scatter Plots and Bar GraphsNormal Distribution

Error bars: standard deviation error bars: standard deviation

error bars: standard error error bars: standard error

Page 22: Previous Lecture :  Exploring Data

Box Plots, Scatter Plots and Bar GraphsSkewed Distribution

Error bars: standard deviation error bars: standard deviation

error bars: standard errorerror bars: standard error

Page 23: Previous Lecture :  Exploring Data

Box Plots, Scatter Plots and Bar GraphsDistribution with Fat

TailError bars: standard deviation error bars: standard deviation

error bars: standard errorerror bars: standard error

Page 24: Previous Lecture :  Exploring Data

Application: Analytical Measurements

Theoretical Concentration

Measu

red

C

on

cen

trati

on

Page 25: Previous Lecture :  Exploring Data

A Few Characteristics of Analytical Measurements

Accuracy: Closeness of agreement between a test result and an accepted reference value.

Precision: Closeness of agreement between independent test results.

Robustness: Test precision given small, deliberate changes in test conditions (preanalytic delays, variations in storage temperature).

Lower limit of detection: The lowest amount of analyte that is statistically distinguishable from background or a negative control.

Limit of quantification: Lowest and highest concentrations of analyte that can be quantitatively determined with suitable precision and accuracy.

Linearity: The ability of the test to return values that are directly proportional to the concentration of the analyte in the sample.

Page 26: Previous Lecture :  Exploring Data

Measuring Blanks

Page 27: Previous Lecture :  Exploring Data

Coefficient of Variation

n

ni

iix

1

xxx n,...,,21

Variance

Sample

Mean

n

i

ni

ix

1

2

2)(

Coefficient of Variation (CV)

Page 28: Previous Lecture :  Exploring Data

Lower Limit of Detection

The lowest amount of analyte that is statistically distinguishable from background or a negative control.

Two methods to determine lower limit of detection:

1. Lowest concentration of the analyte where CV is less than for example 20%.

2. Determine level of blank by taking 95th percentile of the blank measurements and add a constant times the standard deviation of the lowest concentration.

K. Linnet and M. Kondratovich, Partly Nonparametric Approach for Determining the Limit of Detection, Clinical Chemistry 50 (2004) 732–740.

Page 29: Previous Lecture :  Exploring Data

Limit of Detection and Linearity

Theoretical Concentration

Theoretical Concentration

Measu

red

C

on

cen

trati

on

Measu

red

C

on

cen

trati

on

Page 30: Previous Lecture :  Exploring Data

Precision and Accuracy

Theoretical Concentration

Theoretical Concentration

Measu

red

C

on

cen

trati

on

Measu

red

C

on

cen

trati

on

Page 31: Previous Lecture :  Exploring Data

Descriptive Statistics - Summary

• Example distribution: • Normal distribution• Skewed distribution• Distribution with long tails• Complex distribution with several peaks

• Mean, median, quartiles, percentiles

• Variance, Standard deviation, Inter Quartile Range (IQR), error bars

• Box plots, bar graphs, and scatter plots

• Application: Analytical measurements:• Accuracy and precision• Limit of detection and quantitation• Linearity• Robustness

Page 32: Previous Lecture :  Exploring Data

Descriptive Statistics – Recommended Reading

http://blogs.nature.com/methagora/2013/08/giving_statistics_the_attention_it_deserves.html

Page 33: Previous Lecture :  Exploring Data

Descriptive Statistics – Recommended Reading

http://greenteapress.com/thinkstats/

Page 34: Previous Lecture :  Exploring Data

Next Lecture: Data types and representations

in Molecular Biology

>URO1 uro1.seq Length: 2018 November 9, 2000 11:50 Type: N Check: 3854 ..CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACAACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTTGCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACCCACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTGTGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCAGGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCATCTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGATGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG

@SRR350953.5 MENDEL_0047_FC62MN8AAXX:1:1:1646:938 length=152NTCTTTTTCTTTCCTCTTTTGCCAACTTCAGCTAAATAGGAGCTACACTGATTAGGCAGAAACTTGATTAACAGGGCTTAAGGTAACCTTGTTGTAGGCCGTTTTGTAGCACTCAAAGCAATTGGTACCTCAACTGCAAAAGTCCTTGGCCC+SRR350953.5 MENDEL_0047_FC62MN8AAXX:1:1:1646:938 length=152+50000222C@@@@@22::::8888898989::::::<<<:<<<<<<:<<<<::<<:::::<<<<<:<:<<<IIIIIGFEEGGGGGGGII@IGDGBGGGGGGDDIIGIIEGIGG>[email protected] MENDEL_0047_FC62MN8AAXX:1:1:1724:932 length=152NTGTGATAGGCTTTGTCCATTCTGGAAACTCAATATTACTTGCGAGTCCTCAAAGGTAATTTTTGCTATTGCCAATATTCCTCAGAGGAAAAAAGATACAATACTATGTTTTATCTAAATTAGCATTAGAAAAAAAATCTTTCATTAGGTGT+SRR350953.7 MENDEL_0047_FC62MN8AAXX:1:1:1724:932 length=152#.,')2/@@@@@@@@@@<:<<:778789979888889:::::99999<<::<:::::<<<<<@@@@@::::::IHIGIGGGGGGDGGDGGDDDIHIHIIIII8GGGGGIIHHIIIGIIGIBIGIIIIEIHGGFIHHIIIIIIIGIIFIG

##gff-version 3#!gff-spec-version 1.20##species_http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=7425NC_015867.2 RefSeq cDNA_match 66086 66146 . - . ID=aln0;Target=XM_008204328.1 1 61 +; for_remapping=2;gap_count=1;num_ident=8766;num_mismatch=0;pct_coverage=100;pct_coverage_hiqual=100;pct_identity_gap=99.9886;pct_identity_ungap=100;rank=1NC_015867.2 RefSeq cDNA_match 65959 66007 . - . ID=aln0;Target=XM_008204328.1 62 110 +;for_remapping=2;gap_count=1;num_ident=8766;num_mismatch=0;pct_coverage=100;pct_coverage_hiqual=100;pct_identity_gap=99.9886;pct_identity_ungap=100;rank=1NC_015867.2 RefSeq cDNA_match 65799 65825 . - . ID=aln0;Target=XM_008204328.1 111 137 +;for_remapping=2;gap_count=1;num_ident=8766;num_mismatch=0;pct_coverage=100;pct_coverage_hiqual=100;pct_identity_gap=99.9886;pct_identity_ungap=100;rank=1

FASTQ

FASTA GFF3

Page 35: Previous Lecture :  Exploring Data

Next Tutorial: Python Programming

Saturday 9/13 at 3 PM in TRB 120