outline%% - compbio.ucdenver.educompbio.ucdenver.edu/77112015/kechris...
TRANSCRIPT
9/9/15%
1%
Introduc.on%to%Concepts%in%Sta.s.cs%%9/10/15%
Katerina%Kechris%Department)of)Biosta/s/cs)and)Informa/cs)
Computa/onal)Bioscience)Program)
Outline%%
1. Tests%of%significance%2. Exercises%3. NonEparametric%tests%4. Mul.ple%tes.ng%
5. Power%&%sample%size%
Tests%of%significance%%(Hypothesis%Tes.ng)%
• To%evaluate%whether%your%observa.ons%are%due%to%chance%(null%hypothesis)%or%due%to%real%effects%(alterna.ve%hypothesis).%%
• Null%hypothesis%not%necessarily%formulated%in%the%same%manner%as%the%scien.fic%hypothesis.%
Example:%• Scien.fic%hypothesis:%Drug%treatment%improves%blood%pressure.%(effect)%
• Null%hypothesis:%There%is%no%effect%of%drug%treatment%on%blood%pressure.%(varia.ons%are%due%to%chance)%
Concept:%Expected%vs%observed%
• Determine%the%expected%value%you%would%observe%by%chance%and%evaluate%how%extreme%your%observed%value%is.%
Example)
• Observed%difference:%E0.08%(mean%of%observa.ons) %%
• Expected%difference:%0%(specified%by%null%hypothesis)%
9/9/15%
2%
Example%
Subject( Gene( Tissue(0( Tissue(1( yij(=((log(yij1/yij0)(
1% 1% y110% y111% y11%
2% 1% y210% y211% y21%
3% 1% y310% y311% y31%
4% 1% y410% y411% y41%
5% 1% y510% y511% y51%
6% 1% y610% y611% y61%
7% 1% y710% y711% y71%
8% 1% y810% y811% y81%
9% 1% y910% y911% y91%
Subject( yij(=((log(yij1/yij0)(
1% E1.1%
2% 0.46%
3% E0.34%
4% 0.29%
5% 0.82%
6% E1.09%
7% 0.50%
8% E1.44%
9% 1.19%
Example)Observed%difference:%E0.08%(mean%of%observa.ons) %%Expected%difference:%0%(specified%by%null%hypothesis)%
Concept:%Standard%error%
• How%extreme%is%the%observa.on%E0.08%from%0?%%%
• We%need%to%scale%this%distance%in%units%of%standard%errors%(SE).%
Concept:%Standard%Error%(SE)%vs%Standard%Devia.on%(SD)%
• SD%(%%%)%is%the%measure%of%spread%in%the%popula.on.%%
• SE%(%%%%%)%is%the%measure%of%spread%in%the%sample%mean.%
€
σ
€
σX
Concept:%Test%sta.s.c%
• The%tEsta.s.c%tells%us%how%many%SE’s%the%observa.on%is%from%the%expected%value.%
%%%%tEsta.s.c%%=%%
• The%tEsta.s.c%is%an%example%of%a%test%sta.s.c.%
• Test%sta.s.cs%measure%difference%between%data%and%what%is%expected%under%null%hypothesis.%
9/9/15%
3%
Concept:%Significance%Level%(pEvalue)%
Example:%• mean%=%E0.08,%SD%=%.95,%n=9,%tEsta.s.c%=%=%E0.25%• Is%E.25%extreme?%That%is,%what%is%the%chance%that%we%observe%a%
mean%value%that%is%E.25%SE’s%from%the%expected%value?%
• Observed%significance%level%(or%pEvalue)%is%the%chance%of%obtaining%a%testEsta.s.c%extreme%or%more%extreme%as%observed%one.%
• Computed%on%the%basis%of%null%hypothesis.%%• Small%pEvalue%evidence%against%the%null%hypothesis%and%indicates%something%besides%chance%(a%real%effect)%%opera.ng%to%make%difference.%
Concept:%Compu.ng%tEtest%pEvalue%
Example:)t;distribu/on)with)8)degrees)of)freedom)(df))%• Degrees%of%freedom%(df)%=%#%obs%E%1%(est.%sample%mean))%
• Chance%we%observe%a%mean%value%that%is%E0.25%SE’s%from%the%expected%value?%
• Calculate%area%under%the%curve%to%the%% %lea%of%(&%including)%E0.25%%%(standard%tables%&%soaware%do%this).%%
• This%is%a%oneEsided%test.%%%• pEvalue%(area%under%the%curve%≤%E0.25)%=%.40%
Concept:%OneEsided%vs%TwoEsided%Test%
• OneEsided%(or%oneEtailed)%vs.%twoEsided%(or%twoEtailed)%• Depends%on%precise%form%of%alterna.ve%hypothesis.%• Alterna.ve%hypothesis%1:%Drug%treatment%improves%blood%pressure.%(oneEsided%E%right)%
• Alterna.ve%hypothesis%2:%Drug%treatment%affects%blood%pressure.%(twoEsided)%
Summary%
1. Set%up%the%null%hypothesis%
2. Pick%a%testEsta.s.c%
3. Compute%the%observed%significance%level%
9/9/15%
4%
Outline%%
1. Tests%of%significance%2. Exercises%3. NonEparametric%tests%4. Mul.ple%tes.ng%
5. Power%&%sample%size%
What%test%to%use?%
• Parametric%tests:%assume%data%are%distributed%according%to%a%known%family%of%probability%distribu.ons%(e.g.,%normal).%%– If%devia.ons%from%distribu.on%of%interest%(e.g.,%Gaussian),%s.ll%appropriate%(robust%to%outliers)%if%sample%size%large%(>30).%
• NonEparametric%tests%make%no%assump.ons%about%the%popula.on%distribu.on%(distribu.onEfree%tests).%%– RankEbased%&%permuta.on%tests%– May%be%important%when%gross%viola.ons%to%distribu.onal%assump.ons%
Concept:%Rank%Tests%
• Use%ranks%of%data%points.%%• Wilcoxon%Rank%Sum%Test%(MannEWhitney%Test)%%alterna.ve%for%twoEsample%tEtest%
• Evaluate%if%two%random % % % % % %%samples%are%from%same% % %%%%%%%%%%%%%%%%distribu.on%%or%if%shiaed% % % % % % % %%%%%%%in%loca.on.%
Example:%Rank%Test%
)Suppose%we%have%two%groups%and%expression%values%(log2)%for)m%replicate%samples:%
%x1,)x2,).).).,)xm%for%group%1%and%y1,)y2,).).).,)ym%for%group%2%
Is%the%distribu.on%of%the%expression%values%for%these%groups%significantly%different%(are%they%shiaed?)%
Group%1%Group%2%% % %EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE% % % % % % %Expression%values%
9/9/15%
5%
Example:%Rank%Test%
Sample%data%
Combine%groups%and%determine%the%overall%ranks:%
Example:%Rank%Test%
• Are%the%ranks%for%group%1%(or%groups%2)%sufficiently%large?%%
• Use%testEsta.s.c%sum%of%the%ranks%for%group%1%(SR*)=%28).%%
• What%is%the%null%hypothesis?%Is%this%value%extreme?%
• %%%%%=252%possible%ranks%for%each%group.%All%equally%likely.%%
Example:%Rank%Test%
• Calculate%SR)for%each%possible%sample:%
Example:%Rank%Test%
• Recall%the%defini.on%of%a%pEvalue!%%%% % % % %P(SR)≥)SR*)%=%#%(SR)≥)SR*)%%% % % % % % % % %%252%• If%the%sample%size%is%large,%can%use%a%Normal%approxima.on.%%
15% 40%SR%
1/252%Prob
ability%
SR*%
9/9/15%
6%
Concept:%Permuta.on%Tests%
• To%evaluate%significance%of%observed%testEsta.s.c.%%
• Evaluate%all%possible%values%of%testEsta.s.c%on%permuted%data%sets%where%the%labels%have%been%rearranged%on%the%observed%data.%
• The%null%hypothesis%is%generated%from%the%permuta.ons%(do%not%need%to%assume%distribu.on)%%
Example:%Permuta.on%Tests%
• Calculate%tEsta.s.c%on%previous%example%
TwoEsample%tEsta.s.c%t*%=%0.22%
• Do%not%assume%tEdistribu.on,%use%permuta.ons%
Example:%Permuta.on%Tests%
• Permuted%data%set%1%
TwoEsample%tEsta.s.c%t1%=%E1.37%
• Permuted%data%set%2%
TwoEsample%tEsta.s.c%t2%=%2.25%
• ….%repeat%many%.mes…….%
Example:%Permuta.on%Tests%
• If%sample%small%enough,%all%permuta.ons%(p)%can%be%evaluated.%Otherwise,%sample%randomly%(e.g.,%10000%.mes)%from%all%possible%permuta.ons.%
• Recall%the%defini.on%of%a%pEvalue!%%P(tp))≥%%t)%=%%%%%%%%%% % % % %#%(tp))≥%%t))%%
%%% %%%%%%%%%%#%of%permuta.ons%
• Possible%pEvalues%for%both%examples%are%discrete.%
9/9/15%
7%
Outline%%
1. Tests%of%significance%2. Exercises%3. NonEparametric%tests%4. Mul.ple%tes.ng%
5. Power%&%sample%size%
Mul.ple%Tes.ng%
• Suppose%we%are%tes.ng%~20,000%genes%for%differen.al%expression.%%%
• What%is%the%null%hypothesis%for%each%gene?%Suppose%that%the%null%hypothesis%is%true%for%each%gene.%%
• If%we%apply%a%pEvalue%(or%significance%level)%cutoff%of%0.01,%how%many%.mes%do%we%expect%to%incorrectly%reject%the%null%hypothesis%(i.e.,%observe%a%pEvalue%≤%.01)?%
Sta.s.cal%Inference%Decision%Matrix%
power%E%probability%of%rejec.ng%null%hypothesis%when%it%is%false.%It%is%probability%of%predic.ng%a%real%effect.%
Different%error%rates:%
• perEcomparison%error%(PCER)%rate%is%expected%propor.on%of%true%null%hypo.%rejected%over%the%total%number%of%hypo.%
• familyEwise%error%rate%(FWER)%is%probability%of%rejec.ng%>=1%true%hypo.%• false%discovery%rate%(FDR)%is%the%expected%propor.on%of%false%predic.ons%
among%all%the%predic.ons%(null%hypo.%rejec.ons)%
Types%of%Control%
• Many%different%mul.ple%tes.ng%correc.ons%based%on%what%type%of%error%rate%is%controlled%(FWER,%FDR,%etc.).%%
• Bonferroni%procedure%controls%the%FWER.%%– If%m%hypothesis%are%being%tested,%divide%your%significance%level%by%m%(e.g.,%.05/25000).%
• This%procedure%is%very%conserva.ve%(i.e.,%real%effects%may%be%missed).%%
9/9/15%
8%
Types%of%Control %%
• In%gene%expression%studies,%controlling%FDR%may%be%more%appropriate.%%
• With%FDR%control%at%5%,%if%100%genes%significant,%this%set%is%enriched%with%95%%truly%differen.al%expressed%genes.%%
• The%power%is%increased,%but%the%likelihood%of%type%I%errors%increases.%%
• Conceptually%the%FDR%cutoff%is%not%a%pEvalue%cutoff!%
Example:%FDR%
Benjamini%and%Hochberg%(1995)%procedure%controls%the%FDR.%%
Suppose%we%obtain%pEvalues%for%all%genes:%1. %Sort%all%the%pEvalues%p1%to%pm%from%smallest%to%largest%
2. %Find%the%largest%k%so%that%pk)≤)q*(k)/)m)%3. %Reject%all%hypotheses%through%cutoff%value%c=pk%
Example:)q=0.1)and)m=10)(tests)%%
Example:%FDR%
Example:)q=0.1)and)m=10)(tests)%%
With%cutoff%value%c)=%0.029%%3%rejected%hypothesis%%Expected%that%q=10%%of%tests%rejected%are%false%discoveries%(null%hypothesis%true)%%
Bonferroni%correc.on%at%significance%level%0.1%%1%rejected%hypothesis%[c)=%0.10%/%10%=%0.010]%
Outline%%
1. Tests%of%significance%2. Exercises%3. NonEparametric%tests%4. Mul.ple%tes.ng%
5. Power%&%sample%size%
9/9/15%
9%
Power%and%Sample%Size%
• Power%analysis/sample%size%es.ma.on%depends%on%the%significance%level%and%effect%size.%%
• The%greater%the%effect%size,%the%greater%the%power.%%
• There%are%many%soaware%packages%to%calculate%sample%size.%
TEtest%Sample%Size%
To%calculate%the%es.mated%sample%size%you%need:%
• Effect%size%%– difference%between%means,%standard%devia.on%
• Significance%level%– %Type%I%error%probability%(e.g.,%.05)%
• Power%of%test%%– 1%minus%Type%II%error%probability%(e.g.,%80%)%
• Type%of%tEtest%– one,%two%or%pairedEsamples%
• Alterna.ve%hypothesis%– %oneEsided%(greater%or%less)%or%twoEsided%
Comments%
• PreEdetermined%sample%size%can%be%replaced%above%for%power%to%es.mate%the%power%of%your%test%for%that%given%sample%size.%%
• A%power%of%level%of%≥0.80%is%considered%good%power.%%
• The%effect%size%may%be%supported%by%previous%work%or%from%the%literature.%
9/9/15%
10%
Outline%%Take((bio)sta:s:cs(course(s)!(
BIOS%6606%(Sta.s.cs%for%the%Basic%Sciences)%%
BIOS%6611/12%(Biosta.s.cal%Methods) %%BIOS%6631/31%(Sta.s.cal%Theory)%
BIOS%7731%(Mathema.cal%Sta.s.cs,%Kechris)%BIOS%7659%(Sta.s.cal%Methods%in%Genomics,%Kechris)%–%
next%fall?%