power and sample-size estimation for microbiome...
TRANSCRIPT
Power and Sample-size Estimation for Microbiome
StudiesDr Chimusa| Department of Pathology |
University of Cape Town | AGe, 2017
Outline
• Overview of Tutorial• Power Calculation from R micropower tool• Power Calculation from R HMP tool
AGe Dr. Chimusa
Overview Tutorial
l Microbiome studies often compare groups of microbial communities with different environmental exposures, or to which different interventions have been applied.
Example:
l A study may evaluate the difference between the respiratory tract microbial communities of human subjects with exposure to different antibiotic treatments.
The fundamental measure in such a study is the count of community members (species or operational taxonomic units-OTUs), typically accomplished via sequencing a marker gene such as the 16S ribosomal RNA gene for bacteria.
Overview Tutorial
Methods
Analysis
Clarke TH, Gomez A, Singh H, Nelson KE, Brinkac LM. Integrating the Microbiome as a Resource in the Forensics Toolkit.Forensic Science International. Genetics. 2017 Jun 27; 30.: 141-147.
Overview Tutorial
l Microbiome studies often compare groups of microbial communities with different environmental exposures, or to which different interventions have been applied.
Example:
l A study may evaluate the difference between the respiratory tract microbial communities of human subjects with exposure to different antibiotic treatments.
Pairwise distance metrics facilitate standardized comparison of community membership between individual study subjects by addressing the problems of differential membership and mutual absence.
Overview Tutorial
l Microbiome studies often compare groups of microbial communities with different environmental exposures, or to which different interventions have been applied.
Why sample Size and power calculation are needed?
l The design of microbiome studies demands consideration of statistical power -an adequate number of subjects must be recruited to ensure that the effect expected from the exposure or intervention of interest can be detected
Overview Tutoriall The design of microbiome studies demands consideration of statistical
power -an adequate number of subjects must be recruited to ensure that the effect expected from the exposure or intervention of interest can be detected.
What the power ?
l For each simulated effect, power can be calculated as the proportion of bootstrap statistic test (distance matrices) for which P-values are less than the pre-specified threshold for type I error (0.05).
Work is done, relax on beach?
Outline
• Overview of Tutorial• Power Calculation from R micropower tool• Power Calculation from R HMP tool
AGe Dr. Chimusa
Power Calculation from R micropower tool
# Installing micropower1.Login to CHPC and go to youir working directory /mnt/lustre/users/USERNAME/ and go in AGe_NGS (if you do not have that folder, thus mikdir AGe_NGS and cd AGe_NGS2. Copy the Microbiome folder at the group directory > cp -r /mnt/lustre/groups/CBBI0818/AGe_NGS/Microbiomes/ ./3. Go in by cd Microbiomes and cd power 4. Download micropower > git clone https://github.com/brendankelly/micropower.gitAnd cd micropower and cd R5. To run R in the Server, please load the R module > module add chpc/R/3.3.3-gcc6.3.06. Open R by typing R on terminal7. Install micropower as > library(githubinstall)> githubinstall("micropower")> library(devtools)> install_github(repo="micropower",username="brendankelly")
Power Calculation from R micropower tool
Overview Micropower
The variation in community composition between microbiome samples, can be measured by pairwise distance based on either presence-absence or quantitative species abundance data.
PERMANOVA, a permutation-based extension of multivariate analysis of variance to a matrix of pairwise distances, partitions within-group and between-group distances to permit assessment of the effect of an exposure or intervention (grouping factor) upon the sampled microbiome.
Within-group distance and exposure/intervention effect size must be accurately modeled to estimate statistical power for a microbiome study that will be analyzed with pairwise distances and PERMANOVA.
Kelly, B.J., Gross, R., Bittinger, K., Sherrill-Mix, S., Lewis, J.D., Collman, R.G., Bushman, F.D. and Li, H., 2015. Power and sample-size estimation for microbiome studies using pairwise distances and PERMANOVA. Bioinformatics, 31(15), pp.2461-2468.
Power Calculation from R micropower tool
Overview MicropowerA. Let simulate a set of OTU tables, incorporating a range of between-group effects in addition the desired within-group-distance distribution, was then generated using the simPower command
> library(micropower)> T <- sapply(simPower(c(16,16,16),100,10,0.8,seq(0,0.3,length.out=100)),FUN=function(x) {calcOmega2(calcWJstudy(x))})
Vector representing subjects per exposure/intervention group
Number of simulated OTUs
Number of sequence counts per OTU bin
Proportion of sequence counts to retain after subsampling
Effect proportion of unique community membership in affected group of subjects
calcWJstudy: Generates a square matrix of pairwise weighted (unweighted, replace W by U) Jaccard distances from an OTU table.
CalcOmega2: Estimate proportion of distance accounted for by the grouping factor, corrected for the mean-squared error.
Return: list of two-dimensional-matrix OTU tables, with row and column names to suit downstream analysis.
Power Calculation from R micropower tool
Overview Micropower
A. Let simulate a set of OTU tables, incorporating a range of between-group effects in addition the desired within-group-distance distribution, was then generated using the simPower command
> P<-bootPower(T, boot_number=100,subject_group_vector=c(3,4,5),alpha=0.05)
Return: A data frame relating PERMANOVA power to effect size quantified by the coefficient of determination (R^2) and omega-squared.
> write.table(P,file="boot.txt",row.names = FALSE)s
T: from previous slide is a list of square distance matrices, with names.boot_number: number of bootstrap samples to perform on each distance matrix in the list.
subject_group_vector: number of subjects in each group to sample, as a vector.
alpha: threshold for PERMANOVA type I error.
Take T as lapply(simPower(c(16,16,16),100,1,0.8,seq(0,0.3,length.out=100)),calcWJstudy)
Outline
• Overview of Tutorial• Power Calculation from R micropower tool• Power Calculation from R HMP tool
AGe Dr. Chimusa
Power Calculation from R HMP tool
Overview HMP
# Install HMP> install.packages("HMP" ,repo="http://cran.r-project.org", dep=TRUE)> library(HMP)
Data sets to mimic:
l Saliva (Throat and Tongue) data set formed by the Ranked-abundance distribution (RAD) vectors of 24 subjects. RAD vectors contains 21 elements formed by the 20 most abundant taxa at the genus level and additional taxa containing the sum of the remaining less abundant taxa per sample. The format is a matrix of 24 rows by 21 columns, with the each row being a separate subject and each column being a different taxa.
Note: The incorporation of the additional taxon (taxon 21) in the analysis allows for estimating the RAD proportional-mean of taxa with respect to all the taxa within the sample.
Power Calculation from R HMP tool
Overview HMP
# Install HMP> install.packages("HMP" ,repo="http://cran.r-project.org", dep=TRUE)> library(HMP)
A. This Monte-Carlo simulation procedure provides the power and size of the several sample DirichletMultinomial parameter test comparison, using the likelihood-ratio-test statistics
> data(saliva);data(throat);data(tonsils)### Get a list of dirichlet-multinomial parameters for the data> fit.saliva <- DM.MoM(saliva); fit.throat <- DM.MoM(throat); fit.tonsils <- DM.MoM(tonsils)### Set up the number of Monte-Carlo experiments### We use 1 for speed, should be at least 1,000> numMC <- 1### Generate the number of reads per sample### The first number is the number of reads and the second is the number of subjects> nrsGrp1 <- rep(12000, 9) ;nrsGrp2 <- rep(12000, 11); nrsGrp3 <- rep(12000, 12)> group.Nrs <- list(nrsGrp1, nrsGrp2, nrsGrp3)### Computing size of the test statistics (Type I error)> alphap <- fit.saliva$gamma
Power Calculation from R HMP tool
Overview HMP
# Install HMPinstall.packages("HMP" ,repo="http://cran.r-project.org", dep=TRUE)
A. This Monte-Carlo simulation procedure provides the power and size of the several sample DirichletMultinomial parameter test comparison, using the likelihood-ratio-test statistics
> pval1 <- MC.Xdc.statistics(group.Nrs, numMC, alphap, "hnull")> pval1### Computing Power of the test statistics (Type II error)> alphap <- rbind(fit.saliva$gamma, fit.throat$gamma, fit.tonsils$gamma)> pval2 <- MC.Xdc.statistics(group.Nrs, numMC, alphap)> pval2
Power Calculation from R HMP tool
Overview HMP
# Install HMPinstall.packages("HMP" ,repo="http://cran.r-project.org", dep=TRUE)
B. This Monte-Carlo simulation procedure provides the power and size of the several sample RADprobability mean test comparison with known reference vector of proportions, using the Generalized Wald-type statistics.
> data(saliva) ;data(throat);data(tonsils)### Get a list of dirichlet-multinomial parameters for the data> fit.saliva <- DM.MoM(saliva); fit.throat <- DM.MoM(throat); fit.tonsils <- DM.MoM(tonsils)### Set up the number of Monte-Carlo experiments. ### We use 1 for speed, should be at > least 1,000 ; numMC <- 1### Generate the number of reads per sample. ### The first number is the number of reads #and the second is the number of subjects> nrsGrp1 <- rep(12000, 9); nrsGrp2 <- rep(12000, 11); nrsGrp3 <- rep(12000, 12)> group.Nrs <- list(nrsGrp1, nrsGrp2, nrsGrp3)### Computing size of the test statistics (Type I error)> alphap <- fit.saliva$gamma
Power Calculation from R HMP tool
Overview HMP
# Install HMPinstall.packages("HMP" ,repo="http://cran.r-project.org", dep=TRUE)
B. This Monte-Carlo simulation procedure provides the power and size of the several sample RADprobability mean test comparison with known reference vector of proportions, using the Generalized Wald-type statistics.
> pval1 <- MC.Xdc.statistics(group.Nrs, numMC, alphap, "hnull")pval1
### Computing Power of the test statistics (Type II error)alphap <- rbind(fit.saliva$gamma, fit.throat$gamma, fit.tonsils$gamma)pval2 <- MC.Xdc.statistics(group.Nrs, numMC, alphap)pval2
Work is done, relax on beach?
Microbial genome-wide association studies
Outline
● GWAS Analysis● GWAS output analysis● GWAS further analysis
AGe Dr. Chimusa
1. A phylogenetic tree-based approach to genome-wide association studies in microbes
2. Simulated quantitative phenotype from abundance of two microbial species in the gut of the host.
AGe Dr. Chimusa
1. A phylogenetic tree-based approach to genome-wide association studies in microbes
Install devtools, if necessary:
> install.packages("devtools", dep=TRUE)> library(devtools)
Install treeWAS from github:
> install_github("caitiecollins/treeWAS/pkg", build_vignettes = TRUE)> library(treeWAS)
✔ Identify genetic variables that are statistically associated with a phenotypic trait, while correcting for the confounding effects of population structure and recombination.
✔ Applicable to both bacterial and viral genetic data and to both binary and continuous phenotypes.
http://www.biorxiv.org/content/early/2017/05/22/140798
AGe Dr. Chimusa
1. A phylogenetic tree-based approach to genome-wide association studies in microbes
Data input requirement for treeGWAS:
1. A genetic dataset A matrix containing binary genetic data (whether this encodes SNPs, gene presence/absence, etc. is up to you). Individuals should be in the rows, and genetic variables in the columns. Both rows and columns must be appropriately labelled.
2. A phenotypic variable Vector containing either a binary or continuous variable encoding the phenotype for each individual. Each element should have a name that corresponds to the row labels of the genetic data matrix (order does not matter).
https://github.com/caitiecollins/treeWAS/wiki/2.-Data-&-Data-CleaningWiki page of the tool:
AGe Dr. Chimusa
1. A phylogenetic tree-based approach to genome-wide association studies in microbes
Data to use in this tutorial
1. Data is simulated: maintain both the population stratification and genetic composition of the dataset under analysis, but without recreating the "true" associations beyond those expected to arise from these cofounding factors. The null dataset is simulated using the phylogenetic tree of the real dataset, as well as the original homoplasy distribution including the number of substitutions per site due to both mutation and recombination.
https://github.com/caitiecollins/treeWAS/wiki/2.-Data-&-Data-CleaningWiki page of the tool:
Go to Rstudio and open the script treeGWAS_analysis.R
Work is done, relax on beach?
Overview of Tutoriall The goal of this practical is to manipulate quantitative GWAS data
and start exploring how machine learning algorithms can be used to analyze this data. We will be working with the genotypes of 89 individuals from the 1000 Genomes Project (http://www.internationalgenome.org/data) (Han Chinese and Japanese ancestry), and a simulated quantitative phenotype.
AGe Dr. Chimusa
l The above phenotype can be imagined to represent the relative abundance of two microbial species in the gut of the host.
l Link to data: /mnt/lustre/groups/CBBI0818/Age_NGS/Microbiomes/GWAS-GUT/
2. Simulated quantitative phenotype from abundance of two microbial species in the gut of the host.
Data Analysis
GWAS with plink: One of the most well-known pieces of software for analyzing GWAS data is [PLINK](http://zzz.bwh.harvard.edu/plink/), developed by [Shaun Purcell](http://zzz.bwh.harvard.edu/) at Harvard, MGH and the Broad Institute.
AGe Dr. Chimusa
l Uncompress the data using: tar zxvf simulated-gwas.tar.gz
l Link to data: /mnt/lustre/groups/CBBI0818/Age_NGS/Microbiomes/GWAS-GUT/
File formats: `.ped`: The samples data. Contains as many lines as samples in the data and `6 + 2 x num_snps` columns. The first 6 columns contain the following information: Family identifier (`FID`), individual identifier (`IID`), paternal identifier (`PAT`), maternal identifier (`MAT`), sex (`SEX`; male=1, female=2, unknown=other) and phenotype (`PHENOTYPE`). The following columns contain all bi-allelic SNP information. Each SNP is coded on 2 columns, each corresponding to one strand of DNA. The SNP can be encoded `A, T, C, G` or `1, 2` (corresponding to one or the other allele).
Data Analysis
GWAS with plink: One of the most well-known pieces of software for analyzing GWAS data is [PLINK](http://zzz.bwh.harvard.edu/plink/), developed maninly by [Shaun Purcell](http://zzz.bwh.harvard.edu/) at Harvard, MGH and the Broad Institute.
AGe Dr. Chimusa
l Uncompress the data using: tar zxvf simulated-gwas.tar.gz
l Link to data: /mnt/lustre/groups/CBBI0818/Age_NGS/Microbiomes/GWAS-GUT/
File formats:
* `.map`: The markers data. Contains as many lines as SNPs, and 4 columns per SNP: chromosome, SNP identifier, genetic distance in morgans, and base-pair position.
Data Analysis
AGe Dr. Chimusa
l Uncompress the data using: tar zxvf simulated-gwas.tar.gz
l Link to data: /mnt/lustre/groups/CBBI0818/Age_NGS/Microbiomes/GWAS-GUT/
Start by checking the files are intact and plink works, and get some basic statistics on your data.
> plink --noweb --file simulated
From this command, PLINK understands it is going to find the genotype data in `simulated.ped` and SNP descriptions under `simulated.map`.
1. Quality control
Data Analysis
AGe Dr. Chimusa
1. Quality control
Apply quality control filters:
l SNPs with minor allele frequency (MAF) lower than 1% will be removed. We focus on common variants for several reasons: the "common disease, common variant" hypothesis; the fact that rare variants are more likely to be technical artifacts; and, last but not least, because we have limited statistical power to detect the effect of rare SNPs.
l SNPs with missing data for more than 10% of individuals will be removed.l SNPs that are not in Hardy-Weinberg equilibrium (HWE) (p-value larger than
1e-6) will be removed: departure from HWE is likely to be due to a genotyping error.
> plink --file simulated --maf 0.01 --hwe 1e-6 --geno 0.1 --make-bed --out simulated
Data Analysis
AGe Dr. Chimusa
1. Quality control
Apply quality control filters:
l How many SNPs passed quality control?l Answer: 66 536 out of 83 534 (you can get this from the output of the
PLINK command, either on screen or in plink log file
Let us now use PLINK to test for statistical association between each SNP and the phenotype.
> plink --noweb --bfile simulated --assoc --out GWAS
This creates a file called `GWAS.qassoc` (the `q` stands for "quantitative").
2. GWAS analysis
Data Analysis
AGe Dr. Chimusa
You can have a look at the contents of this file usingAt sh terminal> more GWAS.qassoc> awk '$9 < 0.00005' GWAS.qassoc
3. GWAS output analysis
We are going to use Python2.7 to analyze the output of PLINK.> module load chpc/python/2.7.12 (CHPC)
Or
/opt/exp_soft/python-2.7.3/bin/python (hex)
Data Analysis
AGe Dr. Chimusa
> Rscript qqplot.R> RThen in R terminal > source(“qqplot.R”)Or in Rstudio open the script qqplot.R run it line by line
> cp GWAS.qassoc GWAS2.qassoc
Let remove missing values “NA”
> sed -e '/NA/d' GWAS.qassoc > GWAS2.qassoc
Open “GWAS2.qassoc” with nano to remove the header ( thus ctrl+k to line be removed) and save with ctrl+x and type y.
3. GWAS output analysis
Mahathan plot is can obtain as follow> python Mahanatha.py GWAS3.qassoc --cols=0,2,8 --colors=kbc --image=MWAS.png
Data Analysis
AGe Dr. Chimusa
> Rscript qqplot.R> RThen in R terminal > source(“qqplot.R”)Or in Rstudio open the script qqplot.R run it line by line
3. GWAS output analysis
Mahathan plot is can obtain as follow
> python Mahanatha.py GWAS3.qassoc --cols=0,2,8 --colors=kbc --image=MWAS.png
4. GWAS further analysis
Let use the pipeline.py script
Work is done, relax on beach?
Thank you!Dr. Chimusa | [email protected]