power and sample-size estimation for microbiome...

Power and Sample-size Estimation for Microbiome

StudiesDr Chimusa| Department of Pathology |

University of Cape Town | AGe, 2017

Outline

• Overview of Tutorial• Power Calculation from R micropower tool• Power Calculation from R HMP tool

AGe Dr. Chimusa

Overview Tutorial

l Microbiome studies often compare groups of microbial communities with different environmental exposures, or to which different interventions have been applied.

Example:

l A study may evaluate the difference between the respiratory tract microbial communities of human subjects with exposure to different antibiotic treatments.

The fundamental measure in such a study is the count of community members (species or operational taxonomic units-OTUs), typically accomplished via sequencing a marker gene such as the 16S ribosomal RNA gene for bacteria.

Overview Tutorial

Methods

Analysis

Clarke TH, Gomez A, Singh H, Nelson KE, Brinkac LM. Integrating the Microbiome as a Resource in the Forensics Toolkit.Forensic Science International. Genetics. 2017 Jun 27; 30.: 141-147.

Overview Tutorial


Example:

l A study may evaluate the difference between the respiratory tract microbial communities of human subjects with exposure to different antibiotic treatments.

Pairwise distance metrics facilitate standardized comparison of community membership between individual study subjects by addressing the problems of differential membership and mutual absence.

Overview Tutorial


Why sample Size and power calculation are needed?

l The design of microbiome studies demands consideration of statistical power -an adequate number of subjects must be recruited to ensure that the effect expected from the exposure or intervention of interest can be detected

Overview Tutoriall The design of microbiome studies demands consideration of statistical

power -an adequate number of subjects must be recruited to ensure that the effect expected from the exposure or intervention of interest can be detected.

What the power ?

l For each simulated effect, power can be calculated as the proportion of bootstrap statistic test (distance matrices) for which P-values are less than the pre-specified threshold for type I error (0.05).

Work is done, relax on beach?

Outline


AGe Dr. Chimusa

Power Calculation from R micropower tool

# Installing micropower1.Login to CHPC and go to youir working directory /mnt/lustre/users/USERNAME/ and go in AGe_NGS (if you do not have that folder, thus mikdir AGe_NGS and cd AGe_NGS2. Copy the Microbiome folder at the group directory > cp -r /mnt/lustre/groups/CBBI0818/AGe_NGS/Microbiomes/ ./3. Go in by cd Microbiomes and cd power 4. Download micropower > git clone https://github.com/brendankelly/micropower.gitAnd cd micropower and cd R5. To run R in the Server, please load the R module > module add chpc/R/3.3.3-gcc6.3.06. Open R by typing R on terminal7. Install micropower as > library(githubinstall)> githubinstall("micropower")> library(devtools)> install_github(repo="micropower",username="brendankelly")


Overview Micropower

The variation in community composition between microbiome samples, can be measured by pairwise distance based on either presence-absence or quantitative species abundance data.

PERMANOVA, a permutation-based extension of multivariate analysis of variance to a matrix of pairwise distances, partitions within-group and between-group distances to permit assessment of the effect of an exposure or intervention (grouping factor) upon the sampled microbiome.

Within-group distance and exposure/intervention effect size must be accurately modeled to estimate statistical power for a microbiome study that will be analyzed with pairwise distances and PERMANOVA.

Kelly, B.J., Gross, R., Bittinger, K., Sherrill-Mix, S., Lewis, J.D., Collman, R.G., Bushman, F.D. and Li, H., 2015. Power and sample-size estimation for microbiome studies using pairwise distances and PERMANOVA. Bioinformatics, 31(15), pp.2461-2468.


Overview MicropowerA. Let simulate a set of OTU tables, incorporating a range of between-group effects in addition the desired within-group-distance distribution, was then generated using the simPower command

> library(micropower)> T <- sapply(simPower(c(16,16,16),100,10,0.8,seq(0,0.3,length.out=100)),FUN=function(x) {calcOmega2(calcWJstudy(x))})

Vector representing subjects per exposure/intervention group

Number of simulated OTUs

Number of sequence counts per OTU bin

Proportion of sequence counts to retain after subsampling

Effect proportion of unique community membership in affected group of subjects

calcWJstudy: Generates a square matrix of pairwise weighted (unweighted, replace W by U) Jaccard distances from an OTU table.

CalcOmega2: Estimate proportion of distance accounted for by the grouping factor, corrected for the mean-squared error.

Return: list of two-dimensional-matrix OTU tables, with row and column names to suit downstream analysis.


Overview Micropower

A. Let simulate a set of OTU tables, incorporating a range of between-group effects in addition the desired within-group-distance distribution, was then generated using the simPower command

> P<-bootPower(T, boot_number=100,subject_group_vector=c(3,4,5),alpha=0.05)

Return: A data frame relating PERMANOVA power to effect size quantified by the coefficient of determination (R^2) and omega-squared.

> write.table(P,file="boot.txt",row.names = FALSE)s

T: from previous slide is a list of square distance matrices, with names.boot_number: number of bootstrap samples to perform on each distance matrix in the list.

subject_group_vector: number of subjects in each group to sample, as a vector.

alpha: threshold for PERMANOVA type I error.

Take T as lapply(simPower(c(16,16,16),100,1,0.8,seq(0,0.3,length.out=100)),calcWJstudy)

Outline


AGe Dr. Chimusa

Power Calculation from R HMP tool

Overview HMP

# Install HMP> install.packages("HMP" ,repo="http://cran.r-project.org", dep=TRUE)> library(HMP)

Data sets to mimic:

l Saliva (Throat and Tongue) data set formed by the Ranked-abundance distribution (RAD) vectors of 24 subjects. RAD vectors contains 21 elements formed by the 20 most abundant taxa at the genus level and additional taxa containing the sum of the remaining less abundant taxa per sample. The format is a matrix of 24 rows by 21 columns, with the each row being a separate subject and each column being a different taxa.

Note: The incorporation of the additional taxon (taxon 21) in the analysis allows for estimating the RAD proportional-mean of taxa with respect to all the taxa within the sample.


Overview HMP

# Install HMP> install.packages("HMP" ,repo="http://cran.r-project.org", dep=TRUE)> library(HMP)

A. This Monte-Carlo simulation procedure provides the power and size of the several sample DirichletMultinomial parameter test comparison, using the likelihood-ratio-test statistics

> data(saliva);data(throat);data(tonsils)### Get a list of dirichlet-multinomial parameters for the data> fit.saliva <- DM.MoM(saliva); fit.throat <- DM.MoM(throat); fit.tonsils <- DM.MoM(tonsils)### Set up the number of Monte-Carlo experiments### We use 1 for speed, should be at least 1,000> numMC <- 1### Generate the number of reads per sample### The first number is the number of reads and the second is the number of subjects> nrsGrp1 <- rep(12000, 9) ;nrsGrp2 <- rep(12000, 11); nrsGrp3 <- rep(12000, 12)> group.Nrs <- list(nrsGrp1, nrsGrp2, nrsGrp3)### Computing size of the test statistics (Type I error)> alphap <- fit.saliva$gamma


Overview HMP

# Install HMPinstall.packages("HMP" ,repo="http://cran.r-project.org", dep=TRUE)

A. This Monte-Carlo simulation procedure provides the power and size of the several sample DirichletMultinomial parameter test comparison, using the likelihood-ratio-test statistics

> pval1 <- MC.Xdc.statistics(group.Nrs, numMC, alphap, "hnull")> pval1### Computing Power of the test statistics (Type II error)> alphap <- rbind(fit.saliva$gamma, fit.throat$gamma, fit.tonsils$gamma)> pval2 <- MC.Xdc.statistics(group.Nrs, numMC, alphap)> pval2


Overview HMP


B. This Monte-Carlo simulation procedure provides the power and size of the several sample RADprobability mean test comparison with known reference vector of proportions, using the Generalized Wald-type statistics.

> data(saliva) ;data(throat);data(tonsils)### Get a list of dirichlet-multinomial parameters for the data> fit.saliva <- DM.MoM(saliva); fit.throat <- DM.MoM(throat); fit.tonsils <- DM.MoM(tonsils)### Set up the number of Monte-Carlo experiments. ### We use 1 for speed, should be at > least 1,000 ; numMC <- 1### Generate the number of reads per sample. ### The first number is the number of reads #and the second is the number of subjects> nrsGrp1 <- rep(12000, 9); nrsGrp2 <- rep(12000, 11); nrsGrp3 <- rep(12000, 12)> group.Nrs <- list(nrsGrp1, nrsGrp2, nrsGrp3)### Computing size of the test statistics (Type I error)> alphap <- fit.saliva$gamma


Overview HMP


B. This Monte-Carlo simulation procedure provides the power and size of the several sample RADprobability mean test comparison with known reference vector of proportions, using the Generalized Wald-type statistics.

> pval1 <- MC.Xdc.statistics(group.Nrs, numMC, alphap, "hnull")pval1

### Computing Power of the test statistics (Type II error)alphap <- rbind(fit.saliva$gamma, fit.throat$gamma, fit.tonsils$gamma)pval2 <- MC.Xdc.statistics(group.Nrs, numMC, alphap)pval2

Microbial genome-wide association studies

Outline

● GWAS Analysis● GWAS output analysis● GWAS further analysis

AGe Dr. Chimusa

1. A phylogenetic tree-based approach to genome-wide association studies in microbes

2. Simulated quantitative phenotype from abundance of two microbial species in the gut of the host.

AGe Dr. Chimusa


Install devtools, if necessary:

> install.packages("devtools", dep=TRUE)> library(devtools)

Install treeWAS from github:

> install_github("caitiecollins/treeWAS/pkg", build_vignettes = TRUE)> library(treeWAS)

✔ Identify genetic variables that are statistically associated with a phenotypic trait, while correcting for the confounding effects of population structure and recombination.

✔ Applicable to both bacterial and viral genetic data and to both binary and continuous phenotypes.

http://www.biorxiv.org/content/early/2017/05/22/140798

AGe Dr. Chimusa


Data input requirement for treeGWAS:

1. A genetic dataset A matrix containing binary genetic data (whether this encodes SNPs, gene presence/absence, etc. is up to you). Individuals should be in the rows, and genetic variables in the columns. Both rows and columns must be appropriately labelled.

2. A phenotypic variable Vector containing either a binary or continuous variable encoding the phenotype for each individual. Each element should have a name that corresponds to the row labels of the genetic data matrix (order does not matter).

https://github.com/caitiecollins/treeWAS/wiki/2.-Data-&-Data-CleaningWiki page of the tool:

AGe Dr. Chimusa


Data to use in this tutorial

1. Data is simulated: maintain both the population stratification and genetic composition of the dataset under analysis, but without recreating the "true" associations beyond those expected to arise from these cofounding factors. The null dataset is simulated using the phylogenetic tree of the real dataset, as well as the original homoplasy distribution including the number of substitutions per site due to both mutation and recombination.

https://github.com/caitiecollins/treeWAS/wiki/2.-Data-&-Data-CleaningWiki page of the tool:

Go to Rstudio and open the script treeGWAS_analysis.R

Overview of Tutoriall The goal of this practical is to manipulate quantitative GWAS data

and start exploring how machine learning algorithms can be used to analyze this data. We will be working with the genotypes of 89 individuals from the 1000 Genomes Project (http://www.internationalgenome.org/data) (Han Chinese and Japanese ancestry), and a simulated quantitative phenotype.

AGe Dr. Chimusa

l The above phenotype can be imagined to represent the relative abundance of two microbial species in the gut of the host.

l Link to data: /mnt/lustre/groups/CBBI0818/Age_NGS/Microbiomes/GWAS-GUT/

2. Simulated quantitative phenotype from abundance of two microbial species in the gut of the host.

Data Analysis

GWAS with plink: One of the most well-known pieces of software for analyzing GWAS data is [PLINK](http://zzz.bwh.harvard.edu/plink/), developed by [Shaun Purcell](http://zzz.bwh.harvard.edu/) at Harvard, MGH and the Broad Institute.

AGe Dr. Chimusa

l Uncompress the data using: tar zxvf simulated-gwas.tar.gz


File formats: `.ped`: The samples data. Contains as many lines as samples in the data and `6 + 2 x num_snps` columns. The first 6 columns contain the following information: Family identifier (`FID`), individual identifier (`IID`), paternal identifier (`PAT`), maternal identifier (`MAT`), sex (`SEX`; male=1, female=2, unknown=other) and phenotype (`PHENOTYPE`). The following columns contain all bi-allelic SNP information. Each SNP is coded on 2 columns, each corresponding to one strand of DNA. The SNP can be encoded `A, T, C, G` or `1, 2` (corresponding to one or the other allele).

Data Analysis

GWAS with plink: One of the most well-known pieces of software for analyzing GWAS data is [PLINK](http://zzz.bwh.harvard.edu/plink/), developed maninly by [Shaun Purcell](http://zzz.bwh.harvard.edu/) at Harvard, MGH and the Broad Institute.

AGe Dr. Chimusa



File formats:

* `.map`: The markers data. Contains as many lines as SNPs, and 4 columns per SNP: chromosome, SNP identifier, genetic distance in morgans, and base-pair position.

Data Analysis

AGe Dr. Chimusa



Start by checking the files are intact and plink works, and get some basic statistics on your data.

> plink --noweb --file simulated

From this command, PLINK understands it is going to find the genotype data in `simulated.ped` and SNP descriptions under `simulated.map`.

1. Quality control

Data Analysis

AGe Dr. Chimusa

1. Quality control

Apply quality control filters:

l SNPs with minor allele frequency (MAF) lower than 1% will be removed. We focus on common variants for several reasons: the "common disease, common variant" hypothesis; the fact that rare variants are more likely to be technical artifacts; and, last but not least, because we have limited statistical power to detect the effect of rare SNPs.

l SNPs with missing data for more than 10% of individuals will be removed.l SNPs that are not in Hardy-Weinberg equilibrium (HWE) (p-value larger than

1e-6) will be removed: departure from HWE is likely to be due to a genotyping error.

> plink --file simulated --maf 0.01 --hwe 1e-6 --geno 0.1 --make-bed --out simulated

Data Analysis

AGe Dr. Chimusa

1. Quality control

Apply quality control filters:

l How many SNPs passed quality control?l Answer: 66 536 out of 83 534 (you can get this from the output of the

PLINK command, either on screen or in plink log file

Let us now use PLINK to test for statistical association between each SNP and the phenotype.

> plink --noweb --bfile simulated --assoc --out GWAS

This creates a file called `GWAS.qassoc` (the `q` stands for "quantitative").

2. GWAS analysis

Data Analysis

AGe Dr. Chimusa

You can have a look at the contents of this file usingAt sh terminal> more GWAS.qassoc> awk '$9 < 0.00005' GWAS.qassoc

3. GWAS output analysis

We are going to use Python2.7 to analyze the output of PLINK.> module load chpc/python/2.7.12 (CHPC)

Or

/opt/exp_soft/python-2.7.3/bin/python (hex)

Data Analysis

AGe Dr. Chimusa

> Rscript qqplot.R> RThen in R terminal > source(“qqplot.R”)Or in Rstudio open the script qqplot.R run it line by line

> cp GWAS.qassoc GWAS2.qassoc

Let remove missing values “NA”

> sed -e '/NA/d' GWAS.qassoc > GWAS2.qassoc

Open “GWAS2.qassoc” with nano to remove the header ( thus ctrl+k to line be removed) and save with ctrl+x and type y.


Mahathan plot is can obtain as follow> python Mahanatha.py GWAS3.qassoc --cols=0,2,8 --colors=kbc --image=MWAS.png

Data Analysis

AGe Dr. Chimusa

> Rscript qqplot.R> RThen in R terminal > source(“qqplot.R”)Or in Rstudio open the script qqplot.R run it line by line


Mahathan plot is can obtain as follow

> python Mahanatha.py GWAS3.qassoc --cols=0,2,8 --colors=kbc --image=MWAS.png

4. GWAS further analysis

Let use the pipeline.py script

Thank you!Dr. Chimusa | [email protected]

power and sample-size estimation for microbiome...

Documents