basic tutorial of association mapping by avjinder kaler
TRANSCRIPT
Kaler, Avjinder Page 1
Basic Tutorial of Association Mapping
By
Avjinder Singh Kaler ([email protected] )
University of Arkansas, Fayetteville, AR
This tutorial is a basic one, which can be used to perform association mapping. One can use
this to format the data files and then can use those in STRUCTURE, TASSEL, and GAPIT.
Software Needed:
Download and Install these software.
1. TASSEL: uses two models, GLM and MLM, to perform association analysis.
http://www.maizegenetics.net/#!tassel/c17q9
2. GAPIT: Uses R package to perform association analysis.
http://www.maizegenetics.net/#!gapit/cmkv
3. STRUCTURE: used to investigate population structure.
http://pritchardlab.stanford.edu/structure.html
4. TEXTPAD: used to format the files.
https://www.textpad.com/
Kaler, Avjinder Page 2
STEPS IN ASSOCIATION MAPPING
1. First step, eliminate the monomorphic makers and markers with missing data
(depend on quality of genotyping. Includes only those markers with minor allele
frequency (MAF ≥ 0.05).
Perform in excel and use excel function to calculate the missing data and MAF.
Missing data MAF
After removing the markers, prepare two excel files, Genotype and Phenotype
like shown below which would be used in Textpad to make files that will be
used in TASSEL and GAPIT.
Genotype file
Phenotype File
Kaler, Avjinder Page 3
Prepare a population structure file that will be used in STRUCTURE to estimate
the population structure (Q matrix).
o Use the above genotype excel file to create a file like the one below
o Convert di-nucleotide to mono (AA-> A and A or AC -> A and C) if it is di
or duplicate if they are mono
o Convert letters to integers (1, 2, 3, 4,……(use -1 for missing data))
o After converting, file would look like the one below
Population Structure excel file
2. Second step, format the file as needed for software using TextPad
Open the TextPad, copy the genotype file data, and then save as .hmp file.
Copy the phenotype file data in new file and then save as .txt file.
Create the new folder in C with name myGAPIT and copy above Textpad files.
Copy the Population Structure excel file in TextPad and save as .txt file.
3. Third step, estimate the population structure (Q matrix) using STRUCTURE
Open a file -> New Project, write the project information (Name the Project)->
Select directory (Folder where a data file of population structure is saved) ->
Kaler, Avjinder Page 4
choose data file and then click Next. Write the information of input data set,
Number of individual ->Number of loci -> Missing data value(-1) (Use genotype
excel file to see this) and then click Next. Click row marker name and then click
Next. Click individual name and click proceed.
Use below STRUCTURE tutorial to estimate the best K population structure
http://pbgworks.org/sites/pbgworks.org/files/Tutorial%20of%20STRUCTURE%
20software.pdf
After estimating the best K, select the Q-matrix from right window and copy it
with (Ctrl + C) and then paste in excel to prepare a Q-matrix excel file.
Format the Q-matrix excel file like the one below
Copy the Q-matrix excel file in TextPad and save as a .txt file and this file can
be used in TASSEL.
Kaler, Avjinder Page 5
Copy all three files, Genotype file (.hmp), Phenotype file (.txt) and Q-matrix file
(.txt) in one folder.
4. Perform Association analysis in TASSEL Three files, Genotype file (.hmp), Phenotype file (.txt) and Q-matrix file
(.txt) would be used in TASSEL .
Open a TASSEL, then File -> load all files from folder by clicking (I will make my
best guess.) -> select all files.
General Linear Model (GLM) o Use three files, Genotype file (.hmp), Phenotype file (.txt) and
population structure file (.txt)
o Select all three files together and join them using union join option in
data tab.
o Select the joint file and click the GLM from Analysis tab.
o Click Result tab and then Table to save file in excel format and check the
P-value and R-squared values.
o From Result tab, click the Manhattan Plot to see associated markers on
Graph
Mixed Linear Model (MLM)
o MLM will use all three, Genotype file (.hmp), Phenotype file (.txt) and
population structure file (.txt) and plus kinship matrix file.
o Create the Kinship file by highlighting Genotype file, and then click the
Kinship from Analysis tab.
o Highlight the joint file and kinship matrix file and then click the MLM
from Analysis tab.
Kaler, Avjinder Page 6
o Click Result tab and then Table to save file in excel format and check the
P-value and R-squared values.
o From Result tab, click the Manhattan Plot to see associated markers on
Graph.
o One must filter the data set to reduce the false positives using filter tab,
use Sites tab for genotype filtration and Traits tab for phenotype
filtration
o Use below TASSEL tutorial to see GLM and MLM analysis in detail
http://pbgworks.org/sites/pbgworks.org/files/associationmappingtuto
rialusingtassel-111119213639-phpapp01.pdf
5. Perform Association analysis in GAPIT
Genome Association and Prediction Integrated
Tool-GAPIT Uses R package to perform association analysis and it performs genomic
selection. In this, different models are used such as the unified mixed model,
EMMA, the compressed mixed linear model, and P3D/EMMAx.
Create a folder in C drive with name myGAPIT and copy the Genotype file
(.hmp) and phenotype file (.txt).
Open the R-studio software, run these R code: #Install packages (Do this section only for new installation of R)
#-------------------------------------------------------------------------------
source("http://www.bioconductor.org/biocLite.R")
biocLite("multtest")
#----------------------------------------------------------------------------
install.packages("gplots")
install.packages("LDheatmap")
install.packages("genetics")
#Step 0: Import library and GAPIT functions run this section each time to start R)
#######################################################################################
library('MASS') # required for ginv
library(multtest)
library("gplots")
Kaler, Avjinder Page 7
library("LDheatmap")
library("genetics")
library("compiler") #this library is already installed in R
#--------------------------------------------------------------------------------------
source("http://zzlab.net/GAPIT/gapit_functions.txt")
#---------------------------------------------------------------------------------------
source("http://zzlab.net/GAPIT/emma.txt")
# set working directory
setwd("/myGAPIT")
#Tutorial 1: Basic Scenario of Compressed MLM by Zhang and et. al. (Nature Genetics, 2010)
#----------------------------------------------------------------------------------------
#Step 1: Set data directory and import files
# Rename Phenotype and Genotype file names in codes same as files are saved in myGAPIT folder
myY <- read.table("Phenotype_file_name.txt", head = TRUE)
myG <- read.delim("Genotype_file_name.hmp.txt", head = FALSE)
#Step 2: Run GAPIT
myGAPIT <- GAPIT(
Y=myY,
G=myG,
PCA.total=2,
)
#Tutorial 2: Using ECMLM by Li and et. al. (BMC Biology, 2014)
#----------------------------------------------------------------------------------------
#Step 1: Set data directory and import files
myY <- read.table("Phenotype_file_name.txt ", head = TRUE)
myG <- read.delim("Genotype_file_name.hmp.txt", head = FALSE)
#Step 2: Run GAPIT
myGAPIT <- GAPIT(Y=myY, G=myG, PCA.total=3, kinship.cluster=c("average", "complete", "ward"),
kinship.group=c("Mean", "Max"), group.from=200, group.to=1000000, group.by=10)
Results will be saved in myGAPIT folder, use excel file of GWAS results,
Manhattan Plot, QQ-plot, PCA plot to see the associated markers and check the
P-value and R-squared values.
Compare the association mapping results from both GAPIT and TASSEL
Read GWAS papers to explains all results
Other software to perform association analysis, JMP genomics, R (different
packages http://cran.r-project.org/web/views/Genetics.html )
Kaler, Avjinder Page 8
References:
1. http://pbgworks.org/sites/pbgworks.org/files/associationmappingtutorialusingta
ssel-111119213639-phpapp01.pdf
2. http://pbgworks.org/sites/pbgworks.org/files/Tutorial%20of%20STRUCTURE%20
software.pdf
3. http://www.maizegenetics.net/#!gapit/cmkv
4. http://www.ndsu.edu/pubweb/~mcclean/plsc731/homework/papers/zhu%20et
%20al%20%20status%20and%20prospects%20of%20association%20mapping%20i
n%20plants.pdf
Helpful Sites:
1. http://www.ndsu.edu/pubweb/~mcclean/plsc731/topic.htm
2. http://www.extension.org/plant_breeding_genomics
3. http://passel.unl.edu/communities/pbtn?idsubcollectionmodule=1130274157&id
independentpage=156
Important Books
1. Applied Statistical Genetics with R – For Population-based Association Studies
(Andrea, 2009).
2. Genome-Wide Association Studies and Genomic Prediction (Gondro et al., 2013).
3. The fundamentals of modern statistical genetics (Laird et al., 2011).
Acknowledge : Dr. Ainong Shi and Dr. Richard E. Mason for their class.