basic tutorial of association mapping by avjinder kaler

8
Kaler, Avjinder Page 1 Basic Tutorial of Association Mapping By Avjinder Singh Kaler ([email protected] ) University of Arkansas, Fayetteville, AR This tutorial is a basic one, which can be used to perform association mapping. One can use this to format the data files and then can use those in STRUCTURE, TASSEL, and GAPIT. Software Needed: Download and Install these software. 1. TASSEL: uses two models, GLM and MLM, to perform association analysis. http://www.maizegenetics.net/#!tassel/c17q9 2. GAPIT: Uses R package to perform association analysis. http://www.maizegenetics.net/#!gapit/cmkv 3. STRUCTURE: used to investigate population structure. http://pritchardlab.stanford.edu/structure.html 4. TEXTPAD: used to format the files. https://www.textpad.com/

Upload: avjinder-avi-kaler

Post on 14-Apr-2017

1.056 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Basic Tutorial of Association Mapping by Avjinder Kaler

Kaler, Avjinder Page 1

Basic Tutorial of Association Mapping

By

Avjinder Singh Kaler ([email protected] )

University of Arkansas, Fayetteville, AR

This tutorial is a basic one, which can be used to perform association mapping. One can use

this to format the data files and then can use those in STRUCTURE, TASSEL, and GAPIT.

Software Needed:

Download and Install these software.

1. TASSEL: uses two models, GLM and MLM, to perform association analysis.

http://www.maizegenetics.net/#!tassel/c17q9

2. GAPIT: Uses R package to perform association analysis.

http://www.maizegenetics.net/#!gapit/cmkv

3. STRUCTURE: used to investigate population structure.

http://pritchardlab.stanford.edu/structure.html

4. TEXTPAD: used to format the files.

https://www.textpad.com/

Page 2: Basic Tutorial of Association Mapping by Avjinder Kaler

Kaler, Avjinder Page 2

STEPS IN ASSOCIATION MAPPING

1. First step, eliminate the monomorphic makers and markers with missing data

(depend on quality of genotyping. Includes only those markers with minor allele

frequency (MAF ≥ 0.05).

Perform in excel and use excel function to calculate the missing data and MAF.

Missing data MAF

After removing the markers, prepare two excel files, Genotype and Phenotype

like shown below which would be used in Textpad to make files that will be

used in TASSEL and GAPIT.

Genotype file

Phenotype File

Page 3: Basic Tutorial of Association Mapping by Avjinder Kaler

Kaler, Avjinder Page 3

Prepare a population structure file that will be used in STRUCTURE to estimate

the population structure (Q matrix).

o Use the above genotype excel file to create a file like the one below

o Convert di-nucleotide to mono (AA-> A and A or AC -> A and C) if it is di

or duplicate if they are mono

o Convert letters to integers (1, 2, 3, 4,……(use -1 for missing data))

o After converting, file would look like the one below

Population Structure excel file

2. Second step, format the file as needed for software using TextPad

Open the TextPad, copy the genotype file data, and then save as .hmp file.

Copy the phenotype file data in new file and then save as .txt file.

Create the new folder in C with name myGAPIT and copy above Textpad files.

Copy the Population Structure excel file in TextPad and save as .txt file.

3. Third step, estimate the population structure (Q matrix) using STRUCTURE

Open a file -> New Project, write the project information (Name the Project)->

Select directory (Folder where a data file of population structure is saved) ->

Page 4: Basic Tutorial of Association Mapping by Avjinder Kaler

Kaler, Avjinder Page 4

choose data file and then click Next. Write the information of input data set,

Number of individual ->Number of loci -> Missing data value(-1) (Use genotype

excel file to see this) and then click Next. Click row marker name and then click

Next. Click individual name and click proceed.

Use below STRUCTURE tutorial to estimate the best K population structure

http://pbgworks.org/sites/pbgworks.org/files/Tutorial%20of%20STRUCTURE%

20software.pdf

After estimating the best K, select the Q-matrix from right window and copy it

with (Ctrl + C) and then paste in excel to prepare a Q-matrix excel file.

Format the Q-matrix excel file like the one below

Copy the Q-matrix excel file in TextPad and save as a .txt file and this file can

be used in TASSEL.

Page 5: Basic Tutorial of Association Mapping by Avjinder Kaler

Kaler, Avjinder Page 5

Copy all three files, Genotype file (.hmp), Phenotype file (.txt) and Q-matrix file

(.txt) in one folder.

4. Perform Association analysis in TASSEL Three files, Genotype file (.hmp), Phenotype file (.txt) and Q-matrix file

(.txt) would be used in TASSEL .

Open a TASSEL, then File -> load all files from folder by clicking (I will make my

best guess.) -> select all files.

General Linear Model (GLM) o Use three files, Genotype file (.hmp), Phenotype file (.txt) and

population structure file (.txt)

o Select all three files together and join them using union join option in

data tab.

o Select the joint file and click the GLM from Analysis tab.

o Click Result tab and then Table to save file in excel format and check the

P-value and R-squared values.

o From Result tab, click the Manhattan Plot to see associated markers on

Graph

Mixed Linear Model (MLM)

o MLM will use all three, Genotype file (.hmp), Phenotype file (.txt) and

population structure file (.txt) and plus kinship matrix file.

o Create the Kinship file by highlighting Genotype file, and then click the

Kinship from Analysis tab.

o Highlight the joint file and kinship matrix file and then click the MLM

from Analysis tab.

Page 6: Basic Tutorial of Association Mapping by Avjinder Kaler

Kaler, Avjinder Page 6

o Click Result tab and then Table to save file in excel format and check the

P-value and R-squared values.

o From Result tab, click the Manhattan Plot to see associated markers on

Graph.

o One must filter the data set to reduce the false positives using filter tab,

use Sites tab for genotype filtration and Traits tab for phenotype

filtration

o Use below TASSEL tutorial to see GLM and MLM analysis in detail

http://pbgworks.org/sites/pbgworks.org/files/associationmappingtuto

rialusingtassel-111119213639-phpapp01.pdf

5. Perform Association analysis in GAPIT

Genome Association and Prediction Integrated

Tool-GAPIT Uses R package to perform association analysis and it performs genomic

selection. In this, different models are used such as the unified mixed model,

EMMA, the compressed mixed linear model, and P3D/EMMAx.

Create a folder in C drive with name myGAPIT and copy the Genotype file

(.hmp) and phenotype file (.txt).

Open the R-studio software, run these R code: #Install packages (Do this section only for new installation of R)

#-------------------------------------------------------------------------------

source("http://www.bioconductor.org/biocLite.R")

biocLite("multtest")

#----------------------------------------------------------------------------

install.packages("gplots")

install.packages("LDheatmap")

install.packages("genetics")

#Step 0: Import library and GAPIT functions run this section each time to start R)

#######################################################################################

library('MASS') # required for ginv

library(multtest)

library("gplots")

Page 7: Basic Tutorial of Association Mapping by Avjinder Kaler

Kaler, Avjinder Page 7

library("LDheatmap")

library("genetics")

library("compiler") #this library is already installed in R

#--------------------------------------------------------------------------------------

source("http://zzlab.net/GAPIT/gapit_functions.txt")

#---------------------------------------------------------------------------------------

source("http://zzlab.net/GAPIT/emma.txt")

# set working directory

setwd("/myGAPIT")

#Tutorial 1: Basic Scenario of Compressed MLM by Zhang and et. al. (Nature Genetics, 2010)

#----------------------------------------------------------------------------------------

#Step 1: Set data directory and import files

# Rename Phenotype and Genotype file names in codes same as files are saved in myGAPIT folder

myY <- read.table("Phenotype_file_name.txt", head = TRUE)

myG <- read.delim("Genotype_file_name.hmp.txt", head = FALSE)

#Step 2: Run GAPIT

myGAPIT <- GAPIT(

Y=myY,

G=myG,

PCA.total=2,

)

#Tutorial 2: Using ECMLM by Li and et. al. (BMC Biology, 2014)

#----------------------------------------------------------------------------------------

#Step 1: Set data directory and import files

myY <- read.table("Phenotype_file_name.txt ", head = TRUE)

myG <- read.delim("Genotype_file_name.hmp.txt", head = FALSE)

#Step 2: Run GAPIT

myGAPIT <- GAPIT(Y=myY, G=myG, PCA.total=3, kinship.cluster=c("average", "complete", "ward"),

kinship.group=c("Mean", "Max"), group.from=200, group.to=1000000, group.by=10)

Results will be saved in myGAPIT folder, use excel file of GWAS results,

Manhattan Plot, QQ-plot, PCA plot to see the associated markers and check the

P-value and R-squared values.

Compare the association mapping results from both GAPIT and TASSEL

Read GWAS papers to explains all results

Other software to perform association analysis, JMP genomics, R (different

packages http://cran.r-project.org/web/views/Genetics.html )

Page 8: Basic Tutorial of Association Mapping by Avjinder Kaler

Kaler, Avjinder Page 8

References:

1. http://pbgworks.org/sites/pbgworks.org/files/associationmappingtutorialusingta

ssel-111119213639-phpapp01.pdf

2. http://pbgworks.org/sites/pbgworks.org/files/Tutorial%20of%20STRUCTURE%20

software.pdf

3. http://www.maizegenetics.net/#!gapit/cmkv

4. http://www.ndsu.edu/pubweb/~mcclean/plsc731/homework/papers/zhu%20et

%20al%20%20status%20and%20prospects%20of%20association%20mapping%20i

n%20plants.pdf

Helpful Sites:

1. http://www.ndsu.edu/pubweb/~mcclean/plsc731/topic.htm

2. http://www.extension.org/plant_breeding_genomics

3. http://passel.unl.edu/communities/pbtn?idsubcollectionmodule=1130274157&id

independentpage=156

Important Books

1. Applied Statistical Genetics with R – For Population-based Association Studies

(Andrea, 2009).

2. Genome-Wide Association Studies and Genomic Prediction (Gondro et al., 2013).

3. The fundamentals of modern statistical genetics (Laird et al., 2011).

Acknowledge : Dr. Ainong Shi and Dr. Richard E. Mason for their class.