variant analysis tutorial - golden helix, inc · analysis, variant calling lends itself to being...

Variant Analysis TutorialRelease 8.1

Golden Helix, Inc.

Feb 14, 2019

Contents

1. Overview 2Exactly When is Rare Variant Analysis Appropriate? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Download Annotation Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2. Collapsing Rare Variants 5Detect the Presence of a Variant in a Gene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Count Number of Variants (Per Gene) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3. Variant Frequency Binning and CMC Method 11Create Frequency Bins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11CMC Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12CMC Method with Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13CMC Method with Regression Using Transcripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4. KBAC Method 19KBAC with Permutation Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19KBAC with Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22KBAC for Lower Sample Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5. SKAT-O and Generalized SKAT 30Burden Testing using the SKAT-O Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30SKAT and Generalized SKAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33The SKAT-O Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6. Compare Results with GWAS Approach 38Attempt an Association Test on Individual Rare Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . 38Compare with Rare Variant Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38Side Note: What About the Direction of Effect? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Bibliography 52

i

Variant Analysis Tutorial, Release 8.1

Updated: February 11, 2019

Level: Advanced

Packages: DNA-Seq, Power Seat

This tutorial covers a complex case/control variant analysis workflow. The steps include variant collapsing and asso-ciation testing using sequence data and a simulated phenotype.

Requirements

To complete this tutorial you will need to download and unzip the following file, which includes a starter project.

Download

Variant_Analysis_Tutorial.zip

Files included in the above ZIP file:

• Variant Analysis Tutorial - Starter project containing variant data.

Contents 1

http://doc.goldenhelix.com/SVS/tutorials/variant_analysis/Variant_Analysis_Tutorial.zip

1. Overview

The technology for high-throughput production of gene and protein sequence data is rapidly improving, and the appli-cations of sequence technology are also developing rapidly. In general, DNA sequence analysis is tasked with studyingthe effects of rare and low-frequency variants, as well as potentially common variants on the phenotype of interest. Thebioinformatics of sequence analysis ranges from instrument-specific processing of raw data to the final aggregation ofmultiple samples into data mining and analysis tools. The software of sequence analysis can be categorized into thethree stages of the data’s lifecycle: primary, secondary, and tertiary analysis.

Primary analysis can be defined as the machine-specific steps needed to call individual bases and compute qualityscores for those calls. Because current sequencing technologies are generally based on the “shotgun” approach ofchopping all the DNA up into smaller molecules and then generating what are referred to as “reads” of these smallnucleotide sequences, it’s left up to secondary analysis to reassemble these reads to get a representation of the un-derlying long-range sequence. Secondary analysis is the process of assembling the individual reads or aligning themto a reference genome and detecting variants. While more customizable, and sometimes considered part of tertiaryanalysis, variant calling lends itself to being pipelined in the same manner as alignment. Out of the secondary analysisstep of variant calling, you now have a more manageable set of differences between the sequenced samples and thereference, but there is still an enormous amount of data to make sense of. This is the realm of tertiary analysis.

Association techniques used in GWAS studies do not have the power to detect the significance of rare variants in-dividually or provide tools for measuring their compound effect, referred to as rare variant burden. To do this, it isnecessary to collapse several variants into a single covariate based on regions such as genes. This tutorial will covermethods based on detecting the presence of or counting the variants in a gene, the Combined Multivariate and Col-lapsing (CMC) method, the Kernel-Based Adaptive Collapsing (KBAC) method, and the Optimized Sequence KernelAssociation Test (SKAT-O).

Exactly When is Rare Variant Analysis Appropriate?

The quick answer to this question is, “Rare Variant Analysis is normally appropriate when next-generation sequencing(NGS) is used, and normally not needed when microarrays are used.”

That is because NGS sequencing can produce many variant calls where the minor allele frequency (MAF), or thevariant allele frequency (VAF), is only 1%, or sometimes far less than 1%, and for which only one or two or threesamples may contain the variant, while microarrays normally produce data that may be properly analyzed throughGWAS methods. GWAS studies are best done where:

• The minor allele frequency (MAF) (also sometimes called variant allele frequency or VAF) is at least 1% (.01),and preferably 5% (.05), and

• The expected number of cases with at least one minor allele present (MAF or VAF times number of cases) andthe expected number of controls with at least one minor allele present (MAF or VAF times number of controls)should both be at least 5, and preferably at least 10.

Otherwise, you should use rare variant analysis techniques.

2

http://en.wikipedia.org/wiki/Shotgun_sequencing


Since NGS can find common variants as well as rare variants, you may have NGS data for which you would like touse normal GWAS techniques–just make sure your data fits the specifications above. In order to do this, you may wishto filter out the rare variants from your data while leaving in the common variants.

Note: The fact that the data used in this tutorial must be analyzed using rare variant techniques is illustrated by the(attempted) GWAS test performed later in this tutorial for demonstration purposes. (See Attempt an Association Teston Individual Rare Variants.)

Download Annotation Data Sources

Later in the tutorial, an annotation track will be used to create a variant frequency bin spreadsheet. You will need todownload this track before proceeding.

Note: If you have completed the Intro to NGS Tutorial, you should have already downloaded this particular track(1kG Phase3 - Variant Frequencies 5a with Genotype Counts, GHI).

• Open the previously downloaded project.

• From the SVS Project Navigator, choose Tools > Manage Data Sources.

• Click Public Annotations.

• Navigate to 1kG Phase3 - Variant Frequencies 5a with Genotype Counts, GHI and check the box to the leftof this track. See Figure 1-1.

• Click Download.

After the download finishes, you may close the Data Source Library and Downloads windows and proceed with thetutorial.

Note: For this tutorial, you will also need the annotation track RefSeq Genes 105 Interim v1, NCBI. However, thistrack is available locally by default, so you should not need to download it.

Download Annotation Data Sources 3

http://doc.goldenhelix.com/SVS/tutorials/intro_to_ngs/index.html


Figure 1-1. Data Source Library

4 1. Overview

2. Collapsing Rare Variants

The Phenotypes + 1kG Exomes - Sheet 1 spreadsheet consists of exomes taken from the 1000 Genomes Phase 1project, joined with a simulated phenotype. For expediency in this tutorial, we are using only data from Chromosomes5 and 6.

Two approaches for collapsing the rare variants are discussed below. The first detects the presence of at least onevariant in a gene or region, while the second counts the number of variants in a gene or region.

Detect the Presence of a Variant in a Gene

We will use the Count Variants per Gene tool to detect the presence or absence of variants for each subject in generegions of the spreadsheet. The resulting spreadsheet contains binary columns (presence/absence) for each gene. AFisher’s exact test can then be performed with the simulated phenotype.

• From Phenotypes + 1kG Exomes - Sheet 1, choose DNA-Seq > Collapsing Methods > Count Variants perGene.

• The RefSeq Genes 105 Interim v1, NCBI track and the Reference field should be selected by default.

• Choose Both Ref_Alt and Alt_Alt variants after Choose the types of variants to count:.

• Leave Binary Presence/Absence of Variants (Per Gene).

• The dialog should look like Figure 2-1. Click OK.

Figure 2-1. Count Variants per Gene dialog

5


A new spreadsheet called Variant present in gene - Both Alt_Alt and Ref_Alt variants is created. The marker mapentry for each column name contains the start and end map positions for the gene region represented by the column.The marker map also includes the gene name and other data for the gene represented by that column.

Join the phenotype information with this spreadsheet so that a Fisher’s Exact Test can be performed.

Note: This test is performed in [Cohen2004].

• Open the Phenotypes + 1kG Exomes - Sheet 1 spreadsheet and choose Select > Column > Inactivate AllColumns. This will create spreadsheet Phenotypes + 1kG Exomes - Sheet 2.

• Reactivate columns 1 and 2 (Case? and Population) by left-clicking once on each column’s header.

• Choose File > Join or Merge Spreadsheets.

• Choose Variant present in gene - Both Alt_Alt and Ref_Alt variants from the list and click OK.

• Change New dataset name: to C/C + Variant present in gene - Both Alt_Alt and Ref_Alt variants.


Figure 2-2. Join or Merge Spreadsheets dialog

Next, run Fisher’s Exact Test from the merged spreadsheet.

• Click once on the Case? column header to change the color of the column to magenta to signify its dependentstatus.

• Choose Numeric > Fisher’s Exact Test for Binary Predictors.

6 2. Collapsing Rare Variants


The resulting spreadsheet contains the statistical test results.

• From Fisher’s Exact Test Results, right-click the -log10 P-value column header and choose Plot Variable inGenomeBrowse.

• In the chromosome chooser dropdown near the upper-left corner of the plot window, enter 5 - 6. Then click theright-arrow icon just to the right of the genomic coordinate search bar near the top of the plot window.

• In the Plot Tree, right-click on the second -log10 P-Value graph item and select Edit Title.... Enter in the newname: Fisher’s Exact -log10 P on Presence of Var.

The plot should look like Figure 2-3.

Figure 2-3. P-value plot from Fisher’s Exact Test results

Count Number of Variants (Per Gene)

Repeat this analytic method, instead choosing Count of Number of Variants (Per Gene) in the Count Variants perGene dialog. The resulting spreadsheet will contain integer columns (rather than binary) with the count of vari-ants found in each gene. Thus a Numeric Association test can be performed (rather than a Fisher’s Exact Test).This test will assess the burden of multiple damaging variants in each individual, rather than the simple presenceof one or more damaging variants. This test is similar to the test (Cohort Allelic Sums Test or CAST) discussed in[Morgenthaler2007].

• From Phenotypes + 1kG Exomes - Sheet 1, choose DNA-Seq > Collapsing Methods > Count Variants perGene.

Count Number of Variants (Per Gene) 7


• After Choose the types of variants to count: select Both Ref_Alt and Alt_Alt variants.

• After Select the type of output to generate: choose Count of Number of Variants (Per Gene).


Figure 2-4. Count Variants per Gene dialog

Join the phenotype information with this spreadsheet.

• Open Phenotypes + 1kG Exomes - Sheet 2 and choose File > Join or Merge Spreadsheets.

• Choose the Variant counts per gene - Both Alt_Alt and Ref_Alt variants from the list and click OK.

• Change New dataset name: to Phenotype + Variant counts per gene - Both Alt_Alt and Ref_Alt variants.

• Click OK.

Run a numeric association test from the merged spreadsheet.

• Left-click once on the Case? column header to change the color of the column to magenta to signify its depen-dent status.

• Choose Numeric > Numeric Association Tests.

• Check Correlation/Trend test, leave the other options at their default values, and click Run.

Add the results to the previous plot.

• Open Plot of Column -log10 P-Value from Fisher’s Exact Test Results

• Click on the -log10 P-value node in the Plot Tree.

• Choose the Add tab in the Controls dialog box and select the Add Plot Item(s) button. This will open the DataSource Library.

• Click the Project button in the upper left corner to select data from an existing spreadsheet.

• Choose Association Tests and check Corr/Trend -log10 P from the list. See Figure 2-5.

• Click Plot & Close.

Next, rename the graph item.

• Right-click on the graph item Corr/Trend -log10 P and select Edit Title....

• Enter in the new name: Corr/Trend -log10 P on Number of Variants.



Figure 2-5. Add Data Sources Window

Count Number of Variants (Per Gene) 9


• Then on the Style tab of the Controls dialog click the blue rectangle to change the color to green to differentiatethe value points from those of the first plot.

The plot should now contain results from two analyses and look like Figure 2-6.

Figure 2-6. P-value comparison plot

Notice that, in general, a higher significance is obtained from testing for association on variant counts rather than fromtesting for association on the mere presence of variants.


3. Variant Frequency Binning and CMCMethod

The CMC method [Li2008] bins variants according to parameters such as minor allele frequency, then collapsesvariants from each bin within defined regions such as genes. In your own study, you may want to use a differentmethod to define your own variant bins. This tutorial demonstrates the process of creating bins based on minor allelefrequency as determined by the 1000 Genomes Project Phase 1 data.

Create Frequency Bins

• Open Phenotypes + 1kG Exomes - Sheet 1 and choose DNA-Seq > Variant Binning by Frequency Source.

• Choose Select Track and highlight 1kG Phase3 - Variant Frequencies 5a with Genotype Counts, GHI. ClickSelect and Next>.

• Leave the single threshold value of 0.01.


Figure 3-1. Variant Binning by Frequency Track dialog

The resulting spreadsheet contains at least four columns:

• An integer Allele Frequencies Bin column indicating into which bin the marker fell.

• The Allele Frequencies value from the selected annotation track (in this case showing us the alternate allelefrequency of the SNP in the One-thousand Genomes Project (1kG)).

• A list of alleles present.

• The observed reference/alternate alleles from the annotation track.

• Additionally, other fields from the track may be included. For this track, 32 other information fields for eachmarker are provided.

11


If the marker from the spreadsheet was not present in the selected probe track, the Allele Frequencies value is listedas missing and the marker is assigned to the 0 bin, as the variant must be so rare that it wasn’t even identified in the1kG reference dataset used to build the probe track.

CMC Method

The CMC method uses the variant bins to perform a multivariate test collapsed over specified regions. For this test,those regions will be gene regions. This method uses the combined effect of multiple variants in a gene to determinethe association with the phenotype. For this test, we will use the CMC Hotelling 𝑇 2 algorithm.

• In the Phenotypes + 1kG Exomes - Sheet 1 spreadsheet, left-click once on the Case? column name headerto change the color of the column to magenta to signify its dependent status. Spreadsheet Phenotypes + 1kGExomes - Sheet 3 will be created.

• Choose DNA-Seq > Collapsing Methods > CMC with Hotelling T Squared Test.

• The RefSeq Genes 105 Interim v1, NCBI track should be selected by default.

• Under Variant Bins for CMC click Select Sheet.

• Select the Variant Frequency Bins from 1kG Phase3 - Variant Frequencies 5a with Genotype Counts, GHIspreadsheet created above and click OK.

• Click Select Column for Variant Bin column....

• Choose the Allele Frequencies Bin column and click OK.

• The dialog should look like Figure 3-2. Click OK to run CMC with the Hotelling algorithm.

Figure 3-2. CMC with Hotelling T Squared Test dialog

The resulting spreadsheet, CMC with Hotelling T Squared Test with RefSeq Genes 105 Interim v1, NCBI, usesrow numbers (which correspond to indexes into the marker map) for row labels and, for each gene region, reports thechromosome, start and end positions, gene name, transcript name(s), strand, the CMC p-value, the -log10 P-Value andvarious statistics and multiple-testing-corrected results.

12 3. Variant Frequency Binning and CMC Method


Now, plot the -log10 p-values to investigate possible associations between rare variants and the phenotype.

• Right click on the -log10 P-Value column header (Column 8) and choose Plot Variable in GenomeBrowse.

• In the chromosome chooser dropdown near the upper-left corner of the plot window, enter 5 - 6. Then click theright-arrow icon just to the right of the genomic coordinate search bar near the top of the plot window.

To prevent confusion later, rename the graph item to reflect the CMC test performed.

• In the Plot Tree, right-click on the second -log10 P-Value graph item and select Edit Title.... Rename thisgraph item to CMC Hotelling T^2 -log10 P. See Figure 3-3.

Figure 3-3. CMC -log10 P-values from the Hotelling’s T Squared algorithm

CMC Method with Regression

The CMC method also offers the regression algorithm. The CMC regression algorithm not only offers correction forcovariates (integer, real or binary), but also allows having a quantitative dependent variable.

Next, use CMC regression simply to cross-check the results. If a gene is significant under one CMC method, onewould expect it to be significant using the other.

• From Phenotypes + 1kG Exomes - Sheet 3 choose DNA-Seq > Collapsing Methods > CMC with Regression.



• Select the Variant Frequency Bins from 1kG Phase3 - Variant Frequencies 5a with Genotype Counts, GHIspreadsheet created earlier and click OK.

CMC Method with Regression 13




• To save time, after Significance Testing, uncheck Also compute permuted p-values.

• The dialog should look like Figure 3-4. Click OK to run CMC with regression.

Figure 3-4. CMC with Regression dialog

The resulting spreadsheet, CMC Regression with RefSeq Genes 105 Interim v1, NCBI, also uses row numbers(indexes into the marker map) for row labels and also, for each gene region, reports the chromosome, start and endpositions, gene name, transcript name(s), strand, CMC p-value, -log10 P-Value and various statistics and multiple-testing-corrected results. (For the test we just performed, these statistics include the beta and standard error forregressing on each bin, because we kept the default selection of Output bin betas and their standard errors in thedialog.)

Add these results to the previously created plot.

• From Plot of Column -log10 P-Value from CMC with Hotelling T Squared Test with RefSeq Genes 105Interim v1, NCBI, click on -log10 P-Value in the Plot Tree. Choose the Add tab and select Add Plot Item(s)to open the Add Data Sources dialog.

• Click the Project button in the upper left corner to add data from an existing spreadsheet.



• Choose CMC Regression with RefSeq Genes 105 Interim v1, NCBI.

• Check -log10 P-Value from the list and click Plot & Close.

• Rename this graph item (the second -log10 P-Value) by right-clicking on it and selecting Edit Title.... Renameit to CMC w Regression -log10 P and change the color of the data points to green under the Style tab of theControls dialog. The plot should look like Figure 3-5.

Figure 3-5. CMC P-values from Hotelling in blue and from regression in green

Notice that (according to this simulated data) the regression p-values (green) do not seem to be as significant as theT-test p-values (blue), even while both algorithms do agree on which genes are more significant than other genes.

CMC Method with Regression Using Transcripts

Normally, CMC (and also KBAC) will consolidate all transcripts that have the same gene name into one region, startingwith the minimum start position of any transcript and ending with the maximum stop position of any transcript.

However, CMC and KBAC can also perform an individual test for every transcript. Additionally, CMC and KBACallow you to enlarge the region on both ends by a specified number of base pairs to include further variants.

Performing an individual test for every transcript will allow you to catalog associations by transcript and may resultin different levels of significance between transcripts that cover the same gene. However, this option is not normallyrecommended because it will result in a more severe multiple testing penalty than will consolidating transcripts intogenes.

You may want to enlarge the gene or transcript region if you suspect the further variants that are included may affectgene regulation or splicing. However, enlarging the region may include variants from neighboring regions.

CMC Method with Regression Using Transcripts 15


For comparison, run CMC using transcripts to define regions and also expand the region size. We will compare theresults to those already obtained using genes as regions (without expansion).

• From Phenotypes + 1kG Exomes - Sheet 3, choose DNA-Seq > Collapsing Methods > CMC with Regres-sion.


• To the right of Perform one test per..., select transcript.

• Check Include nearby markers (distance in bp): and leave the default of 1000.


• Select the Variant Frequency Bins from 1kG Phase3 - Variant Frequencies 5a with Genotype Counts, GHIspreadsheet created earlier and click OK.



• To save time, after Significance Testing, uncheck Also compute permuted p-values.

• The dialog should look like Figure 3-6. Click OK to run CMC with regression.

Figure 3-6. CMC with Regression on transcripts dialog



The resulting spreadsheet, also called CMC Regression with RefSeq Genes 105 Interim v1, NCBI, uses row num-bers (indexes into the marker map) for row labels, and, for each transcript, reports the chromosome, the actual startand end positions used, the gene name, the transcript name, the strand, the p-value, the -log10 P-Value and the otherremaining statistics for the transcript, including the beta and standard error for regressing on each bin.

Now add these results to the previously created plot.

• In the plot that has -log10 P-values for the other two CMC tests (Plot of Column -log10 P-Value from CMCwith Hotelling T Squared Test with RefSeq Genes 105 Interim v1, NCBI), click on -log10 P-Value in thePlot Tree.

• Choose the Add tab and select Add Plot Item(s) to open the Add Data Sources dialog.

• Click the Project button in the upper left corner to add data from an existing spreadsheet.

• Choose the second CMC Regression with RefSeq Genes 105 Interim v1, NCBI .

• Check -log10 P-Value from the list and click Plot & Close.

• Rename this graph item (now the second -log10 P-Value) by right-clicking on it and selecting Edit Title....Rename it CMC Reg on Transcripts -log10 P and change the color of the data points to orange under the Styletab of the Controls dialog.

Notice that the regression results using transcripts (orange) are very similar to the regression results using genes(green). To see this better, put the transcript results “on the bottom” in the graph.

• Press and hold the mouse button on the CMC Reg on Transcripts -log10 P node in the Graph Control Interface,then drag the node to just under the CMC Hotelling T^2 -log10 P node and drop it, so that the CMC Reg onTranscripts -log10 P node becomes the last node under -log10 P-Value.

The plot should look like Figure 3-7.

• To see the similarities between CMC Reg on Transcripts -log10 P and CMC w Regression -log10 P evenmore clearly, uncheck CMC Hotelling T^2 -log10 P and then alternately check and uncheck CMC Reg onTranscripts -log10 P.

CMC Method with Regression Using Transcripts 17


Figure 3-7. CMC P-values: Hotelling, blue; regression, green; regression with transcripts, orange


4. KBAC Method

The KBAC method [Liu2010] collapses the variant data within a region by categorizing it into multi-marker genotypesover specified regions. No binning procedure is necessary to use the KBAC method.

The KBAC method uses the counts of these multi-marker genotypes to perform a special multivariate case/control testto determine their association with the (case/control) phenotype. This test gives multi-marker genotypes with highersample risks higher weights so as to potentially separate causal from non-causal multi-marker genotypes.

This weighting procedure results in the KBAC method being much more suitable as a one-sided test than as a two-sidedtest.

KBAC with Permutation Testing

• From Phenotypes + 1kG Exomes - Sheet 3 choose DNA-Seq > Collapsing Methods > KBAC with Permu-tation Testing.

• To save time, check Adaptive permutation testing (threshold alpha): and leave the default .01 value.

• Check Two-sided statistics under Outputs, so that One-sided statistics (recommended) and Two-sided statis-tics are both checked.

• Under Regions, the RefSeq Genes 105 Interim v1, NCBI track should already be selected by default.

• The dialog should look like Figure 4-1. Click OK to run KBAC with permutation testing.

As you may have guessed by looking at the secondary (lower) progress bar, there is a substantial amount of time savedby using the adaptive permutation procedure. With this procedure, a gene that has a more highly significant p-valueis tested with the maximum number of permutations, while other genes that are in a less significant p-value rangeare only tested with enough permutations to give a reasonable enough estimate of the p-value to be able to infer itsless-than-high significance.

Only a few of the gene regions in this example require anywhere near all of the 1000 (maximum) permutations thatwere requested in the dialog, and a majority of regions require fewer than 100 or even 50 permutations.

Note: If you have a dataset with more than 400 cases and 400 controls, then using the Marginal binomial kernel typeor the Asymptotic normal kernel type, or using the KBAC Monte-Carlo approximation as the permutation mode,can save some computation time–up to 50% when Asymptotic normal and KBAC Monte-Carlo approximation arecombined. However, the data we are using only has 47 cases–not enough to warrant using these approximations.

The resulting spreadsheet, KBAC with Permutation Testing with RefSeq Genes 105 Interim v1, NCBI, uses rownumbers (indexes into the marker map) for row labels, and, for each gene region, reports the chromosome, the start andend positions, the gene name, the transcript name(s), the strand, the p-value, the -log10 P-Value and various statisticsand multiple-testing-corrected results for both the one-sided KBAC and two-sided KBAC tests.

Plot the -log10 p-values to investigate possible associations between rare variants and the phenotype.

19


Figure 4-1. KBAC with Permutation Testing dialog

20 4. KBAC Method


• In this spreadsheet (KBAC with Permutation Testing with RefSeq Genes 105 Interim v1, NCBI), right clickon the -log10 P-Value (One-Sided) column header (Column 8) and choose Plot Variable in GenomeBrowse.

• In the chromosome chooser dropdown near the upper-left corner of the plot window of the resulting plot, enter5 - 6. Then click the right-arrow icon just to the right of the genomic coordinate search bar near the top of theplot window.

• In the Plot Tree, click on the first -log10 P-Value (One-Sided) node. Choose the Add tab under Controls andselect Add Plot Item(s) to open the Add Data Sources dialog.

• Click the Project button and select the KBAC with Permutation Testing with RefSeq Genes 105 Interim v1,NCBI spreadsheet.

• In the list of plot items, go to -log10 P-Value (Two-Sided) and check it. Click Plot & Close.

• Select the -log10 P-Value (Two-Sided) plot item and go to the Style tab of the Controls dialog. Change thecolor to green by clicking on the blue box and selecting green from the color options.

The plot should look something like Figure 4-2.

Figure 4-2. -log10 P-values from KBAC with Permutation Testing–one-sided, blue; two-sided, green

Note: Your results may differ slightly since these results are dependent upon permutation testing.

Note that for the more significant values in this test, one-sided and two-sided tests come up with the same answers. Tosee this better, drag the -log10 P-Value (Two-Sided) node in the Plot Tree and drop it just below the -log10 P-Value(One-Sided) node. To see this better yet, alternately check and uncheck the -log10 P-Value (Two-Sided) node (nowthat you have this node on the bottom).

KBAC with Permutation Testing 21


These same answers occur because the one-sided KBAC test is weighted toward results where the multi-marker geno-types have higher sample risks and away from results where the multi-marker genotypes have lower sample risk. Thispushes the distribution almost completely into the area of positive test values, so that the two-sided KBAC test, whichis simply the square of the one-sided KBAC test, will come up with the same p-values for all tests except those havingthe least significance.

This is why checking One-sided statistics (recommended) is normally recommended. The intrinsic one-sidedness ofthe KBAC test is discussed further after the next section.

KBAC with Regression

KBAC also offers a regression-based algorithm to allow correction for covariates.

Unlike the situation with the CMC method and CMC with Hotelling vs. CMC with regression, the basic statisticunderlying KBAC with regression is completely equivalent to the basic statistic underlying KBAC with permutationtesting when there is no correction for covariates. The only p-value differences that will arise when there are nocovariates come from the fact of using permutation testing to obtain p-values.

• From Phenotypes + 1kG Exomes - Sheet 3 choose DNA-Seq > Collapsing Methods > KBAC with Regres-sion.

• Check Adaptive permutation testing (threshold alpha): and leave the default .01 value.

• Check Two-sided statistics under Outputs, so that One-sided statistics (recommended) and Two-sided statis-tics are both checked.


• The dialog should look like Figure 4-3. Click OK to run KBAC with regression.

This may take somewhat more time because the regression mechanism that is in place for correcting for covariates isbeing used, even though we are not actually correcting for covariates.

The resulting spreadsheet, KBAC Regression with RefSeq Genes 105 Interim v1, NCBI, also uses row numbers(indexes into the marker map) for row labels and also, for each gene region, reports the chromosome, the start and endpositions, the gene name, the transcript name(s), the strand, the p-value, the -log10 P-Value and various statistics andmultiple-testing-corrected results for both the one-sided KBAC and two-sided KBAC tests.

Now add these results to the KBAC with Permutation Testing plot for comparison.

• Go to this plot (called Plot of Column -log10 P-Value (One-Sided) from KBAC with Permutation Testingwith RefSeq Genes 105 Interim v1, NCBI).

• Right-click on the second -log10 P-Value (One-Sided) plot node and click Edit Title.... Rename this node to-log10 P-Value (OS Perm Testing).

• To declutter the plot, uncheck the -log10 P-Value (Two-Sided) node.

• Click on the (remaining) -log10 P-Value (One Sided) node in the Plot Tree. Choose the Add tab and select theAdd Plot Item(s) button.

• Click the Project button and select the KBAC Regression with RefSeq Genes 105 Interim v1, NCBI spread-sheet. Check -log10 P-Value (One-Sided) on the list. Click Plot & Close.

• Right-click on the new (second) -log10 P-Value (One-Sided) item and rename it to -log10 P-Value (OS Re-gression). Then change the color to orange on the Style tab of the Controls dialog.


Notice that the results are basically the same, with minor fluctuations due to permutation testing. In a few cases, theresults are exactly the same.

22 4. KBAC Method


Figure 4-3. KBAC with Regression dialog

KBAC with Regression 23


Figure 4-4. KBAC with Permutation Testing in blue, KBAC with Regression in orange

24 4. KBAC Method


Note: If you were to add the -log10 P-Value (Two-Sided) from the KBAC with Regression spreadsheet, you wouldfind its results agreeing with the regression one-sided results for the more significant p-values, just as they did betweenone-sided and two-sided KBAC with Permutation Testing.

Now, cross-check the KBAC results with the CMC results–specifically, with the CMC with Hotelling’s 𝑇 2 test.

• Click on the (remaining) -log10 P-Value (One-Sided) node in the Plot Tree. Choose the Add tab and selectAdd Plot Item(s)

• Click the Project button and select the CMC with Hotelling T Squared Test with RefSeq Genes 105 Interimv1, NCBI spreadsheet, then check -log10 P-Value from the list and click Plot & Close.

• Right-click on the new -log10 P-Value item, rename it to -log10 P-Value (CMC Hotelling), and use the Styletab to change this plot item’s color to black.

The plot should look similar to Figure 4-5.

Figure 4-5. KBAC (one-sided) with Permutation Testing, blue; KBAC (one-sided) with Regression, orange; CMCwith Hotelling’s, black

Note that there are two positions where the CMC with Hotelling’s method has the most significant results, one inChromosome 5 and one in Chromosome 6. Let’s zoom in and examine the significant result in Chromosome 5.

• Zoom in on this gene by selecting on the zoom toolbar, then pressing your left mouse button to the left ofthis gene and while holding down your left mouse button dragging the cursor across to the right of this gene,finally releasing the mouse button to complete the zoom. Repeat this process until the dot representing this gene

KBAC with Regression 25


has become a horizontal oval taking up almost the whole left half of the plot area. (Left half, since we will alsowant to view the next two results that will be to the right.)

Note: If you need to re-do any zoom, use on the zoom toolbar to go back to any previous zoom level.

When finished, the plot window should look something like Figure 4-6.

Figure 4-6. Zoom showing the KIF3A and SEPT8 genes

As you see, each p-value in the plot corresponds to an entire gene–the oval representing any p-value and gene willstretch out over the whole genomic extent of that gene.

We will be examining the results for these genes, which are KIF3A and SEPT8.

First, note that for KIF3A, the -log10 P for one-sided KBAC Permutation Testing and for one-sided KBAC withRegression are both about 3.0, which corresponds with a p-value of .001 (1 divided by 1000).

• You may need to uncheck -log10 P-Value (OS Perm Testing) to see the result for -log10 P-Value (OS Regres-sion) more clearly (or at all).

• Check -log10 P-Value (Two-Sided) to reveal another KBAC result for gene KIF3A for which -log10 P isapproximately 3.0.

Of course, p = .001 is the best significance result that permutation testing with 1000 iterations can achieve, and the factthat this is achieved for various KBAC tests on gene KIF3A is in agreement with the very significant p-value assignedto this gene by the CMC with Hotelling test.

Now, let us examine the results for gene SEPT8. We see that both the one-sided KBAC permutation testing resultand the one-sided KBAC regression result show very little significance, even while the CMC result and the two-sided

26 4. KBAC Method


KBAC result do show some significance. Question: Why are the one-sided results showing so much less significancefor this gene?

Remember that the KBAC method is intrinsically a one-sided test, meaning that it is meant to discover if the genevariants tested confer higher sample risk. Therefore, a -log10 P-Value close to zero for the one-sided test we have beendoing may mean that there could be an association in the direction of the gene variant conferring lower sample riskrather than of conferring higher sample risk.

We will now check to see if there is a significant association with any gene variants in the sense of conferring lowersample risk rather than higher sample risk.

KBAC for Lower Sample Risk

To find out, you need to invert the case/control status and test again.

• Open Phenotypes + 1kG Exomes - Sheet 3 and choose Edit > Edit This Spreadsheet.

• Right-click on the header of the Case? column.

• Choose Transform > Invert Boolean Values.

• Click OK to overwrite the column.

• Choose File > Save.

• After Dataset Name: enter Inverted Phenotypes + 1kG Exomes.

• Click OK.

Now, re-perform KBAC with Regression.

• From the new spreadsheet, left-click once on the not(Case?) column to signify its dependent status.

• Choose DNA-Seq > Collapsing Methods > KBAC with Regression.

• This time, in Permutations under Parameters for KBAC (Liu & Leal, 2010) Logistic Regression, change theNumber of permutations to use: to 10000

• Check Adaptive permutation testing (threshold alpha):, and change the (threshold alpha) in the blank on theright to .001.


• Leave the rest of the defaults and click OK.

This may take almost twice as long, but it will be sensitive to p-values down to .0001 rather than just down to .001.To compensate, the alpha threshold was adjusted down, so as to spend less time on any but the most significant generegions.

Now add these results to the previously created plot to compare.

• Using the Plot of Column -log10 P-Value (One-Sided) from KBAC with Permutation Testing with RefSeqGenes 105 Interim v1, NCBI plot, click on the -log10 P-Value (One-Sided) node in the Plot Tree. On theAdd tab of the Controls dialog, choose the Add Plot Item(s) button. Click the Project button on the left sideof the Add Data Sources window.

• Choose the latter occurrence of KBAC Regression with RefSeq Genes 105 Interim v1, NCBI (to get the resultspreadsheet just created) and check -log10 P-Value (One-Sided) from the list. Then click Plot & Close.

• Right-click on the new (second) -log10 P-Value (One-Sided) item and rename it to -log10 P-Value (KBACLower Risk).

• Change the color of the data points to red on the Style tab of the Controls dialog.

KBAC for Lower Sample Risk 27



Figure 4-7. KIF3A and SEPT8 genes showing KBAC (two-sided), green; KBAC (one-sided) with Permutation Testing,blue; KBAC (one-sided) with Regression, orange; CMC with Hotelling’s, black; KBAC testing for lower risk, red

While we have seen that the KIF3A gene is definitely significant in the direction of higher risk, we see that it is foundto be not significant at all in the direction of lower risk, which is self-consistent.

Meanwhile, let’s look over at the SEPT8 gene. We see that according to one-sided KBAC permutation testing, it hasno significance at all in the direction of higher risk, even while it has some significance according to Hotelling CMC.Let’s declutter this portion of the plot.

• Uncheck the -log10 P-Value (CMC Hotelling), -log10 P-Value (OS Perm Testing), and -log10 P-Value (Two-Sided) plot nodes.

Now, it’s abundantly clear that, by contrast to gene KIF3A, gene SEPT8 has a certain level of significance in thedirection of lower risk and basically no significance in the direction of higher risk, which is also self-consistent (SeeFigure 4-8). Both of these results are also consistent with a CMC Hotelling test picking up significance for both genes.(See Figures 4-7 and 4-8.)

• Enter 5 - 6 in the chromosome chooser dropdown and click the right-arrow icon to get the bigger picture.

You will then see that while, for our test data, not that many genes are very significant for lower risk, it is still true thatthose genes that are the most significant for lower risk are not at all the same as those genes that are most significantfor higher risk.

• Click to return to the view of Figure 4-8.

28 4. KBAC Method


Figure 4-8. KBAC test for higher risk, orange; KBAC test for lower risk, red

KBAC for Lower Sample Risk 29

5. SKAT-O and Generalized SKAT

Generalized Sequence Kernel Association Test (Generalized SKAT) [Lee2012] and [Liu2009] is a test whichcombines the test statistics of the following, according to a user-specified ratio 𝜌:

• Burden testing, which collapses the variant data within a region by summing the minor allele counts for eachmarker in the region, and testing this against the phenotype. By contrast to Count Number of Variants (PerGene), the counts are usually weighted by a function of each marker’s minor allele frequency (MAF), so as toestablish a contrast between rare and common variants.

• Sequence Kernel Association Test (SKAT), a test which collapses the variant data within a region by summingthe squares of score statistics for testing individual markers. Just as for Burden testing, weights based on eachmarker’s MAF are usually used to establish a contrast between rare and common variants. In this case, ifweighting is used, the individual squares of score statistics are weighted before they are summed.

The ratio 𝜌 of 1 corresponds to a pure Burden test, and a ratio 𝜌 of 0 corresponds to purely an (original) SKAT test.

Optimized SKAT (SKAT-O) [LeeWuLin2012] is a procedure which optimizes Generalized SKAT over a grid of Nvalues of 𝜌 between zero and 1, inclusive, in such a way as to count as only one test for multiple testing purposesinstead of as N tests. (In Golden Helix SVS, seven grid points are used (N = 7), so we are talking about avoidinghaving to multiply the number of tests by 7 to get a proper multiple testing correction.)

Burden Testing using the SKAT-O Feature

First, we will run a Burden test (Generalized SKAT with 𝜌 = 1).

• From Phenotypes + 1kG Exomes - Sheet 3 choose DNA-Seq > Collapsing Methods > SKAT-O.

• Check Generalized SKAT with rho = and select a rho value of 1.

• Uncheck SKAT-O with standard grid of rho values.


• Under Algorithm for Estimating Distributions, check Somewhat-small-sample corrected (kurtosis estimatedanalytically).

• The dialog should now look like Figure 5-1. Click OK.

• Rename the resulting spreadsheet to Gen. SKAT test with Rho = 1.

This result spreadsheet uses row numbers (indexes into the marker map) for row labels, and, for each gene region,reports the chromosome, the start and end positions, the gene name, the transcript name(s), the strand, the SKATp-value, the -log10 P-Value for SKAT and various other statistics and multiple-testing-corrected results.

Now, plot the -log10 p-values to investigate what associations there are between rare variants and the phenotypeaccording to Burden testing.

30


Figure 5-1. SKAT-O dialog for Burden Testing

Burden Testing using the SKAT-O Feature 31


• In the resulting spreadsheet (renamed above), right click on the -log10 SKAT P-Value column header (Column8) and choose Plot Variable in GenomeBrowse.

• In the chromosome chooser dropdown near the upper-left corner of the plot window of the resulting plot, enter5 - 6. Then click the right-arrow icon just to the right of the genomic coordinate search bar near the top of theplot window.

• In the Plot Tree, right-click on the second -log10 SKAT P-Value node and rename it to -log10 P-Value (Bur-den)

• In the Plot Tree, click on the (remaining) -log10 SKAT P-Value node. Choose the Add tab under Controls andselect Add Plot Item(s) to open the Add Data Sources dialog.

• Click the Project button and select the CMC with Hotelling T Squared Test with RefSeq Genes 105 Interimv1, NCBI spreadsheet.

• In the list of plot items, go to -log10 P-Value and check it. Click Plot & Close.

• Right-click on the -log10 P-Value plot item and rename it to -log10 P-Value (CMC Hotelling). Then changethe color to orange on the Style tab of the Controls dialog.


Figure 5-2. -log10 P-values from Burden Testing, blue; -log10 P-Values from CMC Hotelling, orange

We see that Burden testing is finding the same associations as does the CMC Hotelling test. Genes that are significantfor Burden testing are also significant for the CMC Hotelling test.

32 5. SKAT-O and Generalized SKAT


SKAT and Generalized SKAT

Next, we will run an (original) SKAT test (Generalized SKAT with 𝜌 = 0).


• Check Generalized SKAT with rho = and select a rho value of 0.


• Under Algorithm for Estimating Distributions, check Somewhat-small-sample corrected (kurtosis estimatedanalytically). Then click OK.

• Rename the resulting spreadsheet to Gen. SKAT test with Rho = 0.

Now, we will run a Generalized SKAT test with a 𝜌 value somewhere in the middle between zero and one. Based onthe (somewhat logrithmic) grid of values that Golden Helix SVS SKAT-O uses, the values of which are 0, 0.01, 0.04,0.09, 0.25, 0.5, and 1, we will choose 𝜌 = .25.


• Check Generalized SKAT with rho = and select a rho value of .25.


• Under Algorithm for Estimating Distributions, check Somewhat-small-sample corrected (kurtosis estimatedanalytically). Then click OK.

• Rename the resulting spreadsheet to Gen. SKAT test with Rho = .25.

Both of these result spreadsheets, now called Gen. SKAT test with Rho = 0 and Gen. SKAT test with Rho = .25,have the same row labels, column headers, and reported fields as the Burden test result spreadsheet, now called Gen.SKAT test with Rho = 1.

Let’s compare these results with Burden tests and CMC Hotelling tests. First, add the Generalized SKAT results for𝜌 = .25.

• Using the Plot of Column -log10 SKAT P-Value from Gen. SKAT test with Rho = 1 plot, click on the -log10SKAT P-Value node in the Plot Tree. On the Add tab of the Controls dialog, choose the Add Plot Item(s)button. Click the Project button on the left side of the Add Data Sources window.

• Choose Gen. SKAT test with Rho = .25 and check -log10 SKAT P-Value from the list. Then click Plot &Close.

• Right-click on the new -log10 SKAT P-Value item and rename it to -log10 P-Value (Rho = .25).

• Change the color of the data points to turquoise on the Style tab of the Controls dialog.

Now, repeat this procedure for the (original) SKAT results (𝜌 = 0).

• Click again on the -log10 SKAT P-Value node in the Plot Tree. On the Add tab of the Controls dialog, choosethe Add Plot Item(s) button. Click the Project button on the left side of the Add Data Sources window.

• Choose Gen. SKAT test with Rho = 0 and check -log10 SKAT P-Value from the list. Then click Plot & Close.

• Right-click on the new -log10 SKAT P-Value item and rename it to -log10 P-Value (Original SKAT).

• Change the color of the data points to green on the Style tab of the Controls dialog.


We see that, in general, a significant result for any of the tests in this plot implies a significant result for all the othertests in this plot. This is as would be expected.

Let us look specifically at the three Generalized SKAT tests that use different 𝜌 values.

SKAT and Generalized SKAT 33


Figure 5-3. -log10 P-values from Burden Testing, blue; -log10 P-Values from Generalized SKAT with Rho = .25,turquoise; -log10 P-Values from original SKAT, green; -log10 P-Values from CMC Hotelling, orange



Because the amount of significance should partly depend on whether the direction of effect is different for differentmarkers within the same gene region, we expect to see different tests being the most significant for different genes.

• Declutter the plot by unchecking -log10 P-Value (CMC Hotelling) in the Plot Tree. The plot should looksimilar to Figure 5-4.

Figure 5-4. -log10 P-values from Burden Testing, blue; -log10 P-Values from Generalized SKAT with Rho = .25,turquoise; -log10 P-Values from original SKAT, green

Here, we can see this in action. The Burden test (dark blue) has the best result for some genes, the original SKATtest (green) has the best result for other genes, and, for a few genes, the best result is from the Generalized SKAT testusing 𝜌 = .25.

The SKAT-O Test

Finally, we will run the SKAT-O test. To save time, we will only run this for Chromosome 6.

• Open the Phenotypes + 1kG Exomes - Sheet 3 spreadsheet and choose Select > Activate by Chromosomes.Uncheck 5 and leave 6 checked. Click OK. This will create spreadsheet Phenotypes + 1kG Exomes - Sheet 4.

• Choose DNA-Seq > Collapsing Methods > SKAT-O.

• (Leave Generalized SKAT with rho = unchecked and leave SKAT-O with standard grid of rho valueschecked.)

• Under Algorithm for Estimating Distributions, check Somewhat-small-sample corrected (kurtosis estimatedanalytically).

The SKAT-O Test 35


• The dialog should now look like Figure 5-5. Click OK.

Figure 5-5. Dialog for the SKAT-O Test Itself

• (Leave the name of the resulting spreadsheet as is.)

The resulting spreadsheet, SKAT-O Testing with RefSeq Genes 105 Interim v1, NCBI (Somewhat-small-samplecorrected), uses row numbers (indexes into the marker map) for row labels, and, for each gene region, reports thechromosome, the start and end positions, the gene name, the transcript name(s), the strand, the SKAT-O p-value, the-log10 SKAT-O P-Value and other various statistics and multiple-testing-corrected results.

Now, let us compare SKAT-O with the other results.

• Using the Plot of Column -log10 SKAT P-Value from Gen. SKAT test with Rho = 1 plot, click on the -log10SKAT P-Value node in the Plot Tree. On the Add tab of the Controls dialog, choose the Add Plot Item(s)button. Click the Project button on the left side of the Add Data Sources window.

• Choose SKAT-O Testing with RefSeq Genes 105 Interim v1, NCBI (Somewhat-small-sample corrected)and check -log10 SKAT-O P-Value from the list. Then click Plot & Close.

• Change the color of the data points to red on the Style tab of the Controls dialog.

The SKAT-O results (which are only for Chromosome 6) will show in the right half of the plot.



• Go to the chromosome chooser dropdown near the upper-left corner of the plot window and click its down-arrow.Choose 6. The plot should look similar to Figure 5-6.

Figure 5-6. -log10 P-values from Burden Testing, blue; -log10 P-Values from Generalized SKAT with Rho = .25,turquoise; -log10 P-Values from Original SKAT, green; -log10 P-Values from SKAT-O, red

We see that the SKAT-O (red) p-values don’t always reach the significance of the best Generalized SKAT p-values,but they do almost always track the best Generalized SKAT p-value, whichever one that is, with each SKAT-O resulthaving a -log10 P that is about 0.6 less than that of the best Generalized SKAT -log10 P for that gene.

Being consistent with the best Generalized SKAT p-value is what was intended for the SKAT-O test.

The SKAT-O Test 37

6. Compare Results with GWAS Approach

To demonstrate the utility of the collapsing tests, we can compare the results to a GWAS test that assesses associationwith each individual SNP.

Attempt an Association Test on Individual Rare Variants

• Go back to the Phenotypes + 1kG Exomes - Sheet 3 spreadsheet. Choose Genotype > Genotype AssociationTests.

• Choose an Additive model, and choose Correlation/Trend test as the Test Statistic or Method.

• Select Output -log10(P) under Additional Outputs.

• The dialog should look like Figure 6-1. Click Run.

An association results spreadsheet is created. This spreadsheet contains marker-mapped results which include themarker name as a row label, Corr/Trend P, Corr/Trend -log10 P, Corr/Trend R, and other statistics and multiple-testing-corrected results.

• From the resulting Association Tests (Additive Model) spreadsheet, right click on the Corr/Trend -log10 Pcolumn header and choose Plot Variable in GenomeBrowse.

• In the chromosome chooser dropdown near the upper-left corner of the plot window, enter 5 - 6. Then click theright-arrow icon just to the right of the genomic coordinate search bar near the top of the plot window. Theresult should look like Figure 6-2.

Notice how many of the results collect into just a few discrete values. Let us take a look at one of these discrete values.

• On the spreadsheet Association Tests (Additive Model), right-click on the Corr/Trend -log10 P column headerand choose Sort Descending.

Note that 891 SNP’s all have exactly the same -log10 P-Value of 3.698670, the first one listed being 5:143219-SNV.

• On the spreadsheet Phenotypes + 1kG Exomes - Sheet 3, go to the column header of 5:143219-SNV (whichis just at Column 13), then right-click and choose Sort Descending.

You will notice there is only one variant (A_G instead of A_A) out of all the samples. Since this corresponds to acase, this relatively significant p-value number results. Similarly, the other discrete values shared by many individualresults are also based on only one or a very few variants.

Of course, the phrase “only one or a very few variants” fits the definition of “rare” variants.

Compare with Rare Variant Methods

So, let’s compare the association test results with those from several methods that were actually designed for rarevariant analysis.

38


Figure 6-1. Genotype Association Tests dialog

Compare with Rare Variant Methods 39


Figure 6-2. -log10 P-values from GWAS for Chromosomes 5 and 6

40 6. Compare Results with GWAS Approach


• Go to the Plot of Column Corr/Trend -log10 P from Association Tests (Additive Model) plot and click onthe first Corr/Trend -log10 P in the Plot Tree. Choose the Add tab and select Add Plot Item(s). Click theProject button on the left side of the Add Data Sources window.

• Choose CMC with Hotelling T Squared Test with RefSeq Genes 105 Interim v1, NCBI spreadsheet andcheck -log10 P-Value from the list. Click Plot & Close.

• Rename the resulting plot node to -log10 P-Value (CMC Hotelling) and use the Style tab to change the colorto orange.

Follow the same process for adding one-sided KBAC Regression results to the plot.

• Click on the first Corr/Trend -log10 P in the Plot Tree. Choose the Add tab and select Add Plot Item(s).Click the Project button on the left side of the Add Data Sources window.

• Choose the first occurrence of KBAC Regression with RefSeq Genes 105 Interim v1, NCBI (to get the resultfrom the KBAC testing for higher risk), then check -log10 P-Value (One-Sided) and click Plot & Close.

• Rename this plot node to -log10 P-Value (OS KBAC Higher Risk) and use the Style tab to change the color togray.

And follow a similar process to add the SKAT-O results to the Chromosome 6 section of the plot.

• Click on the first Corr/Trend -log10 P in the Plot Tree. Choose the Add tab and select Add Plot Item(s).Click the Project button on the left side of the Add Data Sources window.

• Choose SKAT-O Testing with RefSeq Genes 105 Interim v1, NCBI (Somewhat-small-sample corrected),then check -log10 SKAT-O P-Value and click Plot & Close.

• Click on this new plot node, then use the Style tab to change the color to red.

The plot should now look approximately like Figure 6-3.

You will note that there are only a few places where a rare variant method yields more significant results than therepetitive association test result of -log10 P = 3.698670 (that we mentioned above), and even not that many placeswhere a rare variant method yields results more significant than the other repetitive association test result of -log10 P= (about) 1.8.

Let us look at gene SLC2A12 as an example of individual vs. collective significance.

• Enter SLC2A12 into the Genomic Location Bar near the top of the GenomeBrowse window. A drop-down willappear (see Figure 6-4). In this drop-down, click on the first SLC2A12 entry. The result should be about likeFigure 6-5.

We see that, based on the GWAS test, one of the SNPs in this region has a very significant association, two other SNPsin this same region have more or less significant associations, while all the others in this same region have either notmuch or very little significance. But we see that when taken together and tested using any one of these three collapsingmethods, this group of variants shows high significance.

(While the KBAC method may appear to be showing quite a bit less significance than the other methods, it is, as wehave mentioned before, showing the maximum significance that it possibly could using 1000 permutations.)

One factor that helps the rare variant methods find more significance is the fact that in any given region, some invidivualSNPs are monomorphic for the wild type, and thus will not show up in the GWAS results, even while these samemonomorphic SNPs do factor into the rare variant analysis algorithms.

If you account for the Bonferroni correction, which is based on a smaller number for the collapsing methods than forthe GWAS association test, the difference between the individual and the collective effects will be more pronounced.



Figure 6-3. -log10 P-values from GWAS in blue, from CMC Hotelling in orange, from KBAC testing for higher riskin gray, and from SKAT-O testing (in Chromosome 6) in red



Figure 6-4. Drop-down for selecting gene SLC2A12



Figure 6-5. Close-up (for gene SLC2A12) of results from GWAS in blue, from CMC Hotelling in orange, from KBACtesting for higher risk in gray, and from SKAT-O testing in red



Side Note: What About the Direction of Effect?

Since we have said that some tests are sensitive to the direction of effect, let’s take a look.

Direction of Effect for KBAC

First, the two one-sided KBAC tests.

• Go to the Plot of Column -log10 P-Value (One-Sided) from KBAC with Permutation Testing with RefSeqGenes 105 Interim v1, NCBI, and click the Plot button in the upper-left corner of the plot window.

• Click the Project button on the left side of the Add Data Sources window.

• Choose Association Tests (Additive Model) and check Corr/Trend R. (Not Corr/Trend P, but insteadCorr/Trend R.)

• Click Plot & Close.

This will create, above the existing plot, a whole new plot consisting of the Corr/Trend R value from the individual-variant association test. This R value indicates the direction of effect that each individual SNP is manifesting with ourphenotype. See Figure 6-6. (Note: You may wish to stretch the lower boundary of the plot window to see the genenames.)

We see that for the gene on the left (which is KIF3A), while a number of markers have a subtle effect in the directionof lower risk, the overwhelming effect direction is from the four markers that have a more pronounced effect inthe direction of higher risk. This is reflected in the two KBAC results, with the result for higher risk being highlysignificant and the result for lower risk being zero.

Meanwhile, for the other genes on the right (both SEPT8 and the shorter gene (CCNI2) which overlaps SEPT8), theonly effect that any of the markers has is a subtle effect in the direction of lower risk. This is also reflected in the KBACresults, with the results for lower risk being more significant than the results for higher risk, with the higher-risk resultfor SEPT8 being essentially zero.

Direction of Effect for Generalized SKAT

Now, let’s take a look at the Generalized SKAT results with different values of 𝜌, along with the SKAT-O results. Youwill remember that Burden testing (𝜌 = 1) is said to be more sensitive to marker effects when they are all in the samedirection, while original SKAT (𝜌 = 0) is said to be sensitive to marker effects even if some markers have effects inthe opposite direction from the effects of other markers.

• Go to the Plot of Column -log10 SKAT P-Value from Gen. SKAT test with Rho = 1 and start with the sameprocedure. Click the Plot button in the upper-left corner of the plot window.

• Click the Project button on the left side of the Add Data Sources window.

• Choose Association Tests (Additive Model) and check Corr/Trend R.

• Click Plot & Close. The plot should look like Figure 6-7.

• Now, as we have done in an earlier plot, enter SLC2A12 into the Genomic Location Bar near the top of theGenomeBrowse window, and click on the first SLC2A12 entry in the dropdown. The result should be likeFigure 6-8.

We see that while a couple of SNPs have a subtle effect in the direction of lower risk, most SNPs have somewhatvarying degrees of, but significant, effects in the same direction as each other, that of higher risk. We see that Burdentesting (𝜌 = 1), as does Generalized SKAT with 𝜌 = .25, comes out somewhat more significant here.

• Now, enter ELOVL4 into the Genomic Location Bar near the top of the GenomeBrowse window, and click onthe first ELOVL4 entry in the dropdown. The result should be like Figure 6-9.

Side Note: What About the Direction of Effect? 45


Figure 6-6. Close-up for Genes KIF3A and SEPT8, with Upper plot: Corr/Trend R for Individual SNP’s; Lower Plot:KBAC test for higher risk, orange; KBAC test for lower risk, red



Figure 6-7. Upper Plot: Corr/Trend R for Individual SNP’s; Lower Plot: -log10 P-values from Burden Testing, blue;from Generalized SKAT with Rho = .25, turquoise; from Original SKAT, green; and from SKAT-O, red

Side Note: What About the Direction of Effect? 47


Figure 6-8. Close-up for Gene SLC2A12–Upper Plot: Corr/Trend R for Individual SNP’s; Lower Plot: -log10 P-values from Burden Testing, blue; from Generalized SKAT with Rho = .25, turquoise; from Original SKAT, green;and from SKAT-O, red



Figure 6-9. Close-up for Gene ELOVL4–Upper Plot: Corr/Trend R for Individual SNP’s; Lower Plot: -log10 P-valuesfrom Burden Testing, blue; from Generalized SKAT with Rho = .25, turquoise; from Original SKAT, green; and fromSKAT-O, red

Here, there are two SNPs with a slight effect in the direction of lower risk, but one SNP with a greater effect in thedirection of higher risk. We see that these are different enough effects and effect directions for the original SKAT test(𝜌 = 0) to find more significance for this gene than any other SKAT-related test.

• Finally, go to gene WDR46 in the same manner as above. The result should be like Figure 6-10.

Here, we see an actual instance of a combination of SNP effects where collapsing using a Generalized SKAT test with𝜌 = .25 works better than collapsing using either a Burden test (𝜌 = 1) or an original SKAT test (𝜌 = 0).

Note that in all cases mentioned above, the SKAT-O result consistently tracks the best Generalized SKAT result, witheach result having a SKAT-O -log10 P that is about 0.6 less than that of the best Generalized SKAT -log10 P for thatgene.

Conclusion

There are unlimited possible workflows for analysis of sequence data. This tutorial demonstrates a few of the toolsavailable for variant filtering, collapsing, and analysis. Other filtering tools are available in SVS, and we are inthe process of developing more. If there is any useful information contained within the plot-viewer annotation data

Conclusion 49


Figure 6-10. Close-up for Gene WDR46–Upper Plot: Corr/Trend R for Individual SNP’s; Lower Plot: -log10 P-valuesfrom Burden Testing, blue; from Generalized SKAT with Rho = .25, turquoise; from Original SKAT, green; and fromSKAT-O, red



sources, we can create a script to filter variants based on that data. There are numerous annotation sources availablein SVS for filtering of variants, and users may develop their own annotations as well. If you have suggestions forimprovements to the program, or if you have any questions about how to use any of these functions, please contact us.

References

References 51

Bibliography

[Cohen2004] Cohen J. C., Kiss R. S., Pertsemlidis A., Marcel Y. L., McPherson R., Hobbs H. H. (2004), ‘Multiplerare alleles contribute to low plasma levels of HDL cholesterol.’ Science 305 (5684):869-72.

[Lee2012] Lee S, et al (2012) ‘Optimal Unified Approach for Rare-Variant Association Testing with Application toSmall-Sample Case-Control Whole-Exome Sequencing Studies’ Am J Hum Genet 91:224–237

[LeeWuLin2012] Lee S, Wu MC, Lin X (2012) ‘Optimal tests for rare variant effects in sequencing association stud-ies’ Biostatistics 13, 4, pp. 762-775

[Liu2009] Liu H, Tang Y, Zhang HH (2009) ‘A new chi-square approximation to the distribution of non-negativedefinite quadratic forms in non-central normal variables’ Computational Statistics and Data Analysis 53 (2009)853-856

[Morgenthaler2007] Morgenthaler S, Thilly WG (2007). ‘A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST).’ Mutat Res. 615(1-2):28-56.

[Li2008] Li B, Leal S (2008). ‘Methods for Detecting Associations with Rare Variants for Common Diseases: Appli-cation to Analysis of Sequence Data’ Am J Hum Genet 83:311-321.

[Liu2010] Liu D, Leal S (2010). ‘A Novel Adaptive Method for the Analysis of Next-Generation Sequencing Data toDetect Complex Trait Associations with Rare Variants Due to Gene Main Effects and Interactions’ PLoS Genet6(10): e1001156. doi:10.1371/journal.pgen.1001156.

52

variant analysis tutorial - golden helix, inc · analysis, variant calling lends itself to being...

Documents