novel analysis of multi-species type 2 diabetes from gene …17… · novel analysis of...
Post on 04-Jul-2020
2 Views
Preview:
TRANSCRIPT
NOVEL ANALYSIS OF MULTI-SPECIES TYPE 2 DIABETES
FROM GENE EXPRESSION DATA
CATHERINE ZHENG
BEc, GradDipEd, PostGradDipSc
A THESIS
submitted in total fulfillment of the requirements for the degree
MASTER OF SCIENCE (HONOURS)
School of Computing, Engineering and Mathematics
UNIVERSITY OF WESTERN SYDNEY Sydney, Australia
2012
Statement of Authentication
The work presented in this thesis is, to the best of my knowledge and belief, original except as
acknowledged in the text. I hereby declare that I have not submitted this material, either in full or
in part, for a degree at this or any other institution.
Signature
i
Table of Contents
List of Figures ..................................................................................................................................... iii
List of Tables ........................................................................................................................................ v
Acknowledgements ............................................................................................................................. ix
Abstract ................................................................................................................................................. x
Chapter 1 - Introduction ....................................................................................................................... 1
1.1 Background on Type 2 Diabetes ............................................................................................... 1
1.2 A Brief Introduction to Gene Expression and Microarrays ..................................................... 2
1.3 Highlights of Earlier Research .................................................................................................. 3
1.4 Research Case ............................................................................................................................. 4
1.4.1 Research Questions ............................................................................................................. 4
1.4.2 Goals .................................................................................................................................... 5
Chapter 2 - Research Methods ............................................................................................................. 7
2.1 Experimental Design .................................................................................................................. 7
2.2 Data Pre-processing ................................................................................................................... 9
2.3 Analysis of Differential Gene Expression .............................................................................. 10
2.4 Cluster Analysis ....................................................................................................................... 12
2.5 Functional Annotation and Pathway Analysis ....................................................................... 12
2.5.1 Gene Set Tests ................................................................................................................... 13
2.5.2 Hypergeometric Test for Gene Set Enrichment Analysis .............................................. 15
Chapter 3 - Analysis of Differential Expression ............................................................................... 17
3.1 Data Pre-processing and Quality Assessment ........................................................................ 17
3.1.1 Affymetrix Mouse Microarrays ....................................................................................... 17
3.1.2 Agilent Human Microarrays ............................................................................................. 21
3.2 Analysis of Differential Expression ........................................................................................ 23
3.2.1 Mouse Microarrays ........................................................................................................... 23
3.2.2 Human Microarrays .......................................................................................................... 30
Chapter 4 - Pathway Analysis ............................................................................................................ 34
4.1 Gene Set Tests .......................................................................................................................... 35
ii
4.1.1 Competitive Gene Set Test ............................................................................................... 36
4.1.1.1 Gene Ontology (GO) Terms ..................................................................................... 36
4.1.1.2 KEGG Pathways ........................................................................................................ 44
4.1.2 Self-Contained Gene Set Test .......................................................................................... 57
4.1.3 Comparison of Three Gene Set Tests for Insulin Related GO Terms ........................... 60
4.1.4 Comparison of Three Gene Set Tests for Glucose Related GO Terms ......................... 66
4.1.5 Comparison of Three Gene Set Tests for the FOXO Gene Set ...................................... 68
4.2 Hypergeometric Test for Gene Set Enrichment Analysis...................................................... 69
4.2.1 The longitudinal mouse study involving the comparison of a high-fat diet to the
control ......................................................................................................................................... 69
4.2.2 The cross-sectional human study comparing expression in tissue samples for a control
group of healthy patients and obese patients ............................................................................ 71
Chapter 5 - Cluster Analysis .............................................................................................................. 75
5.1 Hierarchical Clustering for Mouse Data Sets ......................................................................... 75
5.2 Hierarchical Clustering for the Human Data Set ................................................................... 78
5.3 Hierarchical Clustering for the Combined Mouse Data Sets ................................................ 81
5.4 Hierarchical Clustering for the Integrated Mouse and Human Data Sets ............................ 83
Chapter 6 - Conclusions ..................................................................................................................... 89
6.1 Addressing the Research Questions ........................................................................................ 89
6.2 Discussion ................................................................................................................................. 94
6.3 Comments on the Experimental Design ................................................................................. 95
6.4 Further Work ............................................................................................................................ 95
References ........................................................................................................................................... 97
iii
List of Figures
Figure 3-1: Density plot of the longitudinal mouse study involving the comparison of a high-fat
diet to control in two tissues. ..................................................................................................... 18
Figure 3-2: Boxplots of the longitudinal mouse study involving the comparison of a high-fat diet
to control in two tissues. ............................................................................................................ 19
Figure 3-3: Density plot of the mouse cell line study of exposure to various substances including
insulin. ......................................................................................................................................... 20
Figure 3-4: Boxplots of the mouse cell line study of exposure to various substances including
insulin. ......................................................................................................................................... 20
Figure 3-5: Density plot and boxplots of the Rlog2 values in the human data set. ...................... 21
Figure 3-6: Boxplots of the M values in the human data set after loess normalization. ................ 22
Figure 3-7: Volcano plot of top DE genes between 42 days of a high-fat diet and the control in
the mouse muscle tissue group. ................................................................................................. 25
Figure 3-8: Volcano plot of top ranked genes (non DE) between 42 days of a high-fat diet and
the control in the mouse adipose tissue group. ......................................................................... 27
Figure 3-9: Boxplots of the top ranked genes in the contrast of insulin resistant versus lean
control. ........................................................................................................................................ 31
Figure 3-10: Boxplots of the top ranked genes in the contrast of diabetic versus lean control. .... 32
Figure 3-11: Boxplots for 2 probes corresponding to gene SLC2A4 in the human data. .............. 33
Figure 4-1: Boxplots of gene expression levels (M values) of 62 common probes in both
"Pancreatic cancer" and "Colorectal cancer" KEGG pathways on the Agilent human array
..................................................................................................................................................... 59
Figure 4-2: Boxplots of gene expression levels (M values) of 42 common probes in both
"Pancreatic cancer" and "Endometrial cancer" KEGG pathways on the Agilent human array
..................................................................................................................................................... 60
Figure 4-3: Scatterplot matrix of down-regulated unadjusted p-values from three methods ........ 62
Figure 4-4: Scatterplot matrix of up-regulated unadjusted p-values from three methods ............. 63
Figure 4-5: Scatterplot matrix of down-regulated unadjusted p-values from three methods in the
contrast between diabetic and lean control ............................................................................... 65
iv
Figure 5-1: Hierarchical clustering of the mouse high-fat diet data based on Euclidean distance.
..................................................................................................................................................... 76
Figure 5-2: Hierarchical clustering of the mouse high-fat diet data based on Pearson correlation
..................................................................................................................................................... 76
Figure 5-3: Hierarchical clustering of the mouse cell line data based on Euclidean distance. ...... 77
Figure 5-4: Hierarchical clustering of the mouse cell line data based on Pearson correlation. ..... 78
Figure 5-5: Hierarchical clustering of the human data based on Euclidean distance. .................... 79
Figure 5-6: Hierarchical clustering of the human data based on Pearson correlation. ................... 80
Figure 5-7: Hierarchical clustering of the combined mouse data based on Euclidean distance. ... 81
Figure 5-8: Hierarchical clustering of the combined mouse data based on the top DE genes ...... 82
Figure 5-9: Hierarchical clustering of arrays using the combined longitudinal mouse study and
the human model. ....................................................................................................................... 84
Figure 5-10: Hierarchical clustering of the top differential expressed (DE) genes using the
combined longitudinal mouse study and the human model. .................................................... 85
Figure 5-11: Hierarchical clustering of both the top DE genes and arrays using the combined
longitudinal mouse study and the human model. ..................................................................... 86
Figure 5-12: Hierarchical clustering of the arrays based on a selected cluster of DE genes. ........ 87
v
List of Tables
Table 2-1: Details of the longitudinal mouse study ............................................................................ 7
Table 2-2: Details of the cross-sectional human study....................................................................... 8
Table 2-3: Details of the mouse cell line study .................................................................................. 8
Table 3-1: Differentially expressed genes found in the muscle tissue group ................................. 23
Table 3-2: Top 10 DE genes between 42 days of a high-fat diet and control ................................. 24
Table 3-3: Top 10 DE genes between 14 days of a high-fat diet and control ................................. 24
Table 3-4: Top 10 genes with large fold changes between 42 days of a high-fat diet and the
control in the mouse adipose tissue group ................................................................................ 28
Table 3-5: Top 10 differentially expressed probes in the mouse cell line data .............................. 29
Table 3-6: Differentially expressed genes found in the human data ............................................... 30
Table 3-7: Top 10 ranked genes for the contrast of insulin resistant versus lean control in the
human data .................................................................................................................................. 30
Table 3-8: Top 10 ranked genes for the contrast of diabetic versus lean control in the human data
..................................................................................................................................................... 31
Table 4-1: Top 30 significantly up-regulated GO terms in the contrast between insulin resistant
patients and lean control ............................................................................................................ 37
Table 4-2: Top 30 significantly down-regulated GO terms in the contrast between insulin
resistant patients and lean control.............................................................................................. 38
Table 4-3: Top 30 significantly mixed-regulated GO terms in the contrast between insulin
resistant patients and lean control.............................................................................................. 39
Table 4-4: Top 30 significantly up-regulated GO terms in the contrast between diabetic patients
and lean control .......................................................................................................................... 40
Table 4-5: Top 30 significantly down-regulated GO terms in the contrast between diabetic
patients and lean control ............................................................................................................ 41
Table 4-6: Top 30 significantly mixed-regulated GO terms in the contrast between diabetic
patients and lean control ............................................................................................................ 42
Table 4-7: Top 10 significantly up-regulated GO terms in the contrast between insulin sensitive
patients and lean control ............................................................................................................ 43
vi
Table 4-8: Top 10 significantly down-regulated GO terms in the contrast between insulin
sensitive patients and lean control ............................................................................................. 44
Table 4-9: Top 10 significantly mixed-regulated GO terms in the contrast between insulin
sensitive patients and lean control ............................................................................................. 44
Table 4-10: 55 significantly up-regulated KEGG pathways in the contrast between insulin
resistant and lean control............................................................................................................ 45
Table 4-11: 19 significantly down-regulated KEGG pathways in the contrast between insulin
resistant and lean control............................................................................................................ 47
Table 4-12: 9 significantly mixed-regulated KEGG pathways in the contrast between insulin
resistant and lean control............................................................................................................ 47
Table 4-13: 56 significantly up-regulated KEGG pathways in the contrast between diabetic and
lean control.................................................................................................................................. 48
Table 4-14: 35 significantly up-regulated KEGG pathways in both insulin resistant and diabetic
groups .......................................................................................................................................... 50
Table 4-15: 17 significantly down-regulated KEGG pathways in the contrast between diabetic
and lean control .......................................................................................................................... 51
Table 4-16: 13 significantly down-regulated KEGG pathways in both insulin resistant and
diabetic groups ............................................................................................................................ 52
Table 4-17: 20 significantly mixed-regulated KEGG pathways in the contrast between diabetic
and lean control .......................................................................................................................... 52
Table 4-18: 5 significantly mixed-regulated KEGG pathways in both insulin resistant and
diabetic groups ............................................................................................................................ 53
Table 4-19: Top 30 significantly up-regulated KEGG pathways in the contrast between insulin
sensitive and lean control ........................................................................................................... 54
Table 4-20: 8 significantly down-regulated KEGG pathways in the contrast between insulin
sensitive and lean control ........................................................................................................... 55
Table 4-21: 32 significantly mixed-regulated KEGG pathways in the contrast between insulin
sensitive and lean control ........................................................................................................... 56
Table 4-22: Rotation Gene Set Test - Top 8 mixed-regulated KEGG pathways in the contrast
between diabetic and lean control ............................................................................................. 58
vii
Table 4-23: Rotation Gene Set Test - Top 20 up-regulated in the contrast between insulin
sensitive and lean control ........................................................................................................... 58
Table 4-24: Comparison of down-regulated insulin related GO terms in the contrast between
insulin resistance and lean control ............................................................................................. 61
Table 4-25: Comparison of up-regulated insulin related GO terms in the contrast between insulin
resistance and lean control ......................................................................................................... 63
Table 4-26: Comparison of down-regulated insulin related GO terms in the contrast between
diabetic and lean control ............................................................................................................ 64
Table 4-27: Comparison of up-regulated insulin related GO terms in the contrast between
diabetic and lean control ............................................................................................................ 65
Table 4-28: Comparison of up-regulated insulin related GO terms in the contrast between insulin
sensitive and lean control ........................................................................................................... 66
Table 4-29: Comparison of down-regulated glucose related GO terms in the contrast between
insulin resistance and lean control ............................................................................................. 66
Table 4-30: Comparison of down-regulated glucose related GO terms in the contrast between
diabetic and lean control ............................................................................................................ 67
Table 4-31: Comparison of up-regulated glucose related GO terms in the contrast between
diabetic and lean control ............................................................................................................ 67
Table 4-32: Summary of three gene set tests for the FOXO gene set in three contrasts................ 68
Table 4-33: Top 30 Over-represented GO (BP) terms after 42 days of high-fat diet in the
longitudinal mouse study ........................................................................................................... 69
Table 4-34: 16 over-represented KEGG pathways after 42 days of high-fat diet in the
longitudinal mouse study ........................................................................................................... 70
Table 4-35: Top 12 Over- represented GO terms in the contrast between insulin resistant patients
and the control ............................................................................................................................ 71
Table 4-36: Top 10 Over- represented KEGG Pathways in the contrast between insulin resistant
patients and the control .............................................................................................................. 72
Table 4-37: Top 15 Over- represented GO terms in the contrast between diabetic patients and the
control ......................................................................................................................................... 73
Table 4-38: 12 Over- represented GO terms in the contrast between insulin sensitive patients and
the control ................................................................................................................................... 74
viii
Table 4-39: 4 Over-represented KEGG pathways in the contrast between insulin sensitive
patients and the control .............................................................................................................. 74
Table 6-1: Summary of the number of significant GO terms in the contrast between insulin
resistant patients and controls .................................................................................................... 90
Table 6-2: Summary of the number of significant GO terms in the contrast between diabetic
patients and controls ................................................................................................................... 91
Table 6-3: Summary of the number of significant GO terms in the contrast between insulin
sensitive patients and controls ................................................................................................... 92
ix
Acknowledgements
Firstly, I would like to say
x
Abstract
Purpose
The incidence of type 2 diabetes is reaching epidemic levels. Today type 2 diabetes is
the most common form of diabetes, accounting for 85 to 90 percent of diabetes cases. The
James Lab at Garvan Institute for Medical Research are interested in gene expression in insulin
resistance and diabetes. They have provided three gene expression data sets: a longitudinal
mouse study involving the comparison of a high-fat diet to a standard diet with gene expression
in two tissues, a mouse cell line study and a cross-sectional human study. The main goals of
this research is to identify differentially expressed genes in both the mouse and human data,
compare genomic expression patterns across species, human and mouse, and to focus on
pathway analysis for detecting differential expression in predefined gene sets based on Gene
Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways.
Methods
Three data sets are normalized in order to remove experimental effects arising from the
microarray technology. Linear models can then be fitted on the normalized data using the limma
package to identify genes undergoing differential expression. Each gene has its own expression
profile and genes with similar profiles can be grouped together. We intend to try and use the
data sets together to cluster samples based on gene profiles. In reality, biological processes are
complicated with many molecules working together. The goal of annotating the genome is to link
all information associated with gene products in order to learn how pathways function in the
biological system. In situations where long lists of genes are found to be differentially
expressed, we consider focusing on the analysis of gene sets because it is more sensible to
investigate gene sets that are functionally related based on prior biological knowledge or
experiments. We explore the potentially interesting gene sets using the Gene Ontology (GO)
database and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. Differentially
expressed genes detected in the mouse data are mapped to their corresponding gene sets based
on the Gene Ontology terms and KEGG pathways. Competitive and self-contained gene set
tests (the mean-rank gene set test and the rotation gene set test) are performed for each
comparison in the human data.
xi
The correlation adjusted mean-rank gene set test is included in testing insulin or glucose
related GO terms and KEGG pathways. To test if any GO terms (Biological Process) or KEGG
pathways are over-represented in a list of differentially expressed genes in the mouse or human
data sets, we carry out the hypergeometric test.
Results
We identify a large number of differentially expressed genes in the muscle tissue from
the longitudinal mouse study. The cross-species gene set tests have revealed significant GO
terms and KEGG pathways in each condition of obese patients relative to healthy controls. We
compare the results produced by the mean-rank gene set test and the rotation gene set test.
Significant insulin or glucose related gene sets are found using three gene set testing methods
and the results are compared. The FOXO gene set is found to be significantly up-regulated in
two contrasts in the human data.
1
Chapter 1 - Introduction
In this chapter we will give a brief overview of the onset of type 2 diabetes globally and
also the health issues related to type 2 diabetes that we are currently facing in Australia. The
research case as well as a list of the research questions is explained and the goals are set.
Highlights of some research work completed in recent years are discussed to outline the
achievements and the gaps in the analysis of gene expression data in the areas related to type
2 diabetes.
1.1 Background on Type 2 Diabetes
Type 2 diabetes is a chronic disease and the pathogenesis of this disease involves
metabolic abnormalities in both insulin action and insulin secretion (Weyer et al. 1999). Insulin is
produced by the pancreas and plays a significant role in converting glucose into energy in the
metabolic system.
A brief explanation of type 2 diabetes and the relationship between type 2 diabetes and
insulin resistance given by the James Lab at Garvan Institute for Medical Research is as
follows:
☜In Type 2 diabetes there is a relative deficiency of insulin - that is the body still produces
insulin but is unable to produce insulin in sufficient quantities to hold blood sugar levels within
normal limits. Increasingly in Type 2 diabetes this is due to insulin resistance - the inability of
the body's tissue to respond to insulin in a normal way.☝
(http://www.jameslab.com.au/WhatIsDiabetes.shtml)
The incidence of type 2 diabetes is reaching epidemic levels. Today type 2 diabetes is
the most common form of diabetes, accounting for 85 to 90 percent of diabetes cases. Further,
a greater number of younger people are getting type 2 diabetes whereas previously it mainly
affected older adults. Diabetes is Australia
2
remain undiagnosed. By 2031 it is estimated that 3.3 million Australians will have type 2
diabetes (Vos et al. 2004). The burden of type 2 diabetes is increasing and it is expected to
become the leading cause of disease burden by 2023 (AIHW 2010).
1.2 A Brief Introduction to Gene Expression and Microarrays
Deoxyribonucleic Acid (DNA) carries the genetic instructions for producing proteins in
living organisms. Proteins are essential parts of organisms providing function and regulation in
cells. Genes are segments of DNA responsible for making proteins. The process of converting
from genes to proteins can be described as the central dogma of molecular biology: genes are
first transcribed into messenger ribonucleic acid (mRNA) and mRNA is translated into a chain of
amino acids which after further processing, form a protein. Genes are considered to be
expressed within a cell or organism when they produce RNA, some of which is translated into
proteins. It is crucial to quantify the amount of proteins produced when genes are expressed.
But directly measuring the amount of proteins is somewhat difficult. Therefore, in order to
determine the levels of gene expression, the levels of mRNA are used instead.
To be able to measure the levels of mRNA for tens of thousands of genes in a single
experiment, microarray technology was introduced and developed into a number of platforms. A
microarray is a solid surface, being a silicon or glass chip, on which probe sequences from
different genes are fixed. Microarray technology allows scientists to measure the gene
expression levels of a large number of genes under different diseases or various experimental
conditions.
There are two main categories of microarrays based on the type of data being produced:
one-channel and two-channel microarrays. The representative of one-channel microarrays is
Affymetrix arrays. A number of manufacturers are specialised in two-channel arrays such as
Agilent. The microarrays used in this study are Affymetrix one-channel arrays and Agilent two-
channel arrays. Affymetrix one-channel platform hybridizes one sample per chip. For two-
channel arrays, two samples are applied to each array. In our two-channel arrays with a
reference design, a common reference is used on all arrays. And each sample is compared to
the common reference.
3
1.3 Highlights of Earlier Research
Type 2 diabetes results from a combination of genetic and environmental factors.
Although there is a genetic predisposition, the risk is greatly increased when associated with
lifestyle factors such as high blood pressure, overweight or obesity, insufficient physical activity
and unhealthy diet (http://www.diabetesaustralia.com.au). There is currently no cure for type 2
diabetes. Therefore, understanding the relationship between gene expression and insulin
resistance and type 2 diabetes should lead to better understanding of the disease and
potentially early diagnosis.
A research group in Europe carried out experiments on male mice (i.e., C57BL/6J mice)
that were randomly assigned to a low-fat palm oil diet or a high-fat palm oil diet for 3 or 28 days
(de Wilde et al. 2008). A series of analyses were performed including microarray gene
expression and protein analysis. Two methods were investigated for the analysis of gene
expression: the first method is based on overrepresentation of Gene Ontology (GO) terms
whereas the second method is gene set enrichment analysis (Subramanian et al. 2005). The
sources for the gene set enrichment analysis were GenMAPP, Kyoto Encyclopedia of Genes
and Genomes (KEGG) AND SKmanual. It was found that short-term high-fat feeding led to
altered expression levels of genes involved in a variety of biological processes including
morphogenesis, energy metabolism, lipogenesis, and immune function (de Wilde et al. 2008).
Another recent microarray analysis conducted in Japan compared a male rat model of
spontaneous type 2 diabetes resembling obese patients with type 2 diabetes (i.e., OLETF rats)
to a non-diabetic control group of male rats (Hayashi et al. 2010). Gene expression analysis in
diabetes-related tissues (i.e., liver, adipose and skeletal muscle tissue) was carried out and the
results show that blood gene expression profiling is a useful source of markers to predict type 2
diabetes (Hayashi et al. 2010). Hierarchical clustering analysis of differentially expressed genes
in diabetes-related tissues was performed and overrepresented Gene Ontology terms
(Biological Process) that mapped to these differentially expressed (DE) genes were reported to
support their conclusions (Hayashi et al. 2010). No other databases or pathways were used
except for the GO terms. No human samples were collected because of ethical issues and no
cross species comparison can be further investigated. The James lab managed to obtain tissue
samples from a group of obese insulin sensitive patients, patients with insulin resistance and
type 2 diabetes. We are interested in cross species comparison, i.e., investigating how genes
and gene sets of interest found in the mouse data behave in the human data.
4
1.4 Research Case
The James Lab at Garvan Institute for Medical Research are interested in gene
expression in insulin resistance and type 2 diabetes. They have collected several gene
expression data sets and we have access to three of them. Two of the expression data sets are
Affymetrix mouse microarrays. The Agilent arrays are based on a human study. A brief
description of the gene expression microarrays analysed in this thesis is as follows. Detailed
descriptions of the three data sets are given in Chapter 2 Research Methods.
§ A longitudinal mouse study involving the comparison of a high-fat diet to control with
gene expression in two tissues, adipose and muscle, measured by microarray.
§ A cross-sectional human study comparing expression in tissue samples for a control
group of non-obese healthy patients, obese insulin sensitive patients and patients
with 2 stages of the disease, i.e., obese patients with insulin resistance and type 2
diabetes.
§ A 3T3L1 mouse cell line study of exposure to various substances including insulin.
1.4.1 Research Questions
The research questions we are going to address in this study are as follows:
1. Which genes are differentially expressed in each condition in the human data, relative to
healthy controls?
2. Which genes are differentially expressed in each treatment group in the mouse data
involving the comparison of a high-fat diet to controls?
3. Which genes are differentially expressed in each treatment group in the mouse cell line
study?
4. Are those significant gene sets (GO terms or KEGG pathways) found in the mouse data
involving a high-fat diet differentially expressed in obese patients with insulin resistance
in the human data relative to healthy controls?
5. Are those significant gene sets (GO terms or KEGG pathways) found in the mouse data
involving a high-fat diet differentially expressed in obese patients with type 2 diabetes in
the human data relative to healthy controls?
5
6. Are those significant gene sets (GO terms or KEGG pathways) found in the mouse data
involving a high-fat diet differentially expressed in obese insulin sensitive patients in the
human data relative to healthy controls?
7. Are there any GO terms or KEGG pathways that are over-represented in the list of top
ranked genes in the mouse data with a high-fat diet?
8. Are there any GO terms or KEGG pathways that are over-represented in the list of top
ranked genes in obese patients with insulin resistance relative to healthy controls?
9. Are there any GO terms or KEGG pathways that are over-represented in the list of top
ranked genes in obese patients with type 2 diabetes relative to healthy controls?
10. Are there any GO terms or KEGG pathways that are over-represented in the list of top
ranked genes in obese insulin sensitive patients relative to healthy controls?
11. Are insulin/glucose related GO terms differentially expressed in each condition in the
human data, relative to healthy controls?
12. Is a set of FOXO genes differentially expressed?
1.4.2 Goals
The aim of this research is to investigate existing and novel approaches to the analysis
of these data sets. We have set three main goals for this research project:
1. To identify genes that show significant changes in gene expression levels for each data
set.
2. To compare genomic expression patterns across species, human and mouse, and to
integrate results from these studies.
3. To focus on pathway analysis aiming at detecting differential expression in predefined
gene sets based on two popular databases, Gene Ontology (GO) and Kyoto
Encyclopedia of Genes and Genomes (KEGG). We are particularly interested in how
differential expressed (DE) gene sets detected in the mouse data involving a high-fat
diet behave in the human data. Competitive and self-contained gene set tests are
applied to determine significant gene sets and their results compared. This may help us
6
better understand the biological processes involved in the progress of developing insulin
resistance and type 2 diabetes in the human genome.
It has been found that the whole chromosome sequence segments of mouse and human
are remarkably similar (Copeland, Jenkins & O'Brien 2002). Therefore, it is sensible to use the
mouse as a model organism to investigate functions of genes in the human genome. We expect
that a certain number of genes and gene sets interact in similar ways in several biological
systems for both species.
7
Chapter 2 - Research Methods
Microarrays are used to measure the gene expression levels under different diseases or
experimental conditions. The two microarray platforms used in this study are Affymetrix short
oligonucleotide arrays and Agilent long oligonucleotide arrays. Microarray technology has
enabled researchers to analyse gene expression levels of vast amounts of genes
simultaneously in an efficient manner in multiple biological samples. High-density gene
expression arrays require a pre-processing step to be completed before the statistical
investigation can be carried out.
The discussion of five sections of the research methods used in our statistical analysis is as
follows. The open source statistical programming language R and Bioconductor packages are
used throughout the study.
2.1 Experimental Design
The experimental design was completed by the James Lab at Garvan Institute of
Medical Research. The data sets consist of three designs, namely:
• A longitudinal mouse study involving the comparison of a high-fat diet to control with
expression in two tissues, adipose and muscle, measured by microarray (Table 2-1).
Table 2-1: Details of the longitudinal mouse study
Tissue Type
Type of Diet
Days
Symbol
Replicates
Adipose Standard Lab - Achow 4 Adipose High-fat 5 Ahi5 4 Adipose High-fat 14 Ahi14 4 Adipose High-fat 42 Ahi42 3 Muscle Standard Lab - Mchow 4 Muscle High-fat 5 Mhi5 4 Muscle High-fat 14 Mhi14 3 Muscle High-fat 42 Mhi42 4
One group of mice were given a standard lab diet which consists of 8% calories from fat,
21% calories from protein and 71% calories from carbohydrate. The other three groups
of mice were given a high-fat diet for 5, 14 and 42 days respectively. The high-fat diet
consists of 45% calories from fat, 20% calories from protein and 35% calories from
carbohydrate. The gene expression levels were measured in both adipose and muscle
8
tissues using Affymetrix short oligonucleotide arrays for all groups of mice. The
9
The various substances include the following compounds:
1. Chronic Insulin
10
2.3 Analysis of Differential Gene Expression
In order to identify genes that undergo differential expression (DE) between two groups,
we use a popular approach called linear models to estimate the differences, and subsequently
to test the null hypothesis; for each gene we test that there is no difference between the
population mean intensity in both groups. Linear models for microarray data (the limma
package), is designed for analysing complex microarray experiments and can be applied to data
from both single-channel and two-color microarray platforms (Smyth 2004).
To illustrate the idea of the linear models in the limma package, we use the following matrix
notation:
For gene g , gg XYE α=][ , where gY is the logged expression vector for gene g , X is the
design matrix and gα is a vector of coefficients (Smyth 2004).
For the cross-sectional human study and the mouse cell line study, a design matrix is
defined for each data set before fitting a linear model for each gene on the array. For the
longitudinal mouse study involving the comparison of a high-fat diet to a standard diet with
expression levels in two tissues (adipose and muscle), the data set is first divided into two
separate subsets according to the type of tissue before a design matrix is specified for each
tissue type. Then a contrast matrix is created for each design matrix specifying the comparisons
of interest between each time point after which a high-fat diet was given. Based on the contrast
matrix, we are able to compare the initial coefficients in many ways, as required.
One of the common issues in microarray experiments with small sample sizes is that
some genes with small but consistent fold changes could appear to be differentially expressed
with large t-statistics. To prevent this from happening, we need to modify the denominator of a
standard t test statistic. The limma package achieves this goal by using an empirical Bayes
method to moderate the standard errors of the estimated log-fold changes which is known as
shrinkage estimation. This results in more stable inference and improved power, especially for
experiments with small numbers of arrays (Smyth 2004).
The emprirical Bayes method assumes an inverse Chi-square prior for the 2gσ with
mean 20s and degrees of freedom 0d . The posterior values for the residual variances are
given by
g
gg
gdd
sdsds
+
+=
0
22002~
where gd is the residual degrees of freedom for the gth gene.
11
The moderated t-statistic for the kth contrast for gene g is given by
ggk
gk
gksu
t ~
12
2.4 Cluster Analysis
Apart from finding differentially expressed genes, we are also interested in the
similarities between genes as well as between samples. Each gene has its own expression
profile and genes with similar profiles tend to cluster together. We will perform hierarchical
clustering using Euclidean distance as the distance measure and complete linkage as the
clustering algorithm to produce a dendrogram for each data set. It is also of interest to compare
the dendrograms while using correlation as the similarity measure and complete linkage as the
clustering algorithm. Correlation-based measures are in general invariant to location and scale
transformations and tend to group together genes whose expression patterns are linearly
related (Gentleman et al. 2005).
We intend to try and use the data sets together to either cluster genes or cluster
samples based on gene profiles. Since the two mouse data sets used the same Affymetrix
Mouse Genome 430_2 arrays, we can integrate the data sets based on their probe identifiers
before clustering the samples. Standardization is carried out on both data sets after integration.
In order to include the human (Agilent) data this will mean mapping gene identifiers between
array types.
Chapter 5 shows all the analysis completed for hierarchical clustering.
2.5 Functional Annotation and Pathway Analysis
In gene expression microarray analysis, statistical significance may not necessarily
translate into biologically relevance. After fitting linear models and correcting for multiple testing
error rates, some genes that are biologically relevant may not appear to be statistically
significant due to the fact that the relevant biological differences are modest relative to the noise
inherent in the microarray technology (Subramanian et al. 2005). In reality, biological processes
are complicated with many molecules working together. The goal of annotating the genome is to
link all information associated with gene products in order to learn how pathways function in the
biological system. In situations where long lists of genes are found to be differentially
expressed, we will consider pathway-level approaches by going beyond the analysis of
individual genes because it is more sensible to focus on groups of genes that are functionally
related based on prior biological knowledge or experiments. Comprehensive sets of results are
presented in Chapter 4 Pathway Analysis.
13
2.5.1 Gene Set Tests
To define gene sets according to prior biological experimentation, we use two popular
databases for gene annotation and pathways analysis: Gene Ontology (GO) and Kyoto
Encyclopedia of Genes and Genomes (KEGG).
14
permuting genes. Hence it does not consider any correlation among genes in a gene set to be
an important factor, which may underestimate the variability of the data resulting in small p-
values. CAMERA was recently developed taking inter-gene correlation into consideration by
incorporating the variance inflation factor into the test procedures (Wu & Smyth 2012). The
variance inflation factor is based on the mean correlation estimated directly from residuals from
the linear model for genes in the test set and the procedure is equivalent to computing the
average of all possible pairwise correlations between genes in the set (Wu & Smyth 2012).
The second approach is a self-contained gene set test - the rotation gene set test. It
tests if any of the genes in the set are differentially expressed without regard to other genes on
the array. The rotation gene set test replaces permutation with random rotations of residuals to
resolve issues associated with permuting genes, i.e., not allowing for correlations between
genes (Wu et al. 2010). The number of rotations can be set to a very large value, so it avoids
the problem with small number of replicates which may lead to unreliable estimates of p-values
(Wu et al. 2010). In the rotation gene set test, we test three alternative hypotheses: up, down
and mixed. The null hypothesis is that a contrast of the coefficients gβ is equal to zero, i.e.,
0=gβ for all genes in the gene set of interest. The
15
as GLUT4, is the insulin-responsive glucose transporter in muscle and adipose tissue that
plays an important role in postprandial glucose disposal (Stenbit et al. 1997). Altered SLC4A2
activity is suggested to be one of the factors responsible for decreased glucose uptake in
muscle and adipose tissue in obesity and diabetes (Stenbit et al. 1997). FOXO transcription
factors are evolutionarily conserved mediators of insulin and growth factor signalling. They are
at the interface of crucial cellular processes, orchestrating programs of gene expression that
regulate apoptosis, cell-cycle progression, and oxidativestress resistance (Carter & Brunet
2007). We will test the gene sets associated with the above important genes in order to detect
any differential expression at the group level in the human data.
☜Kyoto Encyclopedia of Genes and Genomes (KEGG) is a database resource for understanding
high-level functions and utilities of the biological system, such as the cell, the organism and the
ecosystem, from genomic and molecular-level information.☝ (http://www.kegg.jp/kegg/)
We will repeat the mapping and testing procedures for KEGG pathways discussed in the
previous paragraphs.
2.5.2 Hypergeometric Test for Gene Set Enrichment Analysis
Another commonly used approach involving finding functional groupings within a set of
differentially expressed genes is the testing of over-represented gene sets in a list of significant
genes. This is also to address some of the research questions. We intend to use the non-
parametric hypergeometric test to investigate if there is an association between genes being
differentially expressed and having the particular function. To demonstrate the concept of the
hypergeometric distribution, we use the following: let N be the total number of genes (i.e., the
gene universe); let M be the number of genes from a particular GO category; let n be the
number of differentially expressed genes; let random variable x be the number of genes from a
particular GO category which appear in the list of n differentially expressed genes. The
probability density function of x is given by
−
−
==
n
N
xn
MN
x
M
xXP )(
16
To compute the final probability of finding x or more genes we need to sum up all the
probabilities from x to ),min( Mn . If a gene set associated with a particular GO term contains
more differentially expressed genes than would be expected by chance, this gene set is over-
represented and will give us insight into the functional characteristics of the gene list (Falcon &
Gentleman 2007). To determine if any GO terms or KEGG pathways are over-represented in a
list of differentially expressed genes, we follow these steps below (Falcon & Gentleman 2007):
1. Carry out nonspecific filtering and define the gene universe.
We use the inter-quartile range to estimate the variation across samples and probes with an
inter-quartile range of less than 0.5 are considered less informative, so they are removed.
Probes with no corresponding Entrez Gene identifiers or annotation in the GO categories
(Biological Process) are also removed. When two or more probes map to the same Entrez Gene
ID, only the probe with the largest inter-quartile range is chosen. We then define those genes
that passed the nonspecific filtering process as the gene universe.
2. Determine a subset of interesting genes.
We use the results from the analysis of differential expression generated previously using the
limma package. For the longitudinal mouse study involving the comparison of a high-fat diet to
control, differentially expressed genes from the muscle tissue group are selected based on the
adjusted p values using a cutoff of 0.01 due to the enormous number of DE probes identified.
For the cross-sectional human study, the top 50 probes are used because of the very few DE
genes found (i.e., no more than 12) in any of the contrasts.
3. Test for over-representation in the collection of gene sets.
The GOstats package provides tools to perform hypergeometric test for over-represented GO
(BP) terms and display a summary of the test results. Given the hierarchical structure of GO
terms, Falcon and Gentleman (2007) developed a conditional hypergeometric test that uses the
relationships among GO terms to decorrelate the results. For KEGG pathways, a non-
conditional hyergeometric test is performed.
17
Chapter 3 - Analysis of Differential Expression
In this chapter, we aim to address the following research questions:
§ Which genes are differentially expressed in each condition in the human data,
relative to healthy controls?
§ Which genes are differentially expressed in each treatment group in the mouse
data involving the comparison of a high-fat diet to controls?
§ Which genes are differentially expressed in each treatment group in the mouse
cell line study
3.1 Data Pre-processing and Quality Assessment
The three data sets used in this research project are pre-existing. The two Affymetrix
mouse microarray data sets are high-density short oligonucleotide arrays and normalization was
done prior to us receiving the data. Agilent Long oligonucleotide arrays were used for the
human cross-sectional study and a different normalization procedure is required. The aim of the
normalization process is to remove effects arising from the microarray technology and to ensure
that the distributions of the intensities across arrays are similar.
3.1.1 Affymetrix Mouse Microarrays
In this case these two mouse arrays were already normalized. Expression values were
first transformed to the log scale. Density plots and boxplots were produced for each of the
mouse data sets to check the normalization process. For the longitudinal mouse study involving
the comparison of a high-fat diet to control in two tissues, adipose and muscle, the distributions
of the log expression values for each type of tissue (see Figure 3-1) appeared to be quite similar
with some variation between tissues. Parallel boxplots generally provide a good summary of the
distributions of intensities across all arrays. It was clearly evident in the boxplots that the log
expression values of the fifteen arrays in each tissue group were similarly distributed (see
Figure 3-2). The medians of the log expression values from the muscle tissue group were
found to be lower than the medians from the adipose tissue group based on the boxplots. The
upper and lower quartiles showed some similar patterns with lower values in the muscle tissue
group (see Figure 3-2). This indicates that we should analyse muscle and adipose tissue groups
separately.
18
2 4 6 8 10
0.00
0.05
0.10
0.15
0.20
0.25
0.30
log expression
dens
ity
AdiposeMuscle
Figure 3-1: Density plot of the longitudinal mouse study involving the comparison of a high-fat
diet to control in two tissues.
19
Ach
ow.1
Ach
ow.2
Ach
ow.3
Ach
ow.4
Ahi
5.1
Ahi
5.2
Ahi
5.3
Ahi
5.4
Ahi
14.1
Ahi
14.2
Ahi
14.3
Ahi
14.4
Ahi
42.1
Ahi
42.2
Ahi
42.4
Mch
ow.1
Mch
ow.2
Mch
ow.3
Mch
ow.4
Mhi
5.1
Mhi
5.2
Mhi
5.3
Mhi
5.4
Mhi
14.2
Mhi
14.3
Mhi
14.4
Mhi
42.1
Mhi
42.2
Mhi
42.3
Mhi
42.4
4
6
8
10
AdiposeMuscle
Figure 3-2: Boxplots of the longitudinal mouse study involving the comparison of a high-fat diet
to control in two tissues.
Figure 3-3 shows the density plot of the mouse cell line study and the distributions of the
log expression values were very similar across all the arrays. Median values as well as the
upper and lower quartiles in the boxplots appeared to be almost the same across all the
treatment groups which indicated very similar spread (see Figure 3-4). No obvious outliers were
found in either of the mouse data sets.
20
24
68
10
0.00 0.05 0.10 0.15 0.20 0.25 0.30
log expression
density
Figu
re 3
-3: Density p
lot of the m
ouse cell line study of exp
osure
to various substances inclu
ding
insulin
.
X3T3.L1a
X3T3.L1b
X3T3.L1c
ChronicInsulinA
ChronicInsulinB
ChronicInsulin.MnTBPa
ChronicInsulin.MnTBPb
TNFa
TNFb
TNFc
TNF.MnTBPa
TNF.MnTBPb
TNF.MnTBPc
DexamethasoneA
DexamethasoneB
DexamethasoneC
Dexamethasone.MnTBPa
Dexamethasone.MnTBPb
Dexamethasone.MnTBPc
GlucoseOxidaseA
GlucoseOxidaseB
GlucoseOxidaseC
2 4 6 8 10
Figu
re 3
-4: Boxplots of the
mouse cell line stud
y of exposure to various substa
nces includin
g
insulin
.
21
3.1.2 Agilent Human Microarrays
The Agilent microarrays of the cross-sectional human study comparing expression in
tissue samples for a control group of healthy patients and obese patients with 3 stages of the
disease (i.e., insulin sensitive, insulin resistant and diabetic) are two-colour arrays and require
the pre-processing procedure including background correction and normalization to be
performed using the limma package. Two-colour microarrays use red (R) and green (G)
channels labelled with Cy5 and Cy3 dyes respectively. In this case the green channel was used
as a common reference. We believe the green channel contains a mixture of material, but since
it is a common reference it has no impact on the analysis. The pre-processing procedure aims
to remove any non-biological effects either between the two channels or between arrays (i.e.,
patients). According to the density plot and boxplots in Figure 3-5, there appeared to be quite a
lot of variation between most of the arrays on the red channel. The distributions of the Rlog2
values (red intensities) were different within the control group and within each of the diseased
groups.
5 10 15 20
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
Density Plot - Red Channel
log2 R
Den
sity
Lean ControlObese Insulin ResistantDiabeticObese Insulin sensitive
1 3 5 7 9 11 13 15 17 19 21 23 25 27
510
1520
Boxplots - Red Channel
log 2
R
Figure 3-5: Density plot and boxplots of the Rlog2 values in the human data set.
22
In analysing the two-colour microarrays, the difference between the red (R) and green
intensities (G) as well as the average of the red and green intensities are of interest. A popular
way of comparing the red and green intensities is using M and A values which are defined as
follows:
M denotes the log fold change for each gene: (R/G)logGlog-Rlog 222 ==M
A denotes the average log intensity for each gene: (RG)log2
1G)logR(log
2
1222 =+=A
Figure 3-6 shows the boxplots of the M values for each array after loess normalization
within the arrays. The distributions of the M values across all arrays were similar and the
median of the M value for each array appeared to be very close to M=0 (see Figure 3-6).
We now can move onto fitting linear models in order to identify differential gene
expression for each of the data sets.
Lean
Con
trol
Lean
Con
trol
Lean
Con
trol
Lean
Con
trol
Lean
Con
trol
Lean
Con
trol
Lean
Con
trol
Obe
se In
sulin
Res
ista
nt
Obe
se In
sulin
Res
ista
nt
Obe
se In
sulin
Res
ista
nt
Obe
se In
sulin
Res
ista
nt
Obe
se In
sulin
Res
ista
nt
Obe
se In
sulin
Res
ista
nt
Obe
se In
sulin
Res
ista
nt
Dia
betic
Dia
betic
Dia
betic
Dia
betic
Dia
betic
Dia
betic
Dia
betic
Obe
se In
sulin
sen
sitiv
e
Obe
se In
sulin
sen
sitiv
e
Obe
se In
sulin
sen
sitiv
e
Obe
se In
sulin
sen
sitiv
e
Obe
se In
sulin
sen
sitiv
e
Obe
se In
sulin
sen
sitiv
e
Obe
se In
sulin
sen
sitiv
e
-10
-5
0
5
10
15
M V
alue
s
Figure 3-6: Boxplots of the M values in the human data set after loess normalization.
23
3.2 Analysis of Differential Expression
For each of the microarray data sets, gene-by-gene statistical tests were performed to
test the hypothesis if there is any statistical significance between each of the treatment groups
and the baseline control, i.e., the treatment or group effects.
3.2.1 Mouse Microarrays
The original data of the longitudinal mouse study involving a high-fat diet was divided
into two subsets based on the tissue type, adipose and muscle. The subsets were analysed
separately to detect differentially expressed (DE) genes in each case.
Table 3-1 shows the number of DE probes found using a threshold of both 0.01 and 0.1
to control the false discovery rate (Benjamini & Hochberg 1995) for each of the contrast in the
muscle tissue group. For example, it was revealed that 1146 probes were differentially
expressed after 42 days of a high-fat diet compared to the control group which was on a
standard diet.
Table 3-1: Differentially expressed genes found in the muscle tissue group
Contrast No. of DE probes using FDR < 0.01
No. of DE probes using FDR < 0.1
Muscle.5 days versus Control 3 3745
Muscle.14 days versus Control 20 810
Muscle.42 days versus Control 1146 5621
The top 10 differentially expressed gene symbols and their corresponding 2log fold
change, average 2log expression levels, moderated t-statistics (Smyth 2004) and adjusted p
values for two selected comparisons are listed in the following tables. The two selected
comparisons are: treatment group after 14 days of a high-fat diet versus the control group and
after 42 days of a high-fat diet versus the control group.
24
Table 3-2: Top 10 DE genes between 42 days of a high-fat diet and control
Gene Symbol logFC AveExpr t
FDR Log odds
Hsdl2 0.7790 10.6069 12.2374 <0.0001 12.3506 Acaa2 1.1012 11.0703 11.5972 <0.0001 11.6576 Serinc1 1.2655 9.5465 9.9676 0.0002 9.6861 Acadl 0.5247 13.2915 9.4685 0.0002 9.0168 Stom 0.7368 8.1708 9.3793 0.0002 8.8937 Serinc3 1.0322 11.7445 9.3742 0.0002 8.8867 Adam10 0.6432 8.0022 9.3486 0.0002 8.8511 Il6st 0.6878 10.7077 8.8126 0.0005 8.0861 Cdc42 0.4965 12.2880 8.5096 0.0006 7.6355 Twsg1 0.8477 9.0585 8.4937 0.0006 7.6115
Table 3-3: Top 10 DE genes between 14 days of a high-fat diet and control
Gene Symbol logFC AveExpr t
FDR Log odds
Prg4 1.365 9.843 9.245 0.002 8.121 Thbs4 1.010 11.469 8.676 0.002 7.391 Lox 1.102 8.552 8.126 0.004 6.639 Rcn3 0.586 8.033 7.727 0.005 6.062 Aspn 1.239 8.125 7.544 0.005 5.791 Comp 0.874 9.068 7.431 0.005 5.620 Fmod 1.298 12.074 7.395 0.005 5.565 Thbs1 1.300 7.469 7.382 0.005 5.545 Tnmd 1.517 9.651 7.312 0.005 5.437 Lox 0.835 9.805 7.283 0.005 5.392
The differentially expressed genes found in the mouse data for the comparison between
42 days of a high-fat diet and a standard diet are of great interest and we intend to investigate
further into these DE genes and their associated GO terms or KEGG pathways in Chapter 4. To
illustrate the results of the DE genes, a volcano plot was produced which visualizes genes with
either statistically significance or a large size of effect or both. The top 10 differentially
expressed gene symbols were highlighted in the volcano plot in Figure 3-7. Data points with
higher values along the y-axis present genes that are highly significant whereas points close to
either left- or right-hand side of the plot represent genes with greater fold changes in both up
and down directions. The horizontal line in red is drawn to separate those differentially
expressed probes from the non-DE ones. Data points above this horizontal line represent those
25
1146 probes we found earlier. Data points lie outside the two vertical lines represent genes with
fold changes greater than 2 and smaller than 2
1. We can see in Figure 3-7 that gene products
Acaa2, Serinc1 and Serinc3 appeared to satisfy both criteria, i.e., being highly significant with
large fold changes.
Figure 3-7: Volcano plot of top DE genes between 42 days of a high-fat diet and the control in
the mouse muscle tissue group.
-2 -1 0 1 2 3
-50
510
Muscle Tissue - Hi Fat Diet 42 days vs Control
Log Fold Change
Log
Odd
s
Hsdl2
Acaa2
Serinc1
Acadl Stom Serinc3Adam10
Il6stCdc42 Twsg1
26
In the adipose tissue group, no differentially expressed genes were found in any of the
contrasts between three time points and the control after adjusting for multiple testing using a
threshold of 0.1 for the false discovery rate. Figure 3-8 shows the volcano plot of top ranked
non-DE genes between 42 days of a high-fat diet and the control in the mouse adipose tissue
group. We found that 48 non differentially expressed probe sets turned out to have very large
fold changes, i.e., fold changes greater than 4 or less than 4
1. This is also indicated clearly in
the volcano plot (see Figure 3-8). Two vertical lines in green represent log fold changes of 2 and
27
-2 0 2 4 6 8
-6-4
-20
2
Adipose Tissue - Hi Fat Diet 42 days vs Control
Log Fold Change
Log
Odd
s
Slc29a1Srpx
Gcnt2Rnf146
Timp3
Bcl2l1 Fkbp11Cnot4
Satl1CpPer2Icam1Gle1Atxn7l1PxnIl6stRbm25Cp
Figure 3-8: Volcano plot of top ranked genes (non DE) between 42 days of a high-fat diet and
the control in the mouse adipose tissue group.
28
Table 3-4: Top 10 genes with large fold changes between 42 days of a high-fat diet and the
control in the mouse adipose tissue group
Gene Symbol logFC AveExpr t
FDR
Log odds
Crisp1 7.764 9.995 3.377 0.372 -1.918
Akr1b7 6.952 10.961 3.649 0.372 -1.495
Serpina1f 5.882 9.391 2.904 0.403 -2.663
Ptgs2 5.685 8.889 3.053 0.390 -2.428
Spink8 5.399 10.497 2.713 0.415 -2.962
Defb42 4.675 9.850 2.626 0.423 -3.096
Ptgs2 3.899 7.844 2.679 0.420 -3.015
Ceacam10 3.688 8.044 1.855 0.524 -4.229
Pcp4 3.552 8.597 2.124 0.478 -3.851
Cldn8 3.279 7.464 2.332 0.452 -3.545
For the 3T3L1 mouse cell line study of exposure to various substances including insulin,
a linear model was fitted for each gene and the F-statistic was used to determine if any genes
are differentially expressed on any of the contrasts between the treatment groups and the
control. Results for the top 10 probes are listed in Table 3-5.
29
Table 3-5: Top 10 differentially expressed probes in the mouse cell line data
Gene Symbol
Chronic Insulin
Chronic Insulin. MnTBP TNF
TNF. MnTBP
Dexamethasone
Dexamethasone. MnTBP
Glucose Oxidase AveExpr F
FDR
Cpm -0.104 0.037 0.019 0.162 3.746 3.814 0.229 4.583 1588.499 <0.0001
Ptgds -0.194 -0.070 0.285 0.221 3.594 3.614 0.024 5.788 1521.455 <0.0001
Fam107a -0.175 -0.102 0.009 0.088 3.369 3.307 0.070 5.648 1037.686 <0.0001
Fkbp5 -0.058 -0.084 0.152 0.134 1.892 1.969 0.024 7.657 853.471 <0.0001
Fam107a -0.226 -0.064 -0.141 -0.004 2.994 3.061 -0.133 4.874 743.152 <0.0001
Clca1 -0.183 -0.361 1.115 0.653 2.026 1.823 0.050 5.670 704.902 <0.0001
Ptgds -0.141 -0.054 0.187 0.167 2.365 2.415 -0.068 7.109 671.262 <0.0001
Fam107a -0.377 -0.226 -0.060 0.041 3.972 3.803 0.004 5.720 638.933 <0.0001
Aldh1a1 -0.134 -0.169 1.055 0.679 3.068 2.638 -0.013 5.545 608.139 <0.0001
Chi3l1 -0.270 -0.145 2.047 1.836 0.233 0.248 0.024 5.861 567.040 <0.0001
30
3.2.2 Human Microarrays
For the cross-sectional human study comparing expression in tissue samples for a
control group of healthy patients, obese insulin sensitive patients, and patients with insulin
resistance and type 2 diabetes, the number of DE genes for each contrast was found to be no
more than 12 using a false discovery rate of 0.1 to adjust for multiple comparisons (see Table 3-
6).
Table 3-6: Differentially expressed genes found in the human data
Contrast
No. of DE genes using FDR < 0.1
Obese Insulin Resistant versus Lean Control 0
Obese Diabetic versus Lean Control 12
Obese Insulin Sensitive versus Lean Control 2
Lists of the top ranked genes from two selected contrasts are shown in Tables 3-7 and
3-8. The two selected contrasts are: patients with insulin resistance versus the lean control
group and patients with diabetic versus the lean control group. Examples of boxplots for the top
ranked genes were constructed to see the distributions of the expression levels in the lean
control and each of the diseased group (see Figures 3-9 and 3-10).
Table 3-7: Top 10 ranked genes for the contrast of insulin resistant versus lean control in the
human data
Gene Symbol logFC t P-Value FDR Log odds
HEXIM1 0.735 5.816 <0.001 0.174 3.050
A_32_P8627 -0.434 -5.476 <0.001 0.174 2.439
CCDC36 -1.309 -5.393 <0.001 0.174 2.286
MASK-BP3 -0.647 -5.132 <0.001 0.174 1.803
EIF4EBP1 -0.644 -5.109 <0.001 0.174 1.761
ENST00000342158 0.397 5.058 <0.001 0.174 1.664
DIP2C -0.486 -5.000 <0.001 0.174 1.557
WIRE -0.421 -4.981 <0.001 0.174 1.521
FXYD4 -0.867 -4.966 <0.001 0.174 1.493
C19orf36 -1.208 -4.943 <0.001 0.174 1.449
31
Lean Control Obese Insulin sensitive Obese Insulin Resistant Diabetic
1.0
1.5
2.0
Boxplot for Gene HEXIM1 Across Arrays
M V
alue
s
Lean Control Obese Insulin sensitive Obese Insulin Resistant Diabetic
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
Boxplot for Gene CCDC36 Across Arrays
M V
alue
s
Figure 3-9: Boxplots of the top ranked genes in the contrast of insulin resistant versus lean
control.
Table 3-8: Top 10 ranked genes for the contrast of diabetic versus lean control in the human
data
Gene Symbol logFC t P-Value FDR Log odds
LUZP5 0.413 5.817 <0.001 0.064 3.596
FXYD4 -1.014 -5.807 <0.001 0.064 3.576
LHPP 0.679 5.712 <0.001 0.064 3.387
DIP2C -0.548 -5.640 <0.001 0.064 3.243
PDZD2 0.834 5.543 <0.001 0.064 3.049
DBNDD1 0.738 5.524 <0.001 0.064 3.010
PON2 0.581 5.398 <0.001 0.068 2.754
TOMM40 -0.418 -5.319 <0.001 0.068 2.594
THC2321224 -0.716 -5.299 <0.001 0.068 2.553
A_32_P23096 1.406 5.293 <0.001 0.068 2.539
32
Lean Control Obese Insulin sensitive Obese Insulin Resistant Diabetic
-3.6
-3.4
-3.2
-3.0
-2.8
Boxplot for Gene LUZP5 Across Arrays
M V
alue
s
Lean Control Obese Insulin sensitive Obese Insulin Resistant Diabetic
-0.5
0.0
0.5
1.0
1.5
Boxplot for Gene FXYD4 Across Arrays
M V
alue
s
Figure 3-10: Boxplots of the top ranked genes in the contrast of diabetic versus lean control.
The James Lab are interested in a particular gene SLC2A4 (also known as GLU4) which
is a member of the solute carrier family 2 (facilitated glucose transporter) and encodes a protein
that functions as an insulin-regulated facilitative glucose transporter. In our analysis of
differential expression for the mouse muscle tissue group, SLC2A4 was found to be one of the
top DE genes. We mapped the gene symbol to its corresponding probes in the human genome
and constructed boxplots as shown in Figure 3-11.
33
Lean Control Obese Insulin sensitive Obese Insulin Resistant Diabetic
2.4
2.6
2.8
3.0
3.2
3.4
A_23_P107350
M V
alue
s
Lean Control Obese Insulin sensitive Obese Insulin Resistant Diabetic
6.4
6.6
6.8
7.0
7.2
7.4
A_32_P151263
M V
alue
s
Figure 3-11: Boxplots for 2 probes corresponding to gene SLC2A4 in the human data.
For each of the microarray data sets, gene-by-gene statistical tests were performed to
test the hypothesis if there is any statistical significance between each of the treatment groups
and the baseline control, i.e., the treatment or group effects.
34
Chapter 4 - Pathway Analysis
In this chapter, we present and interpret the results from the gene set tests and the over-
representation analysis focusing on integrating two data sets: the longitudinal mouse study
involving the comparison of a high-fat diet to the control in two tissues and the cross-sectional
human study comparing expression in tissue samples for a control group of healthy patients,
obese insulin sensitive patients and obese patients with 2 stages of the disease, i.e., patients
with insulin resistance and type 2 diabetes. We intend to provide an insight into the following
research questions:
§ Are those significant gene sets (GO terms or KEGG pathways) found in the mouse
data involving a high-fat diet differentially expressed in obese patients with insulin
resistance in the human data relative to healthy controls?
§ Are those significant gene sets (GO terms or KEGG pathways) found in the mouse
data involving a high-fat diet differentially expressed in obese patients with type 2
diabetes in the human data relative to healthy controls?
§ Are those significant gene sets (GO terms or KEGG pathways) found in the mouse
data involving a high-fat diet differentially expressed in obese insulin sensitive
patients in the human data relative to healthy controls?
§ Are there any GO terms or KEGG pathways that are over-represented in the list of
top ranked genes in the mouse data with a high-fat diet?
§ Are there any GO terms or KEGG pathways that are over-represented in the list of
top ranked genes in obese patients with insulin resistance relative to healthy
controls?
§ Are there any GO terms or KEGG pathways that are over-represented in the list of
top ranked genes in obese patients with type 2 diabetes relative to healthy controls?
§ Are there any GO terms or KEGG pathways that are over-represented in the list of
top ranked genes in obese insulin sensitive patients relative to healthy controls?
35
§ Are insulin/glucose related GO terms differentially expressed in each condition in the
human data, relative to healthy controls?
§ Is a set of FOXO genes differentially expressed?
4.1 Gene Set Tests
In the process of functional annotation, a list of differentially expressed (DE) genes is
used to explore more about the underlying biological processes associated with these DE
genes. In this chapter all FDRs reported in the tables are adjusted p-values. In the longitudinal
mouse study, we chose the differentially expressed genes detected from the comparison
between 42 days of a high-fat diet and the control in the muscle tissue group because no DE
genes were detected in any of the contrasts in the adipose tissue group. Due to the large
number of DE probes found in this contrast (i.e., 5621 DE probes using FDR<0.1), the threshold
used for controlling the false discovery rate (Benjamini & Hochberg 1995) is set at 0.01, which
indicates the expected proportion of false positive results (i.e., incorrectly rejected null
hypotheses) among all the rejected null hypotheses is controlled to be less than 1%. In this
case, we try to focus our attention on the top most DE probes. Hence, 1146 DE probes were
identified, then mapped to their corresponding Gene Ontology (GO) terms as well as KEGG
pathways. The resulting GO terms are associated with the mouse genome, so some of them
may not be included in the human genome. Only GO terms that linked to the human genome
were retained for further gene set tests. There is no such issue with KEGG pathways.
All probes on the Agilent chip are mapped to a total number of 12336 GO terms and 229 KEGG
pathways. It was found that 3171 GO terms were associated with the selected 1146 probes and
3082 GO terms were linked to the
36
4.1.1 Competitive Gene Set Test
We first look at one of the competitive gene set tests - the mean-rank gene set test in the
limma package. The hypothesis tested in this case is whether the selected set of genes tends to
be more highly ranked compared to randomly selected genes that are not in the selected gene
set in terms of the moderated t-statistic (Smyth 2004). We performed a gene set test for each of
the comparisons in the human data set. Because there were a large number of pre-defined
gene sets, we had to carry out many gene set tests. In Section 4.1.1, we controlled the false
discovery rate (FDR) at level 0.05 while correcting for multiple testing. The results of the gene
set test in each of the three contrasts in the human data are as follows.
4.1.1.1 Gene Ontology (GO) Terms
Insulin Resistant Versus Lean Control
The GO identifier, name of the GO term and the number of probes associated with that
GO term on the human arrays (i.e., Agilent chip hgug4112a) were reported. Gene sets in 420
GO categories were found to be significantly up-regulated with 42 of them containing more than
500 probes. We excluded these 42 GO terms as they are considered to be too general. A list of
the top 30 up-regulated gene sets containing no more than 500 probes is shown in Table 4-1.
37
Table 4-1: Top 30 significantly up-regulated GO terms in the contrast between insulin resistant
patients and lean control
GO ID
Up-
regulated
FDR
Mixed-
regulated
FDR GO Term
No. of
Probes
GO:0006974 <0.0001 0.3219 response to DNA damage stimulus 281
GO:0006511 <0.0001 0.3124 ubiquitin-dependent protein catabolic process 265
GO:0000502 <0.0001 0.0110 proteasome complex 109
GO:0030521 <0.0001 0.0225 androgen receptor signaling pathway 109
GO:0005813 <0.0001 1 Centrosome 403
GO:0050681 <0.0001 0.2065 androgen receptor binding 102
GO:0048538 <0.0001 0.0001 thymus development 82
GO:0060736 <0.0001 <0.0001 prostate gland growth 32
GO:0034747 <0.0001 0.0459 Axin-APC-beta-catenin-GSK3B complex 39
GO:0031274 <0.0001 0.0028 positive regulation of pseudopodium assembly 39
GO:0032839 <0.0001 <0.0001 dendrite cytoplasm 32
GO:0000776 <0.0001 0.3902 Kinetochore 115
GO:0051082 <0.0001 0.7611 unfolded protein binding 244
GO:0043234 <0.0001 1 protein complex 386
GO:0006457 <0.0001 0.7800 protein folding 330
GO:0004842 <0.0001 0.8928 ubiquitin-protein ligase activity 388
GO:0043130 <0.0001 0.0281 ubiquitin binding 48
GO:0000785 <0.0001 0.0956 chromatin 206
GO:0016605 <0.0001 0.5400 PML body 138
GO:0016607 <0.0001 1 nuclear speck 244
GO:0060070 <0.0001 0.1784 canonical Wnt receptor signaling pathway 134
GO:0016567 <0.0001 0.6667 protein ubiquitination 286
GO:0002902 <0.0001 <0.0001 regulation of B cell apoptosis 13
GO:0051717 <0.0001 <0.0001 inositol-1,3,4,5-tetrakisphosphate 3-phosphatase activity 13
GO:0051800 <0.0001 <0.0001 phosphatidylinositol-3,4-bisphosphate 3-phosphatase 13
GO:0008219 <0.0001 0.0191 cell death 302
GO:0004402 <0.0001 0.3943 histone acetyltransferase activity 74
GO:0034742 <0.0001 0.0373 APC-Axin-1-beta-catenin complex 24
GO:0006917 <0.0001 1 induction of apoptosis 473
GO:0008234 <0.0001 0.8217 cysteine-type peptidase activity 152
Gene sets in 121 GO categories were found to be significantly down-regulated with 6 of
them containing more than 500 probes. The top 30 significant gene sets (No. of probes
38
Table 4-2: Top 30 significantly down-regulated GO terms in the contrast between insulin
resistant patients and lean control
GO ID
Down-regulated FDR
Mixed-regulated FDR GO Term
No. of Probes
GO:0006415 <0.0001 1 translational termination 233
GO:0003735 <0.0001 1 structural constituent of ribosome 322
GO:0006414 <0.0001 1 translational elongation 254
GO:0005840 <0.0001 1 ribosome 304
GO:0022627 <0.0001 1 cytosolic small ribosomal subunit 81
GO:0045725 <0.0001 0.1032 positive regulation of glycogen biosynthetic process 43
GO:0050896 <0.0001 1 response to stimulus 404
GO:0005759 <0.0001 0.2187 mitochondrial matrix 350
GO:0060754 <0.0001 0.0005 positive regulation of mast cell chemotaxis 17
GO:0005172 <0.0001 0.0010 vascular endothelial growth factor receptor binding 13
GO:0008083 <0.0001 0.7831 growth factor activity 360
GO:0010595 <0.0001 0.8292 positive regulation of endothelial cell migration 79
GO:0003707 <0.0001 0.2289 steroid hormone receptor activity 115
GO:0050927 <0.0001 0.0197 positive regulation of positive chemotaxis 23
GO:0048598 <0.0001 0.0208 embryonic morphogenesis 15
GO:0005743 <0.0001 0.7514 mitochondrial inner membrane 432
GO:0005499 <0.0001 0.0043 vitamin D binding 15
GO:0070644 0.0002 0.0438 vitamin D response element binding 15
GO:0018119 0.0002 0.0029 peptidyl-cysteine S-nitrosylation 13
GO:0007281 0.0003 0.0261 germ cell development 56
GO:0004517 0.0003 0.5909 nitric-oxide synthase activity 23
GO:0050731 0.0003 0.5118 positive regulation of peptidyl-tyrosine phosphorylation 194
GO:0060068 0.0003 0.0681 vagina development 16
GO:0046898 0.0004 0.0088 response to cycloheximide 13
GO:0006809 0.0004 0.7901 nitric oxide biosynthetic process 40
GO:0051450 0.0006 0.0337 myoblast proliferation 14
GO:0045840 0.0006 0.6417 positive regulation of mitosis 83
GO:0017134 0.0007 0.0135 fibroblast growth factor binding 28
GO:0048009 0.0007 0.7543 insulin-like growth factor receptor signaling pathway 40
GO:0030976 0.0007 0.0086 thiamine pyrophosphate binding 8
Gene sets in 128 GO categories were differentially expressed regardless of the direction
and only one of them contains more than 500 probes (i.e., GO:0055114 oxidation-reduction
process). The top 30 significant gene sets (No. of probes
39
Table 4-3: Top 30 significantly mixed-regulated GO terms in the contrast between insulin
resistant patients and lean control
GO ID
Down-regulated FDR
Up-regulated FDR
Mixed-regulated FDR GO Term
No. of Probes
GO:0046716 1 0.0005 <0.0001 muscle cell homeostasis 45
GO:0032839 1 <0.0001 <0.0001 dendrite cytoplasm 32
GO:0060736 1 <0.0001 <0.0001 prostate gland growth 32
GO:0006096 1 0.0289 <0.0001 Glycolysis 83
GO:0070102 1 0.5782 <0.0001 interleukin-6-mediated signaling pathway 26
GO:0033032 1 0.0002 <0.0001 regulation of myeloid cell apoptosis 15
GO:0002902 1 <0.0001 <0.0001 regulation of B cell apoptosis 13
GO:0051717 1 <0.0001 <0.0001 inositol-1,3,4,5-tetrakisphosphate 3-phosphatase activity 13
GO:0051800 1 <0.0001 <0.0001 phosphatidylinositol-3,4-bisphosphate 3-phosphatase activity 13
GO:0060087 1 <0.0001 <0.0001 relaxation of vascular smooth muscle 13
GO:0008289 1 0.1756 <0.0001 lipid binding 247
GO:0006749 1 0.6982 0.0001 glutathione metabolic process 52
GO:0060088 1 0.0037 0.0001 auditory receptor cell stereocilium organization 16
GO:0048538 1 <0.0001 0.0001 thymus development 82
GO:0016314 1 <0.0001 0.0001 phosphatidylinositol-3,4,5-trisphosphate 3-phosphatase activity 16
GO:0060292 1 0.0001 0.0001 long term synaptic depression 19
GO:0019226 1 0.0015 0.0002 transmission of nerve impulse 22
GO:0004438 1 <0.0001 0.0002 phosphatidylinositol-3-phosphatase activity 15
GO:0070830 1 <0.0001 0.0003 tight junction assembly 57
GO:0060742 1 0.0006 0.0004 epithelial cell differentiation involved in prostate gland development 14
GO:0031253 1 0.0004 0.0004 cell projection membrane 17
GO:0006094 1 0.0310 0.0005 Gluconeogenesis 89
GO:0060754 <0.0001 1 0.0005 positive regulation of mast cell chemotaxis 17
GO:0001659 1 0.3797 0.0006 temperature homeostasis 29
GO:0050930 0.2750 1 0.0007 induction of positive chemotaxis 36
GO:0042056 0.7766 1 0.0009 chemoattractant activity 46
GO:0060716 1 0.2502 0.0010 labyrinthine layer blood vessel development 40
GO:0005172 <0.0001 1 0.0010 vascular endothelial growth factor receptor binding 13
GO:0044262 1 0.9387 0.0010 cellular carbohydrate metabolic process 15
GO:0000302 1 0.0402 0.0013 response to reactive oxygen species 37
40
Diabetic versus Lean Control
Gene set tests were carried out in the contrast between patients with diabetes and the
control group, and we found that 488 GO terms appeared to be significantly up-regulated with
60 of them having a size of over 500 probes. The top 30 GO terms (No. of probes
41
Table 4-5: Top 30 significantly down-regulated GO terms in the contrast between diabetic
patients and lean control
GO ID
Down-
regulated
FDR
Mixed-
regulated
FDR GO Term
No. of
Probes
GO:0006415 <0.0001 0.0031 translational termination 233
GO:0006414 <0.0001 0.0012 translational elongation 255
GO:0006413 <0.0001 0.1148 translational initiation 321
GO:0003735 <0.0001 0.0199 structural constituent of ribosome 327
GO:0005840 <0.0001 0.0363 Ribosome 381
GO:0000184 <0.0001 0.0672 nuclear-transcribed mRNA catabolic process, 294
GO:0022627 <0.0001 0.0018 cytosolic small ribosomal subunit 81
GO:0010595 <0.0001 1 positive regulation of endothelial cell 109
GO:0044429 <0.0001 0.1846 mitochondrial part 30
GO:0001974 <0.0001 1 blood vessel remodeling 77
GO:0006364 <0.0001 1 rRNA processing 154
GO:0060754 <0.0001 <0.0001 positive regulation of mast cell chemotaxis 18
GO:0005172 <0.0001 0.0003 vascular endothelial growth factor receptor 14
GO:0004517 <0.0001 0.3393 nitric-oxide synthase activity 23
GO:0050930 <0.0001 0.0098 induction of positive chemotaxis 26
GO:0010181 <0.0001 0.2205 FMN binding 36
GO:0045725 <0.0001 1 positive regulation of glycogen biosynthetic 48
GO:0070644 <0.0001 0.0022 vitamin D response element binding 15
GO:0006809 0.0002 0.8311 nitric oxide biosynthetic process 45
GO:0040015 0.0002 0.0775 negative regulation of multicellular organism 21
GO:0042274 0.0002 0.4221 ribosomal small subunit biogenesis 20
GO:0043526 0.0006 0.2669 neuroprotection 49
GO:0040007 0.0007 0.0899 growth 66
GO:0005896 0.0007 0.0070 interleukin-6 receptor complex 13
GO:0015288 0.0007 0.0388 porin activity 12
GO:0050896 0.0007 1 response to stimulus 351
GO:0046898 0.0008 0.1991 response to cycloheximide 14
GO:0008083 0.0009 1 growth factor activity 361
GO:0009409 0.0012 0.2097 response to cold 111
GO:0031017 0.0013 1 exocrine pancreas development 34
214 GO terms with a size of no more than 500 probes are significantly up-regulated in
both contrasts, i.e., insulin resistant patients versus the control and diabetic patients versus the
control.
42
Gene sets defined by 120 GO terms were significantly down-regulated with 9 of
them containing more than 500 probes. Genes in 182 GO terms appeared to be differentially
expressed regardless of the direction with 11 of them having a size of over 500 probes. The top
30 GO terms (No. of probes
43
51 GO terms with a size of no more than 500 probes are significantly down-regulated in
both contrasts, i.e., insulin resistant patients versus the control and diabetic patients versus the
control.
We found that 38 GO terms (with a size of no more than 500 probes) are significantly
mixed-regulated in both contrasts, i.e., insulin resistant patients versus the control and diabetic
patients versus the control.
Insulin Sensitive versus Lean Control
Gene sets in 428 GO terms were significantly up-regulated with 69 of them having a size
of more than 500 probes. 97 GO terms were significantly down-regulated with 5 of them
containing more than 500 probes. 192 GO terms were significantly regardless of the direction
and 15 of them contain more than 500 probes. The top 10 up-, down- and mixed-regulated GO
terms (No. of probes
44
Table 4-8: Top 10 significantly down-regulated GO terms in the contrast between insulin
sensitive patients and lean control
GO ID
Down-regulated FDR
Mixed-regulated FDR GO Term
No. of Probes
GO:0006415 <0.0001 1 translational termination 233
GO:0003735 <0.0001 1 structural constituent of ribosome 327
GO:0005840 <0.0001 1 ribosome 381
GO:0006413 <0.0001 1 translational initiation 321
GO:0006414 <0.0001 1 translational elongation 255
GO:0000184 <0.0001 1 nuclear-transcribed mRNA catabolic process, nonsense-mediated decay 294
GO:0001974 <0.0001 0.1374 blood vessel remodeling 77
GO:0004517 <0.0001 0.0010 nitric-oxide synthase activity 23
GO:0022627 <0.0001 1 cytosolic small ribosomal subunit 81
GO:0005743 <0.0001 0.9517 mitochondrial inner membrane 478
Table 4-9: Top 10 significantly mixed-regulated GO terms in the contrast between insulin
sensitive patients and lean control
GO ID
Down-regulated FDR
Up-regulated FDR
Mixed-regulated FDR GO Term
No. of Probes
GO:0032839 1 <0.0001 <0.0001 dendrite cytoplasm 34
GO:0034097 0.9246 1 <0.0001 response to cytokine stimulus 228
GO:0070301 0.8613 1 <0.0001 cellular response to hydrogen peroxide 70
GO:0070102 0.8918 1 <0.0001 interleukin-6-mediated signaling pathway 27
GO:0046965 0.8167 1 <0.0001 retinoid X receptor binding 37
GO:0034088 1 <0.0001 <0.0001 maintenance of mitotic sister chromatid cohesion 26
GO:0033160 0.0153 1 <0.0001 positive regulation of protein import into nucleus, translocation 62
GO:0043330 1 <0.0001 <0.0001 response to exogenous dsRNA 42
GO:0046716 1 0.0392 <0.0001 muscle cell homeostasis 74
GO:0031000 1 0.2905 <0.0001 response to caffeine 40
4.1.1.2 KEGG Pathways
Insulin Resistant versus Lean Control
The KEGG pathway identifier, name of the KEGG pathway and the number of probes
associated with that KEGG pathway on the human arrays (i.e., Agilent chip hgug4112a) were
45
reported. It was revealed that gene sets in 55 KEGG pathways were significantly up-regulated
using the criterion of FDR<0.05 with 5 of them containing more than 500 probes. The results
are shown in Table 4-10.
Table 4-10: 55 significantly up-regulated KEGG pathways in the contrast between insulin
resistant and lean control
KEGG ID
Up-regulated FDR
Mixed-regulated FDR KEGG Pathway
No. of Probes
5210 <0.0001 0.4686 Colorectal cancer 316
4310 <0.0001 0.9155 Wnt signaling pathway 421
4141 <0.0001 0.6048 Protein processing in endoplasmic reticulum 327
4722 <0.0001 1 Neurotrophin signaling pathway 461
4520 <0.0001 1 Adherens junction 329
5213 <0.0001 0.2997 Endometrial cancer 256
5215 <0.0001 0.6886 Prostate cancer 433
4144 <0.0001 1 Endocytosis 554
5160 <0.0001 0.0883 Hepatitis C 356
4062 <0.0001 1 Chemokine signaling pathway 503
5200 <0.0001 1 Pathways in cancer 1292
4916 <0.0001 0.1744 Melanogenesis 273
4510 <0.0001 1 Focal adhesion 710
4810 <0.0001 0.3344 Regulation of actin cytoskeleton 536
4320 <0.0001 0.5015 Dorso-ventral axis formation 85
3013 <0.0001 1 RNA transport 284
4530 <0.0001 0.0445 Tight junction 314
5212 <0.0001 0.6623 Pancreatic cancer 342
4110 <0.0001 1 Cell cycle 424
4720 <0.0001 1 Long-term potentiation 197
4120 <0.0001 0.3038 Ubiquitin mediated proteolysis 248
3050 0.0001 0.0856 Proteasome 71
4360 0.0001 1 Axon guidance 337
4350 0.0001 1 TGF-beta signaling pathway 275
4010 0.0003 1 MAPK signaling pathway 694
5217 0.0004 0.4642 Basal cell carcinoma 128
5214 0.0006 0.4562 Glioma 299
5211 0.0007 1 Renal cell carcinoma 283
5221 0.0010 0.5826 Acute myeloid leukemia 224
5020 0.0011 0.0228 Prion diseases 138
5100 0.0017 1 Bacterial invasion of epithelial cells 237
5223 0.0017 1 Non-small cell lung cancer 237
46
4070 0.0022 0.4788 Phosphatidylinositol signaling system 153
4114 0.0024 0.7178 Oocyte meiosis 254
3015 0.0025 1 mRNA surveillance pathway 143
4210 0.0049 1 Apoptosis 299
10 0.0054 0.0004 Glycolysis / Gluconeogenesis 113
4540 0.0060 1 Gap junction 224
4910 0.0077 0.5290 Insulin signaling pathway 321
4666 0.0082 1 Fc gamma R-mediated phagocytosis 251
4130 0.0116 0.7942 SNARE interactions in vesicular transport 61
4974 0.0157 0.1744 Protein digestion and absorption 149
4330 0.0160 1 Notch signaling pathway 121
740 0.0162 0.9853 Riboflavin metabolism 18
5220 0.0228 1 Chronic myeloid leukemia 377
4660 0.0282 1 T cell receptor signaling pathway 384
562 0.0283 0.5143 Inositol phosphate metabolism 103
4670 0.0341 1 Leukocyte transendothelial migration 360
5216 0.0354 1 Thyroid cancer 168
4012 0.0371 1 ErbB signaling pathway 326
5218 0.0371 0.0907 Melanoma 315
5014 0.0402 <0.0001 Amyotrophic lateral sclerosis (ALS) 173
4970 0.0405 1 Salivary secretion 148
4962 0.0440 0.3650 Vasopressin-regulated water reabsorption 78
4662 0.0470 1 B cell receptor signaling pathway 232
Gene sets associated with 19 KEGG pathways were found to be significantly down-
regulated with one pathway containing over 500 probes. The results are given in Table 4-11.
47
Table 4-11: 19 significantly down-regulated KEGG pathways in the contrast between insulin
resistant and lean control
KEGG ID Down-regulated FDR
Mixed-regulated FDR KEGG Pathway
No. of Probes
3010 <0.0001 1 Ribosome 232
190 <0.0001 0.4686 Oxidative phosphorylation 173
983 <0.0001 0.1528 Drug metabolism - other enzymes 83
5012 0.0003 0.3955 Parkinson's disease 194
1100 0.0004 0.0174 Metabolic pathways 1804
330 0.0005 0.1528 Arginine and proline metabolism 112
860 0.0007 1 Porphyrin and chlorophyll metabolism 81
20 0.0010 0.7942 Citrate cycle (TCA cycle) 52
500 0.0017 0.0856 Starch and sucrose metabolism 65
4080 0.0021 1 Neuroactive ligand-receptor interaction 401
4260 0.0035 1 Cardiac muscle contraction 136
280 0.0043 1 Valine, leucine and isoleucine degradation 70
5323 0.0061 1 Rheumatoid arthritis 292
480 0.0084 0.1533 Glutathione metabolism 93
4640 0.0096 1 Hematopoietic cell lineage 220
5016 0.0105 0.3735 Huntington's disease 382
4610 0.0224 1 Complement and coagulation cascades 175
5410 0.0352 1 Hypertrophic cardiomyopathy (HCM) 238
4740 0.0418 1 Olfactory transduction 163
Gene sets in 9 KEGG pathways (See table 4-12) were differentially expressed
regardless of the direction with one pathway containing more than 500 probes. Among these 9
pathways, 4 were neither down-regulated nor up-regulated but mix-regulated.
Table 4-12: 9 significantly mixed-regulated KEGG pathways in the contrast between insulin
resistant and lean control
KEGG ID
Down-regulated FDR
Up-regulated FDR
Mixed-regulated FDR KEGG Pathway
No. of Probes
5014 1 0.0402 <0.0001 Amyotrophic lateral sclerosis (ALS) 173
360 1 0.7942 <0.0001 Phenylalanine metabolism 35
10 1 0.0054 0.0004 Glycolysis / Gluconeogenesis 113
1100 0.0004 1 0.0174 Metabolic pathways 1804
5020 1 0.0011 0.0228 Prion diseases 138
4146 1 1 0.0352 Peroxisome 127
620 0.7594 1 0.0408 Pyruvate metabolism 68
3022 1 0.2583 0.0408 Basal transcription factors 58
4530 1 <0.0001 0.0445 Tight junction 314
48
Diabetic versus Lean Control
Gene sets defined by 56 KEGG pathways were significantly up-regulated with 6 of them
having a size of over 500 probes. The full list is shown in Table 4-13. We compared these
KEGG identifiers to those up-regulated pathways in the contrast of insulin resistant versus lean
control and found that 35 KEGG pathways were significantly up-regulated in both contrasts (see
Table 4-14).
Table 4-13: 56 significantly up-regulated KEGG pathways in the contrast between diabetic and
lean control
KEGG ID
Up-regulated FDR
Mixed-regulated FDR KEGG Pathway
No. of Probes
4650 <0.0001 0.0006 Natural killer cell mediated cytotoxicity 430
4062 <0.0001 0.0710 Chemokine signaling pathway 503
4144 <0.0001 <0.0001 Endocytosis 554
5416 <0.0001 0.0064 Viral myocarditis 272
5200 <0.0001 0.0123 Pathways in cancer 1292
4670 <0.0001 0.5988 Leukocyte transendothelial migration 360
4145 <0.0001 0.6685 Phagosome 392
4360 <0.0001 0.4389 Axon guidance 337
4510 <0.0001 0.0655 Focal adhesion 710
5210 <0.0001 1 Colorectal cancer 316
4520 <0.0001 0.7885 Adherens junction 329
4514 <0.0001 0.0015 Cell adhesion molecules (CAMs) 321
4666 <0.0001 0.5954 Fc gamma R-mediated phagocytosis 251
5100 <0.0001 1 Bacterial invasion of epithelial cells 237
4512 <0.0001 0.0036 ECM-receptor interaction 243
4810 <0.0001 0.6444 Regulation of actin cytoskeleton 536
4660 <0.0001 1 T cell receptor signaling pathway 384
5340 <0.0001 0.0011 Primary immunodeficiency 84
5213 <0.0001 0.8988 Endometrial cancer 256
4612 <0.0001 0.1366 Antigen processing and presentation 220
5212 <0.0001 0.0510 Pancreatic cancer 342
4210 <0.0001 0.8637 Apoptosis 299
5145 <0.0001 0.6444 Toxoplasmosis 456
4060 <0.0001 0.0032 Cytokine-cytokine receptor interaction 583
4110 0.0001 1 Cell cycle 424
5144 0.0001 0.3926 Malaria 237
4722 0.0002 1 Neurotrophin signaling pathway 461
5160 0.0002 0.1672 Hepatitis C 356
4974 0.0002 0.0042 Protein digestion and absorption 149
49
4115 0.0005 0.4147 p53 signaling pathway 293
5217 0.0006 0.5611 Basal cell carcinoma 128
5322 0.0006 0.8988 Systemic lupus erythematosus 237
3050 0.0006 0.3600 Proteasome 71
10 0.0008 <0.0001 Glycolysis / Gluconeogenesis 113
5020 0.0008 0.0055 Prion diseases 138
4380 0.0008 0.9978 Osteoclast differentiation 388
4664 0.0008 1 Fc epsilon RI signaling pathway 254
4916 0.0017 1 Melanogenesis 273
5223 0.0021 1 Non-small cell lung cancer 237
5146 0.0031 0.0003 Amoebiasis 346
4120 0.0042 0.7879 Ubiquitin mediated proteolysis 248
4310 0.0042 1 Wnt signaling pathway 421
4662 0.0050 1 B cell receptor signaling pathway 232
5412 0.0075 0.4524 Arrhythmogenic right ventricular cardiomyopathy (ARVC) 186
5216 0.0083 0.1016 Thyroid cancer 168
5222 0.0111 1 Small cell lung cancer 329
5219 0.0111 0.0783 Bladder cancer 259
4320 0.0128 1 Dorso-ventral axis formation 85
4114 0.0165 1 Oocyte meiosis 254
4530 0.0249 0.1705 Tight junction 314
5215 0.0264 1 Prostate cancer 433
4141 0.0277 1 Protein processing in endoplasmic reticulum 327
5142 0.0319 0.8988 Chagas disease (American trypanosomiasis) 390
4912 0.0447 0.6708 GnRH signaling pathway 287
5010 0.0458 0.2737 Alzheimer's disease 391
5014 0.0458 0.0017 Amyotrophic lateral sclerosis (ALS) 173
50
Table 4-14: 35 significantly up-regulated KEGG pathways in both insulin resistant and diabetic
groups
KEGG ID KEGG Pathway No. of Probes
4062 Chemokine signaling pathway 503
4144 Endocytosis 554
4360 Axon guidance 337
4510 Focal adhesion 710
4520 Adherens junction 329
4530 Tight junction 314
4660 T cell receptor signaling pathway 384
4666 Fc gamma R-mediated phagocytosis 251
4670 Leukocyte transendothelial migration 360
4722 Neurotrophin signaling pathway 461
4810 Regulation of actin cytoskeleton 536
5100 Bacterial invasion of epithelial cells 237
5200 Pathways in cancer 1292
5212 Pancreatic cancer 342
4310 Wnt signaling pathway 421
4916 Melanogenesis 273
5210 Colorectal cancer 316
5213 Endometrial cancer 256
5215 Prostate cancer 433
5216 Thyroid cancer 168
5217 Basal cell carcinoma 128
4210 Apoptosis 299
4662 B cell receptor signaling pathway 232
5160 Hepatitis C 356
10 Glycolysis / Gluconeogenesis 113
4110 Cell cycle 424
4114 Oocyte meiosis 254
5014 Amyotrophic lateral sclerosis (ALS) 173
4120 Ubiquitin mediated proteolysis 248
4141 Protein processing in endoplasmic reticulum 327
5020 Prion diseases 138
3050 Proteasome 71
4320 Dorso-ventral axis formation 85
5223 Non-small cell lung cancer 237
4974 Protein digestion and absorption 149
51
In this case 17 KEGG pathways were significantly down-regulated with one pathway
containing more than 500 probes (see Table 4-15). We found that 13 of these KEGG pathways
appeared to be significantly up-regulated in both contrasts, i.e., insulin resistant and diabetic
patients (see Table 4-16).
Table 4-15: 17 significantly down-regulated KEGG pathways in the contrast between diabetic
and lean control
KEGG ID
Down-regulated FDR
Mixed-regulated FDR KEGG Pathway
No. of Probes
3010 <0.0001 0.0012 Ribosome 232
860 0.0001 1 Porphyrin and chlorophyll metabolism 81
1100 0.0002 0.1988 Metabolic pathways 1804
190 0.0002 1 Oxidative phosphorylation 173
330 0.0003 0.0021 Arginine and proline metabolism 112
983 0.0005 1 Drug metabolism - other enzymes 83
4260 0.0036 1 Cardiac muscle contraction 136
20 0.0036 1 Citrate cycle (TCA cycle) 52
280 0.0042 1 Valine, leucine and isoleucine degradation 70
4740 0.0050 1 Olfactory transduction 163
4020 0.0054 0.5954 Calcium signaling pathway 356
4080 0.0059 1 Neuroactive ligand-receptor interaction 401
410 0.0123 1 beta-Alanine metabolism 36
5016 0.0177 1 Huntington's disease 382
240 0.0319 1 Pyrimidine metabolism 155
5012 0.0319 1 Parkinson's disease 194
3020 0.0451 0.3600 RNA polymerase 49
52
Table 4-16: 13 significantly down-regulated KEGG pathways in both insulin resistant and
diabetic groups
KEGG ID KEGG Pathway No. of Probes
280 Valine, leucine and isoleucine degradation 70
1100 Metabolic pathways 1804
20 Citrate cycle (TCA cycle) 52
983 Drug metabolism - other enzymes 83
4740 Olfactory transduction 163
330 Arginine and proline metabolism 112
4080 Neuroactive ligand-receptor interaction 401
4260 Cardiac muscle contraction 136
5016 Huntington's disease 382
190 Oxidative phosphorylation 173
5012 Parkinson's disease 194
860 Porphyrin and chlorophyll metabolism 81
3010 Ribosome 232
We found that 20 KEGG pathways (see Table 4-17) were differentially expressed
regardless of the direction with 3 of them having a size of over 500 probes. 5 KEGG pathways
were significantly mixed-regulated in both insulin resistant and diabetic patients (see Table 4-
18).
Table 4-17: 20 significantly mixed-regulated KEGG pathways in the contrast between diabetic
and lean control
KEGG ID
Down-regulated FDR
Up-regulated FDR
Mixed-regulated FDR KEGG Pathway
No. of Probes
10 1 0.0008 <0.0001 Glycolysis / Gluconeogenesis 113
4144 1 <0.0001 <0.0001 Endocytosis 554
360 1 0.3090 <0.0001 Phenylalanine metabolism 35
5146 1 0.0031 0.0003 Amoebiasis 346
620 0.6134 1 0.0006 Pyruvate metabolism 68
4650 1 <0.0001 0.0006 Natural killer cell mediated cytotoxicity 430
5340 1 <0.0001 0.0011 Primary immunodeficiency 84
3010 <0.0001 1 0.0012 Ribosome 232
4514 1 <0.0001 0.0015 Cell adhesion molecules (CAMs) 321
5014 1 0.0458 0.0017 Amyotrophic lateral sclerosis (ALS) 173
330 0.0003 1 0.0021 Arginine and proline metabolism 112
53
4060 1 <0.0001 0.0032 Cytokine-cytokine receptor interaction 583
4512 1 <0.0001 0.0036 ECM-receptor interaction 243
4974 1 0.0002 0.0042 Protein digestion and absorption 149
5020 1 0.0008 0.0055 Prion diseases 138
5416 1 <0.0001 0.0064 Viral myocarditis 272
5200 1 <0.0001 0.0123 Pathways in cancer 1292
4150 0.1985 1 0.0146 mTOR signaling pathway 165
4910 1 0.9529 0.0213 Insulin signaling pathway 321
350 1 0.4492 0.0334 Tyrosine metabolism 72
Table 4-18: 5 significantly mixed-regulated KEGG pathways in both insulin resistant and
diabetic groups
KEGG ID KEGG Pathway No. of Probes
10 Glycolysis / Gluconeogenesis 113
620 Pyruvate metabolism 68
5014 Amyotrophic lateral sclerosis (ALS) 173
5020 Prion diseases 138
360 Phenylalanine metabolism 35
Insulin Sensitive versus Lean Control
Gene sets defined by 70 KEGG pathways were significantly up-regulated (FDR<0.05)
with 5 of them having a size of over 500 probes. The top 30 KEGG pathways were given in
Table 4-19. A number of KEGG pathways related to different types of cancer were significantly
up-regulated in the insulin sensitive patients relative to healthy controls.
54
Table 4-19: Top 30 significantly up-regulated KEGG pathways in the contrast between insulin
sensitive and lean control
KEGG ID
Up-regulated FDR
Mixed-regulated FDR KEGG Pathway
No. of Probes
5210 <0.0001 0.0779 Colorectal cancer 316
5213 <0.0001 0.0011 Endometrial cancer 256
5200 <0.0001 <0.0001 Pathways in cancer 1292
4650 <0.0001 0.5834 Natural killer cell mediated cytotoxicity 430
5223 <0.0001 0.0003 Non-small cell lung cancer 237
5212 <0.0001 0.0001 Pancreatic cancer 342
4144 <0.0001 0.4683 Endocytosis 554
4062 <0.0001 1 Chemokine signaling pathway 503
5160 <0.0001 0.0076 Hepatitis C 356
5145 <0.0001 0.6521 Toxoplasmosis 456
5221 <0.0001 0.0660 Acute myeloid leukemia 224
5215 <0.0001 <0.0001 Prostate cancer 433
4722 <0.0001 1 Neurotrophin signaling pathway 461
4320 <0.0001 0.1302 Dorso-ventral axis formation 85
4110 <0.0001 0.8557 Cell cycle 424
5416 <0.0001 0.9113 Viral myocarditis 272
4010 <0.0001 0.0478 MAPK signaling pathway 694
4916 <0.0001 0.1945 Melanogenesis 273
5214 <0.0001 <0.0001 Glioma 299
5218 <0.0001 <0.0001 Melanoma 315
4810 <0.0001 0.0167 Regulation of actin cytoskeleton 536
4360 <0.0001 0.8120 Axon guidance 337
5220 <0.0001 0.9733 Chronic myeloid leukemia 377
4670 <0.0001 0.9809 Leukocyte transendothelial migration 360
5219 <0.0001 <0.0001 Bladder cancer 259
4520 <0.0001 1 Adherens junction 329
5216 <0.0001 0.3898 Thyroid cancer 168
4210 <0.0001 0.0540 Apoptosis 299
4666 <0.0001 0.3209 Fc gamma R-mediated phagocytosis 251
562 <0.0001 0.3360 Inositol phosphate metabolism 103
55
The down- and mixed-regulated KEGG pathways were listed in Tables 4-20 and 4-21.
We found 10 cancer pathways were differentially expressed regardless of the direction (see
Table 4-21) and they were also significantly up-regulated except for two pathways, i.e., Small
cell lung cancer (KEGG ID 5222) and Renal cell carcinoma (KEGG ID 5211).
Table 4-20: 8 significantly down-regulated KEGG pathways in the contrast between insulin
sensitive and lean control
KEGG ID
Down-regulated FDR
Mixed-regulated FDR KEGG Pathway
No. of Probes
3010 <0.0001 1 Ribosome 232
190 <0.0001 0.6578 Oxidative phosphorylation 173
5012 0.0003 0.6039 Parkinson's disease 194
4260 0.0010 0.8469 Cardiac muscle contraction 136
4080 0.0055 1 Neuroactive ligand-receptor interaction 401
4740 0.0129 1 Olfactory transduction 163
330 0.0278 0.0363 Arginine and proline metabolism 112
5410 0.0487 0.1102 Hypertrophic cardiomyopathy (HCM) 238
56
Table 4-21: 32 significantly mixed-regulated KEGG pathways in the contrast between insulin
sensitive and lean control
KEGG ID
Down-regulated FDR
Up-regulated FDR
Mixed-regulated FDR KEGG Pathway
No. of Probes
5218 1 <0.0001 <0.0001 Melanoma 315
4115 1 0.0546 <0.0001 p53 signaling pathway 293
5215 1 <0.0001 <0.0001 Prostate cancer 433
5214 1 <0.0001 <0.0001 Glioma 299
350 1 0.0010 <0.0001 Tyrosine metabolism 72
5219 1 <0.0001 <0.0001 Bladder cancer 259
5200 1 <0.0001 <0.0001 Pathways in cancer 1292
5020 1 0.0005 <0.0001 Prion diseases 138
5211 1 <0.0001 <0.0001 Renal cell carcinoma 283
4150 1 0.4020 <0.0001 mTOR signaling pathway 165
5212 1 <0.0001 <0.0001 Pancreatic cancer 342
4370 1 0.1095 0.0003 VEGF signaling pathway 256
5223 1 <0.0001 0.0003 Non-small cell lung cancer 237
4720 1 0.0105 0.0004 Long-term potentiation 197
4914 1 0.1095 0.0005 Progesterone-mediated oocyte maturation 241
4012 1 0.0004 0.0006 ErbB signaling pathway 326
4060 1 0.0006 0.0009 Cytokine-cytokine receptor interaction 583
5014 1 0.0042 0.0011 Amyotrophic lateral sclerosis (ALS) 173
360 1 0.1826 0.0011 Phenylalanine metabolism 35
5213 1 <0.0001 0.0011 Endometrial cancer 256
5010 1 0.1706 0.0012 Alzheimer's disease 391
5144 1 0.0003 0.0047 Malaria 237
5160 1 <0.0001 0.0076 Hepatitis C 356
4960 1 0.6578 0.0142 Aldosterone-regulated sodium reabsorption 112
4810 1 <0.0001 0.0167 Regulation of actin cytoskeleton 536
3320 1 0.1125 0.0184 PPAR signaling pathway 158
4610 0.3898 1 0.0238 Complement and coagulation cascades 175
4912 1 0.0020 0.0271 GnRH signaling pathway 287
4540 1 <0.0001 0.0299 Gap junction 224
330 0.0278 1 0.0363 Arginine and proline metabolism 112
5222 1 0.0004 0.0452 Small cell lung cancer 329
4010 1 <0.0001 0.0478 MAPK signaling pathway 694
57
4.1.2 Self-Contained Gene Set Test
The self-contained gene set test we have investigated in this project is rotation gene set
tests in the limma package. The rotation gene set test (Wu et al. 2010) is considered as a self-
contained gene set test because only the information contained in the gene set of interest is
used to test the hypothesis if any of the genes in the set are differentially expressed.
Gene Ontology (GO) Terms
No GO terms were found to be statistically significant in the contrast between insulin
resistant patients and lean control when controlling the false discovery rate at the level of 0.1. It
is the similar situation for the contrast between diabetic patients and lean control as well as
between insulin sensitive patients and lean control.
KEGG Pathways
No KEGG pathways were found to be significant in the contrast between insulin resistant
patients and lean control when controlling the false discovery rate at the level of 0.1. In the
contrast between diabetic patients and lean control, gene sets defined by 5 KEGG pathways
appeared to be differentially expressed regardless of the direction when a threshold of 0.1 was
used to control the false discovery rate. All these 5 KEGG pathways were significantly mixed-
regulated in the mean-ranked gene set test. The top 8 mixed-regulated KEGG pathways are
shown in Table 4-22. No significantly up- or down-regulated KEGG pathways were found in the
contrast between diabetic patients and lean control based on a threshold of 0.1 when controlling
the false discovery rate.
In the contrast between insulin sensitive patients versus lean control, gene sets
associated with 18 KEGG pathways were significantly up-regulated when controlling the false
discovery rate at the level of 0.1 (see Table 4-23).
58
Table 4-22: Rotation Gene Set Test - Top 8 mixed-regulated KEGG pathways in the contrast
between diabetic and lean control
KEGG ID
Mixed-regulated FDR
Up-regulated FDR KEGG Pathway
No. of Probes
360 0.0376 0.3052 Phenylalanine metabolism 35
10 0.0940 0.3052 Glycolysis / Gluconeogenesis 113
5146 0.0940 0.3354 Amoebiasis 346
350 0.0940 0.3513 Tyrosine metabolism 72
620 0.0977 0.7738 Pyruvate metabolism 68
4144 0.1170 0.3052 Endocytosis 554
5014 0.1170 0.3360 Amyotrophic lateral sclerosis (ALS) 173
4910 0.1170 0.6640 Insulin signaling pathway 321
Table 4-23: Rotation Gene Set Test - Top 20 up-regulated in the contrast between insulin
sensitive and lean control
KEGG ID
Mixed-regulated FDR
Up-regulated FDR KEGG Pathway
No. of Probes
5213 0.2464 0.0188 Endometrial cancer 256
4070 0.2464 0.0188 Phosphatidylinositol signaling system 153
4320 0.2464 0.0301 Dorso-ventral axis formation 85
5223 0.2464 0.0301 Non-small cell lung cancer 237
562 0.2464 0.0301 Inositol phosphate metabolism 103
603 0.2464 0.0313 Glycosphingolipid biosynthesis - globo series 25
5214 0.2464 0.0322 Glioma 299
5210 0.2464 0.0376 Colorectal cancer 316
5218 0.2464 0.0418 Melanoma 315
4360 0.2747 0.0478 Axon guidance 337
5221 0.2464 0.0478 Acute myeloid leukemia 224
4962 0.3107 0.0533 Vasopressin-regulated water reabsorption 78
4916 0.2464 0.0636 Melanogenesis 273
4010 0.2464 0.0929 MAPK signaling pathway 694
5212 0.2464 0.0929 Pancreatic cancer 342
5216 0.2464 0.0929 Thyroid cancer 168
5160 0.2464 0.0929 Hepatitis C 356
5200 0.2464 0.0940 Pathways in cancer 1292
4144 0.2464 0.1009 Endocytosis 554
4114 0.2747 0.1056 Oocyte meiosis 254
59
We notice that there are a few significant cancer related KEGG pathways again when
the rotation gene set test was applied in the comparison of insulin sensitive patients versus
healthy controls. Common probes in some of the cancer related pathways of interest were found
and boxplots were produced to compare their gene expression levels across samples. For
example, there are 62 common probes in both "Pancreatic cancer" and "Colorectal cancer"
pathways and the median values of these 62 probes in each sample were computed. See the
boxplot in Figure 4-1. Clearly, the medians, upper quartiles and lower quartiles of the 3 groups
of obese patients appeared to be higher than the ones in the controls. Another example is given
in Figure 4-2.
Lean Control Obese Insulin sensitive Obese Insulin Resistant Diabetic
-0.4
-0.3
-0.2
-0.1
Figure 4-1: Boxplots of gene expression levels (M values) of 62 common probes in both
"Pancreatic cancer" and "Colorectal cancer" KEGG pathways on the Agilent human array
60
Lean Control Obese Insulin sensitive Obese Insulin Resistant Diabetic
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
Figure 4-2: Boxplots of gene expression levels (M values) of 42 common probes in both
"Pancreatic cancer" and "Endometrial cancer" KEGG pathways on the Agilent human array
In order to make a fundamental level of biological inferences from the results generated
by gene set tests, I reviewed some recent publications concerning various GO or KEGG
pathways and their roles in the development of type 2 diabetes. The discussion is given in
Section 6.2.
4.1.3 Comparison of Three Gene Set Tests for Insulin Related GO Terms
All GO terms containing the word
61
Insulin Resistance versus Lean Control
Gene sets associated with 12 GO categories were significantly down-regulated using the
mean-rank gene set test (FDR<0.1). None of these GO categories were formally significant
when using either the rotation gene set test (Roast) or the correlation adjusted mean-rank gene
set test (Camera). However, the ranks of the GO categories in terms of the statistical
significance were very similar for the three different methods (see Table 4-24).
Table 4-24: Comparison of down-regulated insulin related GO terms in the contrast between
insulin resistance and lean control
GO ID GO Term MeanRank FDR
Roast FDR
Camera FDR
No. of Probes
GO:0016942 insulin-like growth factor binding protein complex <0.0001 0.1233 0.1790 28
GO:0048009 insulin-like growth factor receptor signaling pathway 0.0002 0.1233 0.1790 40
GO:0005158 insulin receptor binding 0.0009 0.1233 0.1790 87 GO:0005159 insulin-like growth factor receptor binding 0.0010 0.1692 0.1790 45 GO:0005520 insulin-like growth factor binding 0.0010 0.2417 0.3259 56 GO:0043559 insulin binding 0.0020 0.1692 0.2223 18 GO:0043560 insulin receptor substrate binding 0.0117 0.2562 0.3462 44 GO:0005010 insulin-like growth factor receptor activity 0.0121 0.2417 0.3259 18 GO:0031994 insulin-like growth factor I binding 0.0543 0.3002 0.3898 41
GO:0061179 negative regulation of insulin secretion involved in cellular response to glucose 0.0739 0.1233 0.1790 2
GO:0032869 cellular response to insulin stimulus 0.0753 0.4152 0.4576 163
GO:0046627 negative regulation of insulin receptor signaling pathway 0.0832 0.4012 0.4535 64
To evaluate the potential relationship between the results from each pair of the three
methods, unadjusted p-values were used to produce scatterplot matrices. An extremely strong
positive correlation is evident (see Figure 4-3) between the unadjusted p-values generated from
Roast and Camera. Figure 4-3 shows that there is a moderate positive correlation between the
unadjusted p-values from the mean-rank gene set test and Roast, or between the mean-rank
gene set test and Camera.
62
MeanRank
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Roast
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Camera
Down-regulated P values - Contrast Between Insulin Resistance and Lean Control
Figure 4-3: Scatterplot matrix of down-regulated unadjusted p-values from three methods
Two GO categories were significantly up-regulated (see Table 4-25) using the mean-
rank gene set test (FDR<0.1). Both Roast and Camera generated large adjusted p-values and
the scatterplot matrix in Figure 4-4 shows some similar positive linear trends.
63
Table 4-25: Comparison of up-regulated insulin related GO terms in the contrast between insulin
resistance and lean control
GO ID GO Term MeanRank FDR
Roast FDR
Camera FDR
No. of Probes
GO:0043569
negative regulation of insulin-like growth factor receptor signaling pathway 0.0149 0.2030 0.2746 9
GO:0008286 insulin receptor signaling pathway 0.0149 0.9940 0.9882 346
MeanRank
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Roast
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Camera
Up-regulated P values - Contrast Between Insulin Resistance and Lean Control
Figure 4-4: Scatterplot matrix of up-regulated unadjusted p-values from three methods
64
Diabetic versus Lean Control
Gene sets defined by 9 insulin related GO terms appeared to be significantly down-
regulated (see Table 4-26) using the mean-rank gene set test (FDR<0.1). Both Roast and
Camera seem to be very conservative in detecting differentially expressed (DE) gene sets, and
their resulting unadjusted p-values are highly positively correlated even though their null
hypotheses are quite different (see Figure 4-5).
Table 4-26: Comparison of down-regulated insulin related GO terms in the contrast between
diabetic and lean control
GO ID GO Term MeanRank FDR
Roast FDR
Camera FDR
No. of Probes
GO:0048009 insulin-like growth factor receptor signaling pathway 0.0018 0.2127 0.3456 40
GO:0005010 insulin-like growth factor receptor activity 0.0072 0.2958 0.3492 18
GO:0043559 insulin binding 0.0124 0.3142 0.3492 18
GO:0005158 insulin receptor binding 0.0132 0.3190 0.3492 87
GO:0005520 insulin-like growth factor binding 0.0135 0.4036 0.4270 56
GO:0005159 insulin-like growth factor receptor binding 0.0259 0.3190 0.3898 45
GO:0016942 insulin-like growth factor binding protein complex 0.0286 0.4082 0.4270 28
GO:0050796 regulation of insulin secretion 0.0671 0.2958 0.3492 172
65
MeanRank
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Roast
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.00.
00.
20.
40.
60.
81.
0
Camera
Down-regulated P values - Contrast Between Obese Diabetic and Lean Control
Figure 4-5: Scatterplot matrix of down-regulated unadjusted p-values from three methods in the
contrast between diabetic and lean control
Using the mean-rank gene set test, gene sets defined by 2 GO terms were significantly
up-regulated in diabetic relative to the lean control. Roast also detected one gene set to be up-
regulated (FDR<0.1) as listed in Table 4-27.
Table 4-27: Comparison of up-regulated insulin related GO terms in the contrast between
diabetic and lean control
GO ID GO Term MeanRank FDR
Roast FDR
Camera FDR
No. of Probes
GO:0046676 negative regulation of insulin secretion <0.0001 0.0435 0.1449 49
GO:0043569 negative regulation of insulin-like growth factor receptor signaling pathway 0.0291 0.2030 0.2931 9
66
Insulin Sensitive versus Lean Control
Gene sets related to 2 GO terms were significantly up-regulated (see Table 4-28) using
the mean-rank gene set test in the contrast between insulin sensitive patients and the lean
control. No GO terms were found to be significantly down-regulated using any of the three
methods.
Table 4-28: Comparison of up-regulated insulin related GO terms in the contrast between insulin
sensitive and lean control
GO ID GO Term MeanRank FDR
Roast FDR
Camera FDR
No. of Probes
GO:0008286 insulin receptor signaling pathway <0.0001 0.1740 0.9182 346
GO:0032869 cellular response to insulin stimulus 0.0571 0.5075 0.9431 163
4.1.4 Comparison of Three Gene Set Tests for Glucose Related GO Terms
GO terms containing
67
Diabetic versus Lean Control
Using the mean-rank gene set test, gene sets defined by 4 GO terms were down-
regulated (see Table 4-30) whereas genes associated with 4 other GO terms were up-regulated
(see Table 4-31). Again, Roast and Camera found no DE gene sets after adjusting for multiple
testing.
Table 4-30: Comparison of down-regulated glucose related GO terms in the contrast between
diabetic and lean control
GO ID GO Term MeanRank FDR
Roast FDR
Camera FDR
GO:0015758 glucose transport 0.0091 0.1643 0.2912
GO:0005536 glucose binding 0.0153 0.1643 0.2765
GO:0001678 cellular glucose homeostasis 0.0486 0.1643 0.2765
GO:0046323 glucose import 0.0867 0.1500 0.2765
Table 4-31: Comparison of up-regulated glucose related GO terms in the contrast between
diabetic and lean control
GO ID GO Term MeanRank FDR
Roast FDR
Camera FDR
GO:0006006 glucose metabolic process 0.0307 0.5167 0.9885
GO:0042593 glucose homeostasis 0.0307 0.7500 0.9885
GO:0003980 UDP-glucose:glycoprotein glucosyltransferase activity 0.0727 0.1500 0.7729
GO:0009749 response to glucose stimulus 0.0727 0.6800 0.9885
Insulin Sensitive versus Lean Control
No glucose related GO terms were found to be down- or up-regulated in the contrast
between insulin sensitive patients and the lean control after three different gene set tests were
carried out.
68
4.1.5 Comparison of Three Gene Set Tests for the FOXO Gene Set
Four FOXO genes are identified in mammals and three of them are found in humans,
i.e., FOXO1, 3 and 4. Their inter-gene correlation is 0.2727. Using both the mean-rank gene set
test and the rotation gene set test, the FOXO gene set was found to be significantly up-
regulated in two contrasts (i.e., insulin resistant patients versus controls and insulin sensitive
patients versus controls) using a threshold for the false discovery rate control of 0.05. When
testing a single gene set, the unadjusted p values seem to be similar for the mean-rank and the
rotation gene set test. However, Camera returned no significant results in any of the contrasts.
A summary of the results is shown in Table 4-32.
Table 4-32: Summary of three gene set tests for the FOXO gene set in three contrasts
Contrast MeanRank P-value
Roast P-value
Camera P-value
Insulin Resistant versus
Lean Control
Down-regulated 0.9965 0.9610 0.9394 Up-regulated 0.0034 0.0400 0.0605 Mixed-regulated 0.2738 0.2200 0.1211
Diabetic versus
Lean Control
Down-regulated 0.8761 0.7789 0.7507 Up-regulated 0.1238 0.2211 0.2492 Mixed-regulated 0.3877 0.3331 0.4985
Insulin Sensitive versus
Lean Control
Down-regulated 0.9864 0.9522 0.8993 Up-regulated 0.0135 0.0478 0.1006 Mixed-regulated 0.0787 0.0869 0.2012
69
4.2 Hypergeometric Test for Gene Set Enrichment Analysis
4.2.1 The longitudinal mouse study involving the comparison of a high-fat diet to the
control
In the longitudinal mouse study, we first need to define the gene universe. Non-specific
filtering was carried out to remove probe sets with smaller variation across samples, i.e., an
inter-quartile range of less than 0.5. Probes with no annotation in the Gene Ontology terms
(Biological Process) were excluded. The same 1146 DE probes from the comparison between
42 days of high-fat diet and the control in the muscle tissue group were selected to map to
genes of interest based on the Entrez gene identifier. GO terms have parent-child hierarchies,
so a conditional hypergeometric test is required to decorrelate the results (Falcon & Gentleman
2007). Because of the way the conditional hypergeometric test operates, it is difficult to adjust
the resultant p-values directly (Falcon & Gentleman 2007). Hence, a p-value cutoff of 0.01 is
used for conditional hypergeometric tests in this section. 206 GO (Biological Process) terms
were over-represented (using p-value <0.01) in the list of DE genes using the conditional test
(Falcon & Gentleman 2007). The top 30 over-represented GO terms are given in Table 4-33.
Table 4-33: Top 30 Over-represented GO (BP) terms after 42 days of high-fat diet in the
longitudinal mouse study
GO ID (BP) P value Odds Ratio
Exp Count Count Size Term
GO:0008104 <0.0001 2.2 57.0 107 620 protein localization
GO:0080090 <0.0001 1.7 172.0 243 1870 regulation of primary metabolic process
GO:0044238 <0.0001 1.7 125.8 179 1546 primary metabolic process
GO:0048519 <0.0001 1.6 123.9 176 1347 negative regulation of biological process
GO:0044249 <0.0001 1.5 191.8 252 2086 cellular biosynthetic process
GO:0015031 <0.0001 2.5 22.6 48 252 protein transport
GO:0023051 <0.0001 1.8 70.3 111 764 regulation of signaling
GO:0009059 <0.0001 1.5 155.5 211 1691 macromolecule biosynthetic process
GO:0034641 <0.0001 1.5 207.3 268 2254 cellular nitrogen compound metabolic process
GO:0010467 <0.0001 1.5 145.7 199 1608 gene expression
GO:0016043 <0.0001 1.5 146.3 198 1591 cellular component organization
GO:0051234 <0.0001 1.5 134.9 185 1485 establishment of localization
GO:0070727 <0.0001 2.2 28.1 54 306 cellular macromolecule localization
GO:0001932 <0.0001 2.1 29.0 54 315 regulation of protein phosphorylation
GO:0019220 <0.0001 2.0 33.5 60 364 regulation of phosphate metabolic process
GO:0032268 <0.0001 2.0 34.3 61 377 regulation of cellular protein metabolic process
70
GO:0051169 <0.0001 2.8 13.2 31 144 nuclear transport
GO:0016070 <0.0001 1.5 110.5 154 1202 RNA metabolic process
GO:0050657 <0.0001 5.4 3.7 14 40 nucleic acid transport
GO:0051236 <0.0001 5.4 3.7 14 40 establishment of RNA localization
GO:0060255 <0.0001 1.6 104.5 146 1180 regulation of macromolecule metabolic process
GO:0010608 <0.0001 2.7 12.5 29 136 posttranscriptional regulation of gene expression
GO:0043549 <0.0001 2.2 21.3 42 232 regulation of kinase activity
GO:0070647 <0.0001 2.2 20.1 39 219 protein modification by small protein conjugation or removal
GO:0008380 <0.0001 2.7 11.3 26 124 RNA splicing
GO:0009889 <0.0001 1.5 91.6 127 1020 regulation of biosynthetic process
GO:0060548 <0.0001 2.0 27.0 48 294 negative regulation of cell death
GO:0048583 <0.0001 1.5 78.7 111 856 regulation of response to stimulus
GO:0034440 <0.0001 4.0 4.5 14 49 lipid oxidation
GO:0045740 <0.0001 8.0 1.7 8 18 positive regulation of DNA replication
Using the non-conditional hypergeometric test, 16 KEGG pathways were found to be over-
represented (using p-value <0.01) in the list of DE genes (see Table 4-34).
Table 4-34: 16 over-represented KEGG pathways after 42 days of high-fat diet in the
longitudinal mouse study
KEGG ID P value
Odds Ratio
Exp Count Count Size KEGG Pathway
3015 <0.0001 4.3230 4.8927 15 43 mRNA surveillance pathway
4722 <0.0001 3.0508 8.7614 21 77 Neurotrophin signaling pathway
3018 0.0001 4.8113 3.6411 12 32 RNA degradation
5100 0.0003 3.8625 4.5514 13 40 Bacterial invasion of epithelial cells
4141 0.0004 2.4729 11.7198 24 103 Protein processing in endoplasmic reticulum
5211 0.0006 3.3045 5.4617 14 48 Renal cell carcinoma
4510 0.0018 2.0511 15.2472 27 134 Focal adhesion
4720 0.0018 3.1950 4.7790 12 42 Long-term potentiation
62 0.0020 15.7554 0.6827 4 6 Fatty acid elongation in mitochondria
4920 0.0028 2.9929 5.0065 12 44 Adipocytokine signaling pathway
4910 0.0043 2.1819 10.1268 19 89 Insulin signaling pathway
4210 0.0050 2.7332 5.3479 12 47 Apoptosis
5212 0.0050 2.7332 5.3479 12 47 Pancreatic cancer
4912 0.0051 2.4859 6.7133 14 59 GnRH signaling pathway
4114 0.0060 2.4309 6.8271 14 60 Oocyte meiosis
4010 0.0062 1.7758 19.0021 30 167 MAPK signaling pathway
71
4.2.2 The cross-sectional human study comparing expression in tissue samples for a
control group of healthy patients and obese patients
In the cross-sectional human study, a similar non-specific gene filtering process was
performed in order to determine the gene universe. When choosing the list of interesting genes,
we focused on the top 50 probes from each of the 3 contrasts separately because very few DE
probes were detected.
Using Top Ranked Probes from Insulin Resistant Versus Lean Control
Using the top ranked 50 probes from the comparison between the insulin resistant
patients and the control group, the conditional hypergeometric test (Falcon & Gentleman 2007)
was performed for their corresponding GO terms. 12 GO terms were over-represented in the list
of top ranked genes using the criterion of p value less than 0.01 (see Table 4-35). One KEGG
pathway was over-represented (using p value <0.01). The top 10 over-represented KEGG
pathways can be found in Table 4-36.
Table 4-35: Top 12 Over- represented GO terms in the contrast between insulin resistant
patients and the control
GO ID (BP) P value Odds Ratio
Exp Count Count Size Term
GO:0046627 0.0003 98.0588 0.0284 2 10 negative regulation of insulin receptor signaling pathway
GO:0032869 0.0022 13.5680 0.2668 3 94 cellular response to insulin stimulus
GO:0010980 0.0057 370.8333 0.0057 1 2 positive regulation of vitamin D 24-hydroxylase activity
GO:0035408 0.0057 370.8333 0.0057 1 2 histone H3-T6 phosphorylation
GO:0042369 0.0057 370.8333 0.0057 1 2 vitamin D catabolic process
GO:0051345 0.0058 9.5160 0.3746 3 132 positive regulation of hydrolase activity
GO:0009636 0.0067 18.5826 0.1249 2 44 response to toxin
GO:0009103 0.0085 185.3889 0.0085 1 3 lipopolysaccharide biosynthetic process
GO:0010966 0.0085 185.3889 0.0085 1 3 regulation of phosphate transport
GO:0046325 0.0085 185.3889 0.0085 1 3 negative regulation of glucose import
GO:0055062 0.0085 185.3889 0.0085 1 3 phosphate ion homeostasis
GO:0007202 0.0096 15.2826 0.1504 2 53 activation of phospholipase C activity
72
Table 4-36: Top 10 Over- represented KEGG Pathways in the contrast between insulin resistant
patients and the control
KEGG ID P value
Odds Ratio
Exp Count Count Size KEGG Pathway
4960 0.0007 99.0800 0.0431 2 27 Aldosterone-regulated sodium reabsorption
4010 0.0118 21.1416 0.1836 2 115 MAPK signaling pathway
5200 0.0265 13.4624 0.2793 2 175 Pathways in cancer
5130 0.0331 41.3667 0.0335 1 21 Pathogenic Escherichia coli infection
5110 0.0347 39.3810 0.0351 1 22 Vibrio cholerae infection
5143 0.0362 37.5758 0.0367 1 23 African trypanosomiasis
5223 0.0409 33.0267 0.0415 1 26 Non-small cell lung cancer
5214 0.0440 30.5556 0.0447 1 28 Glioma
590 0.0471 28.4253 0.0479 1 30 Arachidonic acid metabolism
4370 0.0486 27.4667 0.0495 1 31 VEGF signaling pathway
Using Top Ranked Probes from Diabetic versus Lean Control
Using the top ranked 50 probes from the comparison between the diabetic patients and
the lean control group, 2 GO terms were over- represented in the list of top ranked genes (using
p value <0.01). Only one KEGG pathway with a size of 13 genes,
73
Table 4-37: Top 15 Over- represented GO terms in the contrast between diabetic patients and
the control
GO ID (BP) P value Odds Ratio
Exp Count Count Size Term
GO:0031099 0.0086 16.2326 0.1412 2 45 regeneration
GO:0048642 0.0094 166.8000 0.0094 1 3 negative regulation of skeletal muscle tissue development
GO:0006488 0.0125 111.1833 0.0125 1 4 dolichol-linked oligosaccharide biosynthetic process
GO:0009225 0.0156 83.3750 0.0157 1 5 nucleotide-sugar metabolic process
GO:0046622 0.0156 83.3750 0.0157 1 5 positive regulation of organ growth
GO:0006044 0.0187 66.6900 0.0188 1 6 N-acetylglucosamine metabolic process
GO:0048635 0.0187 66.6900 0.0188 1 6 negative regulation of muscle organ development
GO:0030260 0.0218 55.5667 0.0220 1 7 entry into host cell
GO:0045736 0.0218 55.5667 0.0220 1 7
negative regulation of cyclin-dependent protein kinase activity
GO:0051828 0.0218 55.5667 0.0220 1 7 entry into other organism involved in symbiotic interaction
GO:0052126 0.0218 55.5667 0.0220 1 7 movement in host environment
GO:0006040 0.0279 41.6625 0.0282 1 9 amino sugar metabolic process
GO:0019059 0.0279 41.6625 0.0282 1 9 initiation of viral infection
GO:0031103 0.0309 37.0278 0.0314 1 10 axon regeneration
GO:0046627 0.0309 37.0278 0.0314 1 10 negative regulation of insulin receptor signaling pathway
Using Top Ranked Probes from Insulin Sensitive versus Lean Control
Using the top ranked 50 probes from the contrast of insulin sensitive patients versus the
control group, 12 GO terms were over-represented (using p value <0.01). Table 4-38 shows the
results.
74
Table 4-38: 12 Over- represented GO terms in the contrast between insulin sensitive patients
and the control
GO ID (BP) P value Odds Ratio
Exp Count Count Size Term
GO:0051098 0.0006 12.1297 0.4109 4 131 regulation of binding
GO:0007190 0.0049 21.8487 0.1066 2 34 activation of adenylate cyclase activity
GO:0031281 0.0052 21.1834 0.1098 2 35 positive regulation of cyclase activity
GO:0051349 0.0052 21.1834 0.1098 2 35 positive regulation of lyase activity
GO:0032656 0.0063 333.6500 0.0063 1 2 regulation of interleukin-13 production
GO:0045404 0.0063 333.6500 0.0063 1 2 positive regulation of interleukin-4 biosynthetic process
GO:0051895 0.0063 333.6500 0.0063 1 2 negative regulation of focal adhesion assembly
GO:0030100 0.0089 15.8612 0.1443 2 46 regulation of endocytosis
GO:0043011 0.0094 166.8000 0.0094 1 3 myeloid dendritic cell differentiation
GO:0046855 0.0094 166.8000 0.0094 1 3 inositol phosphate dephosphorylation
GO:0046856 0.0094 166.8000 0.0094 1 3 phosphatidylinositol dephosphorylation
GO:0050765 0.0094 166.8000 0.0094 1 3 negative regulation of phagocytosis
Four KEGG pathways were over-represented (using p value <0.05) in this contrast (see Table
4-39).
Table 4-39: 4 Over-represented KEGG pathways in the contrast between insulin sensitive
patients and the control
KEGG ID P value
Odds Ratio
Exp Count Count Size Term
4141 0.0182 12.0246 0.2153 2 60 Protein processing in endoplasmic reticulum
592 0.0319 38.9219 0.0323 1 9 alpha-Linolenic acid metabolism
1040 0.0388 31.1125 0.0395 1 11 Biosynthesis of unsaturated fatty acids
4977 0.0492 23.9038 0.0502 1 14 Vitamin digestion and absorption
75
Chapter 5 - Cluster Analysis
In this chapter, we first use hierarchical clustering to explore the structure of any
underlying groups in each of the three data sets separately. Both Euclidean distance and
Pearson correlation were used to measure the distance before applying hierarchical clustering.
Secondly, two mouse data sets were integrated as they both used the same type of Affymetrix
mouse arrays. Hierarchical clustering was performed to discover any patterns that may appear
to be different from the dendrograms produced based on individual data sets alone. Lastly, the
longitudinal mouse study involving the comparison of a high-fat diet to the control in two tissues
and the cross-sectional human study comparing expression in tissue samples for a control
group of healthy patients, obese insulin sensitive patients and patients with two stages of the
disease were integrated after standardisation. Then hierarchical clustering was applied to the
samples to look for any potential structure of groups in between the mouse model and the
human model.
5.1 Hierarchical Clustering for Mouse Data Sets
In general, no gene filtering was performed in the hierarchical clustering of individual
data sets, i.e., all the genes were used to calculate distance matrices and produce
dendrograms. Using selected sets of genes based on differential expression can be biased. For
the longitudinal mouse study involving the comparison of a high-fat diet to the control group in
two tissues, adipose and muscle, samples from the two tissue groups were clearly separated in
both dendrograms (see Figures 5-1 and 5-2). The structure of the cluster trees appeared to be
reasonably similar using either the Euclidean distance or the correlation distance as the
measure of dissimilarity. Overall we can see that the control group of mice (i.e., on a standard
low-fat diet) were grouped together in each of the tissue types. In the adipose tissue group, two
samples (i.e., Ahi42.1 and Ahi42.4) from the group of a high-fat diet of 42 days were clustered
together with the samples from the control group based on both dendrograms.
76
Mhi14.2
Mhi42.1
Mhi42.2
Mhi14.3
Mhi42.3
Mhi42.4
Mhi5.4
Mhi5.3
Mhi5.1
Mhi5.2
Mchow.3
Mchow.4
Mchow.1
Mchow.2
Mhi14.4
Ahi14.4
Ahi42.2
Ahi5.4
Ahi14.1
Ahi14.3
Ahi5.2
Ahi5.1
Ahi5.3
Ahi42.4
Achow.1
Ahi42.1
Achow.3
Achow.2
Achow.4
Ahi14.2
0 50 100 150 200 250M
ou
se
Hig
h F
at D
iet C
lus
ter D
en
dro
gra
m
Com
plete
Linkag
e, Euclide
an Dista
nced
ist(t(X.H
F))
Height
Figu
re 5
-1: Hierarchica
l clustering
of the m
ouse high-fat diet d
ata based on E
uclidean distance.
Mhi14.2
Mhi42.1
Mhi42.2
Mhi14.3
Mhi42.3
Mhi42.4
Mchow.3
Mchow.4
Mchow.1
Mchow.2
Mhi5.4
Mhi5.3
Mhi14.4
Mhi5.1
Mhi5.2
Ahi5.4
Ahi14.1
Ahi14.3
Ahi42.4
Achow.1
Ahi42.1
Achow.3
Achow.2
Achow.4
Ahi14.2
Ahi5.2
Ahi5.1
Ahi5.3
Ahi14.4
Ahi42.2
0.00 0.05 0.10 0.15
Mo
us
e H
igh
Fa
t Die
t Clu
ste
r De
nd
rog
ram
Com
plete Linkage,C
orrelation M
atrix
Height
Figu
re 5
-2: Hierarchica
l clustering
of the m
ouse high-fat diet d
ata based on P
earson correla
tion
77
For the mouse cell line study, little difference can be found in the structure of the two
dendrograms produced based on either the Euclidean distance or the Pearson correlation
distance (see Figures 5-3 and 5-4).
Dex
amet
haso
neB
Dex
amet
haso
neA
Dex
amet
haso
neC
Dex
amet
haso
ne.M
nTB
Pc
Dex
amet
haso
ne.M
nTB
Pa
Dex
amet
haso
ne.M
nTB
Pb
Chr
onic
Insu
lin.M
nTB
Pb
Chr
onic
Insu
lin.M
nTB
Pa
Chr
onic
Insu
linA
Chr
onic
Insu
linB
Glu
cose
Oxi
dase
C
Glu
cose
Oxi
dase
A
Glu
cose
Oxi
dase
B X3T
3.L1
a
X3T
3.L1
b
X3T
3.L1
c TN
Fc
TN
Fa
TN
Fb
TN
F.M
nTB
Pa
TN
F.M
nTB
Pb
TN
F.M
nTB
Pc
1020
3040
5060
Mouse Cell Line Cluster Dendrogram
Complete Linkage,Euclidean Distancedist(t(X.CL))
Hei
ght
Figure 5-3: Hierarchical clustering of the mouse cell line data based on Euclidean distance.
78
Dex
amet
haso
neB
Dex
amet
haso
neA
Dex
amet
haso
neC
Dex
amet
haso
ne.M
nTB
Pc
Dex
amet
haso
ne.M
nTB
Pa
Dex
amet
haso
ne.M
nTB
Pb
TN
Fc
TN
Fa
TN
Fb
TN
F.M
nTB
Pa
TN
F.M
nTB
Pb
TN
F.M
nTB
Pc
Chr
onic
Insu
linA
Chr
onic
Insu
linB
Chr
onic
Insu
lin.M
nTB
Pa
Chr
onic
Insu
lin.M
nTB
Pb
Glu
cose
Oxi
dase
C
Glu
cose
Oxi
dase
A
Glu
cose
Oxi
dase
B
X3T
3.L1
a
X3T
3.L1
b
X3T
3.L1
c
0.00
00.
005
0.01
00.
015
Mouse Cell Line Cluster Dendrogram(Correlation Matrix)
Complete Linkage,Correlation Matrixas.dist(1 - cor(X.CL))
Hei
ght
Figure 5-4: Hierarchical clustering of the mouse cell line data based on Pearson correlation.
5.2 Hierarchical Clustering for the Human Data Set
In the human model, the normalized expression values were used to create the distance
matrix before hierarchical clustering was applied. The structure of any meaningful groups
seemed to be quite unclear in both dendrograms comparing to the ones generated from the two
mouse data sets. For example, in Figure 5-5, patients with type 2 diabetes were clustered
together with insulin sensitive and non-obese healthy patients (i.e., lean control) when the
Euclidean distance was used as the distance measure; another healthy patient was grouped
together with a patient with insulin resistance. Similar examples can be seen in the dendrogram
using a correlation matrix (see Figure 5-6). In general, healthy patients from the control group
were rarely grouped together. In some cases healthy patients were in the same cluster trees
with obese insulin resistant and diabetic patients.
79
Lean Control
Obese Insulin sensitive
Diabetic
Diabetic
Obese Insulin Resistant
Lean Control
Obese Insulin sensitive
Lean Control
Lean Control
Obese Insulin Resistant
Obese Insulin sensitive
Obese Insulin sensitive
Obese Insulin sensitive
Obese Insulin Resistant
Lean Control
Diabetic
Obese Insulin sensitive
Obese Insulin Resistant
Diabetic
Obese Insulin Resistant
Diabetic
Diabetic
Lean Control
Obese Insulin Resistant
Obese Insulin sensitive
Lean Control
Obese Insulin Resistant
Diabetic
100 150 200 250
Hu
ma
n M
od
el C
lus
ter D
en
dro
gra
m(E
uc
lide
an
Dis
tan
ce
)
Com
plete Linkage, Euclidean D
istancedist(t(X
))
Height
Figu
re 5
-5: Hierarchica
l clustering
of the hum
an data based on E
uclidean distance.
80
Lean ControlLean Control
DiabeticObese Insulin ResistantObese Insulin sensitive
DiabeticObese Insulin sensitive
Obese Insulin ResistantLean Control
Lean ControlLean Control
Obese Insulin ResistantObese Insulin sensitive
DiabeticObese Insulin sensitive
Obese Insulin sensitiveObese Insulin sensitive
Lean ControlObese Insulin Resistant
Obese Insulin ResistantDiabeticDiabetic
Obese Insulin ResistantDiabeticObese Insulin Resistant
DiabeticLean Control
Obese Insulin sensitive
0.05 0.10 0.15 0.20
Hu
ma
n M
od
el C
lus
ter D
en
dro
gra
m(C
orre
latio
n M
atrix
)
Com
plete Linkage,Correlation M
atrixas.dist(1 - cor(X
))
Height
Figu
re 5
-6: Hierarchica
l clustering
of the hum
an data based on P
earson co
rrelation.
81
5.3 Hierarchical Clustering for the Combined Mouse Data Sets
Since both of the mouse data sets used the Affymetrix mouse 430_2 arrays, we were
able to combine the two data sets according to their probe identifiers. A standardization process
was performed before combining the data sets. For each mouse data set, we standardized (or
normalized) arrays (i.e., samples or mice) by dividing each expression measure by the
corresponding median value of that array. The dendrogram in Figure 5-7 showed a structure of
two distinct groups with no intersection, i.e., samples in the mouse cell line data were
completely separated from those in the longitudinal mouse study involving high-fat diet feeding.
Dex
amet
haso
neB
Dex
amet
haso
neA
Dex
amet
haso
neC
Dex
amet
haso
ne.M
nTB
Pc
Dex
amet
haso
ne.M
nTB
Pa
Dex
amet
haso
ne.M
nTB
Pb
Chr
onic
Insu
lin.M
nTB
Pb
Chr
onic
Insu
lin.M
nTB
Pa
Chr
onic
Insu
linA
Chr
onic
Insu
linB
Glu
cose
Oxi
dase
CG
luco
seO
xida
seA
Glu
cose
Oxi
dase
BX
3T3.
L1a
X3T
3.L1
bX
3T3.
L1c
TN
Fc
TN
Fa
TN
Fb
TN
F.M
nTB
Pa
TN
F.M
nTB
Pb
TN
F.M
nTB
Pc
Mhi
14.2
Mhi
42.1
Mhi
42.2
Mhi
14.3
Mhi
42.3
Mhi
42.4
Mch
ow.1
Mch
ow.3
Mch
ow.2
Mch
ow.4
Mhi
5.2
Mhi
5.3
Mhi
5.4
Mhi
5.1
Mhi
14.4
Ahi
14.4
Ahi
42.2
Ahi
5.4
Ahi
14.1
Ahi
14.3
Ahi
5.2
Ahi
5.1
Ahi
5.3
Ahi
42.4
Ach
ow.1
Ahi
42.1
Ach
ow.3
Ach
ow.2
Ach
ow.4
Ahi
14.2
010
2030
40
Cluster Dendrogram after normalisation
Complete Linkage,Euclidean Distancedist(t(n.merged))
Hei
ght
Figure 5-7: Hierarchical clustering of the combined mouse data based on Euclidean distance.
82
The top 100 differential expressed (DE) genes from each mouse data set were identified
using the moderated F-statistic which combines the moderated t-statistics for all the contrasts
into an overall test of significance for each gene (Smyth 2004). We used these top DE genes to
select a subset of the combined data. The same standardization process was carried out, and
then hierarchical clustering was applied to this subset. Again, none of the samples from the
mouse cell line study were grouped together with any of the samples from the longitudinal
mouse study (see Figure 5-8).
Figure 5-8: Hierarchical clustering of the combined mouse data based on the top DE genes
Mhi
14.3
Mhi
42.4
Mhi
42.3
Mhi
42.1
Mhi
42.2
Mhi
14.2
Mhi
14.4
Mhi
5.3
Mhi
5.4
Mhi
5.1
Mhi
5.2
Mch
ow.4
Mch
ow.1
Mch
ow.2
Mch
ow.3
Dex
amet
haso
ne.M
nTB
Pc
Dex
amet
haso
ne.M
nTB
Pa
Dex
amet
haso
ne.M
nTB
Pb
Dex
amet
haso
neB
Dex
amet
haso
neA
Dex
amet
haso
neC
TN
F.M
nTB
Pb
TN
F.M
nTB
Pc
TN
F.M
nTB
Pa
TN
Fc
TN
Fa
TN
Fb
Chr
onic
Insu
lin.M
nTB
Pa
Chr
onic
Insu
lin.M
nTB
Pb
Chr
onic
Insu
linA
Chr
onic
Insu
linB
Glu
cose
Oxi
dase
A
Glu
cose
Oxi
dase
B
Glu
cose
Oxi
dase
C
X3T
3.L1
c
X3T
3.L1
a
X3T
3.L1
b
01
23
4
Dendrogram - Top 100 DE genes from Mouse Cell Line and High Fat Diet
Complete Linkage,Euclidean Distancedist(t(subset2))
Hei
ght
83
5.4 Hierarchical Clustering for the Integrated Mouse and Human Data Sets
Based on the results from the analysis of differential expression in Chapter 3, we have
obtained top differentially expressed (DE) genes from both the longitudinal mouse model
involving high-fat feeding and the human study. We intend to use the top DE genes to integrate
these two data sets. Since no genes in the adipose group were detected to be differentially
expressed, we will focus our attention on the top DE genes found in the muscle tissue group.
The top 50 DE probes in the muscle group of the longitudinal mouse study were selected based
on the moderated F-statistic (Smyth 2004). Very few DE genes were found in the human Agilent
arrays for each of the three contrasts, so we will choose the top 50 probes based on the
moderated F-statistic. All the selected top ranked probes were mapped to their corresponding
gene symbols. Some genes encoded on the Agilent human arrays might not necessarily exist
on the mouse arrays and vice versa. Therefore we only keep those genes that are encoded on
both human and mouse arrays so that we are able to select their expression values from both
the mouse and human data sets. For those probes that mapped to the same gene symbol, the
average of the probe level expression values was kept as the expression measure for that gene
symbol. We performed a standardization process for each data set by using the z-
transformation: subtracting the mean of the array before dividing by the standard deviation of
the array across all genes. By applying the hierarchical clustering algorithm using the Euclidean
distance to the arrays, we generated the dendrogram below (see Figure 5-9). The hierarchical
clustering split the combined data set into two groups, i.e., branches of the cluster trees in the
mouse data were separated from those in the human model. There is little evidence that we can
find something in common between these two sets.
84
Figu
re 5
-9: Hierarchica
l clustering
of arrays using the com
bined lon
gitudinal mouse stud
y and
the hum
an mod
el.
On the other h
and, hierarchical clusterin
g of to
p-ranked genes is a
lso of interest. The
dendrogram
in Figure 5
-10 shows the
structure of the po
tential su
bgroups of the top-ranke
d
genes in the combined
mouse
and human stud
y.
Lean ControlLean Control
Obese Insulin ResistantLean Control
Obese Insulin ResistantLean Control
Lean ControlLean Control
Obese Insulin sensitiveObese Insulin sensitive
Obese Insulin ResistantObese Insulin sensitive
DiabeticObese Insulin Resistant
DiabeticObese Insulin ResistantObese Insulin sensitiveObese Insulin sensitiveObese Insulin sensitive
Lean ControlDiabetic
DiabeticObese Insulin ResistantObese Insulin sensitive
DiabeticDiabetic
Obese Insulin ResistantDiabetic
Mhi5.3Mchow.1Mchow.3Mchow.2Mchow.4
Mhi14.2Mhi14.4Mhi5.4Mhi5.1Mhi5.2
Mhi42.2Mhi42.3Mhi42.4Mhi14.3Mhi42.1
Ahi14.4Ahi42.2
Ahi42.1Achow.3
Achow.2Achow.4
Achow.1Ahi14.2
Ahi5.4Ahi14.1Ahi14.3
Ahi42.4Ahi5.3
Ahi5.1Ahi5.2
0 2 4 6 8 10 12 14
To
p D
E g
en
es
co
mb
ine
d u
sin
g M
ou
se
HF
an
d H
um
an
mo
de
l
Co
mp
lete Linka
ge, E
uclidean D
istancedist(t(X
.comb
ined))
Height
85
Figure 5-10: Hierarchical clustering of the top differential expressed (DE) genes using the
combined longitudinal mouse study and the human model.
We can also display both pieces of information together in a heat-map. Figure 5-11
demonstrates a combined dendrogram clustering both the top genes and the arrays. We can
clearly separate samples of the mouse data from the human ones for all clusters of genes
except clusters where a combination of grids in both red and yellow colour was observed. It
was found that this particular cluster contains the following 17 genes: ARIH1, CNOT4,
DBNDD1, DNM1L, FXYD4, HEXIM1 , ITGB1, LAMB2, LRRC23, MLL3, PRG4, RCN3, RFFL,
SLC44A2, SYMPK, TPP2 and ZFAND5. We selected these 17 genes to be used to cluster
arrays (i.e., samples) in the combined data set (see Figure 5-12). There were a few interesting
clusters shown in Figure 5-12. For example, 4 samples from the human data (obese insulin
sensitive, diabetic and insulin resistant) were grouped with one sample from the longitudinal
mouse adipose tissue group (Ahi42.1). 4 samples from the lean control in the human data were
DB
ND
D1
ITG
B1
RF
FL
AR
IH1
RC
N3
LRR
C23
DN
M1L
TP
P2
FX
YD
4S
YM
PK
PR
G4
SLC
44A
2Z
FA
ND
5C
NO
T4
MLL
3H
EX
IM1
LAM
B2
CT
NN
B1
EIF
4EB
P1
UB
E2L
3C
DC
42S
ER
INC
3 IL6S
TS
TA
G2
GF
ER
OG
TT
OM
M40
LHP
PA
PH
1BIK
BK
GP
RK
CA
DN
M2
PO
LR3D
ST
OM
TN
MD
GP
R16
1A
LG11
SF
I1LR
P1
NU
CB
2G
MP
SS
MA
RC
A4
SN
RP
A1
CD
CA
4T
WS
G1
AD
AM
10P
ON
2B
EX
2F
AM
111A
LOX TA
L2C
OM
PD
IP2C
HO
ME
R1
CN
NM
4S
PE
GP
DZ
D2
FA
M43
BG
RB
14T
HB
S4
FA
BP
3F
MO
DA
CA
DL
DU
SP
3A
SP
NA
CS
L1R
GS
5C
D36
TX
NIP
HS
DL2
AC
AA
2S
ER
INC
1
05
1015
2025
3035
Clustering Top DE genes combined using Mouse HF and Human model
Complete Linkage, Euclidean Distancedist(X.combined)
Hei
ght
86
togeth
er in one cluster. Most of the othe
r samp
les from th
e human data w
ere separa
ted from
the mouse sam
ples.
Figu
re 5
-11: H
ierarchical clustering of both th
e top DE
genes and arrays using the
combin
ed
longitud
inal m
ouse study and the hum
an mode
l.
Lean ControlLean Control
Obese Insulin ResistantLean Control
Obese Insulin ResistantLean ControlLean ControlLean Control
Obese Insulin sensitiveObese Insulin sensitive
Obese Insulin ResistantObese Insulin sensitive
DiabeticObese Insulin sensitive
Obese Insulin ResistantDiabeticDiabeticDiabetic
Obese Insulin ResistantLean Control
DiabeticObese Insulin sensitiveObese Insulin sensitive
DiabeticObese Insulin sensitive
Obese Insulin ResistantObese Insulin Resistant
DiabeticAhi42.2Ahi14.4Ahi42.4Ahi5.3Ahi5.2Ahi5.1
Ahi42.1Ahi5.4
Ahi14.1Ahi14.3Ahi14.2
Achow.1Achow.3Achow.4Achow.2
Mhi5.3Mchow.1Mchow.3Mchow.4Mchow.2Mhi14.2Mhi14.4Mhi5.4Mhi5.1Mhi5.2
Mhi42.2Mhi14.3Mhi42.1Mhi42.3Mhi42.4
LOX
FA
M111A
BE
X2P
ON
2A
DA
M10
TWS
G1
CD
CA
4G
MP
SS
NR
PA
1S
MA
RC
A4
NU
CB
2LR
P1
SF
I1A
LG
11G
PR
161TN
MD
SY
MP
KF
XY
D4
DB
ND
D1
RF
FL
ITGB
1A
RIH
1R
CN
3LR
RC
23D
NM
1LTP
P2
PR
G4
ZF
AN
D5
SL
C44A
2C
NO
T4M
LL3H
EXIM
1LA
MB
2E
IF4E
BP
1C
TNN
B1
UB
E2L3
SE
RIN
C3
CD
C42
IL6ST
ST
AG
2G
FE
RTO
MM
40O
GT
LHP
PIK
BK
GA
PH
1BP
RK
CA
DN
M2
ST
OM
PO
LR3D
CD
36TXN
IPH
SD
L2A
CA
A2
SE
RIN
C1
TAL2
CO
MP
DIP
2CP
DZ
D2
FA
M43B
GR
B14
HO
ME
R1
CN
NM
4S
PE
GTH
BS
4F
AB
P3
FM
OD
DU
SP
3A
CA
DL
AS
PN
AC
SL1
RG
S5
87
Figure 5-12: Hierarchical clustering of the arrays based on a selected cluster of DE genes.
Mhi14.2
Mhi14.4
Mchow.2
Mchow.3
Mchow.1
Mchow.4
Mhi42.3
Mhi42.4
Mhi42.1
Mhi14.3
Mhi42.2
Mhi5.1
Mhi5.3
Mhi5.2
Mhi5.4
Obese Insulin sensitive
Diabetic
Obese Insulin sensitive
Obese Insulin Resistant
Ahi42.1
Obese Insulin sensitive
Obese Insulin Resistant
Obese Insulin sensitive
Diabetic
Obese Insulin Resistant
Diabetic
Obese Insulin sensitive
Lean Control
Obese Insulin Resistant
Diabetic
Obese Insulin Resistant
Diabetic
Diabetic
Obese Insulin sensitive
Obese Insulin Resistant
Obese Insulin Resistant
Obese Insulin sensitive
Lean Control
Diabetic
Ahi14.4
Ahi5.1
Ahi5.2
Ahi42.2
Ahi42.4
Ahi14.3
Achow.1
Ahi5.3
Ahi5.4
Ahi14.1
Ahi14.2
Achow.3
Achow.2
Achow.4
Lean Control
Lean Control
Lean Control
Lean Control
Lean Control
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Clu
ste
r De
nd
rog
ram
- 17 D
E g
en
es c
om
bin
ed
usin
g M
ou
se
HF
an
d H
um
an
mo
de
l
Com
plete Linkage, Euclidean D
istancedist(t(sub.com
bined))
Height
88
89
Chapter 6 - Conclusions
6.1 Addressing the Research Questions
We first summarise the findings in order to address all the research questions listed in
Chapter 1. The findings are given in the following four sections: differentially expressed (DE)
genes, cross-species gene set tests, over-representation analysis and the selected gene set of
interest.
Differentially Expressed Genes
1. Which genes are differentially expressed in each condition in the human data, relative to
healthy controls?
We found 12 DE genes in the contrast between type 2 diabetic patients and healthy controls
(FDR<0.1) and 2 other different DE probes in the contrast between insulin sensitive patients
and healthy controls. No DE genes were detected in the patients with insulin resistance
relative to healthy controls. The false discovery rate (FDR) is controlled at the level of 0.1
due to the very small number of DE genes found in the human data.
2. Which genes are differentially expressed in each treatment group in the mouse data
involving the comparison of a high-fat diet to controls?
In the muscle tissue group, 1146 probes were differentially expressed (FDR<0.01) in the
contrast between mice on 42 days of a high-fat diet and controls. A threshold of 0.01 is
chosen to control the false discovery rate given the large number of DE probes found in this
case. However, only 3 and 20 DE probes (FDR<0.01) were found in the contrasts between
mice on 5 days of a high-fat diet versus controls and on 14 days of a high-fat diet versus
controls respectively. This shows that the high-fat diet feeding over a long period of time
such as 6 weeks has a great impact on differential expression of genes in the muscle tissue.
No differentially expressed genes were found in any of the contrasts between three time
points and controls in the adipose tissue group even when using FDR<0.1.
90
3. Which genes are differentially expressed in each treatment group in the mouse cell line
study?
Using the moderated F-statistics, 15933 probes were differentially expressed (FDR<0.01) on
any of the contrasts between the seven treatment groups and the control.
Cross-species Gene Set Tests
4. Are those significant gene sets (GO terms or KEGG pathways) found in the mouse data
involving a high-fat diet differentially expressed in obese patients with insulin resistance
in the human data relative to healthy controls?
We found 3082 GO terms in the human arrays after mapping the DE genes from the mouse
muscle tissue group to their corresponding GO terms in the human Agilent chip. Gene sets
defined by 420 GO terms were significantly up-regulated in obese patients with insulin
resistance relative to healthy controls using the mean-rank gene set test (FDR<0.05). Gene
sets associated with 121 GO terms were significantly down-regulated whereas 128 GO
terms were significantly mixed-regulated. A summary of the number of significant GO terms
with a size of no more than 500 probes is given in Table 6-1. However, the rotation gene set
test returned no DE gene sets (FDR<0.1).
Table 6-1: Summary of the number of significant GO terms in the contrast between insulin
resistant patients and controls
Mean-Rank Gene Set Test No. of GO terms containing
no more than 500 probes
Total No. of GO
terms Up-regulated 378 420
Down-regulated 115 121
Mixed-regulated 127 128
The DE genes in the muscle tissue group were mapped to 188 KEGG pathways. 55 KEGG
pathways were significantly up-regulated using the mean-rank gene set test (FDR<0.05).
Gene sets in 19 KEGG pathways were significantly down-regulated and 4 KEGG pathways
were significantly mixed-regulated. Using the rotation gene set test, no KEGG pathways
were found formally differentially expressed.
91
5. Are those significant gene sets (GO terms or KEGG pathways) found in the mouse data
involving a high-fat diet differentially expressed in obese patients with type 2 diabetes in
the human data relative to healthy controls?
In the contrast between type 2 diabetic patients and healthy controls, 488 GO terms were
significantly up-regulated (FDR<0.05) using the mean-rank gene set test. Gene sets defined
by 120 GO terms were significantly down-regulated and 182 GO terms were differentially
expressed regardless of the direction. Table 6-2 shows a summary of the number of
significant GO terms with a size of no more than 500 probes. The rotation gene set test led
to no formally significant gene sets.
Table 6-2: Summary of the number of significant GO terms in the contrast between diabetic
patients and controls
Mean-Rank Gene Set Test No. of GO terms containing
no more than 500 probes
Total No. of GO
terms Up-regulated 428 488
Down-regulated 111 120
Mixed-regulated 171 182
Based on the results of the mean-rank gene set test, 56 KEGG pathways were significantly
up-regulated, 17 KEGG pathways were significantly down-regulated and 20 KEGG
pathways were significantly mixed-regulated. When the rotation gene set test was applied, 5
KEGG pathways were significantly mixed-regulated, all of which were mixed-regulated in the
mean-rank gene set test.
6. Are those significant gene sets (GO terms or KEGG pathways) found in the mouse data
involving a high-fat diet differentially expressed in obese insulin sensitive patients in the
human data relative to healthy controls?
Using the mean-rank gene set test, 428 GO terms were significantly up-regulated
(FDR<0.05) in the contrast between insulin sensitive patients and healthy controls. Gene
sets associated with 97 GO terms were significantly down-regulated and 192 GO terms
were significantly mixed-regulated. The number of significant GO terms with a size of no
more than 500 probes can be found in Table 6-3. The rotation gene set test found no
significant gene sets.
92
Table 6-3: Summary of the number of significant GO terms in the contrast between insulin
sensitive patients and controls
Mean-Rank Gene Set Test No. of GO terms containing
no more than 500 probes
Total No. of GO
terms Up-regulated 359 428
Down-regulated 92 97
Mixed-regulated 177 192
We found 70 significantly up-regulated KEGG pathways using the mean-rank gene set test.
Among the top 30 KEGG pathways, 12 were related to different types of cancer. The
rotation gene set test found 20 up-regulated KEGG pathways and 9 of them were cancer
related pathways. These 9 cancer pathways were all listed as significant using the mean-
rank gene set test.
Over-representation Analysis
7. Are there any GO terms or KEGG pathways that are over-represented in the list of top
ranked genes in the mouse data with a high-fat diet?
It was found that 206 GO terms (Biological Process) and 16 KEGG pathways were over-
represented in the list of DE genes in the mouse data involving a high-fat diet. Among the 16
KEGG pathways, 2 were cancer related pathways, i.e., Pancreatic cancer and Renal cell
carcinoma.
8. Are there any GO terms or KEGG pathways that are over-represented in the list of top
ranked genes in obese patients with insulin resistance relative to healthy controls?
We found 12 GO terms (Biological Process) and one KEGG pathway (i.e.,
93
controls. The two GO terms are
94
unadjusted p-values whereas Camera appears to be the most conservative one, finding
almost no DE gene sets in this current situation.
6.2 Discussion
In our lists of gene set test results, we identify some significant GO terms or KEGG
pathways that have been previously discussed and confirmed by a number of research teams.
Wnt signaling pathway
Several key components of the Wnt signalling pathway are found to be implicated in
metabolic homeostasis and the development of type 2 diabets (Ip, Chiang & Jin 2012). Based
on our findings, both Wnt signaling pathway (KEGG: 4310) and the canonical Wnt receptor
signaling pathway (GO: 0060070) were significantly up-regulated in two contrasts (obese insulin
resistant patients versus healthy controls and diabetic patients versus healthy controls).
p53 signaling pathway
p53 activation is induced by a number of stress signals, including DNA damage,
oxidative stress and activated oncogenes. It was found that p53 expression in adipose tissue is
crucially involved in the development of insulin resistance, which underlies age-related
cardiovascular and metabolic disorders (Minamino et al. 2009). We found that p53 signaling
pathway (KEGG: 4115) was significantly up-regulated in diabetic patients versus healthy
controls. It was also significantly mixed-regulated in the comparison of obese insulin sensitive
patients versus controls.
Adipocytokine signaling pathway
Adipocytokine signalling pathway, being related to insulin resistance, was found
significantly up-regulated in type 2 diabetic patients (Manoel-Caetano et al. 2012). Our results
show the adipocytokine signaling pathway (KEGG: 4920) is over-represented in the list of DE
genes in the longitudinal mouse study involving a high-fat diet.
95
Oxidative phosphorylation
According to the gene expression analysis completed by research collaboration in
Japan, Oxidative phosphorylation (OXPHOS) pathway may predict the existence of diabetes
because it was down-regulated in the peripheral blood mononuclear cells of patients with type 2
diabetes (Takamura et al. 2007). The down-regulation of this pathway was also detected by
another study (Manoel-Caetano et al. 2012). Interestingly, we found the oxidative
phosphorylation pathway (KEGG: 190) significantly down-regulated in all three contrasts (i.e.,
obese insulin sensitive, insulin resistant, and diabetic patients versus healthy controls) in the
human data.
Citrate cycle (TCA cycle)
Some studies demonstrated an important role for cyclic pathways of pyruvate
metabolism (the pyruvate/malate, pyruvate/citrate, and pyruvate/isocitrate cycles) in control of
insulin secretion (Jensen et al. 2008). Citrate cycle (TCA cycle) pathway (KEGG: 20) was found
to be the top significant KEGG pathway (Manoel-Caetano et al. 2012). We identified that citrate
cycle (TCA cycle) pathway appeared to be significantly down-regulated in insulin resistant
patients versus controls as well as in diabetic patients versus controls.
6.3 Comments on the Experimental Design
The longitudinal mouse study used inbred mice for both control and treatment groups.
Hence a large number of DE genes were detected even with a small sample size. However, the
details of the experimental design of the human study are somewhat unclear. We also expect to
observe more genetic variation in humans. Given the limited number of samples, this could
potentially reduce the chances of finding true DE genes in any of the disease conditions relative
to healthy controls in the human data.
6.4 Further Work
Because of the nature of this experiment, we need biologists to investigate these results
further in order to make valid biological inferences. For the three gene set testing methods, the
Mean-rank gene set test, Roast and Camera, simulation studies are needed to compare the
advantages and disadvantages of the three gene set tests.
96
In gene set tests, we focused on significant Gene Ontology (GO) terms with no more
than 500 probes because of its parent-child hierarchical structure. Other methods of addressing
the GO hierarchy problem could be explored further.
For KEGG pathways, the KEGG.db package is now considered to be deprecated and
future versions of Bioconductor may not have it available. Hence, other possible alternatives
could be explored such as the reactome.db package.
97
References
Australia's Health 2010, 12th Biennial Health Report, Australian Institute of Health and Welfare, Canberra. Benjamini, Y, Drai, D, Elmer, G, Kafkafi, N & Golani, I 2001, 'Controlling the false discovery rate in behavior genetics research', Behavioural Brain Research, vol. 125, no. 1, pp. 279-84. Benjamini, Y & Hochberg, Y 1995, 'Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing', Journal of the Royal Statistical Society. Series B (Methodological), vol. 57, no. 1, pp. 289-300. Carter, ME & Brunet, A 2007, 'FOXO transcription factors', Current biology : CB, vol. 17, no. 4, pp. R113-R4. Copeland, NG, Jenkins, NA & O'Brien, SJ 2002, 'Mmu 16: Comparative Genomic Highlights', Science, vol. 296, no. 5573, pp. 1617-8. de Wilde, J, Mohren, R, van den Berg, S, Boekschoten, M, Dijk, KW-V, de Groot, P, Muller, M, Mariman, E & Smit, E 2008, 'Short-term high fat-feeding results in morphological and metabolic adaptations in the skeletal muscle of C57BL/6J mice', Physiological Genomics, vol. 32, no. 3, pp. 360-9. Falcon, S & Gentleman, R 2007, 'Using GOstats to test gene lists for GO term association', Bioinformatics, vol. 23, no. 2, pp. 257-8. Gentleman, R, Ding, B, Dudoit, S & Ibrahim, J 2005, 'Distance Measures in DNA Microarray Data Analysis', in R Gentleman, V Carey, W Huber, R Irizarry & S Dudoit (eds), Bioinformatics
and Computational Biology Solutions Using R and Bioconductor, Springer New York, pp. 189-208. Goeman, JJ & Bühlmann, P 2007, 'Analyzing gene expression data in terms of gene sets: methodological issues', Bioinformatics, vol. 23, no. 8, pp. 980-7. Hayashi, Y, Kajimoto, K, Iida, S, Sato, Y, Mizufune, S, Kaji, N, Kamiya, H, Baba, Y & Harashima, H 2010, 'DNA microarray analysis of whole blood cells and insulin-sensitive tissues reveals the usefulness of blood RNA profiling as a source of markers for predicting type 2 diabetes', Biological & pharmaceutical bulletin, vol. 33, no. 6, pp. 1033-42. Houstis, N 2006, 'Reactive oxygen species have a causal role in multiple forms of insulin resistance', Nature, vol. 440, no. 7086, pp. 944-8.
98
Ip, W, Chiang, YT & Jin, T 2012, 'The involvement of the wnt signaling pathway and TCF7L2 in diabetes mellitus: The current understanding, dispute, and perspective', Cell Biosci, vol. 2, no. 1, p. 28. Jensen, MV, Joseph, JW, Ronnebaum, SM, Burgess, SC, Sherry, AD & Newgard, CB 2008, 'Metabolic cycling in control of glucose-stimulated insulin secretion', Am J Physiol Endocrinol
Metab, vol. 295, no. 6, pp. E1287-97. Manoel-Caetano, FS, Xavier, DJ, Evangelista, AF, Takahashi, P, Collares, CV, Puthier, D, Foss-Freitas, MC, Foss, MC, Donadi, EA, Passos, GA & Sakamoto-Hojo, ET 2012, 'Gene expression profiles displayed by peripheral blood mononuclear cells from patients with type 2 diabetes mellitus focusing on biological processes implicated on the pathogenesis of the disease', Gene, vol. 511, no. 2, pp. 151-60. Minamino, T, Orimo, M, Shimizu, I, Kunieda, T, Yokoyama, M, Ito, T, Nojima, A, Nabetani, A, Oike, Y, Matsubara, H, Ishikawa, F & Komuro, I 2009, 'A crucial role for adipose tissue p53 in the regulation of insulin resistance', Nat Med, vol. 15, no. 9, pp. 1082-7. Smyth, GK 2004, 'Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments', Statistical applications in genetics and molecular
biology, vol. 3, no. 1, pp. 1-25. Smyth, GK & Speed, T 2003, 'Normalization of cDNA microarray data', Methods (San Diego,
Calif.), vol. 31, no. 4, pp. 265-73. Stenbit, AE, Tsao, T-S, Li, J, Burcelin, R, Geenen, DL, Factor, SM, Houseknecht, K, Katz, EB & Charron, MJ 1997, 'GLUT4 heterozygous knockout mice develop muscle insulin resistance and diabetes', Nat Med, vol. 3, no. 10, pp. 1096-101. Subramanian, A, Tamayo, P, Mootha, VK, Mukherjee, S, Ebert, BL, Gillette, MA, Paulovich, A, Pomeroy, SL, Golub, TR, Lander, ES & Mesirov, JP 2005, 'Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles', Proceedings of the
National Academy of Sciences of the United States of America, vol. 102, no. 43, pp. 15545-50. Takamura, T, Honda, M, Sakai, Y, Ando, H, Shimizu, A, Ota, T, Sakurai, M, Misu, H, Kurita, S, Matsuzawa-Nagata, N, Uchikata, M, Nakamura, S, Matoba, R, Tanino, M, Matsubara, K-i & Kaneko, S 2007, 'Gene expression profiles in peripheral blood mononuclear cells reflect the pathophysiology of type 2 diabetes', Biochemical and Biophysical Research Communications, vol. 361, no. 2, pp. 379-84. Tumurkhuu, G, Koide, N, Dagvadorj, J, Hassan, F, Islam, S, Naiki, Y, Mori, I, Yoshida, T & Yokochi, T 2007, 'MnTBAP, a synthetic metalloporphyrin, inhibits production of tumor necrosis factor⬆
99
Weyer, C, Bogardus, C, Mott, DM & Pratley, RE 1999, 'The natural history of insulin secretory dysfunction and insulin resistance in the pathogenesis of type 2 diabetes mellitus', The Journal of
clinical investigation, vol. 104, no. 6, pp. 787-94. Wu, D, Lim, E, Vaillant, F, Asselin-Labat, ML, Visvader, JE & Smyth, GK 2010, 'ROAST: rotation gene set tests for complex microarray experiments', Bioinformatics, vol. 26, no. 17, pp. 2176-82. Wu, D & Smyth, GK 2012, 'Camera: a competitive gene set test accounting for inter-gene correlation', Nucleic Acids Research, no. Journal Article.
wait until all figures and tables have been added to appendices. For details, see the
Appendices section on the Using Word page (http://www.k-
state.edu/grad/etdr/orient/wordindex.htm).
top related