novel analysis of multi-species type 2 diabetes from gene …17… · novel analysis of...

NOVEL ANALYSIS OF MULTI-SPECIES TYPE 2 DIABETES

FROM GENE EXPRESSION DATA

CATHERINE ZHENG

BEc, GradDipEd, PostGradDipSc

A THESIS

submitted in total fulfillment of the requirements for the degree

MASTER OF SCIENCE (HONOURS)

School of Computing, Engineering and Mathematics

UNIVERSITY OF WESTERN SYDNEY Sydney, Australia

Statement of Authentication

The work presented in this thesis is, to the best of my knowledge and belief, original except as

acknowledged in the text. I hereby declare that I have not submitted this material, either in full or

in part, for a degree at this or any other institution.

Signature

Table of Contents

List of Figures ..................................................................................................................................... iii

List of Tables ........................................................................................................................................ v

Acknowledgements ............................................................................................................................. ix

Abstract ................................................................................................................................................. x

Chapter 1 - Introduction ....................................................................................................................... 1

1.1 Background on Type 2 Diabetes ............................................................................................... 1

1.2 A Brief Introduction to Gene Expression and Microarrays ..................................................... 2

1.3 Highlights of Earlier Research .................................................................................................. 3

1.4 Research Case ............................................................................................................................. 4

1.4.1 Research Questions ............................................................................................................. 4

1.4.2 Goals .................................................................................................................................... 5

Chapter 2 - Research Methods ............................................................................................................. 7

2.1 Experimental Design .................................................................................................................. 7

2.2 Data Pre-processing ................................................................................................................... 9

2.3 Analysis of Differential Gene Expression .............................................................................. 10

2.4 Cluster Analysis ....................................................................................................................... 12

2.5 Functional Annotation and Pathway Analysis ....................................................................... 12

2.5.1 Gene Set Tests ................................................................................................................... 13

2.5.2 Hypergeometric Test for Gene Set Enrichment Analysis .............................................. 15

Chapter 3 - Analysis of Differential Expression ............................................................................... 17

3.1 Data Pre-processing and Quality Assessment ........................................................................ 17

3.1.1 Affymetrix Mouse Microarrays ....................................................................................... 17

3.1.2 Agilent Human Microarrays ............................................................................................. 21

3.2 Analysis of Differential Expression ........................................................................................ 23

3.2.1 Mouse Microarrays ........................................................................................................... 23

3.2.2 Human Microarrays .......................................................................................................... 30

Chapter 4 - Pathway Analysis ............................................................................................................ 34

4.1 Gene Set Tests .......................................................................................................................... 35

4.1.1 Competitive Gene Set Test ............................................................................................... 36

4.1.1.1 Gene Ontology (GO) Terms ..................................................................................... 36

4.1.1.2 KEGG Pathways ........................................................................................................ 44

4.1.2 Self-Contained Gene Set Test .......................................................................................... 57

4.1.3 Comparison of Three Gene Set Tests for Insulin Related GO Terms ........................... 60

4.1.4 Comparison of Three Gene Set Tests for Glucose Related GO Terms ......................... 66

4.1.5 Comparison of Three Gene Set Tests for the FOXO Gene Set ...................................... 68

4.2 Hypergeometric Test for Gene Set Enrichment Analysis...................................................... 69

4.2.1 The longitudinal mouse study involving the comparison of a high-fat diet to the

control ......................................................................................................................................... 69

4.2.2 The cross-sectional human study comparing expression in tissue samples for a control

group of healthy patients and obese patients ............................................................................ 71

Chapter 5 - Cluster Analysis .............................................................................................................. 75

5.1 Hierarchical Clustering for Mouse Data Sets ......................................................................... 75

5.2 Hierarchical Clustering for the Human Data Set ................................................................... 78

5.3 Hierarchical Clustering for the Combined Mouse Data Sets ................................................ 81

5.4 Hierarchical Clustering for the Integrated Mouse and Human Data Sets ............................ 83

Chapter 6 - Conclusions ..................................................................................................................... 89

6.1 Addressing the Research Questions ........................................................................................ 89

6.2 Discussion ................................................................................................................................. 94

6.3 Comments on the Experimental Design ................................................................................. 95

6.4 Further Work ............................................................................................................................ 95

References ........................................................................................................................................... 97

List of Figures

Figure 3-1: Density plot of the longitudinal mouse study involving the comparison of a high-fat

diet to control in two tissues. ..................................................................................................... 18

Figure 3-2: Boxplots of the longitudinal mouse study involving the comparison of a high-fat diet

to control in two tissues. ............................................................................................................ 19

Figure 3-3: Density plot of the mouse cell line study of exposure to various substances including

insulin. ......................................................................................................................................... 20

Figure 3-4: Boxplots of the mouse cell line study of exposure to various substances including

insulin. ......................................................................................................................................... 20

Figure 3-5: Density plot and boxplots of the Rlog2 values in the human data set. ...................... 21

Figure 3-6: Boxplots of the M values in the human data set after loess normalization. ................ 22

Figure 3-7: Volcano plot of top DE genes between 42 days of a high-fat diet and the control in

the mouse muscle tissue group. ................................................................................................. 25

Figure 3-8: Volcano plot of top ranked genes (non DE) between 42 days of a high-fat diet and

the control in the mouse adipose tissue group. ......................................................................... 27

Figure 3-9: Boxplots of the top ranked genes in the contrast of insulin resistant versus lean

control. ........................................................................................................................................ 31

Figure 3-10: Boxplots of the top ranked genes in the contrast of diabetic versus lean control. .... 32

Figure 3-11: Boxplots for 2 probes corresponding to gene SLC2A4 in the human data. .............. 33

Figure 4-1: Boxplots of gene expression levels (M values) of 62 common probes in both

"Pancreatic cancer" and "Colorectal cancer" KEGG pathways on the Agilent human array

..................................................................................................................................................... 59

"Pancreatic cancer" and "Endometrial cancer" KEGG pathways on the Agilent human array

..................................................................................................................................................... 60

Figure 4-3: Scatterplot matrix of down-regulated unadjusted p-values from three methods ........ 62

Figure 4-4: Scatterplot matrix of up-regulated unadjusted p-values from three methods ............. 63

Figure 4-5: Scatterplot matrix of down-regulated unadjusted p-values from three methods in the

contrast between diabetic and lean control ............................................................................... 65

Figure 5-1: Hierarchical clustering of the mouse high-fat diet data based on Euclidean distance.

..................................................................................................................................................... 76

Figure 5-2: Hierarchical clustering of the mouse high-fat diet data based on Pearson correlation

..................................................................................................................................................... 76

Figure 5-3: Hierarchical clustering of the mouse cell line data based on Euclidean distance. ...... 77

Figure 5-4: Hierarchical clustering of the mouse cell line data based on Pearson correlation. ..... 78

Figure 5-5: Hierarchical clustering of the human data based on Euclidean distance. .................... 79

Figure 5-6: Hierarchical clustering of the human data based on Pearson correlation. ................... 80

Figure 5-7: Hierarchical clustering of the combined mouse data based on Euclidean distance. ... 81

Figure 5-8: Hierarchical clustering of the combined mouse data based on the top DE genes ...... 82

Figure 5-9: Hierarchical clustering of arrays using the combined longitudinal mouse study and

the human model. ....................................................................................................................... 84

Figure 5-10: Hierarchical clustering of the top differential expressed (DE) genes using the

combined longitudinal mouse study and the human model. .................................................... 85

Figure 5-11: Hierarchical clustering of both the top DE genes and arrays using the combined

longitudinal mouse study and the human model. ..................................................................... 86

Figure 5-12: Hierarchical clustering of the arrays based on a selected cluster of DE genes. ........ 87

List of Tables

Table 2-1: Details of the longitudinal mouse study ............................................................................ 7

Table 2-2: Details of the cross-sectional human study....................................................................... 8

Table 2-3: Details of the mouse cell line study .................................................................................. 8

Table 3-1: Differentially expressed genes found in the muscle tissue group ................................. 23

Table 3-2: Top 10 DE genes between 42 days of a high-fat diet and control ................................. 24

Table 3-3: Top 10 DE genes between 14 days of a high-fat diet and control ................................. 24

Table 3-4: Top 10 genes with large fold changes between 42 days of a high-fat diet and the

control in the mouse adipose tissue group ................................................................................ 28

Table 3-5: Top 10 differentially expressed probes in the mouse cell line data .............................. 29

Table 3-6: Differentially expressed genes found in the human data ............................................... 30

Table 3-7: Top 10 ranked genes for the contrast of insulin resistant versus lean control in the

human data .................................................................................................................................. 30

Table 3-8: Top 10 ranked genes for the contrast of diabetic versus lean control in the human data

..................................................................................................................................................... 31

Table 4-1: Top 30 significantly up-regulated GO terms in the contrast between insulin resistant

patients and lean control ............................................................................................................ 37

Table 4-2: Top 30 significantly down-regulated GO terms in the contrast between insulin

resistant patients and lean control.............................................................................................. 38

Table 4-3: Top 30 significantly mixed-regulated GO terms in the contrast between insulin

resistant patients and lean control.............................................................................................. 39

Table 4-4: Top 30 significantly up-regulated GO terms in the contrast between diabetic patients

and lean control .......................................................................................................................... 40

Table 4-5: Top 30 significantly down-regulated GO terms in the contrast between diabetic

Table 4-6: Top 30 significantly mixed-regulated GO terms in the contrast between diabetic

Table 4-7: Top 10 significantly up-regulated GO terms in the contrast between insulin sensitive

sensitive patients and lean control ............................................................................................. 44

Table 4-10: 55 significantly up-regulated KEGG pathways in the contrast between insulin

resistant and lean control............................................................................................................ 45

Table 4-11: 19 significantly down-regulated KEGG pathways in the contrast between insulin

Table 4-12: 9 significantly mixed-regulated KEGG pathways in the contrast between insulin

Table 4-13: 56 significantly up-regulated KEGG pathways in the contrast between diabetic and

lean control.................................................................................................................................. 48

Table 4-14: 35 significantly up-regulated KEGG pathways in both insulin resistant and diabetic

groups .......................................................................................................................................... 50

Table 4-15: 17 significantly down-regulated KEGG pathways in the contrast between diabetic

and lean control .......................................................................................................................... 51

Table 4-16: 13 significantly down-regulated KEGG pathways in both insulin resistant and

diabetic groups ............................................................................................................................ 52

Table 4-17: 20 significantly mixed-regulated KEGG pathways in the contrast between diabetic

and lean control .......................................................................................................................... 52

Table 4-18: 5 significantly mixed-regulated KEGG pathways in both insulin resistant and

diabetic groups ............................................................................................................................ 53

Table 4-19: Top 30 significantly up-regulated KEGG pathways in the contrast between insulin

sensitive and lean control ........................................................................................................... 54

Table 4-22: Rotation Gene Set Test - Top 8 mixed-regulated KEGG pathways in the contrast

between diabetic and lean control ............................................................................................. 58

Table 4-23: Rotation Gene Set Test - Top 20 up-regulated in the contrast between insulin

Table 4-24: Comparison of down-regulated insulin related GO terms in the contrast between

insulin resistance and lean control ............................................................................................. 61

Table 4-25: Comparison of up-regulated insulin related GO terms in the contrast between insulin

resistance and lean control ......................................................................................................... 63

diabetic and lean control ............................................................................................................ 64

Table 4-27: Comparison of up-regulated insulin related GO terms in the contrast between

Table 4-29: Comparison of down-regulated glucose related GO terms in the contrast between

insulin resistance and lean control ............................................................................................. 66

Table 4-31: Comparison of up-regulated glucose related GO terms in the contrast between

Table 4-32: Summary of three gene set tests for the FOXO gene set in three contrasts................ 68

Table 4-33: Top 30 Over-represented GO (BP) terms after 42 days of high-fat diet in the

longitudinal mouse study ........................................................................................................... 69

Table 4-34: 16 over-represented KEGG pathways after 42 days of high-fat diet in the

longitudinal mouse study ........................................................................................................... 70

Table 4-35: Top 12 Over- represented GO terms in the contrast between insulin resistant patients

and the control ............................................................................................................................ 71

Table 4-36: Top 10 Over- represented KEGG Pathways in the contrast between insulin resistant

patients and the control .............................................................................................................. 72

Table 4-37: Top 15 Over- represented GO terms in the contrast between diabetic patients and the

control ......................................................................................................................................... 73

Table 4-38: 12 Over- represented GO terms in the contrast between insulin sensitive patients and

the control ................................................................................................................................... 74

Table 4-39: 4 Over-represented KEGG pathways in the contrast between insulin sensitive

patients and the control .............................................................................................................. 74

Table 6-1: Summary of the number of significant GO terms in the contrast between insulin

resistant patients and controls .................................................................................................... 90

Table 6-2: Summary of the number of significant GO terms in the contrast between diabetic

patients and controls ................................................................................................................... 91

sensitive patients and controls ................................................................................................... 92

Acknowledgements

Firstly, I would like to say

Abstract

Purpose

The incidence of type 2 diabetes is reaching epidemic levels. Today type 2 diabetes is

the most common form of diabetes, accounting for 85 to 90 percent of diabetes cases. The

James Lab at Garvan Institute for Medical Research are interested in gene expression in insulin

resistance and diabetes. They have provided three gene expression data sets: a longitudinal

mouse study involving the comparison of a high-fat diet to a standard diet with gene expression

in two tissues, a mouse cell line study and a cross-sectional human study. The main goals of

this research is to identify differentially expressed genes in both the mouse and human data,

compare genomic expression patterns across species, human and mouse, and to focus on

pathway analysis for detecting differential expression in predefined gene sets based on Gene

Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways.

Methods

Three data sets are normalized in order to remove experimental effects arising from the

microarray technology. Linear models can then be fitted on the normalized data using the limma

package to identify genes undergoing differential expression. Each gene has its own expression

profile and genes with similar profiles can be grouped together. We intend to try and use the

data sets together to cluster samples based on gene profiles. In reality, biological processes are

complicated with many molecules working together. The goal of annotating the genome is to link

all information associated with gene products in order to learn how pathways function in the

biological system. In situations where long lists of genes are found to be differentially

expressed, we consider focusing on the analysis of gene sets because it is more sensible to

investigate gene sets that are functionally related based on prior biological knowledge or

experiments. We explore the potentially interesting gene sets using the Gene Ontology (GO)

database and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. Differentially

expressed genes detected in the mouse data are mapped to their corresponding gene sets based

on the Gene Ontology terms and KEGG pathways. Competitive and self-contained gene set

tests (the mean-rank gene set test and the rotation gene set test) are performed for each

comparison in the human data.

The correlation adjusted mean-rank gene set test is included in testing insulin or glucose

related GO terms and KEGG pathways. To test if any GO terms (Biological Process) or KEGG

pathways are over-represented in a list of differentially expressed genes in the mouse or human

data sets, we carry out the hypergeometric test.

Results

We identify a large number of differentially expressed genes in the muscle tissue from

the longitudinal mouse study. The cross-species gene set tests have revealed significant GO

terms and KEGG pathways in each condition of obese patients relative to healthy controls. We

compare the results produced by the mean-rank gene set test and the rotation gene set test.

Significant insulin or glucose related gene sets are found using three gene set testing methods

and the results are compared. The FOXO gene set is found to be significantly up-regulated in

two contrasts in the human data.

Chapter 1 - Introduction

In this chapter we will give a brief overview of the onset of type 2 diabetes globally and

also the health issues related to type 2 diabetes that we are currently facing in Australia. The

research case as well as a list of the research questions is explained and the goals are set.

Highlights of some research work completed in recent years are discussed to outline the

achievements and the gaps in the analysis of gene expression data in the areas related to type

2 diabetes.

1.1 Background on Type 2 Diabetes

Type 2 diabetes is a chronic disease and the pathogenesis of this disease involves

metabolic abnormalities in both insulin action and insulin secretion (Weyer et al. 1999). Insulin is

produced by the pancreas and plays a significant role in converting glucose into energy in the

metabolic system.

A brief explanation of type 2 diabetes and the relationship between type 2 diabetes and

insulin resistance given by the James Lab at Garvan Institute for Medical Research is as

follows:

☜In Type 2 diabetes there is a relative deficiency of insulin - that is the body still produces

insulin but is unable to produce insulin in sufficient quantities to hold blood sugar levels within

normal limits. Increasingly in Type 2 diabetes this is due to insulin resistance - the inability of

the body's tissue to respond to insulin in a normal way.☝

(http://www.jameslab.com.au/WhatIsDiabetes.shtml)

The incidence of type 2 diabetes is reaching epidemic levels. Today type 2 diabetes is

the most common form of diabetes, accounting for 85 to 90 percent of diabetes cases. Further,

a greater number of younger people are getting type 2 diabetes whereas previously it mainly

affected older adults. Diabetes is Australia

remain undiagnosed. By 2031 it is estimated that 3.3 million Australians will have type 2

diabetes (Vos et al. 2004). The burden of type 2 diabetes is increasing and it is expected to

become the leading cause of disease burden by 2023 (AIHW 2010).

1.2 A Brief Introduction to Gene Expression and Microarrays

Deoxyribonucleic Acid (DNA) carries the genetic instructions for producing proteins in

living organisms. Proteins are essential parts of organisms providing function and regulation in

cells. Genes are segments of DNA responsible for making proteins. The process of converting

from genes to proteins can be described as the central dogma of molecular biology: genes are

first transcribed into messenger ribonucleic acid (mRNA) and mRNA is translated into a chain of

amino acids which after further processing, form a protein. Genes are considered to be

expressed within a cell or organism when they produce RNA, some of which is translated into

proteins. It is crucial to quantify the amount of proteins produced when genes are expressed.

But directly measuring the amount of proteins is somewhat difficult. Therefore, in order to

determine the levels of gene expression, the levels of mRNA are used instead.

To be able to measure the levels of mRNA for tens of thousands of genes in a single

experiment, microarray technology was introduced and developed into a number of platforms. A

microarray is a solid surface, being a silicon or glass chip, on which probe sequences from

different genes are fixed. Microarray technology allows scientists to measure the gene

expression levels of a large number of genes under different diseases or various experimental

conditions.

There are two main categories of microarrays based on the type of data being produced:

one-channel and two-channel microarrays. The representative of one-channel microarrays is

Affymetrix arrays. A number of manufacturers are specialised in two-channel arrays such as

Agilent. The microarrays used in this study are Affymetrix one-channel arrays and Agilent two-

channel arrays. Affymetrix one-channel platform hybridizes one sample per chip. For two-

channel arrays, two samples are applied to each array. In our two-channel arrays with a

reference design, a common reference is used on all arrays. And each sample is compared to

the common reference.

1.3 Highlights of Earlier Research

Type 2 diabetes results from a combination of genetic and environmental factors.

Although there is a genetic predisposition, the risk is greatly increased when associated with

lifestyle factors such as high blood pressure, overweight or obesity, insufficient physical activity

and unhealthy diet (http://www.diabetesaustralia.com.au). There is currently no cure for type 2

diabetes. Therefore, understanding the relationship between gene expression and insulin

resistance and type 2 diabetes should lead to better understanding of the disease and

potentially early diagnosis.

A research group in Europe carried out experiments on male mice (i.e., C57BL/6J mice)

that were randomly assigned to a low-fat palm oil diet or a high-fat palm oil diet for 3 or 28 days

(de Wilde et al. 2008). A series of analyses were performed including microarray gene

expression and protein analysis. Two methods were investigated for the analysis of gene

expression: the first method is based on overrepresentation of Gene Ontology (GO) terms

whereas the second method is gene set enrichment analysis (Subramanian et al. 2005). The

sources for the gene set enrichment analysis were GenMAPP, Kyoto Encyclopedia of Genes

and Genomes (KEGG) AND SKmanual. It was found that short-term high-fat feeding led to

altered expression levels of genes involved in a variety of biological processes including

morphogenesis, energy metabolism, lipogenesis, and immune function (de Wilde et al. 2008).

Another recent microarray analysis conducted in Japan compared a male rat model of

spontaneous type 2 diabetes resembling obese patients with type 2 diabetes (i.e., OLETF rats)

to a non-diabetic control group of male rats (Hayashi et al. 2010). Gene expression analysis in

diabetes-related tissues (i.e., liver, adipose and skeletal muscle tissue) was carried out and the

results show that blood gene expression profiling is a useful source of markers to predict type 2

diabetes (Hayashi et al. 2010). Hierarchical clustering analysis of differentially expressed genes

in diabetes-related tissues was performed and overrepresented Gene Ontology terms

(Biological Process) that mapped to these differentially expressed (DE) genes were reported to

support their conclusions (Hayashi et al. 2010). No other databases or pathways were used

except for the GO terms. No human samples were collected because of ethical issues and no

cross species comparison can be further investigated. The James lab managed to obtain tissue

samples from a group of obese insulin sensitive patients, patients with insulin resistance and

type 2 diabetes. We are interested in cross species comparison, i.e., investigating how genes

and gene sets of interest found in the mouse data behave in the human data.

1.4 Research Case

The James Lab at Garvan Institute for Medical Research are interested in gene

expression in insulin resistance and type 2 diabetes. They have collected several gene

expression data sets and we have access to three of them. Two of the expression data sets are

Affymetrix mouse microarrays. The Agilent arrays are based on a human study. A brief

description of the gene expression microarrays analysed in this thesis is as follows. Detailed

descriptions of the three data sets are given in Chapter 2 Research Methods.

§ A longitudinal mouse study involving the comparison of a high-fat diet to control with

gene expression in two tissues, adipose and muscle, measured by microarray.

§ A cross-sectional human study comparing expression in tissue samples for a control

group of non-obese healthy patients, obese insulin sensitive patients and patients

with 2 stages of the disease, i.e., obese patients with insulin resistance and type 2

diabetes.

§ A 3T3L1 mouse cell line study of exposure to various substances including insulin.

1.4.1 Research Questions

The research questions we are going to address in this study are as follows:

1. Which genes are differentially expressed in each condition in the human data, relative to

healthy controls?

2. Which genes are differentially expressed in each treatment group in the mouse data

involving the comparison of a high-fat diet to controls?

3. Which genes are differentially expressed in each treatment group in the mouse cell line

study?

4. Are those significant gene sets (GO terms or KEGG pathways) found in the mouse data

involving a high-fat diet differentially expressed in obese patients with insulin resistance

in the human data relative to healthy controls?

involving a high-fat diet differentially expressed in obese patients with type 2 diabetes in

the human data relative to healthy controls?

involving a high-fat diet differentially expressed in obese insulin sensitive patients in the

human data relative to healthy controls?

7. Are there any GO terms or KEGG pathways that are over-represented in the list of top

ranked genes in the mouse data with a high-fat diet?

ranked genes in obese patients with insulin resistance relative to healthy controls?

ranked genes in obese patients with type 2 diabetes relative to healthy controls?

ranked genes in obese insulin sensitive patients relative to healthy controls?

11. Are insulin/glucose related GO terms differentially expressed in each condition in the

human data, relative to healthy controls?

12. Is a set of FOXO genes differentially expressed?

1.4.2 Goals

The aim of this research is to investigate existing and novel approaches to the analysis

of these data sets. We have set three main goals for this research project:

1. To identify genes that show significant changes in gene expression levels for each data

2. To compare genomic expression patterns across species, human and mouse, and to

integrate results from these studies.

3. To focus on pathway analysis aiming at detecting differential expression in predefined

gene sets based on two popular databases, Gene Ontology (GO) and Kyoto

Encyclopedia of Genes and Genomes (KEGG). We are particularly interested in how

differential expressed (DE) gene sets detected in the mouse data involving a high-fat

diet behave in the human data. Competitive and self-contained gene set tests are

applied to determine significant gene sets and their results compared. This may help us

better understand the biological processes involved in the progress of developing insulin

resistance and type 2 diabetes in the human genome.

It has been found that the whole chromosome sequence segments of mouse and human

are remarkably similar (Copeland, Jenkins & O'Brien 2002). Therefore, it is sensible to use the

mouse as a model organism to investigate functions of genes in the human genome. We expect

that a certain number of genes and gene sets interact in similar ways in several biological

systems for both species.

Chapter 2 - Research Methods

Microarrays are used to measure the gene expression levels under different diseases or

experimental conditions. The two microarray platforms used in this study are Affymetrix short

oligonucleotide arrays and Agilent long oligonucleotide arrays. Microarray technology has

enabled researchers to analyse gene expression levels of vast amounts of genes

simultaneously in an efficient manner in multiple biological samples. High-density gene

expression arrays require a pre-processing step to be completed before the statistical

investigation can be carried out.

The discussion of five sections of the research methods used in our statistical analysis is as

follows. The open source statistical programming language R and Bioconductor packages are

used throughout the study.

2.1 Experimental Design

The experimental design was completed by the James Lab at Garvan Institute of

Medical Research. The data sets consist of three designs, namely:

• A longitudinal mouse study involving the comparison of a high-fat diet to control with

expression in two tissues, adipose and muscle, measured by microarray (Table 2-1).

Table 2-1: Details of the longitudinal mouse study

Tissue Type

Type of Diet

Symbol

Replicates

Adipose Standard Lab - Achow 4 Adipose High-fat 5 Ahi5 4 Adipose High-fat 14 Ahi14 4 Adipose High-fat 42 Ahi42 3 Muscle Standard Lab - Mchow 4 Muscle High-fat 5 Mhi5 4 Muscle High-fat 14 Mhi14 3 Muscle High-fat 42 Mhi42 4

One group of mice were given a standard lab diet which consists of 8% calories from fat,

21% calories from protein and 71% calories from carbohydrate. The other three groups

of mice were given a high-fat diet for 5, 14 and 42 days respectively. The high-fat diet

consists of 45% calories from fat, 20% calories from protein and 35% calories from

carbohydrate. The gene expression levels were measured in both adipose and muscle

tissues using Affymetrix short oligonucleotide arrays for all groups of mice. The

The various substances include the following compounds:

1. Chronic Insulin

2.3 Analysis of Differential Gene Expression

In order to identify genes that undergo differential expression (DE) between two groups,

we use a popular approach called linear models to estimate the differences, and subsequently

to test the null hypothesis; for each gene we test that there is no difference between the

population mean intensity in both groups. Linear models for microarray data (the limma

package), is designed for analysing complex microarray experiments and can be applied to data

from both single-channel and two-color microarray platforms (Smyth 2004).

To illustrate the idea of the linear models in the limma package, we use the following matrix

notation:

For gene g , gg XYE α=][ , where gY is the logged expression vector for gene g , X is the

design matrix and gα is a vector of coefficients (Smyth 2004).

For the cross-sectional human study and the mouse cell line study, a design matrix is

defined for each data set before fitting a linear model for each gene on the array. For the

longitudinal mouse study involving the comparison of a high-fat diet to a standard diet with

expression levels in two tissues (adipose and muscle), the data set is first divided into two

separate subsets according to the type of tissue before a design matrix is specified for each

tissue type. Then a contrast matrix is created for each design matrix specifying the comparisons

of interest between each time point after which a high-fat diet was given. Based on the contrast

matrix, we are able to compare the initial coefficients in many ways, as required.

One of the common issues in microarray experiments with small sample sizes is that

some genes with small but consistent fold changes could appear to be differentially expressed

with large t-statistics. To prevent this from happening, we need to modify the denominator of a

standard t test statistic. The limma package achieves this goal by using an empirical Bayes

method to moderate the standard errors of the estimated log-fold changes which is known as

shrinkage estimation. This results in more stable inference and improved power, especially for

experiments with small numbers of arrays (Smyth 2004).

The emprirical Bayes method assumes an inverse Chi-square prior for the 2gσ with

mean 20s and degrees of freedom 0d . The posterior values for the residual variances are

given by

22002~

where gd is the residual degrees of freedom for the gth gene.

The moderated t-statistic for the kth contrast for gene g is given by

2.4 Cluster Analysis

Apart from finding differentially expressed genes, we are also interested in the

similarities between genes as well as between samples. Each gene has its own expression

profile and genes with similar profiles tend to cluster together. We will perform hierarchical

clustering using Euclidean distance as the distance measure and complete linkage as the

clustering algorithm to produce a dendrogram for each data set. It is also of interest to compare

the dendrograms while using correlation as the similarity measure and complete linkage as the

clustering algorithm. Correlation-based measures are in general invariant to location and scale

transformations and tend to group together genes whose expression patterns are linearly

related (Gentleman et al. 2005).

We intend to try and use the data sets together to either cluster genes or cluster

samples based on gene profiles. Since the two mouse data sets used the same Affymetrix

Mouse Genome 430_2 arrays, we can integrate the data sets based on their probe identifiers

before clustering the samples. Standardization is carried out on both data sets after integration.

In order to include the human (Agilent) data this will mean mapping gene identifiers between

array types.

Chapter 5 shows all the analysis completed for hierarchical clustering.

2.5 Functional Annotation and Pathway Analysis

In gene expression microarray analysis, statistical significance may not necessarily

translate into biologically relevance. After fitting linear models and correcting for multiple testing

error rates, some genes that are biologically relevant may not appear to be statistically

significant due to the fact that the relevant biological differences are modest relative to the noise

inherent in the microarray technology (Subramanian et al. 2005). In reality, biological processes

are complicated with many molecules working together. The goal of annotating the genome is to

link all information associated with gene products in order to learn how pathways function in the

biological system. In situations where long lists of genes are found to be differentially

expressed, we will consider pathway-level approaches by going beyond the analysis of

individual genes because it is more sensible to focus on groups of genes that are functionally

related based on prior biological knowledge or experiments. Comprehensive sets of results are

presented in Chapter 4 Pathway Analysis.

2.5.1 Gene Set Tests

To define gene sets according to prior biological experimentation, we use two popular

databases for gene annotation and pathways analysis: Gene Ontology (GO) and Kyoto

Encyclopedia of Genes and Genomes (KEGG).

permuting genes. Hence it does not consider any correlation among genes in a gene set to be

an important factor, which may underestimate the variability of the data resulting in small p-

values. CAMERA was recently developed taking inter-gene correlation into consideration by

incorporating the variance inflation factor into the test procedures (Wu & Smyth 2012). The

variance inflation factor is based on the mean correlation estimated directly from residuals from

the linear model for genes in the test set and the procedure is equivalent to computing the

average of all possible pairwise correlations between genes in the set (Wu & Smyth 2012).

The second approach is a self-contained gene set test - the rotation gene set test. It

tests if any of the genes in the set are differentially expressed without regard to other genes on

the array. The rotation gene set test replaces permutation with random rotations of residuals to

resolve issues associated with permuting genes, i.e., not allowing for correlations between

genes (Wu et al. 2010). The number of rotations can be set to a very large value, so it avoids

the problem with small number of replicates which may lead to unreliable estimates of p-values

(Wu et al. 2010). In the rotation gene set test, we test three alternative hypotheses: up, down

and mixed. The null hypothesis is that a contrast of the coefficients gβ is equal to zero, i.e.,

0=gβ for all genes in the gene set of interest. The

as GLUT4, is the insulin-responsive glucose transporter in muscle and adipose tissue that

plays an important role in postprandial glucose disposal (Stenbit et al. 1997). Altered SLC4A2

activity is suggested to be one of the factors responsible for decreased glucose uptake in

muscle and adipose tissue in obesity and diabetes (Stenbit et al. 1997). FOXO transcription

factors are evolutionarily conserved mediators of insulin and growth factor signalling. They are

at the interface of crucial cellular processes, orchestrating programs of gene expression that

regulate apoptosis, cell-cycle progression, and oxidativestress resistance (Carter & Brunet

2007). We will test the gene sets associated with the above important genes in order to detect

any differential expression at the group level in the human data.

☜Kyoto Encyclopedia of Genes and Genomes (KEGG) is a database resource for understanding

high-level functions and utilities of the biological system, such as the cell, the organism and the

ecosystem, from genomic and molecular-level information.☝ (http://www.kegg.jp/kegg/)

We will repeat the mapping and testing procedures for KEGG pathways discussed in the

previous paragraphs.

2.5.2 Hypergeometric Test for Gene Set Enrichment Analysis

Another commonly used approach involving finding functional groupings within a set of

differentially expressed genes is the testing of over-represented gene sets in a list of significant

genes. This is also to address some of the research questions. We intend to use the non-

parametric hypergeometric test to investigate if there is an association between genes being

differentially expressed and having the particular function. To demonstrate the concept of the

hypergeometric distribution, we use the following: let N be the total number of genes (i.e., the

gene universe); let M be the number of genes from a particular GO category; let n be the

number of differentially expressed genes; let random variable x be the number of genes from a

particular GO category which appear in the list of n differentially expressed genes. The

probability density function of x is given by

xXP )(

To compute the final probability of finding x or more genes we need to sum up all the

probabilities from x to ),min( Mn . If a gene set associated with a particular GO term contains

more differentially expressed genes than would be expected by chance, this gene set is over-

represented and will give us insight into the functional characteristics of the gene list (Falcon &

Gentleman 2007). To determine if any GO terms or KEGG pathways are over-represented in a

list of differentially expressed genes, we follow these steps below (Falcon & Gentleman 2007):

1. Carry out nonspecific filtering and define the gene universe.

We use the inter-quartile range to estimate the variation across samples and probes with an

inter-quartile range of less than 0.5 are considered less informative, so they are removed.

Probes with no corresponding Entrez Gene identifiers or annotation in the GO categories

(Biological Process) are also removed. When two or more probes map to the same Entrez Gene

ID, only the probe with the largest inter-quartile range is chosen. We then define those genes

that passed the nonspecific filtering process as the gene universe.

2. Determine a subset of interesting genes.

We use the results from the analysis of differential expression generated previously using the

limma package. For the longitudinal mouse study involving the comparison of a high-fat diet to

control, differentially expressed genes from the muscle tissue group are selected based on the

adjusted p values using a cutoff of 0.01 due to the enormous number of DE probes identified.

For the cross-sectional human study, the top 50 probes are used because of the very few DE

genes found (i.e., no more than 12) in any of the contrasts.

3. Test for over-representation in the collection of gene sets.

The GOstats package provides tools to perform hypergeometric test for over-represented GO

(BP) terms and display a summary of the test results. Given the hierarchical structure of GO

terms, Falcon and Gentleman (2007) developed a conditional hypergeometric test that uses the

relationships among GO terms to decorrelate the results. For KEGG pathways, a non-

conditional hyergeometric test is performed.

Chapter 3 - Analysis of Differential Expression

In this chapter, we aim to address the following research questions:

§ Which genes are differentially expressed in each condition in the human data,

relative to healthy controls?

§ Which genes are differentially expressed in each treatment group in the mouse

data involving the comparison of a high-fat diet to controls?

§ Which genes are differentially expressed in each treatment group in the mouse

cell line study

3.1 Data Pre-processing and Quality Assessment

The three data sets used in this research project are pre-existing. The two Affymetrix

mouse microarray data sets are high-density short oligonucleotide arrays and normalization was

done prior to us receiving the data. Agilent Long oligonucleotide arrays were used for the

human cross-sectional study and a different normalization procedure is required. The aim of the

normalization process is to remove effects arising from the microarray technology and to ensure

that the distributions of the intensities across arrays are similar.

3.1.1 Affymetrix Mouse Microarrays

In this case these two mouse arrays were already normalized. Expression values were

first transformed to the log scale. Density plots and boxplots were produced for each of the

mouse data sets to check the normalization process. For the longitudinal mouse study involving

the comparison of a high-fat diet to control in two tissues, adipose and muscle, the distributions

of the log expression values for each type of tissue (see Figure 3-1) appeared to be quite similar

with some variation between tissues. Parallel boxplots generally provide a good summary of the

distributions of intensities across all arrays. It was clearly evident in the boxplots that the log

expression values of the fifteen arrays in each tissue group were similarly distributed (see

Figure 3-2). The medians of the log expression values from the muscle tissue group were

found to be lower than the medians from the adipose tissue group based on the boxplots. The

upper and lower quartiles showed some similar patterns with lower values in the muscle tissue

group (see Figure 3-2). This indicates that we should analyse muscle and adipose tissue groups

separately.

2 4 6 8 10

log expression

AdiposeMuscle

Figure 3-1: Density plot of the longitudinal mouse study involving the comparison of a high-fat

diet to control in two tissues.

AdiposeMuscle

Figure 3-2: Boxplots of the longitudinal mouse study involving the comparison of a high-fat diet

to control in two tissues.

Figure 3-3 shows the density plot of the mouse cell line study and the distributions of the

log expression values were very similar across all the arrays. Median values as well as the

upper and lower quartiles in the boxplots appeared to be almost the same across all the

treatment groups which indicated very similar spread (see Figure 3-4). No obvious outliers were

found in either of the mouse data sets.

0.00 0.05 0.10 0.15 0.20 0.25 0.30

log expression

density

-3: Density p

lot of the m

ouse cell line study of exp

to various substances inclu

insulin

X3T3.L1a

X3T3.L1b

X3T3.L1c

ChronicInsulinA

ChronicInsulinB

ChronicInsulin.MnTBPa

ChronicInsulin.MnTBPb

TNF.MnTBPa

TNF.MnTBPb

TNF.MnTBPc

DexamethasoneA

DexamethasoneB

DexamethasoneC

Dexamethasone.MnTBPa

Dexamethasone.MnTBPb

Dexamethasone.MnTBPc

GlucoseOxidaseA

GlucoseOxidaseB

GlucoseOxidaseC

2 4 6 8 10

-4: Boxplots of the

mouse cell line stud

y of exposure to various substa

nces includin

insulin

3.1.2 Agilent Human Microarrays

The Agilent microarrays of the cross-sectional human study comparing expression in

tissue samples for a control group of healthy patients and obese patients with 3 stages of the

disease (i.e., insulin sensitive, insulin resistant and diabetic) are two-colour arrays and require

the pre-processing procedure including background correction and normalization to be

performed using the limma package. Two-colour microarrays use red (R) and green (G)

channels labelled with Cy5 and Cy3 dyes respectively. In this case the green channel was used

as a common reference. We believe the green channel contains a mixture of material, but since

it is a common reference it has no impact on the analysis. The pre-processing procedure aims

to remove any non-biological effects either between the two channels or between arrays (i.e.,

patients). According to the density plot and boxplots in Figure 3-5, there appeared to be quite a

lot of variation between most of the arrays on the red channel. The distributions of the Rlog2

values (red intensities) were different within the control group and within each of the diseased

groups.

5 10 15 20

Density Plot - Red Channel

log2 R

Lean ControlObese Insulin ResistantDiabeticObese Insulin sensitive

1 3 5 7 9 11 13 15 17 19 21 23 25 27

Boxplots - Red Channel

Figure 3-5: Density plot and boxplots of the Rlog2 values in the human data set.

In analysing the two-colour microarrays, the difference between the red (R) and green

intensities (G) as well as the average of the red and green intensities are of interest. A popular

way of comparing the red and green intensities is using M and A values which are defined as

follows:

M denotes the log fold change for each gene: (R/G)logGlog-Rlog 222 ==M

A denotes the average log intensity for each gene: (RG)log2

1G)logR(log

1222 =+=A

Figure 3-6 shows the boxplots of the M values for each array after loess normalization

within the arrays. The distributions of the M values across all arrays were similar and the

median of the M value for each array appeared to be very close to M=0 (see Figure 3-6).

We now can move onto fitting linear models in order to identify differential gene

expression for each of the data sets.

Figure 3-6: Boxplots of the M values in the human data set after loess normalization.

3.2 Analysis of Differential Expression

For each of the microarray data sets, gene-by-gene statistical tests were performed to

test the hypothesis if there is any statistical significance between each of the treatment groups

and the baseline control, i.e., the treatment or group effects.

3.2.1 Mouse Microarrays

The original data of the longitudinal mouse study involving a high-fat diet was divided

into two subsets based on the tissue type, adipose and muscle. The subsets were analysed

separately to detect differentially expressed (DE) genes in each case.

Table 3-1 shows the number of DE probes found using a threshold of both 0.01 and 0.1

to control the false discovery rate (Benjamini & Hochberg 1995) for each of the contrast in the

muscle tissue group. For example, it was revealed that 1146 probes were differentially

expressed after 42 days of a high-fat diet compared to the control group which was on a

standard diet.

Table 3-1: Differentially expressed genes found in the muscle tissue group

Contrast No. of DE probes using FDR < 0.01

No. of DE probes using FDR < 0.1

Muscle.5 days versus Control 3 3745

The top 10 differentially expressed gene symbols and their corresponding 2log fold

change, average 2log expression levels, moderated t-statistics (Smyth 2004) and adjusted p

values for two selected comparisons are listed in the following tables. The two selected

comparisons are: treatment group after 14 days of a high-fat diet versus the control group and

after 42 days of a high-fat diet versus the control group.

Table 3-2: Top 10 DE genes between 42 days of a high-fat diet and control

Gene Symbol logFC AveExpr t

FDR Log odds

Hsdl2 0.7790 10.6069 12.2374 <0.0001 12.3506 Acaa2 1.1012 11.0703 11.5972 <0.0001 11.6576 Serinc1 1.2655 9.5465 9.9676 0.0002 9.6861 Acadl 0.5247 13.2915 9.4685 0.0002 9.0168 Stom 0.7368 8.1708 9.3793 0.0002 8.8937 Serinc3 1.0322 11.7445 9.3742 0.0002 8.8867 Adam10 0.6432 8.0022 9.3486 0.0002 8.8511 Il6st 0.6878 10.7077 8.8126 0.0005 8.0861 Cdc42 0.4965 12.2880 8.5096 0.0006 7.6355 Twsg1 0.8477 9.0585 8.4937 0.0006 7.6115

Table 3-3: Top 10 DE genes between 14 days of a high-fat diet and control

FDR Log odds

Prg4 1.365 9.843 9.245 0.002 8.121 Thbs4 1.010 11.469 8.676 0.002 7.391 Lox 1.102 8.552 8.126 0.004 6.639 Rcn3 0.586 8.033 7.727 0.005 6.062 Aspn 1.239 8.125 7.544 0.005 5.791 Comp 0.874 9.068 7.431 0.005 5.620 Fmod 1.298 12.074 7.395 0.005 5.565 Thbs1 1.300 7.469 7.382 0.005 5.545 Tnmd 1.517 9.651 7.312 0.005 5.437 Lox 0.835 9.805 7.283 0.005 5.392

The differentially expressed genes found in the mouse data for the comparison between

42 days of a high-fat diet and a standard diet are of great interest and we intend to investigate

further into these DE genes and their associated GO terms or KEGG pathways in Chapter 4. To

illustrate the results of the DE genes, a volcano plot was produced which visualizes genes with

either statistically significance or a large size of effect or both. The top 10 differentially

expressed gene symbols were highlighted in the volcano plot in Figure 3-7. Data points with

higher values along the y-axis present genes that are highly significant whereas points close to

either left- or right-hand side of the plot represent genes with greater fold changes in both up

and down directions. The horizontal line in red is drawn to separate those differentially

expressed probes from the non-DE ones. Data points above this horizontal line represent those

1146 probes we found earlier. Data points lie outside the two vertical lines represent genes with

fold changes greater than 2 and smaller than 2

1. We can see in Figure 3-7 that gene products

Acaa2, Serinc1 and Serinc3 appeared to satisfy both criteria, i.e., being highly significant with

large fold changes.

Figure 3-7: Volcano plot of top DE genes between 42 days of a high-fat diet and the control in

the mouse muscle tissue group.

-2 -1 0 1 2 3

Muscle Tissue - Hi Fat Diet 42 days vs Control

Log Fold Change

Serinc1

Acadl Stom Serinc3Adam10

Il6stCdc42 Twsg1

In the adipose tissue group, no differentially expressed genes were found in any of the

contrasts between three time points and the control after adjusting for multiple testing using a

threshold of 0.1 for the false discovery rate. Figure 3-8 shows the volcano plot of top ranked

non-DE genes between 42 days of a high-fat diet and the control in the mouse adipose tissue

group. We found that 48 non differentially expressed probe sets turned out to have very large

fold changes, i.e., fold changes greater than 4 or less than 4

1. This is also indicated clearly in

the volcano plot (see Figure 3-8). Two vertical lines in green represent log fold changes of 2 and

-2 0 2 4 6 8

Adipose Tissue - Hi Fat Diet 42 days vs Control

Log Fold Change

Slc29a1Srpx

Gcnt2Rnf146

Bcl2l1 Fkbp11Cnot4

Satl1CpPer2Icam1Gle1Atxn7l1PxnIl6stRbm25Cp

Figure 3-8: Volcano plot of top ranked genes (non DE) between 42 days of a high-fat diet and

the control in the mouse adipose tissue group.

Table 3-4: Top 10 genes with large fold changes between 42 days of a high-fat diet and the

control in the mouse adipose tissue group

Log odds

Crisp1 7.764 9.995 3.377 0.372 -1.918

Akr1b7 6.952 10.961 3.649 0.372 -1.495

Serpina1f 5.882 9.391 2.904 0.403 -2.663

Ptgs2 5.685 8.889 3.053 0.390 -2.428

Spink8 5.399 10.497 2.713 0.415 -2.962

Defb42 4.675 9.850 2.626 0.423 -3.096

Ptgs2 3.899 7.844 2.679 0.420 -3.015

Ceacam10 3.688 8.044 1.855 0.524 -4.229

Pcp4 3.552 8.597 2.124 0.478 -3.851

Cldn8 3.279 7.464 2.332 0.452 -3.545

For the 3T3L1 mouse cell line study of exposure to various substances including insulin,

a linear model was fitted for each gene and the F-statistic was used to determine if any genes

are differentially expressed on any of the contrasts between the treatment groups and the

control. Results for the top 10 probes are listed in Table 3-5.

Table 3-5: Top 10 differentially expressed probes in the mouse cell line data

Gene Symbol

Chronic Insulin

Chronic Insulin. MnTBP TNF

TNF. MnTBP

Dexamethasone

Dexamethasone. MnTBP

Glucose Oxidase AveExpr F

Cpm -0.104 0.037 0.019 0.162 3.746 3.814 0.229 4.583 1588.499 <0.0001

Ptgds -0.194 -0.070 0.285 0.221 3.594 3.614 0.024 5.788 1521.455 <0.0001

Fam107a -0.175 -0.102 0.009 0.088 3.369 3.307 0.070 5.648 1037.686 <0.0001

Fkbp5 -0.058 -0.084 0.152 0.134 1.892 1.969 0.024 7.657 853.471 <0.0001

Fam107a -0.226 -0.064 -0.141 -0.004 2.994 3.061 -0.133 4.874 743.152 <0.0001

Clca1 -0.183 -0.361 1.115 0.653 2.026 1.823 0.050 5.670 704.902 <0.0001

Ptgds -0.141 -0.054 0.187 0.167 2.365 2.415 -0.068 7.109 671.262 <0.0001

Fam107a -0.377 -0.226 -0.060 0.041 3.972 3.803 0.004 5.720 638.933 <0.0001

Aldh1a1 -0.134 -0.169 1.055 0.679 3.068 2.638 -0.013 5.545 608.139 <0.0001

Chi3l1 -0.270 -0.145 2.047 1.836 0.233 0.248 0.024 5.861 567.040 <0.0001

3.2.2 Human Microarrays

For the cross-sectional human study comparing expression in tissue samples for a

control group of healthy patients, obese insulin sensitive patients, and patients with insulin

resistance and type 2 diabetes, the number of DE genes for each contrast was found to be no

more than 12 using a false discovery rate of 0.1 to adjust for multiple comparisons (see Table 3-

Table 3-6: Differentially expressed genes found in the human data

Contrast

No. of DE genes using FDR < 0.1

Obese Insulin Resistant versus Lean Control 0

Obese Diabetic versus Lean Control 12

Obese Insulin Sensitive versus Lean Control 2

Lists of the top ranked genes from two selected contrasts are shown in Tables 3-7 and

3-8. The two selected contrasts are: patients with insulin resistance versus the lean control

group and patients with diabetic versus the lean control group. Examples of boxplots for the top

ranked genes were constructed to see the distributions of the expression levels in the lean

control and each of the diseased group (see Figures 3-9 and 3-10).

Table 3-7: Top 10 ranked genes for the contrast of insulin resistant versus lean control in the

human data

Gene Symbol logFC t P-Value FDR Log odds

HEXIM1 0.735 5.816 <0.001 0.174 3.050

A_32_P8627 -0.434 -5.476 <0.001 0.174 2.439

CCDC36 -1.309 -5.393 <0.001 0.174 2.286

MASK-BP3 -0.647 -5.132 <0.001 0.174 1.803

EIF4EBP1 -0.644 -5.109 <0.001 0.174 1.761

ENST00000342158 0.397 5.058 <0.001 0.174 1.664

DIP2C -0.486 -5.000 <0.001 0.174 1.557

WIRE -0.421 -4.981 <0.001 0.174 1.521

FXYD4 -0.867 -4.966 <0.001 0.174 1.493

C19orf36 -1.208 -4.943 <0.001 0.174 1.449

Lean Control Obese Insulin sensitive Obese Insulin Resistant Diabetic

Boxplot for Gene HEXIM1 Across Arrays

Boxplot for Gene CCDC36 Across Arrays

Figure 3-9: Boxplots of the top ranked genes in the contrast of insulin resistant versus lean

control.

Table 3-8: Top 10 ranked genes for the contrast of diabetic versus lean control in the human

Gene Symbol logFC t P-Value FDR Log odds

LUZP5 0.413 5.817 <0.001 0.064 3.596

FXYD4 -1.014 -5.807 <0.001 0.064 3.576

LHPP 0.679 5.712 <0.001 0.064 3.387

DIP2C -0.548 -5.640 <0.001 0.064 3.243

PDZD2 0.834 5.543 <0.001 0.064 3.049

DBNDD1 0.738 5.524 <0.001 0.064 3.010

PON2 0.581 5.398 <0.001 0.068 2.754

TOMM40 -0.418 -5.319 <0.001 0.068 2.594

THC2321224 -0.716 -5.299 <0.001 0.068 2.553

A_32_P23096 1.406 5.293 <0.001 0.068 2.539

Boxplot for Gene LUZP5 Across Arrays

Boxplot for Gene FXYD4 Across Arrays

Figure 3-10: Boxplots of the top ranked genes in the contrast of diabetic versus lean control.

The James Lab are interested in a particular gene SLC2A4 (also known as GLU4) which

is a member of the solute carrier family 2 (facilitated glucose transporter) and encodes a protein

that functions as an insulin-regulated facilitative glucose transporter. In our analysis of

differential expression for the mouse muscle tissue group, SLC2A4 was found to be one of the

top DE genes. We mapped the gene symbol to its corresponding probes in the human genome

and constructed boxplots as shown in Figure 3-11.

A_23_P107350

A_32_P151263

Figure 3-11: Boxplots for 2 probes corresponding to gene SLC2A4 in the human data.

For each of the microarray data sets, gene-by-gene statistical tests were performed to

test the hypothesis if there is any statistical significance between each of the treatment groups

and the baseline control, i.e., the treatment or group effects.

Chapter 4 - Pathway Analysis

In this chapter, we present and interpret the results from the gene set tests and the over-

representation analysis focusing on integrating two data sets: the longitudinal mouse study

involving the comparison of a high-fat diet to the control in two tissues and the cross-sectional

human study comparing expression in tissue samples for a control group of healthy patients,

obese insulin sensitive patients and obese patients with 2 stages of the disease, i.e., patients

with insulin resistance and type 2 diabetes. We intend to provide an insight into the following

research questions:

§ Are those significant gene sets (GO terms or KEGG pathways) found in the mouse

data involving a high-fat diet differentially expressed in obese patients with insulin

resistance in the human data relative to healthy controls?

data involving a high-fat diet differentially expressed in obese patients with type 2

diabetes in the human data relative to healthy controls?

data involving a high-fat diet differentially expressed in obese insulin sensitive

patients in the human data relative to healthy controls?

§ Are there any GO terms or KEGG pathways that are over-represented in the list of

top ranked genes in the mouse data with a high-fat diet?

top ranked genes in obese patients with insulin resistance relative to healthy

controls?

top ranked genes in obese patients with type 2 diabetes relative to healthy controls?

top ranked genes in obese insulin sensitive patients relative to healthy controls?

§ Are insulin/glucose related GO terms differentially expressed in each condition in the

human data, relative to healthy controls?

§ Is a set of FOXO genes differentially expressed?

4.1 Gene Set Tests

In the process of functional annotation, a list of differentially expressed (DE) genes is

used to explore more about the underlying biological processes associated with these DE

genes. In this chapter all FDRs reported in the tables are adjusted p-values. In the longitudinal

mouse study, we chose the differentially expressed genes detected from the comparison

between 42 days of a high-fat diet and the control in the muscle tissue group because no DE

genes were detected in any of the contrasts in the adipose tissue group. Due to the large

number of DE probes found in this contrast (i.e., 5621 DE probes using FDR<0.1), the threshold

used for controlling the false discovery rate (Benjamini & Hochberg 1995) is set at 0.01, which

indicates the expected proportion of false positive results (i.e., incorrectly rejected null

hypotheses) among all the rejected null hypotheses is controlled to be less than 1%. In this

case, we try to focus our attention on the top most DE probes. Hence, 1146 DE probes were

identified, then mapped to their corresponding Gene Ontology (GO) terms as well as KEGG

pathways. The resulting GO terms are associated with the mouse genome, so some of them

may not be included in the human genome. Only GO terms that linked to the human genome

were retained for further gene set tests. There is no such issue with KEGG pathways.

All probes on the Agilent chip are mapped to a total number of 12336 GO terms and 229 KEGG

pathways. It was found that 3171 GO terms were associated with the selected 1146 probes and

3082 GO terms were linked to the

4.1.1 Competitive Gene Set Test

We first look at one of the competitive gene set tests - the mean-rank gene set test in the

limma package. The hypothesis tested in this case is whether the selected set of genes tends to

be more highly ranked compared to randomly selected genes that are not in the selected gene

set in terms of the moderated t-statistic (Smyth 2004). We performed a gene set test for each of

the comparisons in the human data set. Because there were a large number of pre-defined

gene sets, we had to carry out many gene set tests. In Section 4.1.1, we controlled the false

discovery rate (FDR) at level 0.05 while correcting for multiple testing. The results of the gene

set test in each of the three contrasts in the human data are as follows.

4.1.1.1 Gene Ontology (GO) Terms

Insulin Resistant Versus Lean Control

The GO identifier, name of the GO term and the number of probes associated with that

GO term on the human arrays (i.e., Agilent chip hgug4112a) were reported. Gene sets in 420

GO categories were found to be significantly up-regulated with 42 of them containing more than

500 probes. We excluded these 42 GO terms as they are considered to be too general. A list of

the top 30 up-regulated gene sets containing no more than 500 probes is shown in Table 4-1.

Table 4-1: Top 30 significantly up-regulated GO terms in the contrast between insulin resistant

patients and lean control

regulated

Mixed-

regulated

FDR GO Term

No. of

Probes

GO:0006974 <0.0001 0.3219 response to DNA damage stimulus 281

GO:0006511 <0.0001 0.3124 ubiquitin-dependent protein catabolic process 265

GO:0000502 <0.0001 0.0110 proteasome complex 109

GO:0030521 <0.0001 0.0225 androgen receptor signaling pathway 109

GO:0005813 <0.0001 1 Centrosome 403

GO:0050681 <0.0001 0.2065 androgen receptor binding 102

GO:0048538 <0.0001 0.0001 thymus development 82

GO:0060736 <0.0001 <0.0001 prostate gland growth 32

GO:0034747 <0.0001 0.0459 Axin-APC-beta-catenin-GSK3B complex 39

GO:0031274 <0.0001 0.0028 positive regulation of pseudopodium assembly 39

GO:0032839 <0.0001 <0.0001 dendrite cytoplasm 32

GO:0000776 <0.0001 0.3902 Kinetochore 115

GO:0051082 <0.0001 0.7611 unfolded protein binding 244

GO:0043234 <0.0001 1 protein complex 386

GO:0006457 <0.0001 0.7800 protein folding 330

GO:0004842 <0.0001 0.8928 ubiquitin-protein ligase activity 388

GO:0043130 <0.0001 0.0281 ubiquitin binding 48

GO:0000785 <0.0001 0.0956 chromatin 206

GO:0016605 <0.0001 0.5400 PML body 138

GO:0016607 <0.0001 1 nuclear speck 244

GO:0060070 <0.0001 0.1784 canonical Wnt receptor signaling pathway 134

GO:0016567 <0.0001 0.6667 protein ubiquitination 286

GO:0002902 <0.0001 <0.0001 regulation of B cell apoptosis 13

GO:0051717 <0.0001 <0.0001 inositol-1,3,4,5-tetrakisphosphate 3-phosphatase activity 13

GO:0051800 <0.0001 <0.0001 phosphatidylinositol-3,4-bisphosphate 3-phosphatase 13

GO:0008219 <0.0001 0.0191 cell death 302

GO:0004402 <0.0001 0.3943 histone acetyltransferase activity 74

GO:0034742 <0.0001 0.0373 APC-Axin-1-beta-catenin complex 24

GO:0006917 <0.0001 1 induction of apoptosis 473

GO:0008234 <0.0001 0.8217 cysteine-type peptidase activity 152

Gene sets in 121 GO categories were found to be significantly down-regulated with 6 of

them containing more than 500 probes. The top 30 significant gene sets (No. of probes

resistant patients and lean control

Down-regulated FDR

Mixed-regulated FDR GO Term

No. of Probes

GO:0006415 <0.0001 1 translational termination 233

GO:0003735 <0.0001 1 structural constituent of ribosome 322

GO:0006414 <0.0001 1 translational elongation 254

GO:0005840 <0.0001 1 ribosome 304

GO:0022627 <0.0001 1 cytosolic small ribosomal subunit 81

GO:0045725 <0.0001 0.1032 positive regulation of glycogen biosynthetic process 43

GO:0050896 <0.0001 1 response to stimulus 404

GO:0005759 <0.0001 0.2187 mitochondrial matrix 350

GO:0060754 <0.0001 0.0005 positive regulation of mast cell chemotaxis 17

GO:0005172 <0.0001 0.0010 vascular endothelial growth factor receptor binding 13

GO:0008083 <0.0001 0.7831 growth factor activity 360

GO:0010595 <0.0001 0.8292 positive regulation of endothelial cell migration 79

GO:0003707 <0.0001 0.2289 steroid hormone receptor activity 115

GO:0050927 <0.0001 0.0197 positive regulation of positive chemotaxis 23

GO:0048598 <0.0001 0.0208 embryonic morphogenesis 15

GO:0005743 <0.0001 0.7514 mitochondrial inner membrane 432

GO:0005499 <0.0001 0.0043 vitamin D binding 15

GO:0070644 0.0002 0.0438 vitamin D response element binding 15

GO:0018119 0.0002 0.0029 peptidyl-cysteine S-nitrosylation 13

GO:0007281 0.0003 0.0261 germ cell development 56

GO:0004517 0.0003 0.5909 nitric-oxide synthase activity 23

GO:0050731 0.0003 0.5118 positive regulation of peptidyl-tyrosine phosphorylation 194

GO:0060068 0.0003 0.0681 vagina development 16

GO:0046898 0.0004 0.0088 response to cycloheximide 13

GO:0006809 0.0004 0.7901 nitric oxide biosynthetic process 40

GO:0051450 0.0006 0.0337 myoblast proliferation 14

GO:0045840 0.0006 0.6417 positive regulation of mitosis 83

GO:0017134 0.0007 0.0135 fibroblast growth factor binding 28

GO:0048009 0.0007 0.7543 insulin-like growth factor receptor signaling pathway 40

GO:0030976 0.0007 0.0086 thiamine pyrophosphate binding 8

Gene sets in 128 GO categories were differentially expressed regardless of the direction

and only one of them contains more than 500 probes (i.e., GO:0055114 oxidation-reduction

process). The top 30 significant gene sets (No. of probes

resistant patients and lean control

Down-regulated FDR

Up-regulated FDR

No. of Probes

GO:0046716 1 0.0005 <0.0001 muscle cell homeostasis 45

GO:0032839 1 <0.0001 <0.0001 dendrite cytoplasm 32

GO:0060736 1 <0.0001 <0.0001 prostate gland growth 32

GO:0006096 1 0.0289 <0.0001 Glycolysis 83

GO:0070102 1 0.5782 <0.0001 interleukin-6-mediated signaling pathway 26

GO:0033032 1 0.0002 <0.0001 regulation of myeloid cell apoptosis 15

GO:0002902 1 <0.0001 <0.0001 regulation of B cell apoptosis 13

GO:0051717 1 <0.0001 <0.0001 inositol-1,3,4,5-tetrakisphosphate 3-phosphatase activity 13

GO:0051800 1 <0.0001 <0.0001 phosphatidylinositol-3,4-bisphosphate 3-phosphatase activity 13

GO:0060087 1 <0.0001 <0.0001 relaxation of vascular smooth muscle 13

GO:0008289 1 0.1756 <0.0001 lipid binding 247

GO:0006749 1 0.6982 0.0001 glutathione metabolic process 52

GO:0060088 1 0.0037 0.0001 auditory receptor cell stereocilium organization 16

GO:0048538 1 <0.0001 0.0001 thymus development 82

GO:0016314 1 <0.0001 0.0001 phosphatidylinositol-3,4,5-trisphosphate 3-phosphatase activity 16

GO:0060292 1 0.0001 0.0001 long term synaptic depression 19

GO:0019226 1 0.0015 0.0002 transmission of nerve impulse 22

GO:0004438 1 <0.0001 0.0002 phosphatidylinositol-3-phosphatase activity 15

GO:0070830 1 <0.0001 0.0003 tight junction assembly 57

GO:0060742 1 0.0006 0.0004 epithelial cell differentiation involved in prostate gland development 14

GO:0031253 1 0.0004 0.0004 cell projection membrane 17

GO:0006094 1 0.0310 0.0005 Gluconeogenesis 89

GO:0060754 <0.0001 1 0.0005 positive regulation of mast cell chemotaxis 17

GO:0001659 1 0.3797 0.0006 temperature homeostasis 29

GO:0050930 0.2750 1 0.0007 induction of positive chemotaxis 36

GO:0042056 0.7766 1 0.0009 chemoattractant activity 46

GO:0060716 1 0.2502 0.0010 labyrinthine layer blood vessel development 40

GO:0005172 <0.0001 1 0.0010 vascular endothelial growth factor receptor binding 13

GO:0044262 1 0.9387 0.0010 cellular carbohydrate metabolic process 15

GO:0000302 1 0.0402 0.0013 response to reactive oxygen species 37

Diabetic versus Lean Control

Gene set tests were carried out in the contrast between patients with diabetes and the

control group, and we found that 488 GO terms appeared to be significantly up-regulated with

60 of them having a size of over 500 probes. The top 30 GO terms (No. of probes

Table 4-5: Top 30 significantly down-regulated GO terms in the contrast between diabetic

patients and lean control

regulated

Mixed-

regulated

FDR GO Term

No. of

Probes

GO:0006415 <0.0001 0.0031 translational termination 233

GO:0006414 <0.0001 0.0012 translational elongation 255

GO:0006413 <0.0001 0.1148 translational initiation 321

GO:0003735 <0.0001 0.0199 structural constituent of ribosome 327

GO:0005840 <0.0001 0.0363 Ribosome 381

GO:0000184 <0.0001 0.0672 nuclear-transcribed mRNA catabolic process, 294

GO:0022627 <0.0001 0.0018 cytosolic small ribosomal subunit 81

GO:0010595 <0.0001 1 positive regulation of endothelial cell 109

GO:0044429 <0.0001 0.1846 mitochondrial part 30

GO:0001974 <0.0001 1 blood vessel remodeling 77

GO:0006364 <0.0001 1 rRNA processing 154

GO:0060754 <0.0001 <0.0001 positive regulation of mast cell chemotaxis 18

GO:0005172 <0.0001 0.0003 vascular endothelial growth factor receptor 14

GO:0004517 <0.0001 0.3393 nitric-oxide synthase activity 23

GO:0050930 <0.0001 0.0098 induction of positive chemotaxis 26

GO:0010181 <0.0001 0.2205 FMN binding 36

GO:0045725 <0.0001 1 positive regulation of glycogen biosynthetic 48

GO:0070644 <0.0001 0.0022 vitamin D response element binding 15

GO:0006809 0.0002 0.8311 nitric oxide biosynthetic process 45

GO:0040015 0.0002 0.0775 negative regulation of multicellular organism 21

GO:0042274 0.0002 0.4221 ribosomal small subunit biogenesis 20

GO:0043526 0.0006 0.2669 neuroprotection 49

GO:0040007 0.0007 0.0899 growth 66

GO:0005896 0.0007 0.0070 interleukin-6 receptor complex 13

GO:0015288 0.0007 0.0388 porin activity 12

GO:0050896 0.0007 1 response to stimulus 351

GO:0046898 0.0008 0.1991 response to cycloheximide 14

GO:0008083 0.0009 1 growth factor activity 361

GO:0009409 0.0012 0.2097 response to cold 111

GO:0031017 0.0013 1 exocrine pancreas development 34

214 GO terms with a size of no more than 500 probes are significantly up-regulated in

both contrasts, i.e., insulin resistant patients versus the control and diabetic patients versus the

control.

Gene sets defined by 120 GO terms were significantly down-regulated with 9 of

them containing more than 500 probes. Genes in 182 GO terms appeared to be differentially

expressed regardless of the direction with 11 of them having a size of over 500 probes. The top

30 GO terms (No. of probes

51 GO terms with a size of no more than 500 probes are significantly down-regulated in

both contrasts, i.e., insulin resistant patients versus the control and diabetic patients versus the

control.

We found that 38 GO terms (with a size of no more than 500 probes) are significantly

mixed-regulated in both contrasts, i.e., insulin resistant patients versus the control and diabetic

patients versus the control.

Insulin Sensitive versus Lean Control

Gene sets in 428 GO terms were significantly up-regulated with 69 of them having a size

of more than 500 probes. 97 GO terms were significantly down-regulated with 5 of them

containing more than 500 probes. 192 GO terms were significantly regardless of the direction

and 15 of them contain more than 500 probes. The top 10 up-, down- and mixed-regulated GO

terms (No. of probes

sensitive patients and lean control

Down-regulated FDR

No. of Probes

GO:0006415 <0.0001 1 translational termination 233

GO:0003735 <0.0001 1 structural constituent of ribosome 327

GO:0005840 <0.0001 1 ribosome 381

GO:0006413 <0.0001 1 translational initiation 321

GO:0006414 <0.0001 1 translational elongation 255

GO:0000184 <0.0001 1 nuclear-transcribed mRNA catabolic process, nonsense-mediated decay 294

GO:0001974 <0.0001 0.1374 blood vessel remodeling 77

GO:0004517 <0.0001 0.0010 nitric-oxide synthase activity 23

GO:0022627 <0.0001 1 cytosolic small ribosomal subunit 81

GO:0005743 <0.0001 0.9517 mitochondrial inner membrane 478

sensitive patients and lean control

Down-regulated FDR

Up-regulated FDR

No. of Probes

GO:0032839 1 <0.0001 <0.0001 dendrite cytoplasm 34

GO:0034097 0.9246 1 <0.0001 response to cytokine stimulus 228

GO:0070301 0.8613 1 <0.0001 cellular response to hydrogen peroxide 70

GO:0070102 0.8918 1 <0.0001 interleukin-6-mediated signaling pathway 27

GO:0046965 0.8167 1 <0.0001 retinoid X receptor binding 37

GO:0034088 1 <0.0001 <0.0001 maintenance of mitotic sister chromatid cohesion 26

GO:0033160 0.0153 1 <0.0001 positive regulation of protein import into nucleus, translocation 62

GO:0043330 1 <0.0001 <0.0001 response to exogenous dsRNA 42

GO:0046716 1 0.0392 <0.0001 muscle cell homeostasis 74

GO:0031000 1 0.2905 <0.0001 response to caffeine 40

4.1.1.2 KEGG Pathways

Insulin Resistant versus Lean Control

The KEGG pathway identifier, name of the KEGG pathway and the number of probes

associated with that KEGG pathway on the human arrays (i.e., Agilent chip hgug4112a) were

reported. It was revealed that gene sets in 55 KEGG pathways were significantly up-regulated

using the criterion of FDR<0.05 with 5 of them containing more than 500 probes. The results

are shown in Table 4-10.

Table 4-10: 55 significantly up-regulated KEGG pathways in the contrast between insulin

resistant and lean control

KEGG ID

Up-regulated FDR

Mixed-regulated FDR KEGG Pathway

No. of Probes

5210 <0.0001 0.4686 Colorectal cancer 316

4310 <0.0001 0.9155 Wnt signaling pathway 421

4141 <0.0001 0.6048 Protein processing in endoplasmic reticulum 327

4722 <0.0001 1 Neurotrophin signaling pathway 461

4520 <0.0001 1 Adherens junction 329

5213 <0.0001 0.2997 Endometrial cancer 256

5215 <0.0001 0.6886 Prostate cancer 433

4144 <0.0001 1 Endocytosis 554

5160 <0.0001 0.0883 Hepatitis C 356

4062 <0.0001 1 Chemokine signaling pathway 503

5200 <0.0001 1 Pathways in cancer 1292

4916 <0.0001 0.1744 Melanogenesis 273

4510 <0.0001 1 Focal adhesion 710

4810 <0.0001 0.3344 Regulation of actin cytoskeleton 536

4320 <0.0001 0.5015 Dorso-ventral axis formation 85

3013 <0.0001 1 RNA transport 284

4530 <0.0001 0.0445 Tight junction 314

5212 <0.0001 0.6623 Pancreatic cancer 342

4110 <0.0001 1 Cell cycle 424

4720 <0.0001 1 Long-term potentiation 197

4120 <0.0001 0.3038 Ubiquitin mediated proteolysis 248

3050 0.0001 0.0856 Proteasome 71

4360 0.0001 1 Axon guidance 337

4350 0.0001 1 TGF-beta signaling pathway 275

4010 0.0003 1 MAPK signaling pathway 694

5217 0.0004 0.4642 Basal cell carcinoma 128

5214 0.0006 0.4562 Glioma 299

5211 0.0007 1 Renal cell carcinoma 283

5221 0.0010 0.5826 Acute myeloid leukemia 224

5020 0.0011 0.0228 Prion diseases 138

5100 0.0017 1 Bacterial invasion of epithelial cells 237

5223 0.0017 1 Non-small cell lung cancer 237

4070 0.0022 0.4788 Phosphatidylinositol signaling system 153

4114 0.0024 0.7178 Oocyte meiosis 254

3015 0.0025 1 mRNA surveillance pathway 143

4210 0.0049 1 Apoptosis 299

10 0.0054 0.0004 Glycolysis / Gluconeogenesis 113

4540 0.0060 1 Gap junction 224

4910 0.0077 0.5290 Insulin signaling pathway 321

4666 0.0082 1 Fc gamma R-mediated phagocytosis 251

4130 0.0116 0.7942 SNARE interactions in vesicular transport 61

4974 0.0157 0.1744 Protein digestion and absorption 149

4330 0.0160 1 Notch signaling pathway 121

740 0.0162 0.9853 Riboflavin metabolism 18

5220 0.0228 1 Chronic myeloid leukemia 377

4660 0.0282 1 T cell receptor signaling pathway 384

562 0.0283 0.5143 Inositol phosphate metabolism 103

4670 0.0341 1 Leukocyte transendothelial migration 360

5216 0.0354 1 Thyroid cancer 168

4012 0.0371 1 ErbB signaling pathway 326

5218 0.0371 0.0907 Melanoma 315

5014 0.0402 <0.0001 Amyotrophic lateral sclerosis (ALS) 173

4970 0.0405 1 Salivary secretion 148

4962 0.0440 0.3650 Vasopressin-regulated water reabsorption 78

4662 0.0470 1 B cell receptor signaling pathway 232

Gene sets associated with 19 KEGG pathways were found to be significantly down-

regulated with one pathway containing over 500 probes. The results are given in Table 4-11.

KEGG ID Down-regulated FDR

No. of Probes

3010 <0.0001 1 Ribosome 232

190 <0.0001 0.4686 Oxidative phosphorylation 173

983 <0.0001 0.1528 Drug metabolism - other enzymes 83

5012 0.0003 0.3955 Parkinson's disease 194

1100 0.0004 0.0174 Metabolic pathways 1804

330 0.0005 0.1528 Arginine and proline metabolism 112

860 0.0007 1 Porphyrin and chlorophyll metabolism 81

20 0.0010 0.7942 Citrate cycle (TCA cycle) 52

500 0.0017 0.0856 Starch and sucrose metabolism 65

4080 0.0021 1 Neuroactive ligand-receptor interaction 401

4260 0.0035 1 Cardiac muscle contraction 136

280 0.0043 1 Valine, leucine and isoleucine degradation 70

5323 0.0061 1 Rheumatoid arthritis 292

480 0.0084 0.1533 Glutathione metabolism 93

4640 0.0096 1 Hematopoietic cell lineage 220

5016 0.0105 0.3735 Huntington's disease 382

4610 0.0224 1 Complement and coagulation cascades 175

5410 0.0352 1 Hypertrophic cardiomyopathy (HCM) 238

4740 0.0418 1 Olfactory transduction 163

Gene sets in 9 KEGG pathways (See table 4-12) were differentially expressed

regardless of the direction with one pathway containing more than 500 probes. Among these 9

pathways, 4 were neither down-regulated nor up-regulated but mix-regulated.

KEGG ID

Down-regulated FDR

Up-regulated FDR

No. of Probes

5014 1 0.0402 <0.0001 Amyotrophic lateral sclerosis (ALS) 173

360 1 0.7942 <0.0001 Phenylalanine metabolism 35

10 1 0.0054 0.0004 Glycolysis / Gluconeogenesis 113

1100 0.0004 1 0.0174 Metabolic pathways 1804

5020 1 0.0011 0.0228 Prion diseases 138

4146 1 1 0.0352 Peroxisome 127

620 0.7594 1 0.0408 Pyruvate metabolism 68

3022 1 0.2583 0.0408 Basal transcription factors 58

4530 1 <0.0001 0.0445 Tight junction 314

Gene sets defined by 56 KEGG pathways were significantly up-regulated with 6 of them

having a size of over 500 probes. The full list is shown in Table 4-13. We compared these

KEGG identifiers to those up-regulated pathways in the contrast of insulin resistant versus lean

control and found that 35 KEGG pathways were significantly up-regulated in both contrasts (see

Table 4-14).

Table 4-13: 56 significantly up-regulated KEGG pathways in the contrast between diabetic and

lean control

KEGG ID

Up-regulated FDR

No. of Probes

4650 <0.0001 0.0006 Natural killer cell mediated cytotoxicity 430

4062 <0.0001 0.0710 Chemokine signaling pathway 503

4144 <0.0001 <0.0001 Endocytosis 554

5416 <0.0001 0.0064 Viral myocarditis 272

5200 <0.0001 0.0123 Pathways in cancer 1292

4670 <0.0001 0.5988 Leukocyte transendothelial migration 360

4145 <0.0001 0.6685 Phagosome 392

4360 <0.0001 0.4389 Axon guidance 337

4510 <0.0001 0.0655 Focal adhesion 710

5210 <0.0001 1 Colorectal cancer 316

4520 <0.0001 0.7885 Adherens junction 329

4514 <0.0001 0.0015 Cell adhesion molecules (CAMs) 321

4666 <0.0001 0.5954 Fc gamma R-mediated phagocytosis 251

5100 <0.0001 1 Bacterial invasion of epithelial cells 237

4512 <0.0001 0.0036 ECM-receptor interaction 243

4660 <0.0001 1 T cell receptor signaling pathway 384

5340 <0.0001 0.0011 Primary immunodeficiency 84

4612 <0.0001 0.1366 Antigen processing and presentation 220

4210 <0.0001 0.8637 Apoptosis 299

5145 <0.0001 0.6444 Toxoplasmosis 456

4060 <0.0001 0.0032 Cytokine-cytokine receptor interaction 583

4110 0.0001 1 Cell cycle 424

5144 0.0001 0.3926 Malaria 237

4722 0.0002 1 Neurotrophin signaling pathway 461

5160 0.0002 0.1672 Hepatitis C 356

4974 0.0002 0.0042 Protein digestion and absorption 149

4115 0.0005 0.4147 p53 signaling pathway 293

5217 0.0006 0.5611 Basal cell carcinoma 128

5322 0.0006 0.8988 Systemic lupus erythematosus 237

3050 0.0006 0.3600 Proteasome 71

10 0.0008 <0.0001 Glycolysis / Gluconeogenesis 113

5020 0.0008 0.0055 Prion diseases 138

4380 0.0008 0.9978 Osteoclast differentiation 388

4664 0.0008 1 Fc epsilon RI signaling pathway 254

4916 0.0017 1 Melanogenesis 273

5223 0.0021 1 Non-small cell lung cancer 237

5146 0.0031 0.0003 Amoebiasis 346

4120 0.0042 0.7879 Ubiquitin mediated proteolysis 248

4310 0.0042 1 Wnt signaling pathway 421

4662 0.0050 1 B cell receptor signaling pathway 232

5412 0.0075 0.4524 Arrhythmogenic right ventricular cardiomyopathy (ARVC) 186

5216 0.0083 0.1016 Thyroid cancer 168

5222 0.0111 1 Small cell lung cancer 329

5219 0.0111 0.0783 Bladder cancer 259

4320 0.0128 1 Dorso-ventral axis formation 85

4114 0.0165 1 Oocyte meiosis 254

4530 0.0249 0.1705 Tight junction 314

5215 0.0264 1 Prostate cancer 433

4141 0.0277 1 Protein processing in endoplasmic reticulum 327

5142 0.0319 0.8988 Chagas disease (American trypanosomiasis) 390

4912 0.0447 0.6708 GnRH signaling pathway 287

5010 0.0458 0.2737 Alzheimer's disease 391

5014 0.0458 0.0017 Amyotrophic lateral sclerosis (ALS) 173

Table 4-14: 35 significantly up-regulated KEGG pathways in both insulin resistant and diabetic

groups

KEGG ID KEGG Pathway No. of Probes

4062 Chemokine signaling pathway 503

4144 Endocytosis 554

4360 Axon guidance 337

4510 Focal adhesion 710

4520 Adherens junction 329

4530 Tight junction 314

4660 T cell receptor signaling pathway 384

4666 Fc gamma R-mediated phagocytosis 251

4670 Leukocyte transendothelial migration 360

4722 Neurotrophin signaling pathway 461

4810 Regulation of actin cytoskeleton 536

5100 Bacterial invasion of epithelial cells 237

5200 Pathways in cancer 1292

5212 Pancreatic cancer 342

4310 Wnt signaling pathway 421

4916 Melanogenesis 273

5210 Colorectal cancer 316

5213 Endometrial cancer 256

5215 Prostate cancer 433

5216 Thyroid cancer 168

5217 Basal cell carcinoma 128

4210 Apoptosis 299

4662 B cell receptor signaling pathway 232

5160 Hepatitis C 356

10 Glycolysis / Gluconeogenesis 113

4110 Cell cycle 424

4114 Oocyte meiosis 254

5014 Amyotrophic lateral sclerosis (ALS) 173

4120 Ubiquitin mediated proteolysis 248

4141 Protein processing in endoplasmic reticulum 327

5020 Prion diseases 138

3050 Proteasome 71

4320 Dorso-ventral axis formation 85

5223 Non-small cell lung cancer 237

4974 Protein digestion and absorption 149

In this case 17 KEGG pathways were significantly down-regulated with one pathway

containing more than 500 probes (see Table 4-15). We found that 13 of these KEGG pathways

appeared to be significantly up-regulated in both contrasts, i.e., insulin resistant and diabetic

patients (see Table 4-16).

Table 4-15: 17 significantly down-regulated KEGG pathways in the contrast between diabetic

and lean control

KEGG ID

Down-regulated FDR

No. of Probes

3010 <0.0001 0.0012 Ribosome 232

860 0.0001 1 Porphyrin and chlorophyll metabolism 81

1100 0.0002 0.1988 Metabolic pathways 1804

190 0.0002 1 Oxidative phosphorylation 173

983 0.0005 1 Drug metabolism - other enzymes 83

4260 0.0036 1 Cardiac muscle contraction 136

20 0.0036 1 Citrate cycle (TCA cycle) 52

280 0.0042 1 Valine, leucine and isoleucine degradation 70

4020 0.0054 0.5954 Calcium signaling pathway 356

410 0.0123 1 beta-Alanine metabolism 36

5016 0.0177 1 Huntington's disease 382

240 0.0319 1 Pyrimidine metabolism 155

5012 0.0319 1 Parkinson's disease 194

3020 0.0451 0.3600 RNA polymerase 49

Table 4-16: 13 significantly down-regulated KEGG pathways in both insulin resistant and

diabetic groups

280 Valine, leucine and isoleucine degradation 70

1100 Metabolic pathways 1804

20 Citrate cycle (TCA cycle) 52

983 Drug metabolism - other enzymes 83

4740 Olfactory transduction 163

330 Arginine and proline metabolism 112

4080 Neuroactive ligand-receptor interaction 401

4260 Cardiac muscle contraction 136

5016 Huntington's disease 382

190 Oxidative phosphorylation 173

5012 Parkinson's disease 194

860 Porphyrin and chlorophyll metabolism 81

3010 Ribosome 232

We found that 20 KEGG pathways (see Table 4-17) were differentially expressed

regardless of the direction with 3 of them having a size of over 500 probes. 5 KEGG pathways

were significantly mixed-regulated in both insulin resistant and diabetic patients (see Table 4-

Table 4-17: 20 significantly mixed-regulated KEGG pathways in the contrast between diabetic

and lean control

KEGG ID

Down-regulated FDR

Up-regulated FDR

No. of Probes

10 1 0.0008 <0.0001 Glycolysis / Gluconeogenesis 113

4144 1 <0.0001 <0.0001 Endocytosis 554

360 1 0.3090 <0.0001 Phenylalanine metabolism 35

5146 1 0.0031 0.0003 Amoebiasis 346

620 0.6134 1 0.0006 Pyruvate metabolism 68

4650 1 <0.0001 0.0006 Natural killer cell mediated cytotoxicity 430

5340 1 <0.0001 0.0011 Primary immunodeficiency 84

3010 <0.0001 1 0.0012 Ribosome 232

4514 1 <0.0001 0.0015 Cell adhesion molecules (CAMs) 321

5014 1 0.0458 0.0017 Amyotrophic lateral sclerosis (ALS) 173

330 0.0003 1 0.0021 Arginine and proline metabolism 112

4060 1 <0.0001 0.0032 Cytokine-cytokine receptor interaction 583

4512 1 <0.0001 0.0036 ECM-receptor interaction 243

4974 1 0.0002 0.0042 Protein digestion and absorption 149

5020 1 0.0008 0.0055 Prion diseases 138

5416 1 <0.0001 0.0064 Viral myocarditis 272

5200 1 <0.0001 0.0123 Pathways in cancer 1292

4150 0.1985 1 0.0146 mTOR signaling pathway 165

4910 1 0.9529 0.0213 Insulin signaling pathway 321

350 1 0.4492 0.0334 Tyrosine metabolism 72

Table 4-18: 5 significantly mixed-regulated KEGG pathways in both insulin resistant and

diabetic groups

10 Glycolysis / Gluconeogenesis 113

620 Pyruvate metabolism 68

5014 Amyotrophic lateral sclerosis (ALS) 173

5020 Prion diseases 138

360 Phenylalanine metabolism 35

Gene sets defined by 70 KEGG pathways were significantly up-regulated (FDR<0.05)

with 5 of them having a size of over 500 probes. The top 30 KEGG pathways were given in

Table 4-19. A number of KEGG pathways related to different types of cancer were significantly

up-regulated in the insulin sensitive patients relative to healthy controls.

Table 4-19: Top 30 significantly up-regulated KEGG pathways in the contrast between insulin

sensitive and lean control

KEGG ID

Up-regulated FDR

No. of Probes

5210 <0.0001 0.0779 Colorectal cancer 316

5200 <0.0001 <0.0001 Pathways in cancer 1292

4650 <0.0001 0.5834 Natural killer cell mediated cytotoxicity 430

5223 <0.0001 0.0003 Non-small cell lung cancer 237

4144 <0.0001 0.4683 Endocytosis 554

4062 <0.0001 1 Chemokine signaling pathway 503

5160 <0.0001 0.0076 Hepatitis C 356

5145 <0.0001 0.6521 Toxoplasmosis 456

5221 <0.0001 0.0660 Acute myeloid leukemia 224

5215 <0.0001 <0.0001 Prostate cancer 433

4722 <0.0001 1 Neurotrophin signaling pathway 461

4320 <0.0001 0.1302 Dorso-ventral axis formation 85

4110 <0.0001 0.8557 Cell cycle 424

5416 <0.0001 0.9113 Viral myocarditis 272

4010 <0.0001 0.0478 MAPK signaling pathway 694

4916 <0.0001 0.1945 Melanogenesis 273

5214 <0.0001 <0.0001 Glioma 299

5218 <0.0001 <0.0001 Melanoma 315

4360 <0.0001 0.8120 Axon guidance 337

5220 <0.0001 0.9733 Chronic myeloid leukemia 377

4670 <0.0001 0.9809 Leukocyte transendothelial migration 360

5219 <0.0001 <0.0001 Bladder cancer 259

4520 <0.0001 1 Adherens junction 329

5216 <0.0001 0.3898 Thyroid cancer 168

4210 <0.0001 0.0540 Apoptosis 299

4666 <0.0001 0.3209 Fc gamma R-mediated phagocytosis 251

562 <0.0001 0.3360 Inositol phosphate metabolism 103

The down- and mixed-regulated KEGG pathways were listed in Tables 4-20 and 4-21.

We found 10 cancer pathways were differentially expressed regardless of the direction (see

Table 4-21) and they were also significantly up-regulated except for two pathways, i.e., Small

cell lung cancer (KEGG ID 5222) and Renal cell carcinoma (KEGG ID 5211).

KEGG ID

Down-regulated FDR

No. of Probes

3010 <0.0001 1 Ribosome 232

190 <0.0001 0.6578 Oxidative phosphorylation 173

5012 0.0003 0.6039 Parkinson's disease 194

4260 0.0010 0.8469 Cardiac muscle contraction 136

5410 0.0487 0.1102 Hypertrophic cardiomyopathy (HCM) 238

KEGG ID

Down-regulated FDR

Up-regulated FDR

No. of Probes

5218 1 <0.0001 <0.0001 Melanoma 315

4115 1 0.0546 <0.0001 p53 signaling pathway 293

5215 1 <0.0001 <0.0001 Prostate cancer 433

5214 1 <0.0001 <0.0001 Glioma 299

350 1 0.0010 <0.0001 Tyrosine metabolism 72

5219 1 <0.0001 <0.0001 Bladder cancer 259

5200 1 <0.0001 <0.0001 Pathways in cancer 1292

5020 1 0.0005 <0.0001 Prion diseases 138

5211 1 <0.0001 <0.0001 Renal cell carcinoma 283

4150 1 0.4020 <0.0001 mTOR signaling pathway 165

5212 1 <0.0001 <0.0001 Pancreatic cancer 342

4370 1 0.1095 0.0003 VEGF signaling pathway 256

5223 1 <0.0001 0.0003 Non-small cell lung cancer 237

4720 1 0.0105 0.0004 Long-term potentiation 197

4914 1 0.1095 0.0005 Progesterone-mediated oocyte maturation 241

4012 1 0.0004 0.0006 ErbB signaling pathway 326

4060 1 0.0006 0.0009 Cytokine-cytokine receptor interaction 583

5014 1 0.0042 0.0011 Amyotrophic lateral sclerosis (ALS) 173

360 1 0.1826 0.0011 Phenylalanine metabolism 35

5213 1 <0.0001 0.0011 Endometrial cancer 256

5010 1 0.1706 0.0012 Alzheimer's disease 391

5144 1 0.0003 0.0047 Malaria 237

5160 1 <0.0001 0.0076 Hepatitis C 356

4960 1 0.6578 0.0142 Aldosterone-regulated sodium reabsorption 112

4810 1 <0.0001 0.0167 Regulation of actin cytoskeleton 536

3320 1 0.1125 0.0184 PPAR signaling pathway 158

4610 0.3898 1 0.0238 Complement and coagulation cascades 175

4912 1 0.0020 0.0271 GnRH signaling pathway 287

4540 1 <0.0001 0.0299 Gap junction 224

330 0.0278 1 0.0363 Arginine and proline metabolism 112

5222 1 0.0004 0.0452 Small cell lung cancer 329

4010 1 <0.0001 0.0478 MAPK signaling pathway 694

4.1.2 Self-Contained Gene Set Test

The self-contained gene set test we have investigated in this project is rotation gene set

tests in the limma package. The rotation gene set test (Wu et al. 2010) is considered as a self-

contained gene set test because only the information contained in the gene set of interest is

used to test the hypothesis if any of the genes in the set are differentially expressed.

Gene Ontology (GO) Terms

No GO terms were found to be statistically significant in the contrast between insulin

resistant patients and lean control when controlling the false discovery rate at the level of 0.1. It

is the similar situation for the contrast between diabetic patients and lean control as well as

between insulin sensitive patients and lean control.

KEGG Pathways

No KEGG pathways were found to be significant in the contrast between insulin resistant

patients and lean control when controlling the false discovery rate at the level of 0.1. In the

contrast between diabetic patients and lean control, gene sets defined by 5 KEGG pathways

appeared to be differentially expressed regardless of the direction when a threshold of 0.1 was

used to control the false discovery rate. All these 5 KEGG pathways were significantly mixed-

regulated in the mean-ranked gene set test. The top 8 mixed-regulated KEGG pathways are

shown in Table 4-22. No significantly up- or down-regulated KEGG pathways were found in the

contrast between diabetic patients and lean control based on a threshold of 0.1 when controlling

the false discovery rate.

In the contrast between insulin sensitive patients versus lean control, gene sets

associated with 18 KEGG pathways were significantly up-regulated when controlling the false

discovery rate at the level of 0.1 (see Table 4-23).

Table 4-22: Rotation Gene Set Test - Top 8 mixed-regulated KEGG pathways in the contrast

between diabetic and lean control

KEGG ID

Mixed-regulated FDR

Up-regulated FDR KEGG Pathway

No. of Probes

360 0.0376 0.3052 Phenylalanine metabolism 35

10 0.0940 0.3052 Glycolysis / Gluconeogenesis 113

5146 0.0940 0.3354 Amoebiasis 346

350 0.0940 0.3513 Tyrosine metabolism 72

620 0.0977 0.7738 Pyruvate metabolism 68

4144 0.1170 0.3052 Endocytosis 554

5014 0.1170 0.3360 Amyotrophic lateral sclerosis (ALS) 173

4910 0.1170 0.6640 Insulin signaling pathway 321

Table 4-23: Rotation Gene Set Test - Top 20 up-regulated in the contrast between insulin

KEGG ID

Mixed-regulated FDR

Up-regulated FDR KEGG Pathway

No. of Probes

5213 0.2464 0.0188 Endometrial cancer 256

4070 0.2464 0.0188 Phosphatidylinositol signaling system 153

4320 0.2464 0.0301 Dorso-ventral axis formation 85

5223 0.2464 0.0301 Non-small cell lung cancer 237

562 0.2464 0.0301 Inositol phosphate metabolism 103

603 0.2464 0.0313 Glycosphingolipid biosynthesis - globo series 25

5214 0.2464 0.0322 Glioma 299

5210 0.2464 0.0376 Colorectal cancer 316

5218 0.2464 0.0418 Melanoma 315

4360 0.2747 0.0478 Axon guidance 337

5221 0.2464 0.0478 Acute myeloid leukemia 224

4962 0.3107 0.0533 Vasopressin-regulated water reabsorption 78

4916 0.2464 0.0636 Melanogenesis 273

4010 0.2464 0.0929 MAPK signaling pathway 694

5212 0.2464 0.0929 Pancreatic cancer 342

5216 0.2464 0.0929 Thyroid cancer 168

5160 0.2464 0.0929 Hepatitis C 356

5200 0.2464 0.0940 Pathways in cancer 1292

4144 0.2464 0.1009 Endocytosis 554

4114 0.2747 0.1056 Oocyte meiosis 254

We notice that there are a few significant cancer related KEGG pathways again when

the rotation gene set test was applied in the comparison of insulin sensitive patients versus

healthy controls. Common probes in some of the cancer related pathways of interest were found

and boxplots were produced to compare their gene expression levels across samples. For

example, there are 62 common probes in both "Pancreatic cancer" and "Colorectal cancer"

pathways and the median values of these 62 probes in each sample were computed. See the

boxplot in Figure 4-1. Clearly, the medians, upper quartiles and lower quartiles of the 3 groups

of obese patients appeared to be higher than the ones in the controls. Another example is given

in Figure 4-2.

"Pancreatic cancer" and "Colorectal cancer" KEGG pathways on the Agilent human array

"Pancreatic cancer" and "Endometrial cancer" KEGG pathways on the Agilent human array

In order to make a fundamental level of biological inferences from the results generated

by gene set tests, I reviewed some recent publications concerning various GO or KEGG

pathways and their roles in the development of type 2 diabetes. The discussion is given in

Section 6.2.

4.1.3 Comparison of Three Gene Set Tests for Insulin Related GO Terms

All GO terms containing the word

Insulin Resistance versus Lean Control

Gene sets associated with 12 GO categories were significantly down-regulated using the

mean-rank gene set test (FDR<0.1). None of these GO categories were formally significant

when using either the rotation gene set test (Roast) or the correlation adjusted mean-rank gene

set test (Camera). However, the ranks of the GO categories in terms of the statistical

significance were very similar for the three different methods (see Table 4-24).

insulin resistance and lean control

GO ID GO Term MeanRank FDR

Roast FDR

Camera FDR

No. of Probes

GO:0016942 insulin-like growth factor binding protein complex <0.0001 0.1233 0.1790 28

GO:0048009 insulin-like growth factor receptor signaling pathway 0.0002 0.1233 0.1790 40

GO:0005158 insulin receptor binding 0.0009 0.1233 0.1790 87 GO:0005159 insulin-like growth factor receptor binding 0.0010 0.1692 0.1790 45 GO:0005520 insulin-like growth factor binding 0.0010 0.2417 0.3259 56 GO:0043559 insulin binding 0.0020 0.1692 0.2223 18 GO:0043560 insulin receptor substrate binding 0.0117 0.2562 0.3462 44 GO:0005010 insulin-like growth factor receptor activity 0.0121 0.2417 0.3259 18 GO:0031994 insulin-like growth factor I binding 0.0543 0.3002 0.3898 41

GO:0061179 negative regulation of insulin secretion involved in cellular response to glucose 0.0739 0.1233 0.1790 2

GO:0032869 cellular response to insulin stimulus 0.0753 0.4152 0.4576 163

GO:0046627 negative regulation of insulin receptor signaling pathway 0.0832 0.4012 0.4535 64

To evaluate the potential relationship between the results from each pair of the three

methods, unadjusted p-values were used to produce scatterplot matrices. An extremely strong

positive correlation is evident (see Figure 4-3) between the unadjusted p-values generated from

Roast and Camera. Figure 4-3 shows that there is a moderate positive correlation between the

unadjusted p-values from the mean-rank gene set test and Roast, or between the mean-rank

gene set test and Camera.

MeanRank

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Camera

Down-regulated P values - Contrast Between Insulin Resistance and Lean Control

Figure 4-3: Scatterplot matrix of down-regulated unadjusted p-values from three methods

Two GO categories were significantly up-regulated (see Table 4-25) using the mean-

rank gene set test (FDR<0.1). Both Roast and Camera generated large adjusted p-values and

the scatterplot matrix in Figure 4-4 shows some similar positive linear trends.

resistance and lean control

Roast FDR

Camera FDR

No. of Probes

GO:0043569

negative regulation of insulin-like growth factor receptor signaling pathway 0.0149 0.2030 0.2746 9

GO:0008286 insulin receptor signaling pathway 0.0149 0.9940 0.9882 346

MeanRank

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Camera

Up-regulated P values - Contrast Between Insulin Resistance and Lean Control

Figure 4-4: Scatterplot matrix of up-regulated unadjusted p-values from three methods

Gene sets defined by 9 insulin related GO terms appeared to be significantly down-

regulated (see Table 4-26) using the mean-rank gene set test (FDR<0.1). Both Roast and

Camera seem to be very conservative in detecting differentially expressed (DE) gene sets, and

their resulting unadjusted p-values are highly positively correlated even though their null

hypotheses are quite different (see Figure 4-5).

diabetic and lean control

Roast FDR

Camera FDR

No. of Probes

GO:0048009 insulin-like growth factor receptor signaling pathway 0.0018 0.2127 0.3456 40

GO:0005010 insulin-like growth factor receptor activity 0.0072 0.2958 0.3492 18

GO:0043559 insulin binding 0.0124 0.3142 0.3492 18

GO:0005158 insulin receptor binding 0.0132 0.3190 0.3492 87

GO:0005520 insulin-like growth factor binding 0.0135 0.4036 0.4270 56

GO:0005159 insulin-like growth factor receptor binding 0.0259 0.3190 0.3898 45

GO:0016942 insulin-like growth factor binding protein complex 0.0286 0.4082 0.4270 28

GO:0050796 regulation of insulin secretion 0.0671 0.2958 0.3492 172

MeanRank

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.00.

Camera

Down-regulated P values - Contrast Between Obese Diabetic and Lean Control

Figure 4-5: Scatterplot matrix of down-regulated unadjusted p-values from three methods in the

contrast between diabetic and lean control

Using the mean-rank gene set test, gene sets defined by 2 GO terms were significantly

up-regulated in diabetic relative to the lean control. Roast also detected one gene set to be up-

regulated (FDR<0.1) as listed in Table 4-27.

Table 4-27: Comparison of up-regulated insulin related GO terms in the contrast between

Roast FDR

Camera FDR

No. of Probes

GO:0046676 negative regulation of insulin secretion <0.0001 0.0435 0.1449 49

GO:0043569 negative regulation of insulin-like growth factor receptor signaling pathway 0.0291 0.2030 0.2931 9

Gene sets related to 2 GO terms were significantly up-regulated (see Table 4-28) using

the mean-rank gene set test in the contrast between insulin sensitive patients and the lean

control. No GO terms were found to be significantly down-regulated using any of the three

methods.

Roast FDR

Camera FDR

No. of Probes

GO:0008286 insulin receptor signaling pathway <0.0001 0.1740 0.9182 346

GO:0032869 cellular response to insulin stimulus 0.0571 0.5075 0.9431 163

4.1.4 Comparison of Three Gene Set Tests for Glucose Related GO Terms

GO terms containing

Using the mean-rank gene set test, gene sets defined by 4 GO terms were down-

regulated (see Table 4-30) whereas genes associated with 4 other GO terms were up-regulated

(see Table 4-31). Again, Roast and Camera found no DE gene sets after adjusting for multiple

testing.

Roast FDR

Camera FDR

GO:0015758 glucose transport 0.0091 0.1643 0.2912

GO:0005536 glucose binding 0.0153 0.1643 0.2765

GO:0001678 cellular glucose homeostasis 0.0486 0.1643 0.2765

GO:0046323 glucose import 0.0867 0.1500 0.2765

Table 4-31: Comparison of up-regulated glucose related GO terms in the contrast between

Roast FDR

Camera FDR

GO:0006006 glucose metabolic process 0.0307 0.5167 0.9885

GO:0042593 glucose homeostasis 0.0307 0.7500 0.9885

GO:0003980 UDP-glucose:glycoprotein glucosyltransferase activity 0.0727 0.1500 0.7729

GO:0009749 response to glucose stimulus 0.0727 0.6800 0.9885

No glucose related GO terms were found to be down- or up-regulated in the contrast

between insulin sensitive patients and the lean control after three different gene set tests were

carried out.

4.1.5 Comparison of Three Gene Set Tests for the FOXO Gene Set

Four FOXO genes are identified in mammals and three of them are found in humans,

i.e., FOXO1, 3 and 4. Their inter-gene correlation is 0.2727. Using both the mean-rank gene set

test and the rotation gene set test, the FOXO gene set was found to be significantly up-

regulated in two contrasts (i.e., insulin resistant patients versus controls and insulin sensitive

patients versus controls) using a threshold for the false discovery rate control of 0.05. When

testing a single gene set, the unadjusted p values seem to be similar for the mean-rank and the

rotation gene set test. However, Camera returned no significant results in any of the contrasts.

A summary of the results is shown in Table 4-32.

Table 4-32: Summary of three gene set tests for the FOXO gene set in three contrasts

Contrast MeanRank P-value

Roast P-value

Camera P-value

Insulin Resistant versus

Lean Control

Down-regulated 0.9965 0.9610 0.9394 Up-regulated 0.0034 0.0400 0.0605 Mixed-regulated 0.2738 0.2200 0.1211

Diabetic versus

Lean Control

Insulin Sensitive versus

Lean Control

4.2 Hypergeometric Test for Gene Set Enrichment Analysis

4.2.1 The longitudinal mouse study involving the comparison of a high-fat diet to the

control

In the longitudinal mouse study, we first need to define the gene universe. Non-specific

filtering was carried out to remove probe sets with smaller variation across samples, i.e., an

inter-quartile range of less than 0.5. Probes with no annotation in the Gene Ontology terms

(Biological Process) were excluded. The same 1146 DE probes from the comparison between

42 days of high-fat diet and the control in the muscle tissue group were selected to map to

genes of interest based on the Entrez gene identifier. GO terms have parent-child hierarchies,

so a conditional hypergeometric test is required to decorrelate the results (Falcon & Gentleman

2007). Because of the way the conditional hypergeometric test operates, it is difficult to adjust

the resultant p-values directly (Falcon & Gentleman 2007). Hence, a p-value cutoff of 0.01 is

used for conditional hypergeometric tests in this section. 206 GO (Biological Process) terms

were over-represented (using p-value <0.01) in the list of DE genes using the conditional test

(Falcon & Gentleman 2007). The top 30 over-represented GO terms are given in Table 4-33.

Table 4-33: Top 30 Over-represented GO (BP) terms after 42 days of high-fat diet in the

longitudinal mouse study

GO ID (BP) P value Odds Ratio

Exp Count Count Size Term

GO:0008104 <0.0001 2.2 57.0 107 620 protein localization

GO:0080090 <0.0001 1.7 172.0 243 1870 regulation of primary metabolic process

GO:0044238 <0.0001 1.7 125.8 179 1546 primary metabolic process

GO:0048519 <0.0001 1.6 123.9 176 1347 negative regulation of biological process

GO:0044249 <0.0001 1.5 191.8 252 2086 cellular biosynthetic process

GO:0015031 <0.0001 2.5 22.6 48 252 protein transport

GO:0023051 <0.0001 1.8 70.3 111 764 regulation of signaling

GO:0009059 <0.0001 1.5 155.5 211 1691 macromolecule biosynthetic process

GO:0034641 <0.0001 1.5 207.3 268 2254 cellular nitrogen compound metabolic process

GO:0010467 <0.0001 1.5 145.7 199 1608 gene expression

GO:0016043 <0.0001 1.5 146.3 198 1591 cellular component organization

GO:0051234 <0.0001 1.5 134.9 185 1485 establishment of localization

GO:0070727 <0.0001 2.2 28.1 54 306 cellular macromolecule localization

GO:0001932 <0.0001 2.1 29.0 54 315 regulation of protein phosphorylation

GO:0019220 <0.0001 2.0 33.5 60 364 regulation of phosphate metabolic process

GO:0032268 <0.0001 2.0 34.3 61 377 regulation of cellular protein metabolic process

GO:0051169 <0.0001 2.8 13.2 31 144 nuclear transport

GO:0016070 <0.0001 1.5 110.5 154 1202 RNA metabolic process

GO:0050657 <0.0001 5.4 3.7 14 40 nucleic acid transport

GO:0051236 <0.0001 5.4 3.7 14 40 establishment of RNA localization

GO:0060255 <0.0001 1.6 104.5 146 1180 regulation of macromolecule metabolic process

GO:0010608 <0.0001 2.7 12.5 29 136 posttranscriptional regulation of gene expression

GO:0043549 <0.0001 2.2 21.3 42 232 regulation of kinase activity

GO:0070647 <0.0001 2.2 20.1 39 219 protein modification by small protein conjugation or removal

GO:0008380 <0.0001 2.7 11.3 26 124 RNA splicing

GO:0009889 <0.0001 1.5 91.6 127 1020 regulation of biosynthetic process

GO:0060548 <0.0001 2.0 27.0 48 294 negative regulation of cell death

GO:0048583 <0.0001 1.5 78.7 111 856 regulation of response to stimulus

GO:0034440 <0.0001 4.0 4.5 14 49 lipid oxidation

GO:0045740 <0.0001 8.0 1.7 8 18 positive regulation of DNA replication

Using the non-conditional hypergeometric test, 16 KEGG pathways were found to be over-

represented (using p-value <0.01) in the list of DE genes (see Table 4-34).

Table 4-34: 16 over-represented KEGG pathways after 42 days of high-fat diet in the

longitudinal mouse study

KEGG ID P value

Odds Ratio

Exp Count Count Size KEGG Pathway

3015 <0.0001 4.3230 4.8927 15 43 mRNA surveillance pathway

4722 <0.0001 3.0508 8.7614 21 77 Neurotrophin signaling pathway

3018 0.0001 4.8113 3.6411 12 32 RNA degradation

5100 0.0003 3.8625 4.5514 13 40 Bacterial invasion of epithelial cells

4141 0.0004 2.4729 11.7198 24 103 Protein processing in endoplasmic reticulum

5211 0.0006 3.3045 5.4617 14 48 Renal cell carcinoma

4510 0.0018 2.0511 15.2472 27 134 Focal adhesion

4720 0.0018 3.1950 4.7790 12 42 Long-term potentiation

62 0.0020 15.7554 0.6827 4 6 Fatty acid elongation in mitochondria

4920 0.0028 2.9929 5.0065 12 44 Adipocytokine signaling pathway

4910 0.0043 2.1819 10.1268 19 89 Insulin signaling pathway

4210 0.0050 2.7332 5.3479 12 47 Apoptosis

5212 0.0050 2.7332 5.3479 12 47 Pancreatic cancer

4912 0.0051 2.4859 6.7133 14 59 GnRH signaling pathway

4114 0.0060 2.4309 6.8271 14 60 Oocyte meiosis

4010 0.0062 1.7758 19.0021 30 167 MAPK signaling pathway

4.2.2 The cross-sectional human study comparing expression in tissue samples for a

control group of healthy patients and obese patients

In the cross-sectional human study, a similar non-specific gene filtering process was

performed in order to determine the gene universe. When choosing the list of interesting genes,

we focused on the top 50 probes from each of the 3 contrasts separately because very few DE

probes were detected.

Using Top Ranked Probes from Insulin Resistant Versus Lean Control

Using the top ranked 50 probes from the comparison between the insulin resistant

patients and the control group, the conditional hypergeometric test (Falcon & Gentleman 2007)

was performed for their corresponding GO terms. 12 GO terms were over-represented in the list

of top ranked genes using the criterion of p value less than 0.01 (see Table 4-35). One KEGG

pathway was over-represented (using p value <0.01). The top 10 over-represented KEGG

pathways can be found in Table 4-36.

Table 4-35: Top 12 Over- represented GO terms in the contrast between insulin resistant

patients and the control

GO:0046627 0.0003 98.0588 0.0284 2 10 negative regulation of insulin receptor signaling pathway

GO:0032869 0.0022 13.5680 0.2668 3 94 cellular response to insulin stimulus

GO:0010980 0.0057 370.8333 0.0057 1 2 positive regulation of vitamin D 24-hydroxylase activity

GO:0035408 0.0057 370.8333 0.0057 1 2 histone H3-T6 phosphorylation

GO:0042369 0.0057 370.8333 0.0057 1 2 vitamin D catabolic process

GO:0051345 0.0058 9.5160 0.3746 3 132 positive regulation of hydrolase activity

GO:0009636 0.0067 18.5826 0.1249 2 44 response to toxin

GO:0009103 0.0085 185.3889 0.0085 1 3 lipopolysaccharide biosynthetic process

GO:0010966 0.0085 185.3889 0.0085 1 3 regulation of phosphate transport

GO:0046325 0.0085 185.3889 0.0085 1 3 negative regulation of glucose import

GO:0055062 0.0085 185.3889 0.0085 1 3 phosphate ion homeostasis

GO:0007202 0.0096 15.2826 0.1504 2 53 activation of phospholipase C activity

Table 4-36: Top 10 Over- represented KEGG Pathways in the contrast between insulin resistant

KEGG ID P value

Odds Ratio

Exp Count Count Size KEGG Pathway

4960 0.0007 99.0800 0.0431 2 27 Aldosterone-regulated sodium reabsorption

4010 0.0118 21.1416 0.1836 2 115 MAPK signaling pathway

5200 0.0265 13.4624 0.2793 2 175 Pathways in cancer

5130 0.0331 41.3667 0.0335 1 21 Pathogenic Escherichia coli infection

5110 0.0347 39.3810 0.0351 1 22 Vibrio cholerae infection

5143 0.0362 37.5758 0.0367 1 23 African trypanosomiasis

5223 0.0409 33.0267 0.0415 1 26 Non-small cell lung cancer

5214 0.0440 30.5556 0.0447 1 28 Glioma

590 0.0471 28.4253 0.0479 1 30 Arachidonic acid metabolism

4370 0.0486 27.4667 0.0495 1 31 VEGF signaling pathway

Using Top Ranked Probes from Diabetic versus Lean Control

Using the top ranked 50 probes from the comparison between the diabetic patients and

the lean control group, 2 GO terms were over- represented in the list of top ranked genes (using

p value <0.01). Only one KEGG pathway with a size of 13 genes,

Table 4-37: Top 15 Over- represented GO terms in the contrast between diabetic patients and

the control

GO:0031099 0.0086 16.2326 0.1412 2 45 regeneration

GO:0048642 0.0094 166.8000 0.0094 1 3 negative regulation of skeletal muscle tissue development

GO:0006488 0.0125 111.1833 0.0125 1 4 dolichol-linked oligosaccharide biosynthetic process

GO:0009225 0.0156 83.3750 0.0157 1 5 nucleotide-sugar metabolic process

GO:0046622 0.0156 83.3750 0.0157 1 5 positive regulation of organ growth

GO:0006044 0.0187 66.6900 0.0188 1 6 N-acetylglucosamine metabolic process

GO:0048635 0.0187 66.6900 0.0188 1 6 negative regulation of muscle organ development

GO:0030260 0.0218 55.5667 0.0220 1 7 entry into host cell

GO:0045736 0.0218 55.5667 0.0220 1 7

negative regulation of cyclin-dependent protein kinase activity

GO:0051828 0.0218 55.5667 0.0220 1 7 entry into other organism involved in symbiotic interaction

GO:0052126 0.0218 55.5667 0.0220 1 7 movement in host environment

GO:0006040 0.0279 41.6625 0.0282 1 9 amino sugar metabolic process

GO:0019059 0.0279 41.6625 0.0282 1 9 initiation of viral infection

GO:0031103 0.0309 37.0278 0.0314 1 10 axon regeneration

GO:0046627 0.0309 37.0278 0.0314 1 10 negative regulation of insulin receptor signaling pathway

Using Top Ranked Probes from Insulin Sensitive versus Lean Control

Using the top ranked 50 probes from the contrast of insulin sensitive patients versus the

control group, 12 GO terms were over-represented (using p value <0.01). Table 4-38 shows the

results.

Table 4-38: 12 Over- represented GO terms in the contrast between insulin sensitive patients

and the control

GO:0051098 0.0006 12.1297 0.4109 4 131 regulation of binding

GO:0007190 0.0049 21.8487 0.1066 2 34 activation of adenylate cyclase activity

GO:0031281 0.0052 21.1834 0.1098 2 35 positive regulation of cyclase activity

GO:0051349 0.0052 21.1834 0.1098 2 35 positive regulation of lyase activity

GO:0032656 0.0063 333.6500 0.0063 1 2 regulation of interleukin-13 production

GO:0045404 0.0063 333.6500 0.0063 1 2 positive regulation of interleukin-4 biosynthetic process

GO:0051895 0.0063 333.6500 0.0063 1 2 negative regulation of focal adhesion assembly

GO:0030100 0.0089 15.8612 0.1443 2 46 regulation of endocytosis

GO:0043011 0.0094 166.8000 0.0094 1 3 myeloid dendritic cell differentiation

GO:0046855 0.0094 166.8000 0.0094 1 3 inositol phosphate dephosphorylation

GO:0046856 0.0094 166.8000 0.0094 1 3 phosphatidylinositol dephosphorylation

GO:0050765 0.0094 166.8000 0.0094 1 3 negative regulation of phagocytosis

Four KEGG pathways were over-represented (using p value <0.05) in this contrast (see Table

4-39).

Table 4-39: 4 Over-represented KEGG pathways in the contrast between insulin sensitive

KEGG ID P value

Odds Ratio

4141 0.0182 12.0246 0.2153 2 60 Protein processing in endoplasmic reticulum

592 0.0319 38.9219 0.0323 1 9 alpha-Linolenic acid metabolism

1040 0.0388 31.1125 0.0395 1 11 Biosynthesis of unsaturated fatty acids

4977 0.0492 23.9038 0.0502 1 14 Vitamin digestion and absorption

Chapter 5 - Cluster Analysis

In this chapter, we first use hierarchical clustering to explore the structure of any

underlying groups in each of the three data sets separately. Both Euclidean distance and

Pearson correlation were used to measure the distance before applying hierarchical clustering.

Secondly, two mouse data sets were integrated as they both used the same type of Affymetrix

mouse arrays. Hierarchical clustering was performed to discover any patterns that may appear

to be different from the dendrograms produced based on individual data sets alone. Lastly, the

longitudinal mouse study involving the comparison of a high-fat diet to the control in two tissues

and the cross-sectional human study comparing expression in tissue samples for a control

group of healthy patients, obese insulin sensitive patients and patients with two stages of the

disease were integrated after standardisation. Then hierarchical clustering was applied to the

samples to look for any potential structure of groups in between the mouse model and the

human model.

5.1 Hierarchical Clustering for Mouse Data Sets

In general, no gene filtering was performed in the hierarchical clustering of individual

data sets, i.e., all the genes were used to calculate distance matrices and produce

dendrograms. Using selected sets of genes based on differential expression can be biased. For

the longitudinal mouse study involving the comparison of a high-fat diet to the control group in

two tissues, adipose and muscle, samples from the two tissue groups were clearly separated in

both dendrograms (see Figures 5-1 and 5-2). The structure of the cluster trees appeared to be

reasonably similar using either the Euclidean distance or the correlation distance as the

measure of dissimilarity. Overall we can see that the control group of mice (i.e., on a standard

low-fat diet) were grouped together in each of the tissue types. In the adipose tissue group, two

samples (i.e., Ahi42.1 and Ahi42.4) from the group of a high-fat diet of 42 days were clustered

together with the samples from the control group based on both dendrograms.

Mhi14.2

Mhi42.1

Mhi42.2

Mhi14.3

Mhi42.3

Mhi42.4

Mhi5.4

Mhi5.3

Mhi5.1

Mhi5.2

Mchow.3

Mchow.4

Mchow.1

Mchow.2

Mhi14.4

Ahi14.4

Ahi42.2

Ahi5.4

Ahi14.1

Ahi14.3

Ahi5.2

Ahi5.1

Ahi5.3

Ahi42.4

Achow.1

Ahi42.1

Achow.3

Achow.2

Achow.4

Ahi14.2

0 50 100 150 200 250M

Linkag

e, Euclide

an Dista

ist(t(X.H

Height

-1: Hierarchica

l clustering

of the m

ouse high-fat diet d

ata based on E

uclidean distance.

Mhi14.2

Mhi42.1

Mhi42.2

Mhi14.3

Mhi42.3

Mhi42.4

Mchow.3

Mchow.4

Mchow.1

Mchow.2

Mhi5.4

Mhi5.3

Mhi14.4

Mhi5.1

Mhi5.2

Ahi5.4

Ahi14.1

Ahi14.3

Ahi42.4

Achow.1

Ahi42.1

Achow.3

Achow.2

Achow.4

Ahi14.2

Ahi5.2

Ahi5.1

Ahi5.3

Ahi14.4

Ahi42.2

0.00 0.05 0.10 0.15

plete Linkage,C

orrelation M

Height

-2: Hierarchica

l clustering

of the m

ouse high-fat diet d

ata based on P

earson correla

For the mouse cell line study, little difference can be found in the structure of the two

dendrograms produced based on either the Euclidean distance or the Pearson correlation

distance (see Figures 5-3 and 5-4).

Mouse Cell Line Cluster Dendrogram

Complete Linkage,Euclidean Distancedist(t(X.CL))

Figure 5-3: Hierarchical clustering of the mouse cell line data based on Euclidean distance.

Mouse Cell Line Cluster Dendrogram(Correlation Matrix)

Complete Linkage,Correlation Matrixas.dist(1 - cor(X.CL))

Figure 5-4: Hierarchical clustering of the mouse cell line data based on Pearson correlation.

5.2 Hierarchical Clustering for the Human Data Set

In the human model, the normalized expression values were used to create the distance

matrix before hierarchical clustering was applied. The structure of any meaningful groups

seemed to be quite unclear in both dendrograms comparing to the ones generated from the two

mouse data sets. For example, in Figure 5-5, patients with type 2 diabetes were clustered

together with insulin sensitive and non-obese healthy patients (i.e., lean control) when the

Euclidean distance was used as the distance measure; another healthy patient was grouped

together with a patient with insulin resistance. Similar examples can be seen in the dendrogram

using a correlation matrix (see Figure 5-6). In general, healthy patients from the control group

were rarely grouped together. In some cases healthy patients were in the same cluster trees

with obese insulin resistant and diabetic patients.

Lean Control

Obese Insulin sensitive

Diabetic

Obese Insulin Resistant

Lean Control

Diabetic

Lean Control

Diabetic

100 150 200 250

plete Linkage, Euclidean D

istancedist(t(X

Height

-5: Hierarchica

l clustering

of the hum

an data based on E

uclidean distance.

Lean ControlLean Control

DiabeticObese Insulin ResistantObese Insulin sensitive

DiabeticObese Insulin sensitive

Obese Insulin ResistantLean Control

Obese Insulin ResistantObese Insulin sensitive

Obese Insulin sensitiveObese Insulin sensitive

Lean ControlObese Insulin Resistant

Obese Insulin ResistantDiabeticDiabetic

Obese Insulin ResistantDiabeticObese Insulin Resistant

DiabeticLean Control

0.05 0.10 0.15 0.20

plete Linkage,Correlation M

atrixas.dist(1 - cor(X

Height

-6: Hierarchica

l clustering

of the hum

an data based on P

earson co

rrelation.

5.3 Hierarchical Clustering for the Combined Mouse Data Sets

Since both of the mouse data sets used the Affymetrix mouse 430_2 arrays, we were

able to combine the two data sets according to their probe identifiers. A standardization process

was performed before combining the data sets. For each mouse data set, we standardized (or

normalized) arrays (i.e., samples or mice) by dividing each expression measure by the

corresponding median value of that array. The dendrogram in Figure 5-7 showed a structure of

two distinct groups with no intersection, i.e., samples in the mouse cell line data were

completely separated from those in the longitudinal mouse study involving high-fat diet feeding.

Cluster Dendrogram after normalisation

Complete Linkage,Euclidean Distancedist(t(n.merged))

Figure 5-7: Hierarchical clustering of the combined mouse data based on Euclidean distance.

The top 100 differential expressed (DE) genes from each mouse data set were identified

using the moderated F-statistic which combines the moderated t-statistics for all the contrasts

into an overall test of significance for each gene (Smyth 2004). We used these top DE genes to

select a subset of the combined data. The same standardization process was carried out, and

then hierarchical clustering was applied to this subset. Again, none of the samples from the

mouse cell line study were grouped together with any of the samples from the longitudinal

mouse study (see Figure 5-8).

Figure 5-8: Hierarchical clustering of the combined mouse data based on the top DE genes

Dendrogram - Top 100 DE genes from Mouse Cell Line and High Fat Diet

Complete Linkage,Euclidean Distancedist(t(subset2))

5.4 Hierarchical Clustering for the Integrated Mouse and Human Data Sets

Based on the results from the analysis of differential expression in Chapter 3, we have

obtained top differentially expressed (DE) genes from both the longitudinal mouse model

involving high-fat feeding and the human study. We intend to use the top DE genes to integrate

these two data sets. Since no genes in the adipose group were detected to be differentially

expressed, we will focus our attention on the top DE genes found in the muscle tissue group.

The top 50 DE probes in the muscle group of the longitudinal mouse study were selected based

on the moderated F-statistic (Smyth 2004). Very few DE genes were found in the human Agilent

arrays for each of the three contrasts, so we will choose the top 50 probes based on the

moderated F-statistic. All the selected top ranked probes were mapped to their corresponding

gene symbols. Some genes encoded on the Agilent human arrays might not necessarily exist

on the mouse arrays and vice versa. Therefore we only keep those genes that are encoded on

both human and mouse arrays so that we are able to select their expression values from both

the mouse and human data sets. For those probes that mapped to the same gene symbol, the

average of the probe level expression values was kept as the expression measure for that gene

symbol. We performed a standardization process for each data set by using the z-

transformation: subtracting the mean of the array before dividing by the standard deviation of

the array across all genes. By applying the hierarchical clustering algorithm using the Euclidean

distance to the arrays, we generated the dendrogram below (see Figure 5-9). The hierarchical

clustering split the combined data set into two groups, i.e., branches of the cluster trees in the

mouse data were separated from those in the human model. There is little evidence that we can

find something in common between these two sets.

-9: Hierarchica

l clustering

of arrays using the com

bined lon

gitudinal mouse stud

the hum

an mod

On the other h

and, hierarchical clusterin

g of to

p-ranked genes is a

lso of interest. The

dendrogram

in Figure 5

-10 shows the

structure of the po

tential su

bgroups of the top-ranke

genes in the combined

and human stud

DiabeticObese Insulin Resistant

DiabeticObese Insulin ResistantObese Insulin sensitiveObese Insulin sensitiveObese Insulin sensitive

Lean ControlDiabetic

DiabeticObese Insulin ResistantObese Insulin sensitive

DiabeticDiabetic

Obese Insulin ResistantDiabetic

Mhi5.3Mchow.1Mchow.3Mchow.2Mchow.4

Mhi14.2Mhi14.4Mhi5.4Mhi5.1Mhi5.2

Ahi14.4Ahi42.2

Ahi42.1Achow.3

Achow.2Achow.4

Achow.1Ahi14.2

Ahi5.4Ahi14.1Ahi14.3

Ahi42.4Ahi5.3

Ahi5.1Ahi5.2

0 2 4 6 8 10 12 14

lete Linka

uclidean D

istancedist(t(X

ined))

Height

Figure 5-10: Hierarchical clustering of the top differential expressed (DE) genes using the

combined longitudinal mouse study and the human model.

We can also display both pieces of information together in a heat-map. Figure 5-11

demonstrates a combined dendrogram clustering both the top genes and the arrays. We can

clearly separate samples of the mouse data from the human ones for all clusters of genes

except clusters where a combination of grids in both red and yellow colour was observed. It

was found that this particular cluster contains the following 17 genes: ARIH1, CNOT4,

DBNDD1, DNM1L, FXYD4, HEXIM1 , ITGB1, LAMB2, LRRC23, MLL3, PRG4, RCN3, RFFL,

SLC44A2, SYMPK, TPP2 and ZFAND5. We selected these 17 genes to be used to cluster

arrays (i.e., samples) in the combined data set (see Figure 5-12). There were a few interesting

clusters shown in Figure 5-12. For example, 4 samples from the human data (obese insulin

sensitive, diabetic and insulin resistant) were grouped with one sample from the longitudinal

mouse adipose tissue group (Ahi42.1). 4 samples from the lean control in the human data were

3 IL6S

LOX TA

Clustering Top DE genes combined using Mouse HF and Human model

Complete Linkage, Euclidean Distancedist(X.combined)

togeth

er in one cluster. Most of the othe

r samp

les from th

e human data w

ere separa

ted from

the mouse sam

-11: H

ierarchical clustering of both th

e top DE

genes and arrays using the

combin

longitud

inal m

ouse study and the hum

an mode

Obese Insulin ResistantLean ControlLean ControlLean Control

Obese Insulin ResistantDiabeticDiabeticDiabetic

DiabeticObese Insulin sensitiveObese Insulin sensitive

Obese Insulin ResistantObese Insulin Resistant

DiabeticAhi42.2Ahi14.4Ahi42.4Ahi5.3Ahi5.2Ahi5.1

Ahi42.1Ahi5.4

Ahi14.1Ahi14.3Ahi14.2

Achow.1Achow.3Achow.4Achow.2

Mhi5.3Mchow.1Mchow.3Mchow.4Mchow.2Mhi14.2Mhi14.4Mhi5.4Mhi5.1Mhi5.2

Figure 5-12: Hierarchical clustering of the arrays based on a selected cluster of DE genes.

Mhi14.2

Mhi14.4

Mchow.2

Mchow.3

Mchow.1

Mchow.4

Mhi42.3

Mhi42.4

Mhi42.1

Mhi14.3

Mhi42.2

Mhi5.1

Mhi5.3

Mhi5.2

Mhi5.4

Diabetic

Ahi42.1

Diabetic

Lean Control

Diabetic

Lean Control

Diabetic

Ahi14.4

Ahi5.1

Ahi5.2

Ahi42.2

Ahi42.4

Ahi14.3

Achow.1

Ahi5.3

Ahi5.4

Ahi14.1

Ahi14.2

Achow.3

Achow.2

Achow.4

Lean Control

0.0 0.5 1.0 1.5 2.0 2.5 3.0

- 17 D

plete Linkage, Euclidean D

istancedist(t(sub.com

bined))

Height

Chapter 6 - Conclusions

6.1 Addressing the Research Questions

We first summarise the findings in order to address all the research questions listed in

Chapter 1. The findings are given in the following four sections: differentially expressed (DE)

genes, cross-species gene set tests, over-representation analysis and the selected gene set of

interest.

Differentially Expressed Genes

1. Which genes are differentially expressed in each condition in the human data, relative to

healthy controls?

We found 12 DE genes in the contrast between type 2 diabetic patients and healthy controls

(FDR<0.1) and 2 other different DE probes in the contrast between insulin sensitive patients

and healthy controls. No DE genes were detected in the patients with insulin resistance

relative to healthy controls. The false discovery rate (FDR) is controlled at the level of 0.1

due to the very small number of DE genes found in the human data.

2. Which genes are differentially expressed in each treatment group in the mouse data

involving the comparison of a high-fat diet to controls?

In the muscle tissue group, 1146 probes were differentially expressed (FDR<0.01) in the

contrast between mice on 42 days of a high-fat diet and controls. A threshold of 0.01 is

chosen to control the false discovery rate given the large number of DE probes found in this

case. However, only 3 and 20 DE probes (FDR<0.01) were found in the contrasts between

mice on 5 days of a high-fat diet versus controls and on 14 days of a high-fat diet versus

controls respectively. This shows that the high-fat diet feeding over a long period of time

such as 6 weeks has a great impact on differential expression of genes in the muscle tissue.

No differentially expressed genes were found in any of the contrasts between three time

points and controls in the adipose tissue group even when using FDR<0.1.

3. Which genes are differentially expressed in each treatment group in the mouse cell line

study?

Using the moderated F-statistics, 15933 probes were differentially expressed (FDR<0.01) on

any of the contrasts between the seven treatment groups and the control.

Cross-species Gene Set Tests

involving a high-fat diet differentially expressed in obese patients with insulin resistance

in the human data relative to healthy controls?

We found 3082 GO terms in the human arrays after mapping the DE genes from the mouse

muscle tissue group to their corresponding GO terms in the human Agilent chip. Gene sets

defined by 420 GO terms were significantly up-regulated in obese patients with insulin

resistance relative to healthy controls using the mean-rank gene set test (FDR<0.05). Gene

sets associated with 121 GO terms were significantly down-regulated whereas 128 GO

terms were significantly mixed-regulated. A summary of the number of significant GO terms

with a size of no more than 500 probes is given in Table 6-1. However, the rotation gene set

test returned no DE gene sets (FDR<0.1).

resistant patients and controls

Mean-Rank Gene Set Test No. of GO terms containing

no more than 500 probes

Total No. of GO

terms Up-regulated 378 420

Down-regulated 115 121

Mixed-regulated 127 128

The DE genes in the muscle tissue group were mapped to 188 KEGG pathways. 55 KEGG

pathways were significantly up-regulated using the mean-rank gene set test (FDR<0.05).

Gene sets in 19 KEGG pathways were significantly down-regulated and 4 KEGG pathways

were significantly mixed-regulated. Using the rotation gene set test, no KEGG pathways

were found formally differentially expressed.

involving a high-fat diet differentially expressed in obese patients with type 2 diabetes in

the human data relative to healthy controls?

In the contrast between type 2 diabetic patients and healthy controls, 488 GO terms were

significantly up-regulated (FDR<0.05) using the mean-rank gene set test. Gene sets defined

by 120 GO terms were significantly down-regulated and 182 GO terms were differentially

expressed regardless of the direction. Table 6-2 shows a summary of the number of

significant GO terms with a size of no more than 500 probes. The rotation gene set test led

to no formally significant gene sets.

Table 6-2: Summary of the number of significant GO terms in the contrast between diabetic

patients and controls

Total No. of GO

Based on the results of the mean-rank gene set test, 56 KEGG pathways were significantly

up-regulated, 17 KEGG pathways were significantly down-regulated and 20 KEGG

pathways were significantly mixed-regulated. When the rotation gene set test was applied, 5

KEGG pathways were significantly mixed-regulated, all of which were mixed-regulated in the

mean-rank gene set test.

involving a high-fat diet differentially expressed in obese insulin sensitive patients in the

human data relative to healthy controls?

Using the mean-rank gene set test, 428 GO terms were significantly up-regulated

(FDR<0.05) in the contrast between insulin sensitive patients and healthy controls. Gene

sets associated with 97 GO terms were significantly down-regulated and 192 GO terms

were significantly mixed-regulated. The number of significant GO terms with a size of no

more than 500 probes can be found in Table 6-3. The rotation gene set test found no

significant gene sets.

sensitive patients and controls

Total No. of GO

We found 70 significantly up-regulated KEGG pathways using the mean-rank gene set test.

Among the top 30 KEGG pathways, 12 were related to different types of cancer. The

rotation gene set test found 20 up-regulated KEGG pathways and 9 of them were cancer

related pathways. These 9 cancer pathways were all listed as significant using the mean-

rank gene set test.

Over-representation Analysis

ranked genes in the mouse data with a high-fat diet?

It was found that 206 GO terms (Biological Process) and 16 KEGG pathways were over-

represented in the list of DE genes in the mouse data involving a high-fat diet. Among the 16

KEGG pathways, 2 were cancer related pathways, i.e., Pancreatic cancer and Renal cell

carcinoma.

ranked genes in obese patients with insulin resistance relative to healthy controls?

We found 12 GO terms (Biological Process) and one KEGG pathway (i.e.,

controls. The two GO terms are

unadjusted p-values whereas Camera appears to be the most conservative one, finding

almost no DE gene sets in this current situation.

6.2 Discussion

In our lists of gene set test results, we identify some significant GO terms or KEGG

pathways that have been previously discussed and confirmed by a number of research teams.

Wnt signaling pathway

Several key components of the Wnt signalling pathway are found to be implicated in

metabolic homeostasis and the development of type 2 diabets (Ip, Chiang & Jin 2012). Based

on our findings, both Wnt signaling pathway (KEGG: 4310) and the canonical Wnt receptor

signaling pathway (GO: 0060070) were significantly up-regulated in two contrasts (obese insulin

resistant patients versus healthy controls and diabetic patients versus healthy controls).

p53 signaling pathway

p53 activation is induced by a number of stress signals, including DNA damage,

oxidative stress and activated oncogenes. It was found that p53 expression in adipose tissue is

crucially involved in the development of insulin resistance, which underlies age-related

cardiovascular and metabolic disorders (Minamino et al. 2009). We found that p53 signaling

pathway (KEGG: 4115) was significantly up-regulated in diabetic patients versus healthy

controls. It was also significantly mixed-regulated in the comparison of obese insulin sensitive

patients versus controls.

Adipocytokine signaling pathway

Adipocytokine signalling pathway, being related to insulin resistance, was found

significantly up-regulated in type 2 diabetic patients (Manoel-Caetano et al. 2012). Our results

show the adipocytokine signaling pathway (KEGG: 4920) is over-represented in the list of DE

genes in the longitudinal mouse study involving a high-fat diet.

Oxidative phosphorylation

According to the gene expression analysis completed by research collaboration in

Japan, Oxidative phosphorylation (OXPHOS) pathway may predict the existence of diabetes

because it was down-regulated in the peripheral blood mononuclear cells of patients with type 2

diabetes (Takamura et al. 2007). The down-regulation of this pathway was also detected by

another study (Manoel-Caetano et al. 2012). Interestingly, we found the oxidative

phosphorylation pathway (KEGG: 190) significantly down-regulated in all three contrasts (i.e.,

obese insulin sensitive, insulin resistant, and diabetic patients versus healthy controls) in the

human data.

Citrate cycle (TCA cycle)

Some studies demonstrated an important role for cyclic pathways of pyruvate

metabolism (the pyruvate/malate, pyruvate/citrate, and pyruvate/isocitrate cycles) in control of

insulin secretion (Jensen et al. 2008). Citrate cycle (TCA cycle) pathway (KEGG: 20) was found

to be the top significant KEGG pathway (Manoel-Caetano et al. 2012). We identified that citrate

cycle (TCA cycle) pathway appeared to be significantly down-regulated in insulin resistant

patients versus controls as well as in diabetic patients versus controls.

6.3 Comments on the Experimental Design

The longitudinal mouse study used inbred mice for both control and treatment groups.

Hence a large number of DE genes were detected even with a small sample size. However, the

details of the experimental design of the human study are somewhat unclear. We also expect to

observe more genetic variation in humans. Given the limited number of samples, this could

potentially reduce the chances of finding true DE genes in any of the disease conditions relative

to healthy controls in the human data.

6.4 Further Work

Because of the nature of this experiment, we need biologists to investigate these results

further in order to make valid biological inferences. For the three gene set testing methods, the

Mean-rank gene set test, Roast and Camera, simulation studies are needed to compare the

advantages and disadvantages of the three gene set tests.

In gene set tests, we focused on significant Gene Ontology (GO) terms with no more

than 500 probes because of its parent-child hierarchical structure. Other methods of addressing

the GO hierarchy problem could be explored further.

For KEGG pathways, the KEGG.db package is now considered to be deprecated and

future versions of Bioconductor may not have it available. Hence, other possible alternatives

could be explored such as the reactome.db package.

References

Australia's Health 2010, 12th Biennial Health Report, Australian Institute of Health and Welfare, Canberra. Benjamini, Y, Drai, D, Elmer, G, Kafkafi, N & Golani, I 2001, 'Controlling the false discovery rate in behavior genetics research', Behavioural Brain Research, vol. 125, no. 1, pp. 279-84. Benjamini, Y & Hochberg, Y 1995, 'Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing', Journal of the Royal Statistical Society. Series B (Methodological), vol. 57, no. 1, pp. 289-300. Carter, ME & Brunet, A 2007, 'FOXO transcription factors', Current biology : CB, vol. 17, no. 4, pp. R113-R4. Copeland, NG, Jenkins, NA & O'Brien, SJ 2002, 'Mmu 16: Comparative Genomic Highlights', Science, vol. 296, no. 5573, pp. 1617-8. de Wilde, J, Mohren, R, van den Berg, S, Boekschoten, M, Dijk, KW-V, de Groot, P, Muller, M, Mariman, E & Smit, E 2008, 'Short-term high fat-feeding results in morphological and metabolic adaptations in the skeletal muscle of C57BL/6J mice', Physiological Genomics, vol. 32, no. 3, pp. 360-9. Falcon, S & Gentleman, R 2007, 'Using GOstats to test gene lists for GO term association', Bioinformatics, vol. 23, no. 2, pp. 257-8. Gentleman, R, Ding, B, Dudoit, S & Ibrahim, J 2005, 'Distance Measures in DNA Microarray Data Analysis', in R Gentleman, V Carey, W Huber, R Irizarry & S Dudoit (eds), Bioinformatics

and Computational Biology Solutions Using R and Bioconductor, Springer New York, pp. 189-208. Goeman, JJ & Bühlmann, P 2007, 'Analyzing gene expression data in terms of gene sets: methodological issues', Bioinformatics, vol. 23, no. 8, pp. 980-7. Hayashi, Y, Kajimoto, K, Iida, S, Sato, Y, Mizufune, S, Kaji, N, Kamiya, H, Baba, Y & Harashima, H 2010, 'DNA microarray analysis of whole blood cells and insulin-sensitive tissues reveals the usefulness of blood RNA profiling as a source of markers for predicting type 2 diabetes', Biological & pharmaceutical bulletin, vol. 33, no. 6, pp. 1033-42. Houstis, N 2006, 'Reactive oxygen species have a causal role in multiple forms of insulin resistance', Nature, vol. 440, no. 7086, pp. 944-8.

Ip, W, Chiang, YT & Jin, T 2012, 'The involvement of the wnt signaling pathway and TCF7L2 in diabetes mellitus: The current understanding, dispute, and perspective', Cell Biosci, vol. 2, no. 1, p. 28. Jensen, MV, Joseph, JW, Ronnebaum, SM, Burgess, SC, Sherry, AD & Newgard, CB 2008, 'Metabolic cycling in control of glucose-stimulated insulin secretion', Am J Physiol Endocrinol

Metab, vol. 295, no. 6, pp. E1287-97. Manoel-Caetano, FS, Xavier, DJ, Evangelista, AF, Takahashi, P, Collares, CV, Puthier, D, Foss-Freitas, MC, Foss, MC, Donadi, EA, Passos, GA & Sakamoto-Hojo, ET 2012, 'Gene expression profiles displayed by peripheral blood mononuclear cells from patients with type 2 diabetes mellitus focusing on biological processes implicated on the pathogenesis of the disease', Gene, vol. 511, no. 2, pp. 151-60. Minamino, T, Orimo, M, Shimizu, I, Kunieda, T, Yokoyama, M, Ito, T, Nojima, A, Nabetani, A, Oike, Y, Matsubara, H, Ishikawa, F & Komuro, I 2009, 'A crucial role for adipose tissue p53 in the regulation of insulin resistance', Nat Med, vol. 15, no. 9, pp. 1082-7. Smyth, GK 2004, 'Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments', Statistical applications in genetics and molecular

biology, vol. 3, no. 1, pp. 1-25. Smyth, GK & Speed, T 2003, 'Normalization of cDNA microarray data', Methods (San Diego,

Calif.), vol. 31, no. 4, pp. 265-73. Stenbit, AE, Tsao, T-S, Li, J, Burcelin, R, Geenen, DL, Factor, SM, Houseknecht, K, Katz, EB & Charron, MJ 1997, 'GLUT4 heterozygous knockout mice develop muscle insulin resistance and diabetes', Nat Med, vol. 3, no. 10, pp. 1096-101. Subramanian, A, Tamayo, P, Mootha, VK, Mukherjee, S, Ebert, BL, Gillette, MA, Paulovich, A, Pomeroy, SL, Golub, TR, Lander, ES & Mesirov, JP 2005, 'Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles', Proceedings of the

National Academy of Sciences of the United States of America, vol. 102, no. 43, pp. 15545-50. Takamura, T, Honda, M, Sakai, Y, Ando, H, Shimizu, A, Ota, T, Sakurai, M, Misu, H, Kurita, S, Matsuzawa-Nagata, N, Uchikata, M, Nakamura, S, Matoba, R, Tanino, M, Matsubara, K-i & Kaneko, S 2007, 'Gene expression profiles in peripheral blood mononuclear cells reflect the pathophysiology of type 2 diabetes', Biochemical and Biophysical Research Communications, vol. 361, no. 2, pp. 379-84. Tumurkhuu, G, Koide, N, Dagvadorj, J, Hassan, F, Islam, S, Naiki, Y, Mori, I, Yoshida, T & Yokochi, T 2007, 'MnTBAP, a synthetic metalloporphyrin, inhibits production of tumor necrosis factor⬆

Weyer, C, Bogardus, C, Mott, DM & Pratley, RE 1999, 'The natural history of insulin secretory dysfunction and insulin resistance in the pathogenesis of type 2 diabetes mellitus', The Journal of

clinical investigation, vol. 104, no. 6, pp. 787-94. Wu, D, Lim, E, Vaillant, F, Asselin-Labat, ML, Visvader, JE & Smyth, GK 2010, 'ROAST: rotation gene set tests for complex microarray experiments', Bioinformatics, vol. 26, no. 17, pp. 2176-82. Wu, D & Smyth, GK 2012, 'Camera: a competitive gene set test accounting for inter-gene correlation', Nucleic Acids Research, no. Journal Article.

wait until all figures and tables have been added to appendices. For details, see the

Appendices section on the Using Word page (http://www.k-

state.edu/grad/etdr/orient/wordindex.htm).

novel analysis of multi-species type 2 diabetes from gene …17… · novel analysis of...

Documents

in molecular medicine glp-1-mediated gene therapy approaches...

gene therapy for diabetes mellitus°•정현.pdf · gene...

insulin gene region in type i diabetes

gene therapy for diabetes mellitus -...

gene therapy for diabetes mellitus · 2014-06-27 · gene...

diabetes gene therapy1

case control study association of gene variants with...

page 1 of 43 diabetes · 2020. 2. 28. · 1 role of...

gene-specific models and therapies for type 1 diabetes

annual report 2018 - aitken college€¦ · ms graziella...

a 19-snp coronary heart disease gene score profile in...

a comparison of gene expression profiles in patients with...

antiaging gene klotho attenuates pancreatic b-cell ... ·...

a gene expression network model of type 2 diabetes links...

a novel mutation in the avpr2 gene in a palestinian family...

absence of diabetes and pancreatic exocrine dysfunction in...

research article open access gene prioritization in type 2...

perspectives in diabetes cellular engineering and gene...

aademi perfor mane report - great southern … perfor mane...

obesity, diabetes and the thrifty gene - antrocom ·...