bioinformatic analysis of the interface between ...ucbprbe/cp2.pdf · they have their own dna...

20
Bioinformatic analysis of the interface between mitochondrial biogenesis and apoptotic cell death signaling pathways in Parkinson’s disease. Robert Bentham Supervised by Dr G. Szabadkai and Dr K. Bryson March 3, 2012 Contents 1 Introduction 1 2 Microarray Analysis 2 2.1 Data acquisition ............ 3 2.2 Quality Control ............. 3 2.2.1 Normalisation .......... 5 2.3 LIMMA ................. 5 2.3.1 Results ............. 7 2.4 Gene Set Analysis ........... 7 2.4.1 GSEA .............. 8 2.4.2 GAGE .............. 8 2.4.3 Results ............. 9 3 Conclusion 9 References 14 A Tables 16 B R code 17 1 Introduction Mitochondria are subcellular organelles present in most eukaryotic cells. They have a complex evolutionary history, endosymbiotic theory saying that they evolved from free living bacteria which became incorporated within a cell. They have their own DNA (known as mtDNA) which is inherited from the mother only. Mitochondria primarily function being to provide ATP to the rest of the cell which is used as a source of energy means that they are essential for the healthy function of a cell Cell survival is dependent on the maintenance of a healthy cellular mitochondrial pool which is in turn dependent on two processes. The degradation of damaged mitochondria by autophagy and the process of mitochondrial renewal, mitochondrial biogenesis. This project will chiefly concern the latter of these processes, mitochondrial biogenesis. This biogenesis is simply the process of which new mitochondria are formed, however, the precise biological machinery controlling this process however is highly complex. Despite this complexity the PGC-1 family of transcriptional coactivators have been identifies as the master regulators of mitochondrial biogenesis[14]. 1

Upload: doanmien

Post on 30-Jul-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Bioinformatic analysis of the interface between

mitochondrial biogenesis and apoptotic cell death signaling

pathways in Parkinson’s disease.

Robert BenthamSupervised by Dr G. Szabadkai and Dr K. Bryson

March 3, 2012

Contents

1 Introduction 1

2 Microarray Analysis 22.1 Data acquisition . . . . . . . . . . . . 32.2 Quality Control . . . . . . . . . . . . . 3

2.2.1 Normalisation . . . . . . . . . . 52.3 LIMMA . . . . . . . . . . . . . . . . . 5

2.3.1 Results . . . . . . . . . . . . . 72.4 Gene Set Analysis . . . . . . . . . . . 7

2.4.1 GSEA . . . . . . . . . . . . . . 82.4.2 GAGE . . . . . . . . . . . . . . 82.4.3 Results . . . . . . . . . . . . . 9

3 Conclusion 9

References 14

A Tables 16

B R code 17

1 Introduction

Mitochondria are subcellular organelles present in most eukaryotic cells. They have a complex evolutionaryhistory, endosymbiotic theory saying that they evolved from free living bacteria which became incorporatedwithin a cell. They have their own DNA (known as mtDNA) which is inherited from the mother only.Mitochondria primarily function being to provide ATP to the rest of the cell which is used as a source ofenergy means that they are essential for the healthy function of a cell

Cell survival is dependent on the maintenance of a healthy cellular mitochondrial pool which is in turndependent on two processes. The degradation of damaged mitochondria by autophagy and the processof mitochondrial renewal, mitochondrial biogenesis. This project will chiefly concern the latter of theseprocesses, mitochondrial biogenesis. This biogenesis is simply the process of which new mitochondriaare formed, however, the precise biological machinery controlling this process however is highly complex.Despite this complexity the PGC-1 family of transcriptional coactivators have been identifies as the masterregulators of mitochondrial biogenesis[14].

1

Robert Bentham

Cancer, cardiovascular disease and neurodegenerative diseases such as Parkinson’s have all been associatedwith dysfunction of the mitochondria [14] [5]. In a recent review on the overlapping pathways involved inParkinson’s and cancer,[3] the role of mitochondria in both is stressed.

It has previously been shown that PGC-1α down regulation occurs in Parkinson’s disease [19], this couldlead to the pathogenesis of Parkinson’s disease due to mitochondrial dysfunction, possibly meaning thatPGC-1α is a potential therapeutic target. Additionally, in previous bioinformatic analysis of the role ofPGC-1 in cancer[1], PGC-1 also was found to down regulate stress pathways involved in DNA damage.Interestingly DNA damage has also been suggested to be associated with Parkinson’s disease [12].

The aim of this work is to test the hypothesis that in clinical samples of Parkinson’s disease besidesdownregulation of mitochondrial pathways, there are alterations in pathways involving DNA damage. Todo this previously published microarray data will be studied, and significantly expressed genes and genepathways identified.

2 Microarray Analysis

A microarray is a device for measuring the expression levels of large numbers of genes. It does this viautilising the process of DNA hybridisation, which is illustrated in Figure 1. The expression level of eachgene is detected by hybridisning with a number of oligonucleotide fragments on the chip acting as probes.For a single gene there are 11 perfect match (PM) probes and 11 mismatched probes (MM), in which thesequence differs by a single base. These MM probes are important for quality control, they measure thespecificity of the hybridisation by giving an indication of any cross-hybridisation that has occurred. Thusthe chip is covered with large number of probes of DNA both of type PM and MM. The target RNA fromthe experimental sample is manipulated and fluorescently labelled. So when hybridisation occurs withthe probes, there is a measure of gene expression obtained from the intensity of the fluorescence at eachspot on the microarray. For Affymetrix chips two microarrays from different experimental conditions, onebeing from a control sample can then be compared, and differences arising from the experimental conditioninferred.

Figure 1: Image from Affymetrix illustrating the construction and workings of an affymetrix microarray.

There are numerous issues in the use of microarrays or any other high-throughput technique, firstlythere is a huge amount of data that must be analysed in a statistically robust manner. To maintain thisrobustness quality control is a essential part of any analysis, these issues and others are discussed in [20],

2

Robert Bentham

unfortunately with microarrays different statistical techniques can lead to quite different results, so onemust proceed with care. There are also many things beyond our control, there is technical variabilityin the actual experiment. This comes from differences in the temperature and pH values which affectinghybridisation on the microarray. Additionally each probe can not be optimised for hybridation equally,adding the stochastic nature of biological systems this leads to very noisy results with large systematicbias. Any statistical analysis must deal with these levels of noise and judge when to reject a microarrayfrom the analysis if any systematic bias can no longer be tolerated.

2.1 Data acquisition

For the aims of this report four datasets were identified for analysis involving microarrays from patientswith Parkinson’s disease. The first dataset, which will be referred to as the Zheng dataset (available fromGEO series accession number GSE24378 [9]) and is part of the meta study that identified PGC1-α as apotential target for parkinson’s disease [19]. This particular study is made of 17 samples with 8 replicatesfor parkinson’s disease and 9 replicates for the controls, the RNA used on the microarray is from 500dopamine (DA) neurons from the pars compacta (SNc) of the substantia nigra.

Another three data sets were furthermore selected for analysis, these included another dataset, whichwill be called the Middleton dataset (available through GEO Series accession number GSE20292[8]) whichwas also used in the meta study [19] [28]. Middleton has 18 control replicates and 11 replicates withparkinson’s disease. The next data set chosen, named Mullen (available through GEO Series accessionnumber GSE7621[6]) has 16 replicates for Parkinson’s disease and 9 replicase for the controls [17]. Thefinal data set, will be referred to as Moran (available through GEO Series accession number GSE8397[7][21]). The Moran dataset, had microarrays from the Affymetrix U133A and U133B chip, of these onlythe U133A chip were used, as well as this microarrays taken from the substantia nigra with no distinctionbetween the lateral and medial parts. After this the Moran dataset contained 24 replicates for Parkinson’sdisease and 15 replicates for the controls.

All of the data sets chosen were from microarrays using Affymetrix chips , Middleton and Moran usedU133A chips while Mullen uses the more recent U133 plus 2.0, these two differ in the number of genes theydetect the plus 2.0 having probes for an additional 6500 genes. In contrast to this the Zheng study usesthe U133 X3P chip which uses probes designed to examine sequences closer to the 3’ end of transcripts,which is useful in cases of bad RNA degradation which happens from the 5’ end of transcripts.

2.2 Quality Control

The purpose of quality control is to identify arrays which are not possible to correct and use in the analysis.Problems may include mistakes in the experimental procedure or a very high signal to noise ratio. Fora comprehensive look at array quality a variety of measures should be examined, this can be quite timeconsuming, however it is possible to automate this process somewhat with R package arrayQualityMetrics[15]. A few of the main methods of quality control used will be discussed here, though there are manydifferent techniques many of which are generated automatically in the arrayQualityMetrics package.

The first thing to check for is array defects by looking at a spatial plot of intensitied, areas such as highintensity could indicate uneven hybridisation, while patterns in the spatial plot could indicate a particlebeing loose in the chip and scratching the surface while hybridisation occurs in a centrifuge. Figure ??ashows the spatial plots for all chips in the Middleton dataset, in this case and all other datasets examinedthere were no problems with either array defects or hybridisation effects.

The next quantity to check for is RNA degradation or poor labeling. It is well known that RNA degra-dation starts from the 5’ end of a molecule and finishes at the 3’ end, a feature that the chip U133 plus2.0 makes use of. For this reason if RNA degradation has occurred the mean intensity of the probes at the3’ end should be much higher, this can easily be checked and plotted in R. Figure 2b shows an increase inthe intensities of probes at the 3’ end in the Zheng dataset. Indeed all the other datasets showed similarresults, this result could also be due to inefficient labeling as the labeling reaction used in preparing the

3

Robert Bentham

(a) Spatial plot showing probe intensities of microarraysfor the Middleton dataset, all microarrays here are nor-mal. This plot was generated with the arrayQualityMet-rics package

RNA degradation plot

5' <−−−−−> 3' Probe Number

Mea

n In

tens

ity :

shift

ed a

nd s

cale

d

0 2 4 6 8 10

020

4060

80

C1C2C3C4PD1PD2PD3C5PD4PD5PD6PD7C6C7PD8C8C9

(b) RNA degradation plot showing severe degradationfor samples in the Zheng dataset despite the specialU133 X3P chip here designed for cases with bad RNAdegradation.

(c) PM and MM log2 intensity graph for data in theMiddleton study generated with the arrayQualityMetricspackage.

Figure 2: Quality Control measures used in analysis

RNA to sample occurs from the 3’ end, however due to all samples being taken from postmortems it isvery likely that the cause of this result is RNA degradation. As all samples have comparable degradationin each dataset this should not effect the analysis[2].

Addition measures of quality control include checking the density histogram of the PM and MM log2intensities, MM probes measure the non-specific hybridisation or cross hybridisation that occurs, it isexpected that the RNA should bind more strongly to the PM probes than the MM probes, if this is notthe case than there will be a high signal to noise ratio in the results. This graph for the Middleton datasetis shown in Figure 2c, for all studies it was found the the RNA binded more strongly to the PM probes,though the graphs suggest that the levels of noise are possibly quite high.

4

Robert Bentham

2.2.1 Normalisation

An abundance of variation exists between chips in microarray analysis, even between replicates from thesame sample. Variations are caused by a combination of technical and biological reasons, technical suchas the temperature and pH levels during hybridisation, and biological such as differences between twosamples coming from patients with the same condition. Therefore for a fair comparison of all the chipsbeing analysed, all chips need must be normalised with respect to each other. Checking for successfulnormalisation is the final step in quality control. The method of normalisation used in this report wasRobust Multichip Average (RMA) which is fully described along with other possible alternatives methodsin [11].

Checking successful normalisation can be measured by examining boxplots and MvA plots both pre andpost normalisation. Figure 3 shows the effects of boxplots of the intensity values on each chip. MvA plotsmeasure M, the log-2 fold change between intensity values of each probeset on different arrays, while A isthe average log-2 intensity of each probeset on the arrays. On an MvA plot every probeset is plotted withM on the y axis and A on the x axis. Figure 4 shows MvA plots post normalisation, ideally and MvA plotis symmetrical in the x axis and resembles a comet shape [24]. Once all the data has been normalised thenext task is to find the significantly expressed genes.

(a) Boxplot showing pre normalised data for the intensityof each chip in the Middleton study

(b) Boxplot showing post normalised data for the inten-sity of each chip in the Middleton study

Figure 3: The effect of normalisation on the boxplot showing intensity for each chip, graphs generated with thearrayQualityMetrics package. Normalisation is needed so all chips can be compared fairly.

2.3 LIMMA

LIMMA or linear models for microarray data [25] is a package in R designed for finding significant genes byestimating the log fold changes in expression level between different experimental conditions. The methodLIMMA uses is fully explained in [24]. The first step is to calculate the log fold change, for which LIMMAassumes a linear model:

E[yj ] = Dαj (1)

Here yj represents the expression data for the gene j, and E[yj ] is a vector of the expression levels forgene j in each sample. D is the design matrix, which will be explained shortly, and αj is the vector ofcoefficients, containing the differences between the experimental conditions. The design matrix and vectorof coefficients can be made so that the comparison of interest, here the log fold change between the control

5

Robert Bentham

Fig

ure

4:M

vApl

ots

betw

een

asa

mpl

eof

cont

rolr

eplic

ates

inth

eM

oran

stud

ysh

own

post

norm

alis

atio

n,th

eid

eals

hape

ofa

MvA

plot

for

repl

icat

espo

stno

rmal

isat

ion

issy

mm

etri

cal

inth

ex

axis

and

rese

mbl

esa

com

etsh

ape.

Her

eth

eL

OE

SSlin

eis

show

nin

red

and

ifno

rmal

isat

ion

isdo

new

ell

shou

ldlie

onth

ex

axis

.T

here

plic

ates

show

nhe

reha

veal

lbe

enno

rmal

ised

fair

lyw

ell.

6

Robert Bentham

and Parkinson case, is built into the fitted model. To see this, suppose that there are 4 samples, 2 replicatesfor the control case and two for Parkinson’s disease. Then the design matrix and vector of coefficients canbe written as:

D =

1 01 01 11 1

, αj =(θ1θ2

)(2)

Here θ2 could be written as xpd − xc, where xc and xpd are the log expression levels of a particular genein the control and Parkinson’s samples respectively. Written like this θ2 gives the difference between thelog expression level in the control and Parkinson case, in contrast θ1 gives the difference between the logexpression level between the control and a reference. LIMMA estimates both coefficients but it is only thevalue of θ2 representing the log fold change that is of interest.

LIMMA then uses an empirical Bayes’ method to adjust these coefficients, Empirical Bayes borrowsinformation across genes and makes sure the analysis is stable which is especially needed for experimentswith small numbers of arrays [24]. After this LIMMA automatically calculates the FDR adjusted p values,this is needed as multiple hypotheses are being tested for significance. For example if 1000 genes were testedfor significance at a significance level of 0.01, statistically it is expected that 10 genes would be deemed tobe significant even if really there are no significant genes. FDR or false discovery rate adjusts the p value,so this false discovery rate is controlled. This new adjusted p value is essentially the probability of a falsediscovery of a differentially expressed gene among those genes which have been classified as differentiallyexpressed. If a particular gene has a FDR adjusted p value of 0.07 it means that an estimated 7% of thegenes with lower adjusted p values are false positives.

2.3.1 Results

Running LIMMA on the Zheng dataset identifies precisely zero significantly expressed genes after multiplehypothesis adjustment, this is surprising due to the difference that is expected between patients withParkinson’s and patients without. One reason this could be so is that due to the experimental design ofthe Zheng dataset samples were taken from only 500 DA neurons, this is a very small sample size and itis not surprising therefore that little is found in the analysis[2]. Another telling sign is that this dataset isonly part of a meta study, [19], and has no papers published just using its results, suggesting that by itselfthere are no significant findings. For these reasons the Zheng dataset was discarded from further analysis.

The other three datasets did find significantly expressed genes. The Middleton study found 91 genesthat were significantly expressed, Mullen 180 and Moran 3360. All genes were significantly expressed withmultiple hypothesis adjusted p values of less than 0.05. Clearly the Moran study found a much greaternumber of significantly expressed genes, this could be due to Moran having the largest number of replicatesof all the studies thereby being able to find more significant genes, in contrast Middleton had the smallestnumber of replicates and has the least number of significantly expressed genes.

2.4 Gene Set Analysis

From using LIMMA, a list of significant genes for each study has been found. This tells us there aredifferences between the two cases of the controls and those with Parkinson’s disease, however to extractbiological meaning from these lists presents difficulty. The truth is that in biology a gene does not act inisolation but in concourse with many others. An improved approach is to examine differences in sets ofgenes, genesets or gene pathways, that provide a common function or purpose. These gene pathways orgene sets are largely identified from major databases such as Gene Ontology or KEGG. Finally significantgene pathways involves using a set of statistical techniques known as Gene Set Analysis (GSA), here twoof these techniques will be described.

7

Robert Bentham

2.4.1 GSEA

GSEA or gene set enrichment analysis is one of the standard methods for GSA. Originally developed bySubramanian et al. [26] in 2005. Since then it has found wide use in the bioinformatics community, and iscertainly one of the most popular method of GSA. The original method involved using a ranked gene listsuch as the ones generated by a LIMMA analysis, and calculating what is referred to as an ‘EnrichmentScore’ for each gene set. This enrichment score fully described in [26] and gives a score based on whetherthe genes in a gene set were towards the top or bottom of the ranked list. Using this enrichment score,significance is inferred by use of sample permutations to derive a distribution from which the p-values canbe calculated.

Many varieties of GSEA can be found in the literature, the calculation of the enrichment score has beenseen as over complicated. Other approaches include using a two sample t test such as is found in [13] and[27]. Particularly, [13] introduces the Jiang and Gentleman statistic or the J-G statistic:

τk =∑g∈Sk

tg/√|Sk| (3)

Here tg is the t statistic for a single gene expression g and |Sk| is the size of the gene pathway of interest.The J-G statistic is normalized by the length of the pathway, such that as |Sk| approaches infinity, thedistribution of the J-K statistic approaches the unit normal. Methods of inferring the significance of genepathways are then as in the original GSEA paper made using sample permutations.

This method was slightly adapted in [22], where instead of the J-K statistic based on an aggregation oft statistics a statistic based on the aggregation of gene level regression residuals was used instead. Thismethod assumes that there is a linear relationship between the mean response variable i.e. gene expressionand the explanatory covariates such as the presence of Parkinson’s disease. If such a linear regressionmodel holds the regression residuals can be calculated, in a similar way to the J-K statistic:

Rki =∑g∈Sk

rgi/√|Sk| (4)

Significance is again inferred using sample permutations. This last procedure calculating the significantpathways with regression residues can be implemented in the R package, GSEAlm [23].

2.4.2 GAGE

GAGE or Generally Applicable Geneset Enrichment for pathway analysis is a method for gene set analysis,developed by Luo et al. [18], in which the authors claim to improve on previous GSA methods such asGSEA and PAGE[16], another popular methor for GSA . GAGE like PAGE determines the significance ofgene sets based on a parametric analysis as opposed to a method based on permutation of sample labels asis used to calculate the significance in GSEA. Some claim that GSEA has low sensitivity, while the authorsof GAGE claim that PAGE is overly sensitive.

The procedure of GAGE is outlined in [18], and will be given in brief here. As with all GSA methodsthe aim is to give a ranking and assign the significance to gene set pathways. It does this by taking intoaccount the mean fold changes of the target gene set by means of a two sample t test. PAGE in comparisonuses a z-test. The two sample t test and its degrees of freedom are defined as follows:

t =m−M√

s2

n + S2

n

(5)

df = (n− 1)(s2 + S2)2

s4 + S4(6)

Where m, s and n are the mean fold change, standard deviation and number of genes in the gene setrespectively. M and S represent the average mean fold change and standard deviation of all the genes. The

8

Robert Bentham

t test essentially compares the gene set of interest with a gene set of identical size with mean fold changeand standard deviation derived from the background.

P values can then be obtained from the t test. However GAGE combines all the p values from differentreplicates into a global P value. GAGE has two modes to compare pairs of experiment-control samples 1-1if the samples are paired or to compare the experimental samples to the average gene expression levels forthe unpaired. Since all studies used in this report were unpaired the latter case will only be discussed.

With k = 1, ...,K experimental samples and l = 1, ..., L control samples and L 6= K, the p values need tobe combined in a way where each p value is independent. If the null hypothesis is true, the p values fromthe two sample t test will follow a Uniform(0,1) distribution. Additionally it is known that the negativelog sum of K independent p-values follows a Gamma(K,1) distribution. Thus to calculate a global p-valueis simple using the gamma distribution:

P (X > x) ∼ Gamma(K, 1) (7)

The only issue therefore is constructing K independent p-values for unpaired data such as we use in thisreport, however this turns out to be fairly simple. For the first experimental sample P1 is calculated asthe average of the one on one comparison with the experimental sample to all of the control samples,in this way K independent p-values are constructed from which the negative log sum follows the gammadistribution.

x = − 1L

∑kl

logPkl (8)

Running GAGE in it’s R package is extremely simple, and automatically ranks the gene sets and correctsthe p-values for multiple testing issues. The results gained from the GAGE and GSEA analysis are discussedbelow.

2.4.3 Results

Both GSEAlm and GAGE were used to analyse the data, both implemented in R. Out of this only theresults for GAGE are presented in this report, due to problems with the GSEAlm analysis. GSEAlmpredicted that there were no significant pathways with p values less than 0.05 for the Middleton study,since the LIMMA analysis showed earlier that there were significantly expressed genes in the Middletonstudy between the Parkinson cases and the control this seems surprising and biologically unrealistic. Thiscould be explained by low sensitivity of GSEA which has been suggested in the literature [4]. Additionallythere seems to be a problem with this version of GSEA: GSEAlm. The outputs has many different pathwayswith exactly the same p value, from online resources [10] this seems to be a common feature of the program.Due to this the GSEAlm output fails to give a definitive ranking of the significance of the gene pathwaysand fails to hit pathways which are biologically relevant and so was judged unsuitable for use in this report.

The results from the GAGE analysis are given in Tables 1,2 and 3. Table 1 shows the consensus pathwaysbetween all three studies that have been significantly regulated up or down. Table 2 shows significantpathways relevant to DNA damage and stress present in each individual study, and Table 3 shows thesignificantly expressed genes in these pathways. Full implications of these results will be discussed in theconclusion.

3 Conclusion

Evidence from the bioinformatic results in this report suggest that the hypothesis given in the introductionis correct. The clearest demonstration of this is in Table 1 and 2. Table 1 shows many mitochondrialpathways down regulated as expected but also that DNA damage and stress related pathways are altered.Table 2 gives the significant pathways related to DNA damage and stress in each of the data sets analysed,this shows just how many DNA damage and stress related pathways were shown to be involved. Table

9

Robert Bentham

1(a)

GO

Term

Desc

ripti

on

GO

:0016564

transcrip

tio

nrepressor

activ

ity

GO

:0004861

cyclin-dependent

protein

kin

ase

inhib

itor

activ

ity

GO

:0007050

cell

cycle

arrest

GO

:0005540

hyalu

ronic

acid

bin

din

gG

O:0

006954

inflam

mato

ryre

sponse

GO

:0007507

heart

develo

pm

ent

GO

:0006968

cellula

rdefe

nse

response

GO

:0042326

negati

ve

regula

tion

ofphosp

hory

lati

on

GO

:0030511

posi

tive

regula

tion

oftr

ansf

orm

ing

gro

wth

facto

rbeta

recepto

rsi

gnaling

path

way

Table

1:

1(a

)and

(b)

show

all

the

signifi

cant

gen

epath

way

sfo

und

from

the

Gen

eO

nto

logy

data

base

usi

ng

GA

GE

.T

able

1(a

)sh

ows

all

the

path

way

sw

hic

hw

ere

signifi

cantl

yup

regula

ted,

while

table

(b)

show

sall

the

path

way

sth

at

wer

esi

gnifi

cantl

ydow

nre

gula

ted.

The

resu

lts

hav

eb

een

annota

ted

wit

hth

ehel

pof

Dr

Gyorg

ySza

badka

i,in

topath

way

sw

hic

hare

rela

ted

toD

NA

dam

age

and

stre

ss,

and

path

way

sre

late

dto

mit

och

ondri

al

funct

ions.

Path

way

sin

blu

ere

pre

sent

those

ass

oci

ate

dw

ith

DN

Adam

age

and

stre

ss,

while

those

inre

dare

the

mit

och

ondri

al

(PG

C-1

dep

enden

t)path

way

s.A

sca

nb

ese

enth

eup

regula

ted

path

way

sare

stro

ngly

rela

ted

toD

NA

dam

age

and

stre

ssw

hile

the

dow

nre

gula

ted

conta

inm

any

path

way

sass

oci

ate

dw

ith

mit

och

ondri

al

whic

hare

PG

C-1

dep

enden

t.T

hes

ere

sult

sco

nfirm

the

concl

usi

ons

in[1

9]

wher

ePGC

1−α

was

show

nto

be

dow

nre

gula

ted

leadin

gto

def

ects

inm

itoch

ondri

al

funct

ion,

unlike

[19],

ala

rge

met

ast

udy,PGC

1−α

was

not

show

nto

be

stati

stic

ally

signifi

cant

inth

eL

IMM

Aanaly

sis

butPGC

1−α

rela

ted

path

way

scl

earl

yare

signifi

cant

her

e.

1(b)

GO

Term

Desc

ripti

on

GO

:0006887

exocyto

sis

GO

:0007268

synapti

ctr

ansm

issi

on

GO

:0003924

GT

Pase

acti

vity

GO

:0005743

mit

ochondria

lin

ner

mem

brane

GO

:0016192

vesi

cle

-media

ted

transp

ort

GO

:0051437

posi

tive

regula

tion

ofubiq

uit

in-p

rote

inligase

acti

vity

duri

ng

mit

oti

ccell

cycle

GO

:0030426

gro

wth

cone

GO

:0031145

anaphase-prom

otin

gcom

ple

x-dependent

proteasom

al

ubiq

uit

in-dependent

protein

catabolic

process

GO

:0042416

dopam

ine

bio

synth

eti

cpro

cess

GO

:0007626

locom

oto

rybehavio

rG

O:0

001975

resp

onse

toam

pheta

min

eG

O:0

051436

negativ

eregula

tio

nofubiq

uit

in-protein

ligase

activ

ity

durin

gm

itotic

cell

cycle

GO

:0006836

neuro

transm

itte

rtr

ansp

ort

GO

:0048169

regula

tion

oflo

ng-t

erm

neuro

nalsy

napti

cpla

stic

ity

GO

:0051281

posi

tive

regula

tion

ofre

lease

ofse

quest

ere

dcalc

ium

ion

into

cyto

sol

GO

:0005759

mit

ochondria

lm

atrix

GO

:0006886

intr

acellula

rpro

tein

transp

ort

GO

:0006108

mala

te

metabolic

process

GO

:0043524

negati

ve

regula

tion

ofneuro

napopto

sis

GO

:0008344

adult

locom

oto

rybehavio

rG

O:0

000502

pro

teaso

me

com

ple

xG

O:0

043274

phosp

holipase

bin

din

gG

O:0

008198

ferr

ous

iron

bin

din

gG

O:0

005777

pero

xis

om

eG

O:0

030424

axon

GO

:0051258

pro

tein

poly

meri

zati

on

GO

:0048854

bra

inm

orp

hogenesi

sG

O:0

030666

endocyti

cvesi

cle

mem

bra

ne

GO

:0006099

tric

arboxylic

acid

cycle

GO

:0007264

small

GT

Pase

media

ted

signaltr

ansd

ucti

on

GO

:0070469

respir

atory

chain

GO

:0015992

proton

transport

GO

:0030672

synapti

cvesi

cle

mem

bra

ne

GO

:0006120

mit

ochondria

lele

ctron

transport,N

AD

Hto

ubiq

uin

one

GO

:0006626

protein

targetin

gto

mit

ochondrio

nG

O:0

006096

gly

coly

sis

GO

:0009636

resp

onse

toto

xin

GO

:0005978

gly

cogen

bio

synth

eti

cpro

cess

GO

:0016829

lyase

acti

vity

GO

:0019717

synapto

som

eG

O:0

016820

hydro

lase

acti

vity,acti

ng

on

acid

anhydri

des,

cata

lyzin

gtr

ansm

em

bra

ne

movem

ent

ofsu

bst

ances

GO

:0005747

mit

ochondria

lrespir

atory

chain

com

ple

xI

GO

:0008137

NA

DH

dehydrogenase

(ubiq

uin

one)

activ

ity

GO

:0030170

pyrid

oxalphosphate

bin

din

gG

O:0

022900

ele

ctron

transport

chain

GO

:0046933

hydrogen

ion

transportin

gAT

Psynthase

activ

ity,rotatio

nalm

echanis

mG

O:0

006091

generatio

nofprecursor

metabolites

and

energy

GO

:0000226

mic

rotu

bule

cyto

skele

ton

org

aniz

ati

on

GO

:0045263

proton-transportin

gAT

Psynthase

com

ple

x,coupling

factor

F(o)

GO

:0005838

pro

teaso

me

regula

tory

part

icle

GO

:0051289

pro

tein

hom

ote

tram

eri

zati

on

GO

:0006800

oxygen

and

reactiv

eoxygen

specie

sm

etabolic

process

GO

:0017157

regula

tion

ofexocyto

sis

GO

:0007269

neuro

transm

itte

rse

cre

tion

GO

:0019003

GD

Pbin

din

gG

O:0

042776

mit

ochondria

lAT

Psynthesis

couple

dproton

transport

GO

:0017075

synta

xin

-1bin

din

gG

O:0

007612

learn

ing

GO

:0005504

fatt

yacid

bin

din

gG

O:0

046961

proton-transportin

gAT

Pase

activ

ity,rotatio

nalm

echanis

mG

O:0

015078

hydro

gen

ion

transm

em

bra

ne

transp

ort

er

acti

vity

GO

:0006413

transl

ati

onalin

itia

tion

GO

:0048488

synapti

cvesi

cle

endocyto

sis

GO

:0051246

regula

tion

ofpro

tein

meta

bolic

pro

cess

GO

:0044262

cellula

rcarb

ohydra

tem

eta

bolic

pro

cess

GO

:0005883

neuro

fila

ment

GO

:0030234

enzym

ere

gula

tor

acti

vity

GO

:0009055

ele

ctron

carrie

ractiv

ity

GO

:0019787

small

conju

gati

ng

pro

tein

ligase

acti

vity

GO

:0051287

NA

Dor

NA

DH

bin

din

gG

O:0

051536

iron-sulfur

clu

ster

bin

din

gG

O:0

004298

thre

onin

e-t

ype

endopepti

dase

acti

vity

10

Robert Bentham

GO

Term

Desc

ripti

on

GO

:0016564

transc

ripti

on

repre

ssor

acti

vit

yG

O:0

000122

negati

ve

regula

tion

of

transc

ripti

on

from

RN

Ap

oly

mera

seII

pro

mote

rG

O:0

007050

cell

cycle

arr

est

GO

:0000080

G1

phase

of

mit

oti

ccell

cycle

GO

:0016563

transc

ripti

on

acti

vato

racti

vit

yG

O:0

004861

cycli

n-d

ep

endent

pro

tein

kin

ase

inhib

itor

acti

vit

yG

O:0

032582

negati

ve

regula

tion

of

gene-s

pecifi

ctr

ansc

ripti

on

GO

:0006968

cellula

rdefe

nse

resp

onse

(a)

Sig

nifi

cantl

yup

regula

ted

GO

path

way

sre

late

dto

stre

ss/D

NA

dam

-age

inth

eM

iddle

ton

study

GO

Term

Desc

ripti

on

GO

:0051437

posi

tive

regula

tion

of

ubiq

uit

in-p

rote

inligase

acti

vit

yduri

ng

mit

oti

ccell

cycle

(b)

Sig

nifi

cantl

ydow

nre

gula

ted

GO

path

way

sre

late

dto

stre

ss/D

NA

dam

age

inth

eM

iddle

ton

study

GO

Term

Desc

ripti

on

GO

:0000122

negati

ve

regula

tion

of

transc

ripti

on

from

RN

Ap

oly

mera

seII

pro

mote

rG

O:0

016564

transc

ripti

on

repre

ssor

acti

vit

yG

O:0

004861

cyclin-d

ep

endent

pro

tein

kin

ase

inhib

itor

acti

vit

yG

O:0

043065

posi

tive

regula

tion

of

ap

opto

sis

GO

:0030530

hete

rogeneous

nucle

ar

rib

onucle

opro

tein

com

ple

xG

O:0

007050

cell

cycle

arr

est

GO

:0000123

his

tone

acety

ltra

nsf

era

secom

ple

xG

O:0

045941

posi

tive

regula

tion

of

transc

ripti

on

GO

:0032582

negati

ve

regula

tion

of

gene-s

pecifi

ctr

ansc

ripti

on

GO

:0003705

RN

Ap

oly

mera

seII

transc

ripti

on

facto

racti

vit

y,

enhancer

bin

din

gG

O:0

000118

his

tone

deacety

lase

com

ple

xG

O:0

000060

pro

tein

imp

ort

into

nucle

us,

transl

ocati

on

GO

:0008285

negati

ve

regula

tion

of

cell

pro

life

rati

on

GO

:0016563

transc

ripti

on

acti

vato

racti

vit

yG

O:0

003676

nucle

icacid

bin

din

gG

O:0

006968

cellula

rdefe

nse

resp

onse

GO

:0043966

his

tone

H3

acety

lati

on

GO

:0042771

DN

Adam

age

resp

onse

,si

gnal

transd

ucti

on

by

p53

cla

ssm

edia

tor

resu

ltin

gin

inducti

on

of

ap

opto

sis

GO

:0008656

casp

ase

acti

vato

racti

vit

yG

O:0

006357

regula

tion

of

transc

ripti

on

from

RN

Ap

oly

mera

seII

pro

mote

rG

O:0

005667

transc

ripti

on

facto

rcom

ple

xG

O:0

043066

negati

ve

regula

tion

of

ap

opto

sis

GO

:0003727

single

-str

anded

RN

Abin

din

gG

O:0

003714

transc

ripti

on

core

pre

ssor

acti

vit

yG

O:0

006950

resp

onse

tost

ress

GO

:0008630

DN

Adam

age

resp

onse

,si

gnal

transd

ucti

on

resu

ltin

gin

inducti

on

of

ap

opto

sis

GO

:0045893

posi

tive

regula

tion

of

transc

ripti

on,

DN

A-d

ep

endent

GO

:0006309

DN

Afr

agm

enta

tion

involv

ed

inap

opto

sis

GO

:0006978

DN

Adam

age

resp

onse

,si

gnal

transd

ucti

on

by

p53

cla

ssm

edia

tor

resu

ltin

gin

transc

ripti

on

of

p21

cla

ssm

edia

tor

GO

:0003950

NA

D+

AD

P-r

ibosy

ltra

nsf

era

seacti

vit

yG

O:0

003690

double

-str

anded

DN

Abin

din

gG

O:0

008284

posi

tive

regula

tion

of

cell

pro

life

rati

on

GO

:0048384

reti

noic

acid

recepto

rsi

gnaling

path

way

GO

:0006281

DN

Are

pair

(c)

Sig

nifi

cantl

yup

regula

ted

GO

path

way

sre

late

dto

stre

ss/D

NA

dam

-age

inth

eM

ora

nst

udy

GO

Term

Desc

ripti

on

GO

:0016564

transc

ripti

on

repre

ssor

acti

vit

yG

O:0

045892

negati

ve

regula

tion

of

transc

ripti

on,

DN

A-d

ep

endent

GO

:0032583

regula

tion

of

gene-s

pecifi

ctr

ansc

ripti

on

GO

:0003704

specifi

cR

NA

poly

mera

seII

transc

ripti

on

facto

racti

vit

yG

O:0

004861

cyclin-d

ep

endent

pro

tein

kin

ase

inhib

itor

acti

vit

yG

O:0

010553

negati

ve

regula

tion

of

gene-s

pecifi

ctr

ansc

ripti

on

from

RN

Ap

oly

mera

seII

pro

mote

rG

O:0

043433

negati

ve

regula

tion

of

transc

ripti

on

facto

racti

vit

yG

O:0

016566

specifi

ctr

ansc

ripti

onal

repre

ssor

acti

vit

yG

O:0

005667

transc

ripti

on

facto

rcom

ple

xG

O:0

035257

nucle

ar

horm

one

recepto

rbin

din

gG

O:0

043984

his

tone

H4-K

16

acety

lati

on

GO

:0005694

chro

moso

me

GO

:0043966

his

tone

H3

acety

lati

on

GO

:0042800

his

tone

meth

ylt

ransf

era

seacti

vit

y(H

3-K

4sp

ecifi

c)

GO

:0030530

hete

rogeneous

nucle

ar

rib

onucle

opro

tein

com

ple

xG

O:0

006950

resp

onse

tost

ress

GO

:0007050

cell

cycle

arr

est

GO

:0000118

his

tone

deacety

lase

com

ple

xG

O:0

003714

transc

ripti

on

core

pre

ssor

acti

vit

yG

O:0

000084

Sphase

of

mit

oti

ccell

cycle

GO

:0008656

casp

ase

acti

vato

racti

vit

yG

O:0

016581

NuR

Dcom

ple

xG

O:0

006260

DN

Are

plicati

on

GO

:0045941

posi

tive

regula

tion

of

transc

ripti

on

GO

:0006338

chro

mati

nre

modeling

GO

:0016605

PM

Lb

ody

GO

:0010552

posi

tive

regula

tion

of

gene-s

pecifi

ctr

ansc

ripti

on

from

RN

Ap

oly

mera

seII

pro

mote

rG

O:0

012501

pro

gra

mm

ed

cell

death

GO

:0006281

DN

Are

pair

GO

:0044428

nucle

ar

part

GO

:0045767

regula

tion

of

anti

-ap

opto

sis

(d)

Sig

nifi

cantl

yup

regula

ted

GO

path

way

sre

late

dto

stre

ss/D

NA

dam

age

inth

eM

ullen

study

Tab

le2:

Gen

eP

athw

ays

rela

ted

tost

ress

/DN

Ada

mag

eth

atha

vesi

gnifi

cant

lybe

enup

ordo

wn

regu

late

dw

ith

pva

lues<

0.05

.

11

Robert Bentham

Gene Name logFC Adjusted P valueGAS1 1.094643 2.055961E-02

BANF1 0.541628 2.219783E-02DNAJB6 1.241308 2.219783E-02MYST3 0.484861 2.349617E-02HSPA1L 0.766496 2.784302E-02PHF21A 0.503459 2.784302E-02

INSR 0.759715 2.934864E-02HNRNPH3 0.552047 2.934864E-02

CXXC1 0.430215 2.934864E-02CUL2 -0.672837 2.943436E-02

TRIM28 0.372411 3.553511E-02KAT2A 0.567463 3.926393E-02HBP1 0.491259 4.561574E-02

HNRNPH3 0.548458 4.884567E-02PHF15 0.873731 4.901984E-02

(a) Genes in the Mullen study with P values < 0.05 ingene pathways related to DNA damage and stress

Gene Name Moran Middleton MullenDNAJB6 ! % !

HSPA1L ! % !

PHF21A ! % !

INSR ! % !

HNRNPH3 ! % !

CXXC1 ! % !

CUL2 ! % !

KAT2A ! % !

HBP1 ! % !

PHF15 ! % !

MBD3 ! ! %

IKBKB ! ! %

MAP3K11 % ! %

TCIRG1 % ! %

FOXO1 % ! %

GAS1 % % !

BANF1 % % !

MYST3 % % !

TRIM28 % % !

(b) Genes which are significant (P value < 0.05) in mul-tiple studies. Most significant genes in the Moran studyomitted here and fully given in Appendix A

Gene Name logFC Adjusted P valueMBD3 0.588015 2.692973E-02

MAP3K11 0.656159 4.069257E-02IKBKB 0.211414 4.160097E-02TCIRG1 0.509469 4.639875E-02FOXO1 0.528559 4.903848E-02

(c) Genes in the Middleton study with P values < 0.05in gene pathways related to DNA damage and stress

Table 3: Tables showing significant genes in pathways related to DNA damage and stress. a) and c) show significantgenes in the Mullen and Middleton study. While b) shows which genes are significant in multiple studies. 381 genesrelated to DNA damage and stress pathways were significant in the Moran study and are fully listed in Appendix A.

12

Robert Bentham

3 then shows significant genes involved in these DNA damage related pathways, and which genes weresignificant in more than one of the datasets. These significant genes could be useful in finding a newtherapeutic target for Parkinson’s disease.

A larger study or a meta study with more microarray data would give a clearer picture of the genesand pathways that have been up or down regulated in comparison to the fairly noisy one presented inthis study. However despite the relatively small sample sizes and accompanying noise, the overall trendof down regulated mitochondria pathways and altered DNA damage and stress pathways is clear. Moreresearch in the interface of these two areas and the role of PCG-1 in Parkinson’s disease would heightenour understanding and develop new approaches for the treatment of Parkinson’s disease.

13

Robert Bentham

References

[1] T.E. Bartlett. Bioinformatic analysis of the interface between mitochondrial biogenesis and apoptoticcell death signalling pathways in cancer. Mres Summer Project, 2011.

[2] Kevin Bryson. private communication, 2012.

[3] M.J. Devine, H. Plun-Favreau, and N.W. Wood. Parkinson’s disease and cancer: two wars, one front.Nature Reviews Cancer, 11(11):812–823, 2011.

[4] I. Dinu, J. Potter, T. Mueller, Q. Liu, A. Adewale, G. Jhangri, G. Einecke, K. Famulski, P. Halloran,and Y. Yasui. Improving gene set analysis of microarray data by sam-gs. BMC bioinformatics,8(1):242, 2007.

[5] M.R. Duchen and G. Szabadkai. Roles of mitochondria in human disease. Essays Biochem, 47:115–137,2010.

[6] GEO. http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE7621, 2007. Ac-cessed: 28/02/2012.

[7] GEO. http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE8397, 2008. Ac-cessed: 28/02/2012.

[8] GEO. http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE20292, 2010. Ac-cessed: 28/02/2012.

[9] GEO. http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE24378, 2011. Ac-cessed: 28/02/2012.

[10] Daniel Gusenleitne. Gene set enrichment analysis (gsealm)tutorial. http://bcb.dfci.harvard.edu/~aedin/courses/cccb-introduction-to-r-and-bioconductor-may-2011/tutorial.pdf. Ac-cessed: 28/02/2012.

[11] R.A. Irizarry, B. Hobbs, F. Collin, Y.D. Beazer-Barclay, K.J. Antonellis, U. Scherf, and T.P. Speed.Exploration, normalization, and summaries of high density oligonucleotide array probe level data.Biostatistics, 4(2):249, 2003.

[12] D.K. Jeppesen, V.A. Bohr, and T. Stevnsner. Dna repair deficiency in neurodegeneration. Progressin Neurobiology, 2011.

[13] Z. Jiang and R. Gentleman. Extensions to gene set enrichment. Bioinformatics, 23(3):306, 2007.

[14] A.W.E. Jones, Z. Yao, J.M. Vicencio, A. Karkucinska-Wieckowska, and G. Szabadkai. Pgc-1 familycoactivators and cell fate: Roles in cancer, neurodegeneration, cardiovascular disease and retrogrademitochondria-nucleus signalling. Mitochondrion, 2011.

[15] Audrey Kauffmann, Robert Gentleman, and Wolfgang Huber. arrayqualitymetrics–a bioconductorpackage for quality assessment of microarray data. Bioinformatics, 25(3):415–6, 2009.

[16] S.Y. Kim and D. Volsky. Page: parametric analysis of gene set enrichment. BMC bioinformatics,6(1):144, 2005.

[17] T.G. Lesnick, S. Papapetropoulos, D.C. Mash, J. Ffrench-Mullen, L. Shehadeh, M. De Andrade, J.R.Henley, W.A. Rocca, J.E. Ahlskog, and D.M. Maraganore. A genomic pathway approach to a complexdisease: axon guidance and parkinson disease. PLoS genetics, 3(6):98, 2007.

14

Robert Bentham

[18] W. Luo, M. Friedman, K. Shedden, K. Hankenson, and P. Woolf. Gage: generally applicable gene setenrichment for pathway analysis. BMC bioinformatics, 10(1):161, 2009.

[19] J.K. McGill and M.F. Beal. Pgc-1 α], a new therapeutic target in huntington’s disease? Cell,127(3):465–468, 2006.

[20] M. Miron and R. Nadon. Inferential literacy for experimental high-throughput biology. Trends inGenetics, 22(2):84–89, 2006.

[21] LB Moran, DC Duke, M. Deprez, DT Dexter, R.K.B. Pearce, and MB Graeber. Whole genomeexpression profiling of the medial and lateral substantia nigra in parkinson’s disease. Neurogenetics,7(1):1–11, 2006.

[22] A.P. Oron, Z. Jiang, and R. Gentleman. Gene set enrichment analysis using linear models anddiagnostics. Bioinformatics, 24(22):2586–2591, 2008.

[23] Assaf Oron, Robert Gentleman (with contributions from S. Falcon, and Z. Jiang). GSEAlm: LinearModel Toolset for Gene Set Enrichment Analysis. R package version 1.8.0.

[24] G. Smyth. Limma: linear models for microarray data. Bioinformatics and computational biologysolutions using R and Bioconductor, pages 397–420, 2005.

[25] Gordon K. Smyth. Limma: linear models for microarray data. In R. Gentleman, V. Carey, S. Dudoit,and W. Huber R. Irizarry, editors, Bioinformatics and Computational Biology Solutions using R andBioconductor, pages 397–420. Springer, New York, 2005.

[26] A. Subramanian, P. Tamayo, V.K. Mootha, S. Mukherjee, B.L. Ebert, M.A. Gillette, A. Paulovich,S.L. Pomeroy, T.R. Golub, E.S. Lander, et al. Gene set enrichment analysis: a knowledge-basedapproach for interpreting genome-wide expression profiles. Proceedings of the National Academy ofSciences of the United States of America, 102(43):15545, 2005.

[27] L. Tian, S.A. Greenberg, S.W. Kong, J. Altschuler, I.S. Kohane, and P.J. Park. Discovering statis-tically significant pathways in expression profiling studies. Proceedings of the National Academy ofSciences of the United States of America, 102(38):13544, 2005.

[28] Y. Zhang, M. James, F.A. Middleton, and R.L. Davis. Transcriptional analysis of multiple brain re-gions in parkinson’s disease supports the involvement of specific protein processing, energy metabolism,and signaling pathways, and suggests novel disease mechanisms. American Journal of Medical GeneticsPart B: Neuropsychiatric Genetics, 137(1):5–16, 2005.

15

Robert Bentham

Appendices

A Tables

Gene name logFC Adjusted P value Gene name logFC Adjusted P value Gene name logFC Adjusted P valueCUX2 -0.957411 3.110158E-08 FGFR1 0.220495 3.832065E-03 MYST1 0.147059 1.765914E-02NR4A2 -1.206677 2.511338E-07 ORC2L -0.238690 3.850204E-03 KRT18 -0.293551 1.768079E-02

YTHDC2 -0.956063 5.114295E-07 SIRT2 0.337159 3.852516E-03 ERI3 -0.357999 1.789257E-02PSEN2 -0.597398 6.201595E-07 DRD2 -0.323548 3.898446E-03 PFDN5 0.257103 1.796585E-02RBM9 -0.691800 9.260810E-07 MAFF 0.312910 3.917795E-03 DLG3 -0.201248 1.810632E-02ICMT -0.454469 1.077351E-06 VEZF1 0.327521 4.006702E-03 APEX1 -0.229858 1.863233E-02DRD2 -0.833074 1.736227E-06 C1D -0.376084 4.010220E-03 WARS -0.465903 1.888945E-02SUB1 -1.168283 2.409076E-06 SMARCA4 -0.421016 4.029591E-03 MAB21L1 -0.242650 1.902206E-02

NR4A2 -1.305759 2.505296E-06 NDRG4 -0.964717 4.030569E-03 CUL3 -0.280591 1.915695E-02PBX1 -1.030599 3.204438E-06 KDM5A 0.359093 4.083478E-03 CDKN1C 0.580427 1.927160E-02

ATP8A2 -0.757973 9.099608E-06 TPD52L1 0.553849 4.128617E-03 DUSP10 0.227539 1.970679E-02MED24 -0.337476 1.521774E-05 RPS9 0.213239 4.145544E-03 RPS9 0.331523 1.970843E-02FABP7 -0.948737 1.784250E-05 DEAF1 -0.456666 4.187411E-03 PPM1D 0.292983 1.991855E-02PAN2 0.587286 1.862470E-05 TFEB 0.377877 4.333228E-03 SMARCC1 0.311631 1.996298E-02

BASP1 -1.101985 2.422178E-05 ZC3H7B 0.323857 4.454371E-03 ATRX -0.252841 2.003025E-02RNF14 -0.762596 2.657807E-05 NAB1 -0.390073 4.481002E-03 CAPN10 0.171640 2.032344E-02MXI1 0.426898 2.777041E-05 RPS4X 0.301142 4.481002E-03 IKBKB 0.233266 2.032344E-02

OBFC1 -0.345506 3.074203E-05 INSR 0.571061 4.494150E-03 HSPA1L 0.337327 2.060335E-02NR4A2 -0.858372 3.837342E-05 MED7 -0.229032 4.634760E-03 RNF14 -0.613698 2.060335E-02DRD2 -0.528345 4.053527E-05 RASSF1 0.217266 4.686959E-03 FOXO4 0.261129 2.123364E-02

ZBTB16 0.804387 4.154816E-05 SAP30 0.370421 4.750235E-03 CHD3 -0.231212 2.129811E-02CASC3 0.459059 4.227519E-05 CIAO1 -0.233149 4.828717E-03 CIZ1 0.198300 2.132111E-02HLTF -0.610986 4.407303E-05 CDKN1C 0.630543 4.884169E-03 RAF1 0.297875 2.213752E-02RNF10 -0.363115 4.492968E-05 LRCH4 0.240531 4.909274E-03 MBD3 0.232923 2.216529E-02HBP1 0.430443 6.173774E-05 PHF17 0.288786 5.008478E-03 CDKN2C 0.191727 2.217483E-02TOB2 0.625310 7.242554E-05 RYBP 0.437083 5.008588E-03 DNAJA2 -0.565387 2.217483E-02ODZ1 -0.774075 8.018387E-05 AGGF1 -0.357470 5.177286E-03 PTBP1 0.312594 2.278352E-02

FOXA1 -0.493710 1.082717E-04 HTATIP2 0.260162 5.269930E-03 WARS -0.405446 2.320651E-02SIN3B -0.362442 1.132238E-04 YWHAB -0.278063 5.304984E-03 TFDP2 -0.173172 2.323106E-02SORT1 0.567520 1.433114E-04 RAN -0.494002 5.435535E-03 ZCCHC14 0.311412 2.339757E-02FABP7 -0.856448 1.466798E-04 ABCA2 0.597212 5.484714E-03 CREBBP 0.273296 2.360719E-02RBM9 -0.790446 1.526903E-04 RBM9 -0.576966 5.580142E-03 ZNF274 0.205700 2.373376E-02

DNAJB6 1.030789 1.622180E-04 YWHAB -0.526150 5.645709E-03 PRKAR1A -0.349559 2.384159E-02AZGP1 1.429307 1.780275E-04 SMARCA4 -0.309234 5.717543E-03 MLH1 -0.151832 2.420398E-02

R3HDM1 -0.407426 1.873502E-04 RBMS1 -0.331339 5.727718E-03 CUL2 -0.244754 2.434984E-02PKNOX2 -0.323459 1.917057E-04 TARDBP 0.366918 5.785979E-03 DLG3 -0.186573 2.493001E-02DNAJB2 0.566402 1.917057E-04 MAPK1 -0.465367 5.900381E-03 RTEL1 0.097418 2.494372E-02MAPK9 -0.697986 1.950452E-04 C11orf9 0.530671 5.954850E-03 FOXA2 -0.248195 2.496548E-02RXRA 0.486204 2.353479E-04 TPD52L1 0.708256 6.003614E-03 NBN -0.389994 2.509963E-02SCFD1 -0.435021 2.383114E-04 ZC3H11A 0.286946 6.123835E-03 FOXL2 0.129132 2.518111E-02PSEN2 -0.360370 2.443693E-04 SMARCC1 0.298454 6.163572E-03 DBP -0.215670 2.519681E-02TRAK1 -0.334989 2.905580E-04 KDM4B 0.215654 6.163572E-03 ARNTL 0.187634 2.555248E-02LRCH4 0.280912 3.346337E-04 PHB 0.351591 6.179972E-03 CLEC11A 0.211593 2.560748E-02

ATR -0.487612 3.365620E-04 FTH1 0.333135 6.211837E-03 CDKN1C 0.592224 2.578780E-02PBX1 -0.977497 3.587020E-04 FOXO3 0.414799 6.499113E-03 ENPP2 0.411463 2.582428E-02SCG2 -1.678182 3.704575E-04 CAND1 -0.421703 6.700786E-03 ASH2L -0.208881 2.603854E-02SIN3B -0.458179 3.775224E-04 RBMS1 -0.379899 6.708206E-03 EIF1 0.184540 2.616135E-02TXNIP 0.763070 3.835557E-04 EIF1 0.227191 6.810593E-03 OLIG2 0.361263 2.620637E-02TCF12 0.521666 3.865018E-04 TIPARP 0.423771 6.953670E-03 CTBP2 0.186639 2.696214E-02TCF25 -0.371687 4.053739E-04 ADARB2 0.367469 7.017789E-03 TBL1X 0.182123 2.738215E-02MYT1L -1.093676 4.053739E-04 ING1 0.186267 7.034542E-03 PEX14 -0.219804 2.757462E-02DRD2 -0.351110 4.269094E-04 AHSA1 0.583125 7.146545E-03 CUL4B -0.190014 2.793711E-02

PPP2R5C -0.556722 4.727540E-04 PTPRU -0.249405 7.167527E-03 DDX23 0.216878 2.814519E-02MTMR15 0.426398 4.882360E-04 CNOT8 0.349014 7.291980E-03 GAS7 0.281240 2.832397E-02

CUL2 -0.204652 5.021639E-04 SOX10 0.519145 7.291980E-03 EXOG -0.243930 2.834072E-02PIAS2 -0.439832 5.083587E-04 KPNB1 -0.304317 7.508054E-03 NFE2 0.171944 2.899066E-02

SMARCA4 -0.523213 5.326174E-04 HUS1 -0.151773 7.508054E-03 ZNF143 0.249454 2.919955E-02HTR2A -0.603349 5.429389E-04 TRAK1 -0.360190 7.714198E-03 SERTAD2 0.285793 2.950488E-02

HNRNPA0 -0.394298 5.512777E-04 C16orf5 0.480987 7.784919E-03 CAPNS1 -0.265057 2.958784E-02NFKBIA 0.848332 5.735661E-04 SMARCD3 -0.376585 7.784919E-03 HRAS -0.315547 3.000099E-02AZGP1 0.876736 5.963782E-04 HR -0.251331 7.966696E-03 ZNF862 0.168448 3.027087E-02USP21 0.384855 6.023984E-04 FADS1 -0.339560 7.966696E-03 PRKRIR -0.197368 3.055533E-02ATF4 0.493282 6.129153E-04 SERP1 0.493409 8.013586E-03 PINK1 -0.331692 3.103766E-02

UCHL1 -1.130250 6.129153E-04 MTDH -0.430631 8.035706E-03 CAT 0.316974 3.121511E-02ATR -0.197517 6.612463E-04 CDC7 -0.229709 8.044592E-03 SQSTM1 0.298721 3.255658E-02

PHF21A 0.360997 6.661077E-04 BCL6 0.684969 8.434040E-03 ELF2 0.232207 3.255658E-02SGK1 0.808574 7.873610E-04 CTBP2 0.301135 8.446550E-03 CALCOCO1 0.235879 3.285726E-02P2RX7 0.851832 7.873610E-04 TCEB1 -0.200684 8.638273E-03 NCOA6 0.167214 3.374271E-02FOXA2 -0.297080 7.874555E-04 CA2 0.830987 8.686105E-03 HTATIP2 0.370030 3.377235E-02SIAH2 0.263142 7.961976E-04 YBX1 0.380355 8.805353E-03 PARP4 0.253744 3.379494E-02CXXC1 0.304683 8.022894E-04 EIF1 0.202097 9.381447E-03 PRKCZ -0.380026 3.393437E-02TOB2 0.803282 8.116605E-04 RPS4X 0.257787 9.594234E-03 SP1 0.133998 3.402302E-02MXD4 0.330607 8.430294E-04 BECN1 -0.358866 9.722317E-03 IGF1R 0.663056 3.447343E-02

PRKDC -0.419466 8.506714E-04 MKRN1 -0.419459 9.889990E-03 SERTAD2 0.319964 3.451420E-02MUS81 0.231609 8.752821E-04 CTCF 0.281895 9.943180E-03 ADRA2A -0.233688 3.482569E-02PHF15 0.480905 9.156354E-04 HNRNPD -0.251050 9.943782E-03 SF3A3 -0.182652 3.499074E-02KIFAP3 -0.812213 1.086406E-03 NME1 -0.563889 9.944915E-03 PTPRF -0.222122 3.531795E-02RAD17 -0.225555 1.125597E-03 SIRT4 0.223675 1.003110E-02 PIAS2 -0.083276 3.626421E-02KRAS -0.511846 1.139155E-03 CCK -0.777129 1.067137E-02 SSR1 -0.285912 3.632779E-02

MKNK2 1.029514 1.139155E-03 TMEM161A 0.169364 1.071402E-02 HTR2A -0.210176 3.670744E-02PNKP -0.365547 1.233909E-03 CAT 0.447409 1.085852E-02 SMARCD2 0.190971 3.749642E-02

GADD45G 0.441461 1.254979E-03 SAP30 0.387418 1.102105E-02 USP22 -0.230039 3.773870E-02CAND2 -0.326311 1.278468E-03 ZC3H15 -0.401399 1.114595E-02 LUC7L3 0.364510 3.776656E-02

NDN -0.469761 1.292416E-03 TXNIP 0.783723 1.122945E-02 SMARCA4 -0.351617 3.785179E-02ZCCHC24 0.446347 1.357363E-03 APC -0.443112 1.123172E-02 LITAF 0.273056 3.849488E-02TRMT11 -0.427785 1.388872E-03 BRPF1 0.233774 1.150433E-02 RPS4X 0.169256 3.864644E-02

PTN -0.449501 1.388872E-03 FTSJ2 -0.196468 1.150433E-02 IL6ST 0.300552 3.947735E-02

16

Robert Bentham

GAS7 0.422373 1.401236E-03 RPS6 0.245071 1.218943E-02 FTSJ1 -0.269207 3.952213E-02CXXC1 0.223148 1.463128E-03 NFKB1 0.235819 1.218943E-02 NOTCH1 0.235740 3.976578E-02KDM1A -0.226117 1.477855E-03 HNRNPH3 0.286614 1.247986E-02 NFX1 0.136595 3.977656E-02BTG1 0.659258 1.477855E-03 REXO4 0.230182 1.258635E-02 APBB2 0.185202 4.005845E-02TXNIP 0.769755 1.512227E-03 PRPF19 -0.214907 1.265220E-02 DBC1 -0.355175 4.005845E-02PBX1 -0.403184 1.539236E-03 PTBP1 0.456295 1.279337E-02 PIAS4 0.265963 4.073692E-02NOP2 0.304194 1.574186E-03 HNRNPH3 0.415568 1.296613E-02 TRAP1 -0.162923 4.076317E-02

YWHAB -0.496887 1.577064E-03 HNRNPH3 0.401294 1.296613E-02 STK3 0.256124 4.077380E-02TCEA2 -0.538326 1.586035E-03 BCL2 0.419075 1.307179E-02 RB1CC1 -0.238072 4.137764E-02

PHLDA3 0.233626 1.586035E-03 CTBP2 0.420036 1.307265E-02 KDM4B 0.141568 4.174237E-02ZNF24 0.349494 1.598659E-03 DFFB 0.228604 1.325943E-02 PMS2L3 -0.191117 4.193934E-02MXD4 0.287742 1.619508E-03 SUB1 -0.344900 1.340769E-02 HINFP 0.219310 4.205649E-02STS -0.526712 1.923440E-03 PTBP1 0.408555 1.360090E-02 PTBP1 0.291009 4.218885E-02

HNRNPF 0.435092 1.940266E-03 ZHX2 0.249500 1.393329E-02 KHDRBS1 -0.275841 4.314293E-02TASP1 -0.205034 2.116698E-03 PTN -0.406477 1.425780E-02 BARD1 0.226402 4.330154E-02SESN1 0.317038 2.167532E-03 HDAC3 -0.152992 1.427661E-02 MXD4 0.384053 4.344766E-02TGIF1 0.413216 2.291665E-03 RPS6 0.179358 1.459720E-02 KPNB1 -0.194875 4.373219E-02

ATP8A2 -0.947873 2.312591E-03 TBP -0.133868 1.462089E-02 CAND1 -0.319295 4.409346E-02CHD3 -0.319365 2.413011E-03 NARS -0.231926 1.463586E-02 SETMAR 0.202307 4.425921E-02

HDAC1 0.443182 2.416774E-03 NFE2L2 0.447991 1.470688E-02 ATF3 0.289605 4.451829E-02EIF1 0.305392 2.455681E-03 MAFF 1.314520 1.482125E-02 CDC123 -0.195283 4.469899E-02

HNRNPH3 0.443930 2.456575E-03 CHD1L 0.256668 1.512965E-02 TAF9 -0.341545 4.469899E-02SIRT3 -0.315936 2.462529E-03 ASNS -0.651617 1.520338E-02 MAPK9 -0.295685 4.526086E-02

SMARCA4 -0.366475 2.571390E-03 DDX39 0.380958 1.523832E-02 ST18 0.501593 4.539651E-02CAND1 -0.295231 2.644765E-03 NARS2 -0.262346 1.547714E-02 NPAT -0.197968 4.600492E-02HTRA2 -0.269759 2.680301E-03 CIZ1 0.215167 1.547714E-02 RELA 0.228300 4.623955E-02TGFB3 0.443781 2.806006E-03 SMARCC1 0.260351 1.553457E-02 CDKN1C 0.583804 4.636397E-02SATB1 -0.472312 2.811070E-03 LRCH4 0.224864 1.555004E-02 BRD1 0.138380 4.648539E-02

ARID5B 0.381866 2.857923E-03 TMEM204 0.309821 1.561598E-02 HDAC4 0.204620 4.663850E-02RING1 0.174505 2.864779E-03 L3MBTL 0.210720 1.569126E-02 CCND1 -0.441077 4.689704E-02KAT2A 0.293493 2.902642E-03 EDNRB -0.924036 1.570343E-02 CIZ1 0.169355 4.735165E-02

CDKN2C 0.433614 3.136708E-03 MEIS2 0.374867 1.593708E-02 BRD7 0.260610 4.755995E-02BNIP3L 0.279732 3.194006E-03 SRCAP 0.143141 1.630483E-02 PHF16 0.205263 4.785161E-02MCTS1 -0.452188 3.226521E-03 NCOR1 0.215121 1.654489E-02 KCNMA1 0.267207 4.785472E-02GAS7 0.300674 3.336826E-03 PBXIP1 0.410710 1.674284E-02 PFDN5 0.190537 4.824541E-02

ZNF282 0.206052 3.484791E-03 AARSD1 -0.230421 1.677547E-02 HNRNPD -0.328193 4.874030E-02SMAD2 -0.327491 3.649291E-03 YTHDC2 -0.209308 1.692677E-02 RNASE4 0.209620 4.923652E-02ZNF423 -0.464177 3.728742E-03 YBX1 0.451395 1.694194E-02 ZEB1 -0.306020 4.923652E-02RBMS1 -0.340321 3.728742E-03 ZFP161 0.208603 1.714257E-02 PATZ1 0.186908 4.941602E-02

APC -0.279449 3.789623E-03 BCL2L13 -0.314110 1.744508E-02 CDKN1C 0.542014 4.942412E-02

Genes in the Moran study with P values < 0.05 in gene pathways related to DNA damage and stress

B R code

Below is the R code for this report showing the major steps, as used for the Middleton dataset.

Quality Control

1 #Quality control for Middleton Study, step 1 make sure CEL files are in wd and load them into R

23 library("affy")

4 library("arrayQualityMetrics")

5 library("limma")

67 Middleton<-ReadAffy();

89 #Define pheno_data for AffyBatch to include PD/C info

10 Middleton_Status<-c("C",rep("PD",7),"C","PD",rep("C",7),"PD","C","C","PD","PD",rep("C",7));

11 Middleton_pheno_data<-new("AnnotatedDataFrame",data=data.frame(sample=c(1:17),Status=Middleton_Status));

12 sampleNames(Middleton_pheno_data)<-list.celfiles();

13 phenoData(Middleton)<-parkinson_pheno_data;

1415 #Calculate and plot RNA degradation graph

16 Middleton_degrade<-AffyRNAdeg(Middleton,log.it=TRUE);

17 plotAffyRNAdeg(Middleton_degrade,transform="shift.scale");

1819 #Plot MvA plot PreNorm

20 Middleton_controls<-which(Middleton_Status=="C");

21 Middleton_park<-which(Middleton_Status=="PD");

22 mva.pairs(exprs(Middleton[,Middleton_controls[1:9]]),log.it=TRUE,plot.method="smoothScatter");

23 mva.pairs(exprs(Middleton[,Middleton_controls[10:18]]),log.it=TRUE,plot.method="smoothScatter");

24 mva.pairs(exprs(Middleton[,Middleton_park]),log.it=TRUE,plot.method="smoothScatter");

2526 #Note for full quality analysis use arrayQualityMetrics package:

27 # arrayQualityMetrics(expressionset = Middleton, outdir = "Middleton_QAraw", force = FALSE, do.logtransform = TRUE, intgroup = fac)

Normalisation, Quality Control and LIMMA Analysis

1 # Normalisation and postNorm quality control and LIMMA analysis

234 #Normalise using RMA

5 Middleton_normed<-rma(Middleton);

17

Robert Bentham

67 #Do post normalisation Quality control

8 mva.pairs(exprs(Middleton_normed[,Middleton_controls[1:9]]),log.it=TRUE,plot.method="smoothScatter");

9 mva.pairs(exprs(Middleton_normed[,Middleton_controls[10:18]]),log.it=TRUE,plot.method="smoothScatter");

10 mva.pairs(exprs(Middleton_normed[,Middleton_park]),log.it=TRUE,plot.method="smoothScatter");

1112 #Note for full quality analysis use arrayQualityMetrics package:

13 # arrayQualityMetrics(expressionset = Middleton_normed, outdir = "Middleton_QAnorm", force = FALSE, do.logtransform = TRUE, intgroup = fac)

1415 #Continue with LIMMA analysis - Create design matrix

16 Middleton_design<-model.matrix(~Middleton_normed$Status);

17 colnames(Middleton_design)<-c("C","PDvsC")

1819 #Run lmFit and eBayes

20 Middleton_fit<-lmFit(Middleton_normed,Middleton_design);

21 Middleton_fit<-eBayes(Middleton_fit);

2223 # Do multiple hypothesis adjustment

24 Middleton_top=topTable(Middleton_fit,coef="PDvsC",adjust="BH",number=nrow(Middleton_normed));

25 Middleton_results<-decideTests(Middleton_fit,adjust.method="fdr",p.value=0.05);

262728 #Write significant genes to file

29 Middleton_sig<-rownames(Middleton_results)[which(as.integer(Middleton_results[,2])!=0)];

30 write(Middleton_sig,file="Middleton_sig_genes.txt");

Gene Set analysis

GSEAlm

1 #GSEAlm method for Gene Set Analysis for pathways from Gene Ontology (GO) database

23 library(genefilter)

4 library(hgu133a.db) #Check with annotation() if this is right, note for Mullen need hgu133aplus2.db

5 #library(KEGG.db)

6 library(GO.db)

7 library(GSEAbase)

8 library(GSEAlm)

910 #Get GeneSetCollection from GO with all pathways

11 Middleton_gsc<-GeneSetCollection(Middleton_normed,setType=GOCollection());

1213 #Create Incidence matrix from the GeneSetColletion describing all pathways

14 Middleton_Am<-incidence(Middleton_gsc);

1516 #Create expression set with only genes in incidence matrix

17 Middleton_nsF = Middleton_normed[colnames(Middleton_Am), ];

1819 #Only select the pathways with greater than 10 genes as short pathways are difficult to analyse statistically

20 Middleton_selectedrows<-(rowSums(Middleton_Am)>10);

21 Middleton_Am2<-Middleton_Am[Middleton_selectedrows,];

2223 #Apply the GSEAlm algorithm with 2000 permutations

24 Middleton_perm<-gsealmPerm(Middleton_nsF,~Status,mat=Middleton_Am2,nperm=2000);

2526 #Prepare the output file

27 Middleton_permA=Middleton_permB=c(1:length(Middleton_perm[,1]));

2829 for (i in 1:length(Middleton_perm[,1])){

30 Middleton_permA[i]<-min(Middleton_perm[i,1],Middleton_perm[i,2]);

31 if(Middleton_tAadj[i]<0){

32 Middleton_permB[i]="DOWN"}

33 else{

34 Middleton_permB[i]="UP"}

35 }

3637 Middleton_permAdj=p.adjust(Middleton_permA,method="fdr",n=length(Middleton_permA));

3839 names(Middleton_tA)= rownames(Middleton_Am2) ;

4041 Middleton_GO<-cbind(as.vector(names(Middleton_tA)),as.vector(Term(names(Middleton_tA))),as.vector(Middleton_permB),

42 as.vector(Middleton_permA),as.vector(Middleton_permAdj));

43 Middleton_GO<-Middleton_GO[order(as.numeric(Middleton_GO[,4])),];

44 colnames(Middleton_GO)=c("GOID","GO Term", "UP/DOWN", "P value","Adjusted P value");

4546 #Save results to file.

47 write.table(Middleton_GO,file="Middleton_GO_terms.txt",sep="\t");

GAGE

1 #GAGE method for Gene Set Analysis for pathways from Gene Ontology (GO) database

23 #library(KEGG.db)

4 library(GO.db)

56 #Use GSEABase package to get Gene set collection, format needs to be changed slightly to work with GAGE.

18

Robert Bentham

7 Middleton_gsc<-GeneSetCollection(Middleton_normed,setType=GOCollection());

8 Middleton_geneset<-geneIds(Middleton_gsc);

9 Middleton_genesetnames<-names(Middleton_gsc);

1011 #Apply GAGE algorithm

12 Middleton_gage <- gage(exprs(Middleton_normed), gsets = Middleton_geneset, ref = Middleton_controls, samp = Middleton_park,compare=’unpaired’);

1314 #Get GOID from Middleton_genesetname in right format

15 Middleton_lessterms<-c(1:length(Middleton_genesetnames));

16 Middleton_greaterterms<-c(1:length(Middleton_genesetnames));

17 for (i in 1:length(Middleton_lessterms)){

18 Middleton_lessterms[i]<-Middleton_genesetnames[as.numeric(substring(rownames((Middleton_gage$less[, 1:5]))[i],2,nchar(rownames((Middleton_gage$less[,

1:5]))[i])))]

19 Middleton_greaterterms[i]<-Middleton_genesetnames[as.numeric(substring(rownames((Middleton_gage$greater[,

1:5]))[i],2,nchar(rownames((Middleton_gage$greater[, 1:5]))[i])))]

20 }

2122 #Prepare file for saving

23 Middleton_GOgageless<-cbind(Middleton_lessterms,as.vector(Term(Middleton_lessterms)),as.vector((Middleton_gage$less[, 3])),as.vector((Middleton_gage$less[,

4])));

24 Middleton_GOgagegreater<-cbind(Middleton_greaterterms,as.vector(Term(Middleton_greaterterms)),as.vector((Middleton_gage$greater[,

3])),as.vector((Middleton_gage$greater[, 4])));

25 colnames(Middleton_GOgageless)=c("GOID","GO Term", "P value","Adjusted P value");

26 colnames(Middleton_GOgagegreater)=c("GOID","GO Term", "P value", "Adjusted P value");

2728 write.table(Middleton_GOgagegreater,file="Middleton_GO_gage_greater.txt",sep="\t");

29 write.table(Middleton_GOgageless,file="Middleton_GO_gage_less.txt",sep="\t");

Find GO Pathways which are significant

1 #Find GO pathways that are significantly over expressed in all studies

23 Middleton_GO_gagegreatersig<-Middleton_GO_gagegreater[which(Middleton_GOgagegreater[,4]<0.05),];

4 Moran_GO_gagegreatersig<-Moran_GO_gagegreater[which(Moran_GOgagegreater[,4]<0.05),];

5 Mullen_GO_gagegreatersig<-Mullen_GO_gagegreater[which(MullenGOgagegreater[,4]<0.05),];

67 sig_genes<-intersect(Mullen_GO_gagegreatersig[,1],intersect(Moran_GOgagegreatersig[,1],Middleton_GOgagegreatersig[,1]));

89 GO_gage_greater_sig<-cbind(sig_genes,Term(sig_genes));

1011 write.table(Middleton_GOgagegreatersig,file="Middleton_GO_gage_greater_sig.txt",sep="\t");

12 write.table(Moran_GOgagegreatersig,file="Moran_GO_gage_greater_sig.txt",sep="\t");

13 write.table(Mullen_GOgagegreatersig,file="Mullen_GO_gage_greater_sig.txt",sep="\t");

14 write.table(GO_gagegreatersig,file="GO_gage_greater_sig.txt",sep="\t");

Table for Significant genes in pathways related to DNA damage and stress

1 #Create table of significant genes in pathways related to DNA damage in the Middleton study.

23 #Create mapping between Affy probes and gene names

4 a<-hgu133aSYMBOL;

5 mapped_probes<-mappedkeys(a);

6 xx<-as.list(a[mapped_probes]);

78 #Import relavant GO pathways from premade file

9 Middleton_GO_DNA <- read.table("~/CP2/NewStudies/Middleton_GO_DNA", quote="\")

1011 #Find all genes involved in a DNA damage related pathway

12 A=c(1:length(Middleton_GO_DNA));

13 A[1]=which(Middleton_genesetnames==Middleton_GO_DNA[1]);

14 C1=genepaths[[A[1]]]

15 for (i in 2:length(Middleton_GO_DNA)){

16 A[i]=which(Middleton_genesetnames==Middleton_GO_DNA[i])

17 C2=genepaths[[A[i]]]

18 C1=union(C1,C2)}

1920 #Select genes if interest

21 D1=which(Middleton_top[,1] %in% C1);

222324 #Map probe Affy ID to gene ID

25 C1a=C1;

26 for (i in 1:length(C1)){

27 C1a[i]=xx[[Middleton_top[D1[i],1]]]}

2829 #Select only significant genes and save file

30 Middleton_DNA_genes<-cbind(C1a,Middleton_top[D1,c(2,6)]);

31 Middleton_DNA_genes_sig<-Middleton_DNA_genes[which(Middleton_DNA_genes[,3]<0.05),];

32 for (i in 1:5){

33 write.table(sprintf("%s & %f & %E \\\\ \\hline",Middleton_DNA_genes_sig[i,1],Middleton_DNA_genes_sig[i,2],Middleton_DNA_genes_sig[i,3]),

34 file="Middletontable.txt",append=TRUE,row.names=FALSE,col.names=FALSE)

35 }

3637 #Create table for significant genes common to all datasets

3839 for (i in 1:10){

40 cat(sprintf("%s & \\Checkmark & \\XSolidBrush & \\Checkmark \\\\ \\hline",intersect(Mullen_DNA_genes_sig[,1],Moran_DNA_genes_sig[,1])[i]))}

19

Robert Bentham

41 for (i in 1:2){

42 cat(sprintf("%s & \\Checkmark & \\Checkmark & \\XSolidBrush\\\\ \\hline",intersect(Middleton_DNA_genes_sig[,1],Moran_DNA_genes_sig[,1])[i]))}

43 for (i in 1:3){

44 cat(sprintf("%s & \\XSolidBrush & \\Checkmark & \\XSolidBrush\\\\ \\hline",setdiff(Middleton_DNA_genes_sig[,1],Moran_DNA_genes_sig[,1])[i]))}

45 for (i in 1:4){

46 cat(sprintf("%s & \\XSolidBrush & \\XSolidBrush & \\Checkmark \\\\ \\hline",setdiff(Mullen_DNA_genes_sig[,1],Moran_DNA_genes_sig[,1])[i]))}

20