rmani: regulatory module network inference - biomed central

RESEARCH Open Access

RMaNI: Regulatory Module Network InferenceframeworkPiyush B Madhamshettiwar1,2, Stefan R Maetschke1,2, Melissa J Davis1,2, Mark A Ragan1,2*

From Asia Pacific Bioinformatics Network (APBioNet) Twelfth International Conference on Bioinformatics(InCoB2013)Taicang, China. 20-22 September 2013

Abstract

Background: Cell survival and development are orchestrated by complex interlocking programs of gene activationand repression. Understanding how this gene regulatory network (GRN) functions in normal states, and is altered incancers subtypes, offers fundamental insight into oncogenesis and disease progression, and holds great promisefor guiding clinical decisions. Inferring a GRN from empirical microarray gene expression data is a challenging taskin cancer systems biology. In recent years, module-based approaches for GRN inference have been proposed toaddress this challenge. Despite the demonstrated success of module-based approaches in uncovering biologicallymeaningful regulatory interactions, their application remains limited a single condition, without supporting thecomparison of multiple disease subtypes/conditions. Also, their use remains unnecessarily restricted tocomputational biologists, as accurate inference of modules and their regulators requires integration of diverse toolsand heterogeneous data sources, which in turn requires scripting skills, data infrastructure and powerfulcomputational facilities. New analytical frameworks are required to make module-based GRN inference approachmore generally useful to the research community.

Results: We present the RMaNI (Regulatory Module Network Inference) framework, which supports cancer subtype-specific or condition specific GRN inference and differential network analysis. It combines both transcriptomic aswell as genomic data sources, and integrates heterogeneous knowledge resources and a set of complementarybioinformatic methods for automated inference of modules, their condition specific regulators and facilitatesdownstream network analyses and data visualization. To demonstrate its utility, we applied RMaNI to ahepatocellular microarray data containing normal and three disease conditions. We demonstrate that how RMaNIcan be employed to understand the genetic architecture underlying three disease conditions. RMaNI is freelyavailable at http://inspect.braembl.org.au/bi/inspect/rmani

Conclusion: RMaNI makes available a workflow with comprehensive set of tools that would otherwise bechallenging for non-expert users to install and apply. The framework presented in this paper is flexible and can beeasily extended to analyse any dataset with multiple disease conditions.

BackgroundComplex cellular behaviour in cancer is orchestrated bythe action of transcriptional regulatory networks [1,2].Computational inference of transcriptional regulatory net-works, referred to as Gene Regulatory Networks (GRN),

from microarray gene expression data is one of the funda-mental goals of systems biology and its translation togenomic medicine [3]. GRN inference and analysis, espe-cially when integrated with experimental validation, hasproven to be a powerful tool in understanding how regula-tory networks are disrupted and rewired in normal andcancer conditions, and in identifying novel regulatoryinteractions as well as broader systemic disruptions in keyoncogenic processes [4-6]. Many methods have been

* Correspondence: [email protected] for Molecular Bioscience, The University of Queensland, 306Carmody Road, St Lucia, Brisbane, Queensland 4072, AustraliaFull list of author information is available at the end of the article

Madhamshettiwar et al. BMC Bioinformatics 2013, 14(Suppl 16):S14http://www.biomedcentral.com/1471-2105/14/S16/S14

© 2013 Madhamshettiwar et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited.

http://inspect.braembl.org.au/bi/inspect/rmani

mailto:[email protected]

http://creativecommons.org/licenses/by/2.0

developed to infer GRNs from microarray gene expressiondata. These approaches include unsupervised, semi-super-vised and supervised methods based on computationalmathematics, multivariate statistics and informationscience [7-11].Although diverse computational and statistical approa-

ches have been applied to this problem, the accuracy ofedge-wise network inference methods remains poor[11-14]. Novel approaches are needed to address thegenome-wide network inference problem. A promisingdirection is the inference of transcriptional modulesinstead of individual edges. Module inference is simplerthan edge-wise network inference [15,16], and higheraccuracies can be achieved [7,17].

Transcriptional module networksSeveral studies have revealed that regulatory networks aremodular in nature and organised hierarchically [18].According to Oltvai and Barabasi’s “complexity of life”pyramid, functional modules are less complex comparedto individual transcriptional programs, which in turn arethe building blocks for these modules [15]. Therefore,inferring modules instead of the individual interactions ofcomplete networks drastically reduces the complexity ofthe inference problem, and shows great promise for net-work analysis in complex disease conditions includingcancer [17,19-21]. A transcriptional-module network iscomposed of clusters of co-expressed genes collaborativelyor alternatively regulated by one or several transcriptionfactors (TFs) via convergent or divergent regulatory pro-grams. A convergent regulatory program represents a parti-cular set of target genes (TGs) regulated by different setsof TFs, whereas a divergent regulatory program representsa given set of TFs regulating distinct sets of TGs [7,22].Several methods have been developed to infer modules

from microarray data, including a range of clusteringmethods such as k-means, hierarchical clustering andself-organizing maps. However, all these approaches suf-fer from certain limitations; for instance, the number ofclusters is not determined automatically but requires thenumber of clusters to be pre-specified [23-26]. WGCNA[27], based on the weighted gene co-expression networkanalysis approach [28], is the most widely used methodand has been applied to a number of diseases [29-32]. Italso uses a clustering approach to infer modules, but itoptimizes the threshold to achieve a scale-free topology.Assuming scale-freeness, several model-based clusteringapproaches have been developed [33-35]. Model-basedapproaches allow a statistical analysis of the inferredmodules and automatically estimate the number of mod-ules [34]. For example, Genomica [20] uses expectationmaximisation (EM) to identify modules [16,20].Other methods [22,36-39] use additional experimental

data such as protein-protein interactions, TF binding

affinity data, in vitro DNA binding specificities, DNAmotifs and ChIP-chip data. Such integrative approachesare attractive and promising approaches to infer mod-ules, as they take into account different sources of biolo-gical information [40]. However, they do not nativelyintegrate methods for module inference, identification ofregulators, or comprehensive downstream analysis andvisualization. Also, they support the analysis only ofindividual datasets arising from only one condition with-out differential analysis of other conditions or subtypes.Integrating diverse data sources as well as multiple

methods brings many challenges. These challenges can bediverse, range from methodological to practical in nature,and can arise due to the computational or statistical com-plexities of methods and the dimensionality of omic data[41,42]. For instance, combining heterogeneous datarequires extensive file formatting at different stages of ana-lysis, while integrating different methods involves theselection or optimization of diverse parameters and otheruser-control features. As a consequence of these chal-lenges, it is difficult for biologists or clinicians (withoutstrong informatic skills) to chain multiple methodstogether into comprehensive, flexible workflows to addresssubstantial questions. For example, to identify the modulesinvolved in any disease condition one must retrieve datafrom different repositories (e.g. motif data from Transfac[43] or Genomatix [44]), map the identifiers e.g. using Bio-mart [45], perform differential gene expression analysis e.gusing LIMMA [46], infer the modules and identify regula-tors e.g. using Genomica [20], integrate the inferred mod-ules and regulators for visualization e.g. using Cytoscape[47], and finally perform functional analysis of modulegenes e.g. using DAVID [48,49]. This work focuses onmaking available a workflow and computational resourcesfor the inference of modules and their regulators, down-stream analyses and visualization.

RMaNI - Regulatory Module Network Inference frameworkHere, we present a novel integrative and automated analy-tical framework “RMaNI - Regulatory Module NetworkInference” for disease condition or subtype-specific mod-ule network inference, analysis and data visualization. Ituses the Learning Module Networks (LeMoNe) algorithm[50] and Regulatory Impact Factors (RIF) [51] to identifyrelevant regulatory TFs. The LeMoNe algorithm uses aBayesian probabilistic model-based approach for clusteringgenes, and in selecting thresholds does not assume thatnetworks necessarily have a scale-free topology [50].RMaNI combines both transcriptomic as well as geno-

mic data sources, and integrates heterogeneous knowledgeresources and a set of complementary bioinformatic meth-ods for microarray data processing, differential expression(DE) analysis, module detection and regulator identifica-tion, gene and module significance measure calculations,


of 11

functional enrichment analysis of module genes, andvisualization of data and networks.

Case study - application to hepatocellular carcinomaTo demonstrate its utility, we applied RMaNI to a hepa-tocellular microarray dataset containing normal tissueand three disease conditions: pre-malignant (cirrhosis),cirrhosis with hepatocellular carcinoma (cirrhosisHCC),and hepatocellular tumor (HCC). We illustrate that theidentification and analysis of transcriptional modulenetwork can give insight into the common and uniquegenetic architecture underlying hepatocellular carcinomaconditions.

ImplementationThe RMaNI web interface has been created using Rwui[52], a Java-based application that uses the Apache Strutsframework. The complete application is running on aTomcat server on a high-performance computing cluster.The workflow integrates publicly available R[53], Biocon-ductor [54] and custom packages and functions for dataimport, processing, analysis, integration and visualization.All packages are currently running under R version 2.15.2,and can be easily updated as newer versions of R arereleased. RMaNI is freely available as a user-friendly web-application at http://inspect.braembl.org.au/bi/inspect/rmani, with a comprehensive manual available (AdditionalFile 1).In the next section, we describe the RMaNI workflow

and provide a brief overview of the methods used in eachstep. Then, we present a case study showing how RMaNIcan be employed to understand the genetic architectureunderlying three hepatocellular carcinoma conditions.

RMaNI: structure and functionalitiesFigure 1 illustrates the workflow in RMaNI. The workflowis divided into three main stages: 1) data preparation, 2)inference of modules and regulators, and 3) integration ofmodule networks and analysis. In this section, we describethese stages and the individual steps involved therein.

Stage 1 - Data PreparationAt this stage, the pre-processed (background correctedand normalized) microarray gene expression data andsample annotations are imported from files uploaded bythe user.

Step 1.1 - DatasetsRMaNI can be applied to gene expression datasets arisingfrom multiple conditions. Currently, we support datasetsarising from 13 different types of Affymetrix chips:hgu133a, hgu133a2, hgu133b, hgu133plus2, hgu219,hgu95a, hgu95av2, hgu95b, hgu95c, hgu95d, hgu95e,hthgu133a and hthgu133b.

Step 1.2 - Feature selection for input to module inferenceworkflowOnce a user has the microarray dataset, the questionarises: which and how many features should one input tothe network inference step? Because there is no standardfeature-selection method or recommendation on theminimum number of features, the workflow comparesdifferent feature-selection methods for different genesets, and identifies the optimal combination of these twoparameters.The user compares three feature-selection methods:

differentially expressed genes between normal and allsubtypes (DE_all), differentially expressed genes betweennormal and each subtype (DE_pair), and the most-variable genes across the dataset based on the coefficientof variation (Var). For differential expression analysisRMaNI uses the LIMMA package [46], and to select vari-able genes it uses a custom R function. To find the opti-mal number of genes, for each of the three featureselection methods, it selects eight subsets with 10 to4000 genes (10, 50, 100, 200, 500, 1000, 2000 and 4000genes) optimal for network inference step. To identifythe optimal feature selection method and number ofgenes, it examines how well they group the samples intothe known classes.The user compares the different gene sets on seven dif-

ferent clustering methods (clues, kmeans, PAM, AGNES,Fanny, SOTA and MCLUST) [55-57]. The workflow usesthe Rand Index (RI) [58] as a measure for evaluating theclustering performance. RI measures the similaritybetween two data clusterings (known against predicted).An RI equal to 1 indicates perfect clustering, while an RIof 0 indicates that the clustering is no better than chance.These methods are implemented in the R packages clValid[59], clues [55], cluster [57] and mclust [56]. A briefdescription of each clustering method is given below.clues (clustering based on local shrinking) is a nonpara-

metric clustering method using local shrinking [55]. It esti-mates the number of clusters and simultaneously finds apartition of a data set via three steps: shrinking, partition,and determination of the optimal number of partitions.kmeans is a parametric, centroid-based clustering

method. Given the number of clusters, it starts with aninitial estimate for the cluster centroids, and each sampleis assigned to the cluster with the nearest mean [60]. Thecluster centroids are then updated, and the entire processis iterated until the cluster centroids become stable.PAM (Partitioning Around Medoids) is a parametric

method similar to k-means, but PAM is a medoid-basedmethod. A medoid is a representative object of a cluster,such that its average dissimilarity to all objects in thatcluster is minimal [61]. Given the number of clusters,PAM starts with an initial estimate for the clustermedoids, and calculates the dissimilarity matrix using the


of 11



Euclidean or Manhattan distance [61]. Based on thismatrix, each sample is assigned to the cluster with thenearest medoid.AGNES (AGglomerative NESting) is a hierarchical

clustering method which groups a dataset into a tree of

clusters [61]. It is a bottom-up clustering method thatstarts with small clusters of single samples and then, ateach step using a specified distance metric, merges theclusters into larger cluster. This is repeated iteratively untila single cluster is obtained, containing all samples.

Figure 1 RMaNI workflow. Stages involved in RMaNI workflow. Workflow is divided into three main stages - Data preparation, Inference ofmodules and regulators, and Integration of module networks and analysis.


of 11

Fanny is a fuzzy or soft clustering method [61]. Withthis method each sample has partial membership witheach cluster rather than belonging exclusively to just asingle cluster. Each sample describes the probabilityscores for its cluster membership. After optimizing thenumber of clusters, the method starts with assigningrandom cluster probabilities to each sample, and repeatsthis process until convergence.SOTA (Self-Organising Tree Algorithm) is a divisive

clustering method [62]. It generates an unsupervisedneural network with a binary tree topology. Contrary toAGNES, SOTA is a top-down clustering method. It startsthe clustering process with a binary tree consisting of aroot node with two leaves, each representing one cluster.The self-organizing process then grows the tree by con-verting the leaf with the highest score into a node andattaching two new leaves to it. The score for each clusteris defined as the mean value of the distances between thecluster and the samples associated with it [63].MCLUST (Model based clustering) is a nonparametric,

model-based clustering method that uses finite normalmixture modelling and the expectation maximisation(EM) algorithm. Unlike other methods, it does not requirethe number of clusters as input, but instead infers thenumber of clusters from the data.In summary, stage 1 provides an estimate on the feature-

selection method and optimal number of genes which bestexplains the given data. The user can choose this featureselection method and this many genes for input to findclusters of co-expressed genes in the next step. For easeand flexibility of processing the user’s own data, thefeature selection step is not supported through the RMaNIweb-interface.

Stage 2 - Clustering of genes to modules andidentification of regulatorsThis is the main stage of the RMaNI workflow. It takesthe gene set optimized in the feature selection step, anduses the corresponding gene expression data for modulenetwork inference. Below we provide the details of theindividual steps.

Step 2.1 - Inference of transcriptional module networksGiven a gene expression dataset and a set of candidateregulators (TFs, microRNA or clinical variable of interestlike stage or grade); inference of modules is composed oftwo steps: first clustering of co-expressed genes to iden-tify modules, and second the inference of links betweenregulators and modules.

Step 2.1.1 - Clustering of genesRMaNI uses the LeMoNe (Learning Module Networks)algorithm for inferring modules from microarray data,LeMoNe performs a two-way Bayesian clustering of genes

and uses a Gibbs sampling procedure to iteratively updatethe cluster assignments of genes [34,50]. Each inferredmodule contains the genes for which the expression pro-files best fit the same multivariate normal distribution [7].LeMoNe has been successfully applied to different condi-tions including cancer [17,64-67]. LeMoNe outputs theensemble of clustering solutions represented as a gene-to-cluster probability matrix reflecting the probability of theassignment of a gene to each module, referred to as fuzzyclustering (one gene can belong to multiple modules, eachwith certain probability). Using a graph spectral methodand a probability cut-off, it then outputs tight clusters (inwhich one gene belongs to only one cluster) from fuzzyclusters [50].

Step 2.2 - Inferring the regulatorsTo identify and prioritize potential TFs regulating mod-ules, candidate TFs are gathered by integrating lists of TFsfrom Vaquerizas [68], Ravasi [69], TCOF-DB [70] andTransfac [43]. To infer the potential regulator for eachmodule, two methods are employed: LeMoNe’s regulatoryprogram (LRP) and the Regulatory Impact Factor analysis(RIF) algorithm. Below, we briefly describe these methods.

Step 2.2.1 - LeMoNe regulatory programIn LeMoNe’s regulatory program, two types of regulatorscan be assigned, regulators with continuous or with dis-crete values. Continuous values include expression valuesmeasured, for example, for TFs, signal transducers, kinasesand/or microRNAs. Discrete values can be clinical vari-ables like tumor stage or grade. In this workflow the focusis on TFs. Transcriptional regulatory programs areinferred using a hierarchical decision-tree model. The reg-ulator assigned to each module consists of the set of TFsfor which the expression profiles best explain all or part ofthe conditions. TFs receive a regulatory score reflectingthe statistical confidence with which a TF regulates genesin the cluster. The collection of the regulatory scores foreach TF is then converted into a global score. Finally, theTFs are sorted by their scores to construct a ranked list ofpotential regulators.

Step 2.2.2 - Regulatory Impact Factor (RIF) analysisRIF analysis was initially developed to identify TFs thatcontribute to the differential expression in a particularcondition, although the TF itself is not differentiallyexpressed [51]. RIF is based on the differential correlationbetween a TF and the genes differentially expressed (DE)under two conditions. To compute a regulatory confi-dence score, it integrates three sources of information intoa single measure: (a) the change in correlation betweenthe TF and the DE genes, referred to as differential wiring;(b) the amount of differential expression of DE genes; and(c) the abundance of DE genes under the two conditions.


of 11

It assigns a score (RIF1) to those TFs that are consistentlymost differentially co-expressed with the highly abundantand highly expressed DE genes, and another score (RIF2)to those TFs with the most altered ability to predict theabundance of DE genes [51].

Step 2.3 - Gene significance measures for module rankingRMaNI uses two measures to rank the modules for eachsubtype, average gene significance (GS) and modScore.The average gene significance focuses on the differentialexpression of genes in two conditions in each module andthe modScore represents the overall correlation betweengenes in each module. RMaNI combines these two scoresinto a single score referred as a standard score:

averageGS = average(-log10(DE pvalue of a gene))modScore = sum(abs(correlation of genes))/choose

(ngenes,2))standardScore = 1 - ((1 - averageGS])*(1 - modScore))

Stage 3 - Integration of transcriptional module networksand topological analysisAt this stage, the workflow combines all subtype-specificmodules and regulators to build a transcriptional modulenetwork. In the topological analysis of such a modulenetwork, RMaNI calculates the overlap of TFs, TGs andinteractions across subtypes, and generates node andedge attributes to aid in visualization.

Step 3.1 - Functional enrichment analysis of the inferredmodulesThe genes in each of the modules are subjected to afunctional GO enrichment analysis using BiNGO [71].Significantly enriched GO terms are detected by a hyper-geometric test with adjusted Benjamini-Hochberg FalseDiscovery Rate (FDR) [72] correction at significance level0.05 against the all other genes in the network as abackground.

Step 3.2 - Cluster similarity measuresTo visualize the similarities between different modules,RMaNI uses the Jaccard similarity index as an externalmeasure and Biological Process (BP) and MolecularFunction (MF) as biological measures. The Jaccard indexis calculated as the number of unique genes common totwo clusters divided by the total number of unique genesin two sets. The BP and MF similarity measures are calcu-lated by the GOSemSim package in Bioconductor [73].

Step 3.3 - VisualizationThroughout the analysis a number of figures are generatedfor data visualization, including the representation ofinferred modules, significance measures calculated foreach module, and overlaps of TFs, TGs and interactionsacross all subtypes. To visualize the network, the workflow

exports interactions to a Cytoscape [47] -compatible file.Node attributes such as subtype and module member-ships, number of modules regulated by a TF, GO annota-tions, and edge attributes such as subtype membership ofan interaction, and regulatory score for an edge, are alsoprovided for further exploration.

Application of RMaNI to hepatocellular carcinomaTo demonstrate the utility of RMaNI, we applied thisworkflow to hepatocellular carcinoma dataset (GSE14323)[74], containing normal tissues and three disease condi-tions: pre-malignant (cirrhosis), cirrhosis with hepatocellu-lar carcinoma (cirrhosisHCC), and hepatocellular tumor(HCC). We investigated the ability of RMaNI to infercondition-specific transcriptional module networks, findcommon and unique TFs and regulatory interactions toexamine the genetic architecture and ultimately to under-stand the differences and similarities between conditions.Below, we present the results for the individual steps todemonstrate the workflow.

DatasetWe used a Robust Multiarray Averaging (RMA) normal-ised and standardised hapatocellular carcinoma microarraygene expression dataset, based on 115 samples (Table 1).

Inference of module networks in hepatocellularcarcinoma conditionsWe selected top 4000 differentially expressed genes (basedon BH-adjusted p-value) between normal and three condi-tions to infer the modules as described in the workflow.For this step RMaNI uses the LeMoNe algorithm. Michoelet al. [65] evaluated performance of the LeMoNe againststate-of-the-art method genomica, and Smet and Marchal[7] compared LeMoNe against other network inferencemethods. For each pair of the normal-to-conditiondatasets (Table 2), 10 clustering solutions were generated.For each run LeMoNe used the default setting of 50 burn-ins and 100 Gibbs sampling steps, where the minimumnumber of genes in a cluster was set to 4. The defaultprobability score cut-off of 0.2 was used uniquely assigngenes to clusters.Table 3 summarizes the clustering results. For each

condition, it shows the different number of clusters gener-ated, with their number of genes, maximum and minimum

Table 1 Description of the hepatocellular carcinomamicroarray dataset.

Dataset No. of samples in each condition Platform

Normal Cirrhosis CirrhosisHCC HCC

GSE14323115 samples

19 41 17 38 HG-U133A(12079 probes)

In the next step, we input dataset to the LeMoNe algorithm to infer modules.


of 11

module sizes. We also performed a GO enrichment analy-sis on each module using BiNGO to measure the func-tional coherence of genes in the modules. Table 3 alsoshows the total number of modules, in each subtype, withat least one significant GO category enriched (BH-adjustedp-value 0.05). For instance, in cirrhosis, RMaNI generateda set of 74 modules corresponding with a total of 3794genes. The largest modules had 302 genes and the smallest4. In the next step we identified the regulators of themodules.

Identification and ranking of regulatorsTo assign the potential regulators (TFs) to the inferredmodules, two data-driven approaches were employed inRMaNI: LRP and RIF. This step resulted in the potentialregulatory TFs ordered according to their LRP and RIFscore. To find the most confident regulators for eachcluster, RMaNI used the intersection of regulators identi-fied by both methods and integrated both scores into onescore (stdScore). Table 4 presents the TFs predicted thathave a regulatory role in at least two conditions. Forinstance, it reveals that TF CBFB regulates at least onemodule in each of the cirrhosis, cirrhosisHCC and HCCconditions and has 557 interactions across the three con-ditions. Previous studies were limited to a set of priorcandidate TFs only, e.g. differentially expressed TFs orTFs involved in a particular pathway but considering thefact that the detection of DE TFs from expression data islimited due to their low and sparse expression levels,RMaNI uses all the TFs of a species (human in thisstudy) without the need of prior TF identification. How-ever, its applicability in organisms without known TFswill largely be determined by the entirety of TF databasesand annotations, which are expected to improve overtime with advances in ChIP-chip and ChIP-seq studies.Other continuous regulatory factors such as microRNAs,signal transducers, kinases and discrete regulatory factors

such as clinical parameter, e.g. stage, grade or treatmentscan also be used.

Identification of modules with the highest DE andcorrelationTo identify the modules with high DE as well as high cor-relation for each condition (referred as best modules), weordered the standard score, generated from averagGS andmodScore, and for each module the workflow detected theknee-point (the maximum inflection point of a graph)from standard score to select the best modules. Table 5shows the total number of best modules selected for eachcondition and the number of TFs and target genes inselected modules. For instance, in cirrhosis, 7 modulescorresponding to 200 genes were selected. The 200 genesinclude 191 TGs and 9 TFs.

Network analysisWe aggregated all the module networks inferred for eachcondition to construct an overall network. For this pur-pose, RMaNI generates the network around the regula-tors predicted with highest confidence according tostdScore. The generated hepatocellular carcinoma net-work includes 24 TFs and 557 TGs connected by 5897edges. We found 144 nodes unique to cirrhosis, 342nodes unique to cirrhosisHCC, and 71 nodes unique toHCC. 1296, 4104 and 497 edges were unique to cirrhosis,cirrhosisHCC and HCC conditions, respectively. Previousapproaches do not identify unique or shared TFs betweenmodules, and between subtypes or conditions. By con-trast, in this analysis we performed the analysis of con-vergent and divergent regulatory programs via TFoverlap analysis. Figure 2 illustrates the TFs overlapacross three conditions. We found one TF (CBFB) asso-ciated with all the three conditions, two TFs (TCF4 andUSF2) associated with two conditions (Table 4) and 21TFs were unique to one condition.

Network visualizationWe imported the inferred module network in Cytoscapefor visualization and exploration. For demonstration oftopological analysis of inferred network, we extracted asub-network of 70 nodes (TFs and TGs). Figure 3 showshepatocellular carcinoma sub-network which includes 6TFs and 64 TGs connected by 110 edges. Nodes and edgesare rendered as per different evidences. For instance, node

Table 2 Summary of the datasets used in the study, fivesets of normal and subtype pairs data were input toLeMoNe.

Datasets No. of DE Genes No. of Samples

Normal + cirrhosis 4000 60

Normal + cirrhosisHCC 4000 36

Normal + HCC 4000 57

Table 3 Summary of gene clustering results.

Conditions No. of Modules No. of Genes Max Module Size Min Module Size

Normal + cirrhosis 74 3794 302 4

Normal + cirrhosisHCC 59 3813 342 4

Normal + HCC 78 3772 219 4


of 11

Table 4 TFs that are predicted to have a regulatory role in at least two conditions.

TFs Conditions TGs in cirrhosis TGs in cirrhosisHCC TGs inHCC

TotalEdges

CBFB cirrhosis, cirrhosisHCC, HCC 144 342 71 557

TCF4 cirrhosis, HCC 144 0 71 215

USF2 cirrhosis, cirrhosisHCC 144 342 0 486

Remaining TFs are unique to individual conditions.

Table 5 Summary of the modules with highest DE and correlation (best modules).

Conditions Total Modules No. of best Modules No. ofGenes

No. ofTFs

No. ofTGs

cirrhosis 74 7 200 9 191

cirrhosisHCC 59 6 183 11 172

HCC 78 6 255 30 225

Total 211 18 638 50 588

Unique 211 50 548 47 548

Best modules were selected, for each condition, from all the modules inferred.

Figure 2 TF overlap analysis result. Figure illustrates TF overlap analysis results. Three TFs are predicted to have a regulatory role in at leasttwo conditions. Remaining TFs are unique to individual conditions.


of 11

shape represents its type (TF or TG), and node colour gra-dient represents DE in cirrhosis against normal tissues.Edge colour represents condition membership. Figureshows distinct modules regulated by different TFs. It alsoillustrates the TF overlap analysis result (Table 4), forinstance, CBFB, TCF4 and USF2 regulates the TGs in atleast two conditions whereas TCF21, ATF1 and BRCA2are predicted to have regulatory role only in one of theconditions.

ConclusionsWe have presented the RMaNI workflow, developed forthe end-user perspective of a biologist or clinician. Itprovides an easy-to-use interface to a comprehensive,integrated suite of tools for the inference of conditionor subtype-specific transcriptional module networks andtheir analysis. We described the RMaNI workflow andapplied it to hepatocellular carcinoma data. We demon-strated that identifying the transcriptional module net-work, and analysing and visualizing the inferrednetwork, can give insight into the common as well asunique regulatory architecture underlying different dis-ease conditions. We anticipate integrating additionaltools and workflows in future to meet the distinct needsof researchers confronting the complexity of cancer.

Additional material

Additional File 1: RMaNI User Manual

List of abbreviations usedTF: Transcription Factor; TG: Target Gene; LeMoNe: Learning ModuleNetworks; RIF: Regulatory Impact Factors; clues: clustering based on localshrinking; PAM: Partitioning Around Medoids; AGNES: AGglomerativeNESting; SOTA: Self-Organising Tree Algorithm; MCLUST: Model basedclustering; LRP: LeMoNe’s regulatory program; averageGS: average GeneSignificance; GO: Gene Ontology.

Competing interestsThe authors declare that they have no competing interests.

Authors’ contributionsPBM developed the RMaNI framework, wrote the code and the manuscript.SRM, MJD, MAR advised on design and features of RMaNI, provided overallscientific and technical guidance, and assisted with the manuscript. Allauthors read and approved the final manuscript.

AcknowledgementsWe thank Mr Gavin Graham, Dr Gerald Hartig and Mr. A.Varlokov fromBioinformatics Resource Australia-EMBL for high-performance computing andweb-development support; Dr Toni Reverter and Dr Sriganesh Srihari for helpfuldiscussions; Dr Richard Newton for help with Rwui; and the R/Bioconductorresearch community, who have made their programs and source codespublicly available. Computational resources were provided by NationalComputational Infrastructure Specialised Facility in Bioinformatics. Access toTransfac was provided by QFAB Bioinformatics through Australian ResearchCouncil grant LE098933. PBM, SRM, MJD and MAR acknowledge support ofAustralian Research Council grants CE0348221, DP110103384 and LE0989334.

DeclarationsPublication of this article was funded by The University of Queensland.This article has been published as part of BMC Bioinformatics Volume 14Supplement 16, 2013: Twelfth International Conference on Bioinformatics(InCoB2013): Bioinformatics. The full contents of the supplement areavailable online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S16.

Authors’ details1Institute for Molecular Bioscience, The University of Queensland, 306Carmody Road, St Lucia, Brisbane, Queensland 4072, Australia. 2Australian

Figure 3 Hepatocellular carcinoma transcriptional sub-network. The hepatocellular carcinoma sub-network showing TGs (rectangles) and TFs(circles). Node colour gradient: red, up-regulation; green, down-regulation; yellow, no-change. Edge colors: blue, cirrhosis; black, cirrhosisHCC, cyan, HCC.


of 11

http://www.biomedcentral.com/content/supplementary/1471-2105-14-S16-S14-S1.PDF

http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S16

http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S16

Research Council Centre of Excellence in Bioinformatics, The University ofQueensland, 306 Carmody Road, St Lucia, Brisbane, Queensland 4072,Australia.

Published: 22 October 2013

References1. Hanahan D, Weinberg RA: Hallmarks of cancer: the next generation. Cell

2011, 144(5):646-674.2. Gentles AJ, Gallahan D: Systems biology: confronting the complexity of

cancer. Cancer Res 2011, 71(18):5961-5964.3. Barabasi AL, Gulbahce N, Loscalzo J: Network medicine: a network-based

approach to human disease. Nat Rev Genet 2011, 12(1):56-68.4. Madhamshettiwar PB, Maetschke SR, Davis MJ, Reverter A, Ragan MA: Gene

regulatory network inference: evaluation and application to ovariancancer allows the prioritization of drug targets. Genome Med 2012, 4(5):41.

5. He F, Chen H, Probst-Kepper M, Geffers R, Eifes S, Del Sol A, Schughart K,Zeng AP, Balling R: PLAU inferred from a correlation network is criticalfor suppressor function of regulatory T cells. Mol Syst Biol 2012, 8:624.

6. Choi JK, Yu U, Yoo OJ, Kim S: Differential coexpression analysis usingmicroarray data and its application to human cancer. Bioinformatics 2005,21(24):4348-4355.

7. De Smet R, Marchal K: Advantages and limitations of current networkinference methods. Nat Rev Micro 2010.

8. Jérôme A, Annie R, Benoit M, Jean-Luc G: Transcriptional NetworkInference from Functional Similarity and Expression Data: A GlobalSupervised Approach. Statistical Applications in Genetics and MolecularBiology 2012, 11(1).

9. Cerulo L, Elkan C, Ceccarelli M: Learning gene regulatory networks fromonly positive and unlabeled data. BMC Bioinformatics 2010, 11:228.

10. de Jong H: Modeling and simulation of genetic regulatory systems: aliterature review. J Comput Biol 2002, 9(1):67-103.

11. Maetschke SR, Madhamshettiwar PB, Davis MJ, Ragan MA: Supervised,semi-supervised and unsupervised inference of gene regulatorynetworks. arXiv 2013, arXiv:1301.1083.

12. Stolovitzky G, Prill RJ, Califano A: Lessons from the DREAM2 Challenges.Ann N Y Acad Sci 2009, 1158:159-195.

13. Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, Stolovitzky G:Revealing strengths and weaknesses of methods for gene networkinference. Proc Natl Acad Sci USA 2010, 107(14):6286-6291.

14. Prill RJ, Marbach D, Saez-Rodriguez J, Sorger PK, Alexopoulos LG, Xue X,Clarke ND, Altan-Bonnet G, Stolovitzky G: Towards a Rigorous Assessment ofSystems Biology Models: The DREAM3 Challenges. PloS one 2010, 5:(2):e9202.

15. Oltvai ZN, Barabasi AL: Systems biology. Life’s complexity pyramid. Science2002, 298(5594):763-764.

16. Segal E, Friedman N, Koller D, Regev A: A module map showingconditional activity of expression modules in cancer. Nat Genet 2004,36(10):1090-1098.

17. Michoel T, De Smet R, Joshi A, Van de Peer Y, Marchal K: Comparativeanalysis of module-based versus direct methods for reverse-engineeringtranscriptional regulatory networks. BMC Syst Biol 2009, 3:49.

18. Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N: Revealingmodular organization in the yeast transcriptional network. Nat Genet2002, 31(4):370-377.

19. Bonneau R: Learning biological networks: from modules to dynamics.Nat Chem Biol 2008, 4(11):658-664.

20. Segal E, Shapira M, Regev A, Pe’er D, Botstein D, Koller D, Friedman N: Modulenetworks: identifying regulatory modules and their condition-specificregulators from gene expression data. Nat Genet 2003, 34(2):166-176.

21. Wong DJ, Chang HY: Learning more from microarrays: insights frommodules and networks. The Journal of investigative dermatology 2005,125(2):175-182.

22. Bar-Joseph Z, Gerber GK, Lee TI, Rinaldi NJ, Yoo JY, Robert F, Gordon DB,Fraenkel E, Jaakkola TS, Young RA, et al: Computational discovery of genemodules and regulatory networks. Nat Biotechnol 2003, 21(11):1337-1342.

23. Jain AK, Murty MN, Flynn PJ: Data clustering: a review. ACM Comput Surv1999, 31(3):264-323.

24. Dalton L, Ballarin V, Brun M: Clustering algorithms: on learning, validation,performance, and applications to genomics. Current genomics 2009,10(6):430-445.

25. Thalamuthu A, Mukhopadhyay I, Zheng X, Tseng GC: Evaluation andcomparison of gene clustering methods in microarray analysis.Bioinformatics 2006, 22(19):2405-2412.

26. Miller CA, Settle SH, Sulman EP, Aldape KD, Milosavljevic A: Discoveringfunctional modules by identifying recurrent and mutually exclusivemutational patterns in tumors. BMC Med Genomics 2011, 4:34.

27. Langfelder P, Horvath S: WGCNA: an R package for weighted correlationnetwork analysis. BMC Bioinformatics 2008, 9(1):559.

28. Zhang B, Horvath S: A General Framework for Weighted Gene Co-Expression Network Analysis. Statistical Applications in Genetics andMolecular Biology 2005, 4(1).

29. Winden KD, Karsten SL, Bragin A, Kudo LC, Gehman L, Ruidera J,chwind DH, Engel J Jr: A systems level, functional genomics analysis ofchronic epilepsy. PloS one 2011, 6(6):e20763.

30. Rosen EY, Wexler EM, Versano R, Coppola G, Gao F, Winden KD,Oldham MC, Martens LH, Zhou P, Farese RV Jr, et al: Functional genomicanalyses identify pathways dysregulated by progranulin deficiency,implicating Wnt signaling. Neuron 2011, 71(6):1030-1042.

31. Saris C, Horvath S, van Vught P, van Es M, Blauw H, Fuller T, Langfelder P,DeYoung J, Wokke J, Veldink J, et al: Weighted gene co-expressionnetwork analysis of the peripheral blood from Amyotrophic LateralSclerosis patients. BMC Genomics 2009, 10(1):405.

32. Horvath S, Zhang B, Carlson M, Lu KV, Zhu S, Felciano RM, Laurance MF,Zhao W, Qi S, Chen Z, et al: Analysis of oncogenic signaling networks inglioblastoma identifies ASPM as a molecular target. Proc Natl Acad SciUSA 2006, 103(46):17402-17407.

33. Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL: Model-basedclustering and data transformations for gene expression data.Bioinformatics 2001, 17(10):977-987.

34. Joshi A, Van de Peer Y, Michoel T: Analysis of a Gibbs sampler method formodel-based clustering of gene expression data. Bioinformatics 2008,24(2):176-183.

35. McNicholas PD, Murphy TB: Model-based clustering of microarrayexpression data via latent Gaussian mixture models. Bioinformatics 2010,26(21):2705-2712.

36. Lemmens K, Dhollander T, De Bie T, Monsieurs P, Engelen K, Smets B,Winderickx J, De Moor B, Marchal K: Inferring transcriptional modulesfrom ChIP-chip, motif and microarray data. Genome Biol 2006, 7(5):R37.

37. Reimand J, Tooming L, Peterson H, Adler P, Vilo J: GraphWeb: miningheterogeneous biological networks for gene modules with functionalsignificance. Nucleic Acids Res 2008, 36(Web Server):W452-459.

38. Qi J, Michoel T, Butler G: An integrative approach to infer regulationprograms in a transcription regulatory module network. J BiomedBiotechnol 2012, 2012:245968.

39. McCord RP, Berger MF, Philippakis AA, Bulyk ML: Inferring condition-specific transcription factor function from DNA binding and geneexpression data. Mol Syst Biol 2007, 3:100.

40. Baitaluk M, Kozhenkov S, Ponomarenko J: An integrative approach toinferring gene regulatory module networks. PLoS One 2012, 7(12):e52836.

41. Vega VB, Woo XY, Hamidi H, Yeo HC, Yeo ZX, Bourque G, Clarke ND:Inferring Direct Regulatory Targets of a Transcription Factor in theDREAM2 Challenge. Challenges of Systems Biology: Community Efforts toHarness Biological Complexity 2009, 1158:215-223.

42. Hurley D, Araki H, Tamada Y, Dunmore B, Sanders D, Humphreys S,Affara M, Imoto S, Yasuda K, Tomiyasu Y, et al: Gene network inferenceand visualization tools for biologists: application to new humantranscriptome datasets. Nucleic Acids Res 2011.

43. Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K,Karas D, Kel AE, Kel-Margoulis OV, et al: TRANSFAC(R): transcriptionalregulation, from patterns to profiles. Nucl Acids Res 2003, 31(1):374-378.

44. Quandt K, Frech K, Karas H, Wingender E, Werner T: MatInd andMatInspector: new fast and versatile tools for detection of consensusmatches in nucleotide sequence data. Nucleic Acids Res 1995,23(23):4878-4884.

45. Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, Huber W:BioMart and Bioconductor: a powerful link between biological databasesand microarray data analysis. Bioinformatics 2005, 21(16):3439-3440.

46. Gordon S: Limma: linear models for microarray data. In Bioinformatics andComputational Biology Solutions using R and Bioconductor. New York:Springer;Gentleman R, Carey V, Dudoit S, Irizarry R, Huber W 2005:397-420.


of 11

http://www.ncbi.nlm.nih.gov/pubmed/21376230?dopt=Abstract

















































































47. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N,Schwikowski B, Ideker T: Cytoscape: a software environment forintegrated models of biomolecular interaction networks. Genome Res2003, 13(11):2498-2504.

48. Huang DW, Sherman BT, Lempicki RA: Systematic and integrative analysisof large gene lists using DAVID bioinformatics resources. Nat Protocols2008, 4(1):44-57.

49. Huang DW, Sherman BT, Lempicki RA: Bioinformatics enrichment tools:paths toward the comprehensive functional analysis of large gene lists.Nucleic Acids Res 2009, 37(1):1-13.

50. Joshi A, De Smet R, Marchal K, Van de Peer Y, Michoel T: Module networksrevisited: computational assessment and prioritization of modelpredictions. Bioinformatics 2009, 25(4):490-496.

51. Reverter-Gomez A, Hudson NJ, Nagaraj SH, Perez-Enciso M, Dalrymple BP:Regulatory Impact Factors: Unraveling the transcriptional regulation ofcomplex traits from expression data. Bioinformatics 2010, btq051.

52. Newton R, Wernisch L: Rwui: A web application to create user friendlyweb interfaces for R scripts. R News 2007, 7(2):32-35.

53. R Development Core Team: R: A Language and Environment forStatistical Computing. Vienna, Austria: R Foundation for StatisticalComputing 2012.

54. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B,Gautier L, Ge Y, Gentry J, et al: Bioconductor: open software developmentfor computational biology and bioinformatics. Genome Biol 2004, 5(10):R80.

55. Fang C, Weiliang Q, Ruben HZ, Ross L, W X: clues: An R Package forNonparametric Clustering Based on Local Shrinking. Journal of StatisticalSoftware 2010, 33(4):1-16.

56. Fraley C, Raftery AE: MCLUST Version 3: An R Package for Normal MixtureModeling and Model-Based Clustering. Seattle, WA 98195-4322 USA:Department of Statistics, University of Washington 2006.

57. Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K: cluster: ClusterAnalysis Basics and Extensions. R package version 1143 2012.

58. Rand WM: Objective Criteria for the Evaluation of Clustering Methods.Journal of the American Statistical Association 1971, 66(336):846-850.

59. Datta S: clValid: An R Package for Cluster Validation. Journal of StatisticalSoftware 2008, 25(4).

60. Hartigan JA, Wong MA: Algorithm AS 136: A k-means clusteringalgorithm. Applied Statistics 1979, 28(1):100-108.

61. Kaufman L, Rousseeuw P: Finding Groups in Data: An Introduction toCluster Analysis. 1990, Wiley-Interscience.

62. Dopazo J, Carazo JM: Phylogenetic Reconstruction Using anUnsupervised Growing Neural Network That Adopts the Topology of aPhylogenetic Tree. J Mol Evol 1997, 44(2):226-233.

63. Yin L, Huang CH, Ni J: Clustering of gene expression data: performanceand similarity analysis. BMC Bioinformatics 2006, , 7 Suppl 4: S19.

64. Bonnet E, Tatari M, Joshi A, Michoel T, Marchal K, Berx G, Van de Peer Y:Module network inference from a cancer gene expression data setidentifies microRNA regulated modules. PloS one 2010, 5(4):e10162.

65. Michoel T, Maere S, Bonnet E, Joshi A, Saeys Y, Van den Bulcke T, VanLeemput K, van Remortel P, Kuiper M, Marchal K, et al: Validating modulenetwork learning algorithms using simulated data. BMC Bioinformatics2007, 8(Suppl 2):S5.

66. Bonnet E, Michoel T, Van de Peer Y: Prediction of a gene regulatorynetwork linked to prostate cancer from gene expression, microRNA andclinical data. Bioinformatics 2010, 26(18):i638-i644.

67. Vermeirssen V, Joshi A, Michoel T, Bonnet E, Casneuf T, Van de Peer Y:Transcription regulatory networks in Caenorhabditis elegans inferredthrough reverse-engineering of gene expression profiles constitutebiological hypotheses for metazoan development. Mol Biosyst 2009.

68. Vaquerizas JM, Kummerfeld SK, Teichmann SA, Luscombe NM: A census ofhuman transcription factors: function, expression and evolution. Nat RevGenet 2009, 10(4):252-263.

69. Ravasi T, Suzuki H, Cannistraci CV, Katayama S, Bajic VB, Tan K, Akalin A,Schmeier S, Kanamori-Katayama M, Bertin N, et al: An atlas ofcombinatorial transcriptional regulation in mouse and man. Cell 2010,140(5):744-752.

70. Schaefer U, Schmeier S, Bajic VB: TcoF-DB: dragon database for humantranscription co-factors and transcription factor interacting proteins.Nucleic Acids Res 2011, 39(Database):D106-110.

71. Maere S, Heymans K, Kuiper M: BiNGO: a Cytoscape plugin to assessoverrepresentation of Gene Ontology categories in Biological Networks.Bioinformatics 2005, 21(16):3448-3449.

72. Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practicaland Powerful Approach to Multiple Testing. Journal of the Royal StatisticalSociety Series B (Methodological) 1995, 57(1):289-300.

73. Yu G, Li F, Qin Y, Bo X, Wu Y, Wang S: GOSemSim: an R package formeasuring semantic similarity among GO terms and gene products.Bioinformatics 2010, 26(7):976-978.

74. Mas VR, Maluf DG, Archer KJ, Yanek K, Kong X, Kulik L, Freise CE, Olthoff KM,Ghobrial RM, McIver P, et al: Genes involved in viral carcinogenesis andtumor initiation in hepatitis C virus-induced hepatocellular carcinoma.Mol Med 2009, 15(3-4):85-94.

doi:10.1186/1471-2105-14-S16-S14Cite this article as: Madhamshettiwar et al.: RMaNI: Regulatory ModuleNetwork Inference framework. BMC Bioinformatics 2013 14(Suppl 16):S14.

Submit your next manuscript to BioMed Centraland take full advantage of:

• Convenient online submission

• Thorough peer review

• No space constraints or color figure charges

• Immediate publication on acceptance

• Inclusion in PubMed, CAS, Scopus and Google Scholar

• Research which is freely available for redistribution

Submit your manuscript at www.biomedcentral.com/submit


of 11


































rmani: regulatory module network inference - biomed central

Documents