proceedings open access the molgenis toolkit: rapid ... · infrastructure accommodates a particular...

PROCEEDINGS Open Access

The MOLGENIS toolkit rapid prototyping ofbiosoftware at the push of a buttonMorris A Swertz12345678 Martijn Dijkstra17 Tomasz Adamusiak28 Joeri K van der Velde145Alexandros Kanterakis1 Erik T Roos1 Joris Lops1 Gudmundur A Thorisson210 Danny Arends1 George Byelas1Juha Muilu29 Anthony J Brookes210 Engbert O de Brock11 Ritsert C Jansen145 Helen Parkinson238

From The 11th Annual Bioinformatics Open Source Conference (BOSC) 2010Boston MA USA 9-10 July 2010

Abstract

Background There is a huge demand on bioinformaticians to provide their biologists with user friendly andscalable software infrastructures to capture exchange and exploit the unprecedented amounts of new omicsdata We here present MOLGENIS a generic open source software toolkit to quickly produce the bespokeMOLecular GENetics Information Systems needed

Methods The MOLGENIS toolkit provides bioinformaticians with a simple language to model biological datastructures and user interfaces At the push of a button MOLGENISrsquo generator suite automatically translates thesemodels into a feature-rich ready-to-use web application including database user interfaces exchange formats andscriptable interfaces Each generator is a template of SQL JAVA R or HTML code that would require much effort towrite by hand This lsquomodel-drivenrsquo method ensures reuse of best practices and improves quality because themodeling language and generators are shared between all MOLGENIS applications so that errors are found quicklyand improvements are shared easily by a re-generation A plug-in mechanism ensures that both the generatorsuite and generated product can be customized just as much as hand-written software

Results In recent years we have successfully evaluated the MOLGENIS toolkit for the rapid prototyping of manytypes of biomedical applications including next-generation sequencing GWAS QTL proteomics and biobankingWriting 500 lines of model XML typically replaces 15000 lines of hand-written programming code which allows forquick adaptation if the information system is not yet to the biologistrsquos satisfaction Each application generated withMOLGENIS comes with an optimized database back-end user interfaces for biologists to manage and exploit theirdata programming interfaces for bioinformaticians to script analysis tools in R Java SOAP RESTJSON and RDF atab-delimited file format to ease upload and exchange of data and detailed technical documentation Existingdatabases can be quickly enhanced with MOLGENIS generated interfaces using the lsquoExtractModelrsquo procedure

Conclusions The MOLGENIS toolkit provides bioinformaticians with a simple model to quickly generate flexibleweb platforms for all possible genomic molecular and phenotypic experiments with a richness of interfaces notprovided by other tools All the software and manuals are available free as LGPLv3 open source at httpwwwmolgenisorg

Correspondence maswertzrugnl1Genomics Coordination Center Groningen Bioinformatics Center Universityof Groningen amp Dept of Genetics University Medical Center Groningen POBox 30001 9700 RB Groningen The NetherlandsFull list of author information is available at the end of the article

Swertz et al BMC Bioinformatics 2010 11(Suppl 12)S12httpwwwbiomedcentralcom1471-210511S12S12

copy 2010 Swertz et al licensee BioMed Central Ltd This is an open access article distributed under the terms of the Creative CommonsAttribution License (httpcreativecommonsorglicensesby20) which permits unrestricted use distribution and reproduction inany medium provided the original work is properly cited

BackgroundHigh-throughput technologies have boosted biologicaland medical research and the need for software infra-structures to manage and process the large datasets pro-duced is widely accepted [1-3] Bioinformaticians areunder continuous pressure to both tackle the complexityand diversity of new biological systems and analyticalmethods and to translate these quickly into flexibleinformatics infrastructures while keeping up with theunpredictable evolution of molecular biotechnologiesand the increasing scale of experiments While standar-dization of tools and data formats in open source pro-jects like the Generic Model Organism DatabaseGMOD [4] and the Open Bioinformatics FoundationOBF [5] have been indispensable in reducing the devel-opment efforts needed via reusable and easy to integratecomponents new research must also be quickly accom-modated for which efficient software variation mechan-isms are neededFigure 1 outlines the lsquomodel-drivenrsquo development

method that several bioinformatics projects adopted inrecent years to enable fast and flexible infrastructure

development [1] for example Taverna and Galaxy foranalysis workflows [67] CCPN for processing tools [8]and the early MOLGENIS for biological data manage-ment [9] See our review [1] for a more complete over-view This method consists of three componentsextensible lsquostandardrsquo software that provides a vast arrayof reusable components a high-level modeling language(domain-specific language DSL) to specify biology-spe-cific customizations to this software and a softwarecode generator to automatically translate (or execute)such custom models into all lower level program logicof the complete working software saving all the effortneeded to write the software by handIn this paper we present the evolution of MOLGENIS

into a generic model-driven toolkit for the rapid genera-tion of bespoke data-intensive biosoftware applications[10] We demonstrate step-by-step how bioinformaticianscan use a domain-specific language to efficiently modelthe biological details of their particular biological systemand use MOLGENIS software generation tools to auto-matically generate a web application tailored to theexperiments of their biologists building on reusable

Figure 1 Model-driven development Many minor and major changes have to be written in software code before a lsquostandardrsquo softwareinfrastructure accommodates a particular research Using lsquomodel-drivenrsquo development methods a bioinformatician only needs to model what isneeded for his experiment using a therefore optimized domain specific language (DSL) Generators quickly produce all the software logic tocompose a full software infrastructure that accommodates these needs When experimental needs change a bioinformatician can (re)run thesame generator with an adapted model file to quickly produce another variant of software infrastructure This vastly reduces lsquotime-to-researchrsquoand enables bioinformaticians to quickly develop a suite of software infrastructures with each variant accommodating a specific research taskwhile still on track to reuse integrate and share the best standard features with other labs and bioinformaticians


Page 2 of 9

components Next we evaluate the results of these meth-ods in the development of a range of MOLGENIS appli-cations [911-15] that is software applications generatedusing the MOLGENIS toolkit We found up to 30 timesefficiency improvement compared to hand-writing soft-ware while providing a richness of features practicallyunfeasible to produce by hand but not yet provided byrelated projects We conclude by inviting the bioinfor-matics community to add more MOLGENIS modelscomponents and generators to quickly generate all thesoftware infrastructures biologists want to have

MethodsThe MOLGENIS toolkit is based on the method ofmodel-driven development which emerged in the 1990sfrom the computer industry The key to success is theclear scope of the toolbox (ie what family of softwareapplications should be produced with it) and separatingwhich features should be fixed (eg reusable compo-nents common to all MOLGENIS applications) andwhich features should be variable (ie modeled and gen-erated per MOLGENIS application instance) a processknown as domain analysis [16] Below we discuss MOL-GENISrsquo initial domain analysis its modeling languagegenerators and reusable components

Domain analysisTable 1 summarizes the initial set of features werequired from MOLGENIS information systems whenwe started it explains why these features are indeedrequired and describes what parts of the features arecommon and variable over experiment databases Toobtain this picture we analyzed 20 existing microarraydatabases next to many requirements interviews seeTable 1 in [9]

The second step was to implement the common andvariable parts which we started with a prototype Herewe applied the donrsquot repeat yourself principle (DRY)[17] every piece of design knowledge must have a sin-gle unambiguous authoritative representation Wetherefore searched through the prototype software codeIf we found identical pieces then we put them into thelibrary of reusable components If we found very similarpieces of software code we put the common parts intoa generator and the variation points into the modelinglanguage In each subsequent step we evolved the MOL-GENIS generator only incorporating new functionalitywhen we repeatedly needed itDuring the next six years of using the MOLGENIS

generator we added numerous functions and optimiza-tions such as filters for the data viewing data as alsquomatrixrsquo downloading data as CSV files enabling pro-gramming interaction via R and web services and so onThe generators ensure that lsquooldrsquo MOLGENIS applicationvariants can benefit from these improvements when aMOLGENIS instance is re-generated these improve-ments are automatically integrated into the new version

Modeling languageFigure 2 shows how a custom MOLGENIS application canbe defined in a single file The file is written in MOL-GENISrsquo modeling language This enables compact specifi-cation of what experiment database is needed ie todeclare how an experiment is organized in terms of datatypes and their relationships and how these data are to beshown on the screen Figure 2 shows the following fea-tures Three data entities ❶Experiment Sample and Hybri-dization the Experiment entity has six fields including IDMedium and Stress (because it needs to administratemicrobe experiments) To minimize the modeling work

Table 1 Common and variable features of MOLGENIS information systems

MOLGENIS Features (F) Common parts (C) and Variable parts (V)

F1 DataStore and find lab activities datasets andbiomaterials

C1 Logic to add update remove find and count data entities in a database read and write datafilesV1 Data structures that suit the research eg samples in a clinical lab have a ldquotissuerdquo while microbesamples do not

F2 ControlManipulate lab entities such that they suit theresearch process

C2 Logic to select navigate (first previous next last) find (filter) and edit data entities (using thelogic of C1)V2 Control structures that suit the research eg experiments are shown with a menu with sub-forms for Samples and Hybridizations

F3 ViewView entities and control interactively (via theInternet)

C3 Presentation of logic that shows F1 and F2 with usable layout and formattingV3 Presentation of structure of the specific entity (V1) and control structure (V2) of a system variantvia the Internet (option to have this in company style)

F4 SecurityEnsure that the right people get access to theright results

C4 To manage users roles and privileges and have authentication and authorization in placeV4 To set Roles and Privileges to entities and controls eg only spotters (role) are allowed to addarrays (privilege)

F5 ExtensibilityAllow addition of components for dataprocessing and visualizations

C5 To have a plug-in mechanism to integrate external programs so that these programs canbenefit from entity and control logicV5 To extend a system variant with logic beyond the family eg analysis scripts quantification filevalidation complex data aggregation and export to files


Page 3 of 9

we choose sensible defaults in the domain-specific lan-guage a principle known as convention over configurationeach field has to be set to a value by the researcher unlessspecified to be nillable❷ field can be edited (updated)unless specified to be read only❸ each field is default oftype lsquostringrsquo (a variable character string of length 255)unless otherwise specified to eg rsquodecimalrsquo❹ and fields canbe defined as having a relation to fields in other entitiesvia a cross-reference (xref)❺ The user interface consists ofa plugin❻ that renders the MOLGENIS header and toolmenu one user interface form❼ to control Experimentswith a sub menu❽ consisting of two child forms for Sam-ples and Hybridizations Child forms are automaticallylinked to the parent form based on cross references egthe field lsquoExperimentrsquo of lsquoSamplersquo references to the lsquoIDrsquo ofan lsquoExperimentrsquo❾ By default forms show each entity asone-record-per-screen unless specified as a list❿ Themodeling language includes advanced object-orientationfeatures like inheritance as well as extensive help to docu-ment your model (not shown)One can think of MOLGENISrsquo modeling language as a

lsquodomain-specific languagersquo (DSL) that is optimized toefficiently express a particular problem task or area

[1819] in this case to compose biosoftware infrastruc-tures The level of abstraction is raised so no lengthytechnical or redundant details on how each featureshould be implemented in general programming lan-guages have to be given [2021] Examples of otherdomain-specific languages include RSplus for statisticsMatLab for mathematics SQL for databases HTML forlayouting and now MOLGENISrsquo modeling language forbiological software infrastructuresIn most cases knowledge of the DSL is all that is

needed to produce a custom MOLGENIS applicationvariant The domain-specific language was implementedusing XML so that model files can be edited using off-the-shelf XML editors However you may want toinclude hand-programmed components into a particularMOLGENIS instance For example for the eXtensibleGenotype And Phenotype (XGAP) database applicationof MOLGENIS [11] we developed a lsquoMatrixViewerrsquo thatbuilds on the generated components which saved us thework of writing the plug-in from scratch This requires amodel sentence that points to the lsquoplug-inrsquo (allowing itto be seamlessly integrated) as well as hand-program-ming of the plug-in itself

Figure 2 Example model The detailed software needed for an experiment can be described in domain-specific language (DSL left) TheMOLGENIS generator reads the model and automatically produces the custom software infrastructure specified (right) The screenshot includesexample data See main text for a description of the numbers


Page 4 of 9

Reusable componentsEach MOLGENIS application follows the widelyaccepted three-layered architecture design of web appli-cations Figure 3 summarizes some of MOLGENISrsquo reu-sable components and their variation mechanismsMOLGENISrsquo reusable components provide buildingblocks with a modular structure which allows them tobe assembled in diverse combinations similar to prefab-ricated houses that are built from modular walls insteadof bricks Some building blocks are semi-finished andneed to be lsquocompletedrsquo before use (which is automatedin MOLGENIS via the generators and inheritance) Webased the design of MOLGENIS on industry-provendesign patterns from the lsquopatterns for enterprise applica-tion architecturersquo (PEAA) a catalog of proven solutionsfor software design problems that we used as a guideline[22] The logic of the reusable components is implemen-ted using Java (httpjavasuncom) the HTML layout

for the user interface is encoded in Freemarker tem-plates (httpfreemarkersourceforgenet) and the data-base back-end using MySQL PostgreSQL or HSQLDB

GeneratorsThe generators are compact specifications of how eachdatabase feature should be implemented The MOL-GENIS toolkit now has over 20 generators but normalusers will never need to take a look inside However forreaders wanting to create their own generators Figure 4provides an example of the simple text-based genera-tors we use Each generator consists of two files a Free-marker template that describes the code to be generated(similar to that shown in Figure 4a) and a Java lsquoGenera-torrsquo class that controls the generating process A newgenerator can be developed as follows first write someexamples of the desired programs by hand where possi-ble using similar patterns (see Figure 4b) and markwhich parts are variable between them Then copy oneof these examples into a generator template (text file)and replace all variable parts with lsquoholesrsquo that are to befilled by the code generator based on parameters fromDSL (see Figure 4a) At each generation the template isthen automatically copied and the lsquoholesrsquo filled based onparameters described in the domain-specific languagesaving much laborious manual work

ResultsTo start generating your own MOLGENIS applicationyou can download a ready-to-use lsquoworkspacersquo fromhttpwwwmolgenisorg which can be edited using thecommonly used Eclipse integrated development environ-ment (IDE) tool (httpwwweclipseorg) Extensivemanuals are available to help install the Java MySQLTomcat and Eclipse software needed and to learn howto walk through the Eclipse workspace to edit modelsand generate and run MOLGENIS instances most newusers can complete this part in about three hoursBelow we summarize the output you can expect as wellas recent experiences from using this toolbox Detailedexamples on how these features can be used to supportactual microarray or genetical genomics experimentscan be found in [111415]

Expected outputAfter completing a MOLGENIS model and running thegenerator as described above you have a ready-to-usesoftware application Figure 5 summarizes the featuresyou get when running the generated result as a webapplication a fully functional system where researcherscan upload manage browse and query their biologicaldata that conform to the model optionally enhancedwith analysis tools to explore and annotate (dependingon the plug-ins)

parameter

ExperimentMapper

HybMapper

FormControlName

Experiment

FormControlName

Hyb

MenuControlName

ExpSub

MenuView- toHtml(Menu)html getName(String) getSubForms()

FormView- toHtml(Form) html- getPage Listltentitygt getName(String) getSubForms() toHtml(ltentitygt) html

Menu Controller- select(Event)

FormController- prev(Event)- next(Event)- filter(Event)

DataMapper- find(filter) Listltentitygt- count(filter) int- add(ltentitygt e)- update(ltentitygt e)- remove(ltentitygt e) findSql() sqlstring addSql() sqlstring updateSql() sqlstring removeSql() sqlstring

inheritance

inheritance

parameter

relational

database

ExperimentTable

HybTable

HybFormView

ExperimentFormView

ExpSubMenuView

Hyb Form

ExpSub Menu

(B) MOLGENIS instance(A) Reusable components

model + generator Experiment Form

Figure 3 Reusable components (A) shows finished and semi-finished components that provide reusable features for displayingscreens (FormView and MenuView) handling user requests (Form-and MenuController) and reading and writing to the database(DataMapper) (B) shows components of a completed softwarevariant as described in Figure 2 Only the lsquodifferencesrsquo needed to beadded using systematic variation mechanisms (dotted lines) such asinheritance or parameterization


Page 5 of 9

Figure 4 Example generator MOLGENIS generators are implemented as templates This example shows the generator for a databasecomponent (A) This template is applied to each ltentitygt in the model to generate many complete DataMappers that would otherwise need tobe written by hand (B) shows an example of the generated source files in this case for ltentity name=Experimentgt as described in Figure 1The command $Name(entity) translates to the name of the entity (ldquoExperimentrdquo) and command $csv($entityFields x) means that commandlsquoxrsquo is applied to each field of the entity and returned as a comma separated string (csv)

Figure 5 Expected output Overview of a typical MOLGENIS application in this case customized in EBI style See main text for a description ofthe numbers


Page 6 of 9

An important feature is human readable and printabledocumentation of your model including a graphicaloverview showing relationships in UML❶ which is ofgreat use when still designing and discussing the modelin a team The next step is typically using the web userinterface to populate and test your application with realdata❷ To enable batch loading from a spreadsheetapplication such as Excel the system comes with a tab-delimited importexport tool tailored for your datawhich you can use from the user interface as well as viaa command-line tool ie the headers of your Excel filehave to match the fields you have defined in the model❸ In our experience most computational biologistsgreatly appreciate the use of the R interface to load ana-lyze and re-store data from within the R statistical envir-onment❹ with web services to connect to workflowtools❺ Finally advanced programmers may want to cus-tomize the layout or integrate their own scripts into theuser interface that is create plug-ins that are seamlesslyintegrated with the generated software❻ Typical exam-ples here are the integration of R scripts that producegraphical overviews of the data enabling them to be runby non-technical research colleagues Alternatively youcan use SOAP REST and RDF interfaces for integrationwith workflow tools like Taverna or for use with com-monly used JavaScript frameworks like jQuery to createlsquoWeb 20rsquo interactive websites When satisfied with yourMOLGENIS system it can be shared as a simple JARexecutable using an embedded web server or as a WARfile that can be run on public web servers

ApplicationsSince the earliest MOLGENIS application [9] we havesuccessfully evaluated use of the MOLGENIS toolkit tobuild a wide range of biomedical applications [11-15]ranging from sequencing to proteomics includingbull XGAP an eXtensible Genotype And Phenotype plat-

form [11] for systems genetics (GWAS GWL) to storeall kinds of omics data ranging from genotype to tran-script and protein data XGAP comes with plug-ins toview large data matrices and run processing tools on acluster See httpwwwxgaporgbull Pheno-OM to integrate any phenotype data from

locus-specific annotations to rich biobank cohort reportswith the help of the OntoCAT ontology toolkit to createsemantic mappings between related data items [23] Seehttpwwwebiacukmicroarray-srvphenobull FINDIS a mutation database for monogenic diseases

belonging to the Finnish disease heritage See httpwwwfindisorgmolgenis_findisbull HGVBaseG2P the data management and curation

interface complement for HGVbaseG2P a central data-base of genotype to phenotype association studies [12]See httpwwwhgvbaseg2porg

bull MAGETAB-OM a microarray experiment data plat-form based on the MAGE-TAB data format standard tocreate a local microarray repository that is compatiblewith the public ArrayExpress and GEO repositories Seehttpmagetab-omsourceforgenetbull NordicDB the database of high-density genome-

wide SNP information from 5000 controls originatingfrom Finnish Swedish and Danish studies [13] Seehttpwwwnordicdborgbull DesignGG a web tool to optimally design such

genetical genomics experiments [14] See httpgbicbiolrugnldesignGGMore MOLGENIS applications can be found at http

wwwmolgenisorg Each of these MOLGENIS projectsreported major benefits from the short cycle frommodel to running system to enable quick evaluation(500 lines of model XML replaces 15000 lines of pro-gramming code) and use of the batch loading of data toevaluate how the newly built system works with realdata More often than not MOLGENIS helped in find-ing inconsistencies in existing data that would otherwisehave gone unnoticed leading to experimental errors Inour experience a typical MOLGENIS generator rungives you about 90 of the application that is desiredlsquofor freersquo with the remaining 10 typically filled in usingplug-ins that are written by hand The MOLGENIStoolkit has also been used to extend or replace existingsoftware applications the ExtractModel tool allows youto generate a MOLGENIS application from an existingdatabase which can then be run side-by-side with codedeveloped previously providing the best of both gener-ated and hand-written worlds

Richness of featuresMOLGENIS provides a richness of features not yet pro-vided by other projects BioMart [1024] and InterMine[25] generate powerful query interfaces for existing databut are not suited for bespoke data managementOmixed [26] generates programmatic interfaces ontodatabases including a security layer but lacks userinterfaces PEDROPierre [27] generates data entry andretrieval user interfaces but lacks programmatic inter-faces and general generators such as AndroMDA [28]and Ruby-on-Rails [29] require much more program-ming and configuration efforts compared to tools speci-fic to the biological domain Turnkey [30] seems tocome close to MOLGENIS having GUI and SOAPinterfaces but lacks auto-generation of R interfaces andfile exchange format

ConclusionsIn a recent perspective paper [1] we evaluated the gen-eral benefits and pitfalls of model-driven developmentsuch as the ability to develop infrastructure in short


Page 7 of 9

cycles to get the application right ensuring developersand biologists are thinking along the same lines andincreasing quality and functionality for all We furtherevaluated applying this method to both microarray andgenetical genomics experiments [9] [11]Here we have presented MOLGENIS in detail and

reported the results of using this method against awider range of applications We conclude that usingmodel-driven methods enables bioinformaticians tobuild biological software infrastructures faster thanbefore with the additional benefit of much easier shar-ing of models data and components Much less time isspent on customizing and gluing together individualcomponents The result is of higher quality becausefewer incidental errors creep into the applications as aconsequence of the automated procedures best prac-tices are applied instead of reinvented And you do notneed heavy-weight technology to implement a model-driven generator simple text-based templates suffice tocreate biological software generatorsAs a next step we want to expand the MOLGENIS

toolkit to also generate data processing tools includinguser friendly interaction building on other lsquomodel-dri-ven bioinformaticsrsquo projects in this area such as Taverna[6] to modelexecute analysis workflows and Galaxy [7]to generate user interfaces for processing tools Wehope that many bioinformaticians will enforce our opensource efforts and share their best models plug-ins andgenerators at httpwwwmolgenisorg so that in timeevery biologist may find a MOLGENIS variant that suitshisher needs

Availability and requirementsProject name MOLGENISProject homepage httpwwwmolgenisorgOperating systems Windows Linux AppleProgramming language Java JRE 15 or higherOther requirements MySQL or Postgresql Tomcat or

other J2EE containerLicense GNU Lesses General Public License version 3

(GNU LGPLv3)Any restrictions to use by non-academics No

AbbreviationsAPI application programming interface CPNN collaborative computingproject for NMR CSV comma separated values DesignGG experimentaldesign of genetical genomics software DRY principle of donrsquot repeatyourself DSL domain specific language EBI European BioinformaticsInstitute FINDIS finish disease database GEN2PHEN EU project to unifyhuman and model organism genetic variation databases GMOD genericmodel organism database project GUI graphical user interface GWASgenome wide association study GWL genome wide linkage analysisHGVBaseG2P human genome variation database of genotype-to-phenotypeinformation HTML hypertext markup language IDE integrated developmentenvironment JAR Java Software Archive LGPL lesser general public licenseMAGE-TAB microarray gene expression tab delimited file format MOLGENIS

molecular genetics information systems toolkit NordicDB Nordic ControlCohort Database with harmonized SNP information from Denmark EstoniaFinland and Sweden OBF Open Bioinformatics Foundation OntoCATontology common API toolkit PEAA patterns for enterprise applicationarchitecture QTL quantitative trait locus RDF resource description formatREST representative state transfer web services SNP single nucleotidepolymorphism SOAP simple object access protocol SQL structured querylanguage UML uniform data modeling language WAR web applicationarchive file XML extensible markup language XGAP extensible genotypeand phenotype software platform

AcknowledgementsThe authors thank Jackie Senior for editing this manuscript and PANACEA(funded by the European Commission FP7 contract 222936) NWO (RubiconGrant 82509008) the Netherlands Bioinformatics Center (NBIC) BBMRI-NL(funded by the Netherlands Organization for Scientific Research NWO)CASIMIR (funded by the European Commission under contract numberLSHG-CT-2006-037811) GEN2PHEN (funded by the European CommissionFP7-HEALTH contract 200754) and the EMBL for financial supportThis article has been published as part of BMC Bioinformatics Volume 11Supplement 12 2010 Proceedings of the 11th Annual Bioinformatics OpenSource Conference (BOSC) 2010 The full contents of the supplement areavailable online at httpwwwbiomedcentralcom1471-210511issue=S12

Author details1Genomics Coordination Center Groningen Bioinformatics Center Universityof Groningen amp Dept of Genetics University Medical Center Groningen POBox 30001 9700 RB Groningen The Netherlands 2EU-GEN2PHEN consortiumhttpwwwgen2phenorg 3EU-CASIMIR consortium httpwwwcasimiracuk 4EU-PANACEA consortium httpwwwpanaceaprojecteu 5EU-EUROTRANS cosortium httpwwweuratranseu 6BBMRI-NL Postzone S4-PPO Box 9600 2300 RC Leiden The Netherlands httpwwwbbmrinl7Netherlands Bioinformatics Centre Geert Grooteplein 28 6525 GANijmegen The Netherlands httpwwwnbicnl 8European BioinformaticsInstitute Wellcome Trust Genome Campus Cambridge CB10 1SD UK9Institute for Molecular Medicine Finland University of HelsinkiHaartmaninkatu 8 FIN-00290 Helsinki Finland 10Department of GeneticsUniversity of Leicester University Road Leicester LE1 7RH UK 11ClusterInformation Systems Faculty of Economics and Business University ofGroningen PO Box 800 9700 AV Groningen The Netherlands

Authorsrsquo contributionsMAS and EOB conceived the method and designed and implemented thefirst MOLGENIS generator suite MAS MD KJV TER AK JL DA GB GAT JMand TA participated in the development of the MOLGENIS toolkit andorhave been developing applications using MOLGENIS as a platform RCJ HPGAT and AJB have been feeding requirements to steer future developmentMAS drafted the manuscript All authors read and approved the finalmanuscript

Competing interestsThe authors declare that they have no competing interests

Published 21 December 2010

References1 Swertz MA Jansen RC Beyond standardization dynamic software

infrastructures for systems biology Nat Rev Genet 2007 8235-2432 Stein LD Towards a cyberinfrastructure for the biological sciences

progress visions and challenges Nat Rev Genet 2008 9678-6883 Thorisson GA Muilu J Brookes AJ Genotype-phenotype databases

challenges and solutions for the post-genomic era Nat Rev Genet 2009109-18

4 Generic Model Organism Database (GMOD) [httpgmodorg]5 Open Bioinformatics Foundation (O|B|F) [httpwwwopen-bioorg]6 Oinn T Addis M Ferris J Marvin D Senger M Greenwood M Carver T

Glover K Pocock MR Wipat A Li P Taverna a tool for the compositionand enactment of bioinformatics workflows Bioinformatics 2004203045-3054


Page 8 of 9

7 Goecks J Nekrutenko A Taylor J Galaxy a comprehensive approach forsupporting accessible reproducible and transparent computationalresearch in the life sciences Genome Biol 2010 11R86

8 Fogh RH Boucher W Vranken WF Pajon A Stevens TJ Bhat TNWestbrook J Ionides JMC Laue ED A framework for scientific datamodeling and automated software development Bioinformatics 2005211678-1684

9 Swertz MA de Brock EO van Hijum SAFT de Jong A Buist G Baerends RJSKok J Kuipers OP Jansen RC Molecular Genetics Information System(MOLGENIS) alternatives in developing local experimental genomicsdatabases Bioinformatics 2004 202075-2083

10 MOLGENIS a open source software toolkit to rapidly generate bespokebiosoftware web applications [httpwwwmolgenisorg]

11 Swertz MA Velde KJ Tesson BM Scheltema RA Arends D Vera G Alberts RDijkstra M Schofield P Schughart K Hancock JM Smedley DWolstencroft K Goble C de Brock EO Jones AR Parkinson HE Jansen RCXGAP a uniform and extensible data model and software platform forgenotype and phenotype experiments Genome Biol 2010 11R27

12 Thorisson GA Lancaster O Free RC Hastings RK Sarmah P Dash DBrahmachari SK Brookes AJ HGVbaseG2P a central genetic associationdatabase Nucleic Acids Res 2009 37D797-802

13 Leu M Humphreys K Surakka I Rehnberg E Muilu J Rosenstrom PAlmgren P Jaaskelainen J Lifton RP Kyvik KO Kaprio J Pedersen NLPalotie A Hall P Gronberg H Groop L Peltonen L Palmgren J Ripatti SNordicDB a Nordic pool and portal for genome-wide control data Eur JHum Genet 2010

14 Li Y Swertz MA Vera G Fu J Breitling R Jansen RC designGG an R-package and web tool for the optimal design of genetical genomicsexperiments BMC Bioinformatics 2009 10188

15 Smedley D Swertz MA Wolstencroft K Proctor G Zouberakis M Bard JHancock JM Schofield P Solutions for data integration in functionalgenomics a critical assessment and case study Brief Bioinform 20089532-544

16 Neighbors J Software construction using components Irvine PhD ThesisUniversity of California 1981

17 Hunt A Thomas D The pragmatic programmer from Journeyman toMaster Addison-Wesley 1999

18 Bentley J Little languages Communications of the ACM 1986 29711-72119 van Deursen A Klint P Little languages little maintenance Journal of

software maintenance 1998 101720 Greenfield J Short K Cook S Kent S Software Factories Assembling

Applications with Patterns Models Frameworks and Tools John Wiley ampSons 2004

21 Brooks F The Mythical Man Month Essays on Software Engineering20th anniversary edition Addison-Wesley 1995

22 Fowler M Patterns of Enterprise Application Architecture Addison-Wesley 2002

23 OntoCAT ontology common API (application programming interface)toolkit [httpontocatsourceforgenet]

24 Smedley D Haider S Ballester B Holland R London D Thorisson GKasprzyk A BioMartndashbiological queries made easy BMC Genomics 20091022

25 Lyne R Smith R Rutherford K Wakeling M Varley A Guillier F Janssens HJi W McLaren P North P Rana D Riley T Sullivan J Watkins XWoodbridge M Lilley K Russell S Ashburner M Mizuguchi K Micklem GFlyMine an integrated database for Drosophila and Anophelesgenomics Genome Biol 2007 8R129

26 Omixed customisable storage system for scientific data [httpwwwomixedorg]

27 Jameson D Garwood K Garwood C Booth T Alper P Oliver SG Paton NWData capture in bioinformatics requirements and experiences withPedro BMC Bioinformatics 2008 9183

28 AndroMDA extensible generator framework that adheres to the ModelDriven Architecture (MDA) paradigm [httpwwwandromdaorg]

29 Ruby on Rails open source web framework in the Ruby language[httpwwwrubyonrailsorg]

30 OrsquoConnor BD Day A Cain S Arnaiz O Sperling L Stein LD GMODWeb aweb framework for the Generic Model Organism Database Genome Biol2008 9R102

doi1011861471-2105-11-S12-S12Cite this article as Swertz et al The MOLGENIS toolkit rapidprototyping of biosoftware at the push of a button BMC Bioinformatics2010 11(Suppl 12)S12

Submit your next manuscript to BioMed Centraland take full advantage of

bull Convenient online submission

bull Thorough peer review

bull No space constraints or color figure charges

bull Immediate publication on acceptance

bull Inclusion in PubMed CAS Scopus and Google Scholar

bull Research which is freely available for redistribution

Submit your manuscript at wwwbiomedcentralcomsubmit


Page 9 of 9

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Domain analysis

Modeling language

Reusable components

Generators

Results

Expected output

Applications

Richness of features

Conclusions

Availability and requirements

Acknowledgements

Author details

Authors contributions

Competing interests

References

BackgroundHigh-throughput technologies have boosted biologicaland medical research and the need for software infra-structures to manage and process the large datasets pro-duced is widely accepted [1-3] Bioinformaticians areunder continuous pressure to both tackle the complexityand diversity of new biological systems and analyticalmethods and to translate these quickly into flexibleinformatics infrastructures while keeping up with theunpredictable evolution of molecular biotechnologiesand the increasing scale of experiments While standar-dization of tools and data formats in open source pro-jects like the Generic Model Organism DatabaseGMOD [4] and the Open Bioinformatics FoundationOBF [5] have been indispensable in reducing the devel-opment efforts needed via reusable and easy to integratecomponents new research must also be quickly accom-modated for which efficient software variation mechan-isms are neededFigure 1 outlines the lsquomodel-drivenrsquo development

method that several bioinformatics projects adopted inrecent years to enable fast and flexible infrastructure

development [1] for example Taverna and Galaxy foranalysis workflows [67] CCPN for processing tools [8]and the early MOLGENIS for biological data manage-ment [9] See our review [1] for a more complete over-view This method consists of three componentsextensible lsquostandardrsquo software that provides a vast arrayof reusable components a high-level modeling language(domain-specific language DSL) to specify biology-spe-cific customizations to this software and a softwarecode generator to automatically translate (or execute)such custom models into all lower level program logicof the complete working software saving all the effortneeded to write the software by handIn this paper we present the evolution of MOLGENIS

into a generic model-driven toolkit for the rapid genera-tion of bespoke data-intensive biosoftware applications[10] We demonstrate step-by-step how bioinformaticianscan use a domain-specific language to efficiently modelthe biological details of their particular biological systemand use MOLGENIS software generation tools to auto-matically generate a web application tailored to theexperiments of their biologists building on reusable

Figure 1 Model-driven development Many minor and major changes have to be written in software code before a lsquostandardrsquo softwareinfrastructure accommodates a particular research Using lsquomodel-drivenrsquo development methods a bioinformatician only needs to model what isneeded for his experiment using a therefore optimized domain specific language (DSL) Generators quickly produce all the software logic tocompose a full software infrastructure that accommodates these needs When experimental needs change a bioinformatician can (re)run thesame generator with an adapted model file to quickly produce another variant of software infrastructure This vastly reduces lsquotime-to-researchrsquoand enables bioinformaticians to quickly develop a suite of software infrastructures with each variant accommodating a specific research taskwhile still on track to reuse integrate and share the best standard features with other labs and bioinformaticians


of 9




















Page 3 of 9







Page 4 of 9






parameter

ExperimentMapper

HybMapper

FormControlName

Experiment

FormControlName

Hyb

MenuControlName

ExpSub






inheritance

inheritance

parameter

relational

database

ExperimentTable

HybTable

HybFormView

ExperimentFormView

ExpSubMenuView

Hyb Form

ExpSub Menu





Page 5 of 9




Page 6 of 9














Page 7 of 9





















Page 8 of 9



































Page 9 of 9

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Domain analysis

Modeling language

Reusable components

Generators

Results

Expected output

Applications


Conclusions


Acknowledgements

Author details


Competing interests

References




















of 9







Page 4 of 9






parameter

ExperimentMapper

HybMapper

FormControlName

Experiment

FormControlName

Hyb

MenuControlName

ExpSub






inheritance

inheritance

parameter

relational

database

ExperimentTable

HybTable

HybFormView

ExperimentFormView

ExpSubMenuView

Hyb Form

ExpSub Menu





Page 5 of 9




Page 6 of 9














Page 7 of 9





















Page 8 of 9



































Page 9 of 9

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Domain analysis

Modeling language

Reusable components

Generators

Results

Expected output

Applications


Conclusions


Acknowledgements

Author details


Competing interests

References







of 9






parameter

ExperimentMapper

HybMapper

FormControlName

Experiment

FormControlName

Hyb

MenuControlName

ExpSub






inheritance

inheritance

parameter

relational

database

ExperimentTable

HybTable

HybFormView

ExperimentFormView

ExpSubMenuView

Hyb Form

ExpSub Menu





Page 5 of 9




Page 6 of 9














Page 7 of 9





















Page 8 of 9



































Page 9 of 9

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Domain analysis

Modeling language

Reusable components

Generators

Results

Expected output

Applications


Conclusions


Acknowledgements

Author details


Competing interests

References






parameter

ExperimentMapper

HybMapper

FormControlName

Experiment

FormControlName

Hyb

MenuControlName

ExpSub






inheritance

inheritance

parameter

relational

database

ExperimentTable

HybTable

HybFormView

ExperimentFormView

ExpSubMenuView

Hyb Form

ExpSub Menu





of 9




Page 6 of 9














Page 7 of 9





















Page 8 of 9



































Page 9 of 9

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Domain analysis

Modeling language

Reusable components

Generators

Results

Expected output

Applications


Conclusions


Acknowledgements

Author details


Competing interests

References




of 9














Page 7 of 9





















Page 8 of 9



































Page 9 of 9

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Domain analysis

Modeling language

Reusable components

Generators

Results

Expected output

Applications


Conclusions


Acknowledgements

Author details


Competing interests

References














of 9





















Page 8 of 9



































Page 9 of 9

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Domain analysis

Modeling language

Reusable components

Generators

Results

Expected output

Applications


Conclusions


Acknowledgements

Author details


Competing interests

References





















of 9



































Page 9 of 9

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Domain analysis

Modeling language

Reusable components

Generators

Results

Expected output

Applications


Conclusions


Acknowledgements

Author details


Competing interests

References



































of 9

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Domain analysis

Modeling language

Reusable components

Generators

Results

Expected output

Applications


Conclusions


Acknowledgements

Author details


Competing interests

References

proceedings open access the molgenis toolkit: rapid ... · infrastructure accommodates a particular...

Documents