designing integrated computational biology pipelines visually

14
Designing Integrated Computational Biology Pipelines Visually Hasan M. Jamil Abstract—The long-term cost of developing and maintaining a computational pipeline that depends upon data integration and sophisticated workflow logic is too high to even contemplate “what if” or ad hoc type queries. In this paper, we introduce a novel application building interface for computational biology research, called VizBuilder, by leveraging a recent query language called BioFlow for life sciences databases. Using VizBuilder, it is now possible to develop ad hoc complex computational biology applications at throw away costs. The underlying query language supports data integration and workflow construction almost transparently and fully automatically, using a best effort approach. Users express their application by drawing it with VizBuilder icons and connecting them in a meaningful way. Completed applications are compiled and translated as BioFlow queries for execution by the data management system LifeDB, for which VizBuilder serves as a front end. We discuss VizBuilder features and functionalities in the context of a real life application after we briefly introduce BioFlow. The architecture and design principles of VizBuilder are also discussed. Finally, we outline future extensions of VizBuilder. To our knowledge, VizBuilder is a unique system that allows visually designing computational biology pipelines involving distributed and heterogeneous resources in an ad hoc manner. Index Terms—Data integration, computational pipelines, what if queries, systems biology, visual programming, ad hoc queries, workflow Ç 1 INTRODUCTION A VAILABILITY of huge volumes of a wide variety of data raises an intriguing question—Can a user fashion a hypothetical question and compute a response just to initiate an investigation? Or pose a “what if” query to understand the consequences of a contemplated change in the experimental set up? While the answers are in the affirmative, the real question is, at what cost? Most computational biology studies are meticulously designed, implemented, and used for a long time, for example, [1], [2], [3], [4], [5], [6]. A large number of such applications survive several years, used by many researchers, and maintained by computationally savvy users or experts. Therefore, the long-term cost of development and maintenance is very high. The added complexity of computational pipeline construction using distributed heterogeneous databases makes it highly unlikely to even contemplate such endeavors, thus preventing biologists from asking ad hoc investigative or what if types of queries, even though the possibilities are exciting. However, despite remarkable progress in technology, the task of developing application logics has largely remained a writing exercise. Therefore, end users, such as domain scientists find it extremely difficult to gain access to needed information scattered across the globe without the help of technical experts or predesigned interfaces. Although modern compilers help programmers with automatic code generation, interface building and so on, they operate on the premise of a technically savvy developer with good command over the textual programming language. For a developer, a text-based approach makes more sense than the graphical alternative, as it allows fine control over complex control flows and data structures. Once the application logics are encapsulated in an easy-to-use inter- face in the form of point and click operations, end users can then begin to exploit it. This situation is changing with the introduction of high- level languages in emerging fields such as data integration [7], [8], [9], business workflow management [10], network analysis [11], and gene expression data management [12]. User demographics of these languages include people from disparate scientific fields, user bases in e-commerce socie- ties, first responders in disaster mitigation systems, and so on. To facilitate access and querying, most applications and web-based services now develop form and GUI centered interfaces. Such static form-based approaches are bound to fall short of capturing the full capabilities of these powerful languages. The plurality of such languages, and their high complexity that make even a skilled developer face significant hurdles and a steep learning curve leaves no room for domain scientists, and end users rely on pre- fabricated interfaces and workflow processes to access and analyze data of interest. Given these difficulties, visual programming systems seem to be the most logical solution to these problems. To support ad hoc and on-the-fly data integration in life sciences, we are developing a new query language called BioFlow [8] for our biological data management system, LifeDB [13]. Even though BioFlow offers high-level abstrac- tions for programming support, our experience with end users in the life sciences tell us that developing applications IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 10, NO. 3, MAY/JUNE 2013 605 . The author is with the Department of Computer Science, University of Idaho, Janssen Engineering, Room 236, Moscow, ID 83844. E-mail: [email protected]. Manuscript received 2 July 2012; revised 3 Feb. 2013; accepted 31 May 2013; published online 12 June 2013. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TCBB-2012-07-0163. Digital Object Identifier no. 10.1109/TCBB.2013.69. 1545-5963/13/$31.00 ß 2013 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

Upload: hasan-m

Post on 24-Feb-2017

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Designing Integrated Computational Biology Pipelines Visually

Designing Integrated ComputationalBiology Pipelines Visually

Hasan M. Jamil

Abstract—The long-term cost of developing and maintaining a computational pipeline that depends upon data integration and

sophisticated workflow logic is too high to even contemplate “what if” or ad hoc type queries. In this paper, we introduce a novel

application building interface for computational biology research, called VizBuilder, by leveraging a recent query language called

BioFlow for life sciences databases. Using VizBuilder, it is now possible to develop ad hoc complex computational biology applications

at throw away costs. The underlying query language supports data integration and workflow construction almost transparently and fully

automatically, using a best effort approach. Users express their application by drawing it with VizBuilder icons and connecting them in

a meaningful way. Completed applications are compiled and translated as BioFlow queries for execution by the data management

system LifeDB, for which VizBuilder serves as a front end. We discuss VizBuilder features and functionalities in the context of a real life

application after we briefly introduce BioFlow. The architecture and design principles of VizBuilder are also discussed. Finally, we

outline future extensions of VizBuilder. To our knowledge, VizBuilder is a unique system that allows visually designing computational

biology pipelines involving distributed and heterogeneous resources in an ad hoc manner.

Index Terms—Data integration, computational pipelines, what if queries, systems biology, visual programming, ad hoc queries,

workflow

Ç

1 INTRODUCTION

AVAILABILITY of huge volumes of a wide variety of dataraises an intriguing question—Can a user fashion a

hypothetical question and compute a response just toinitiate an investigation? Or pose a “what if” query tounderstand the consequences of a contemplated change inthe experimental set up? While the answers are in theaffirmative, the real question is, at what cost? Mostcomputational biology studies are meticulously designed,implemented, and used for a long time, for example, [1], [2],[3], [4], [5], [6]. A large number of such applications surviveseveral years, used by many researchers, and maintainedby computationally savvy users or experts. Therefore, thelong-term cost of development and maintenance is veryhigh. The added complexity of computational pipelineconstruction using distributed heterogeneous databasesmakes it highly unlikely to even contemplate suchendeavors, thus preventing biologists from asking ad hocinvestigative or what if types of queries, even though thepossibilities are exciting.

However, despite remarkable progress in technology, the

task of developing application logics has largely remained a

writing exercise. Therefore, end users, such as domain

scientists find it extremely difficult to gain access to needed

information scattered across the globe without the help of

technical experts or predesigned interfaces. Although

modern compilers help programmers with automatic code

generation, interface building and so on, they operate on thepremise of a technically savvy developer with goodcommand over the textual programming language. For adeveloper, a text-based approach makes more sense thanthe graphical alternative, as it allows fine control overcomplex control flows and data structures. Once theapplication logics are encapsulated in an easy-to-use inter-face in the form of point and click operations, end users canthen begin to exploit it.

This situation is changing with the introduction of high-level languages in emerging fields such as data integration[7], [8], [9], business workflow management [10], networkanalysis [11], and gene expression data management [12].User demographics of these languages include people fromdisparate scientific fields, user bases in e-commerce socie-ties, first responders in disaster mitigation systems, and soon. To facilitate access and querying, most applications andweb-based services now develop form and GUI centeredinterfaces. Such static form-based approaches are bound tofall short of capturing the full capabilities of these powerfullanguages. The plurality of such languages, and their highcomplexity that make even a skilled developer facesignificant hurdles and a steep learning curve leaves noroom for domain scientists, and end users rely on pre-fabricated interfaces and workflow processes to access andanalyze data of interest. Given these difficulties, visualprogramming systems seem to be the most logical solutionto these problems.

To support ad hoc and on-the-fly data integration in lifesciences, we are developing a new query language calledBioFlow [8] for our biological data management system,LifeDB [13]. Even though BioFlow offers high-level abstrac-tions for programming support, our experience with endusers in the life sciences tell us that developing applications

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 10, NO. 3, MAY/JUNE 2013 605

. The author is with the Department of Computer Science, University ofIdaho, Janssen Engineering, Room 236, Moscow, ID 83844.E-mail: [email protected].

Manuscript received 2 July 2012; revised 3 Feb. 2013; accepted 31 May 2013;published online 12 June 2013.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBB-2012-07-0163.Digital Object Identifier no. 10.1109/TCBB.2013.69.

1545-5963/13/$31.00 � 2013 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

Page 2: Designing Integrated Computational Biology Pipelines Visually

involving diverse internet resources is still a challenge forthem. On the other hand, depicting a program withdiagrams, flow charts, and sequence of conceptual con-structs was more user friendly and appealing than writing itin textual languages such as SQL or BioFlow. This served asthe motivation behind our graphical editor, VizBuilder [14],a visual toolkit that aims to generate BioFlow scripts fromapplication description diagrams for execution by theLifeDB engine.

Our principal goals in the design of VizBuilder weresimplicity of design to support conceptual applicationdevelopment at a higher abstraction level than currentlypossible, and follow best practices in workflow systems[15], [16]. The design was inspired by several academic [17],[18], [19] and industrial [20] workflow systems. Wepreferred independence from the underlying computa-tional structures such as grid, or web services for wideracceptance and applicability. Unlike systems such as [21],[22], [20], we have also preferred a decoupled environmentas opposed to a tight coupling of the interface andexecution engine so that modular design and leveragingexisting systems such as LifeDB are possible. In summary,we use VizBuilder to visually express an application at aconceptual level. Then a translation mechanism is used toconvert the visual specification to a BioFlow script foronward execution by the LifeDB engine. Although BioFlowis largely declarative, it contains programming constructssuch as loops and conditions, which is a departure from[23]. Finally, from a design standpoint, VizBuilder extendsthe ideas of graph grammars in visual languages such as[21], [24], [22] by enriching them with features from domain-

specific modeling to support abstraction and decoupling fromlower-level engines.

The remainder of this paper is organized as follows: InSection 4, we introduce salient features of BioFlow onintuitive grounds using a real-life data integration applica-tion in micro RNA data analysis. In Section 5, we discussthe system architecture of VizBuilder, and applicationbuilding using VizBuilder in Section 6. We will discussthe graph grammar used in VizBuilder for BioFlowlanguage and show the reconstruction of the BioFlow scriptfrom Section 4 with VizBuilder visual operators to impressupon the reader of the flexibility of our system. In Section 8,we discuss where VizBuilder stands in relation to othersimilar systems and in data integration, generally. Finally,we conclude in Section 10 with plans for future research.

2 DATA INTEGRATION USING BIOFLOW

From an application standpoint, an online resource madeavailable as an interface may be considered a function that,on submitting a set of permissible inputs, returns aresponse, often in the form of a table. A typical explorationsession is shown in Fig. 1 where a user submits two rows ofvalues, yellow and pink, from one of her tables to a website,to receive two tables in response. She then collects twodifferent subsets of the responses, combines them into onegreen table, and repeats the process on another online site.

During the collection process, users often perform one ofthe two operations we call combine and link. Operationcombine is characterized by collecting a set of objects in onegroup or table having identical properties. Link operation,on the other hand, usually broadens or extends what isknown about a set of objects, and thus is reminiscent of ajoin operation in relational data model. The two operationsare schematically shown in Fig. 2. In the absence of schemaheterogeneity, and when the schemes of each of the tablesare known, these operations usually reduce to traditionalrelational union and join operation. However, in reality,schema heterogeneity is common and schema reconciliationbecomes essential. To further complicate matters, hetero-geneity also manifests itself in the form of representation inaddition to schema mismatches, necessitating semanticreconciliation based on entity resolution [25], [26], [27].Once we have (at our disposal) the machineries to performthese three operations—submit, link, and combine—in aschema independent way, it is plausible that all we need is asimple mechanism to sequence operations, choose alter-natives, repeat actions as needed, and reuse previouslydefined actions—i.e., the necessary ingredients for specify-ing workflows or computational pipelines.

2.1 BioFlow Statements for Data Integration

We are developing a new open source data managementsystem LifeDB for life sciences research with the singulargoal of supporting integrated data analysis using distrib-uted online and heterogeneous resources. LifeDB relies on adeclarative query language called BioFlow to support datadefinition and data manipulation as its host language.BioFlow in turn is firmly grounded on an algebraiclanguage called Integra [28]. The well-defined declarativenature and the closure property of BioFlow makes it

606 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 10, NO. 3, MAY/JUNE 2013

Fig. 1. Online database interaction model.

Fig. 2. Model for combine and link operation.

Page 3: Designing Integrated Computational Biology Pipelines Visually

possible to cast each BioFlow sentence into a set of self-definable visual operators in the form of graphical icons,and compose them to create nested expressions or sequencethem as needed. Since the underlying data model is stillrelational, it becomes possible to retain SQL as the mainvehicle for data management, and treat BioFlow as anextension only for interfacing with online databases, andperform collection operations in the form of combine andlink. In conventional relational databases, users are con-fined within a single database and the schema is welldefined; but in applications such as the ones we are tryingto model, they are not restricted to one single database.Instead, they are trying to integrate multiple databasesperhaps with conflicting representations of the same set ofobjects. Thus, as we will see, in all three statements (submit,link, and combine), online schema matching, wrappergeneration, and object recognition play a significant role.In the following sections, we summarize the three BioFlowstatements for completeness and refer readers to the abovearticles for a complete discussion.

2.1.1 Extract and Call Statements

To maintain closure property and to bring uniformity toBioFlow, we view web forms as a function and as a part ofan operator called transform that accepts a table and aBoolean condition, and returns a table as a response. Thisabstraction helps hide the fact that each web form actuallyreturns a semistructured and often, unstructured documentin contrast to a relational model where every statementreturns a table. So, to sanitize the response generated by aweb form, and to convert it into a table, we use a wrappergenerator as a parameter to the transform operator.Furthermore, we also use a schema matcher to resolveany semantic heterogeneity that potentially exists betweenthe user view and the physical website. The structure of theextract statement is as follows:

define function f

extract a1 t1; . . . ; am tmusing wrapper w matcher m in ontology o

from s

submit v1 T1; . . . ; vk Tk.

The define function construct above was first introduced in[29] as a mechanism for uniform treatment of relationaltables and tables returned by web forms in SQL applica-tions, where explicit extraction expression for wrappers wasrequired. Such expressions were manually created specificto each site and supplied as a part of the statement makingit completely procedural, and query and site specific. Weextend and adapt this statement in a way similar to userdefined functions that is activated only when a call is madeusing a call statement. The define function statement onlystates the structure of the interface like a view definition.

The functionality of this statement can be explained asfollows: It expects to be supplied with a list of values vi oftypes Ti that the function submits to the URL s. Once aresponse page is received, it uses the wrapper w to extract aset of objects as rows of a table rðRÞ. It then matches thescheme a1; . . . ; am with R using schema matcher m andprojects the matched columns a1; . . . ; am. In LifeDB, wehave made available OntoMatch [30] for schema matching,

and FastWrap [31] as wrapper generator while options areavailable to use other user preferred schema matchers andwrappers by selecting appropriate ontology o where theyare defined.

It is important to note here that we adopt the renownedbest effort integration [32], [33], [34] in LifeDB. Thisapproach to information integration and extraction recog-nizes the fact that some queries may fail due to themethod used for schema matching and table extraction.Therefore, it is important that we offer an array of choicesfor schema matchers and wrapper generators as a familyof functions for users to choose from as appropriate. Thisis exactly the reason we have included the using clauseand in ontology option since a single specific choice is notalways the right one.

The corresponding call statement has the followingstructure and syntax. In this statement, u is a relation, or aquery that returns a relation or a tuple of values thatmatches with the scheme of the submit clause of f . If it is arelation, then the result is a relation v ¼

S8tðt2uÞ rt such that

rt is a relation returned for each t 2 u by f :

call f with u.

2.1.2 Combine Statement

It is often necessary to collect a set of objects from multiplesites, or as shown in Figs. 1 and 2, extract different segmentsof objects and collate them into one single table. Since thestructure and representations of these objects may vary, it isessential that in addition to matching schemes, we alsorecognize “objects” or “entities” semantically [25], [26], [27]so that we are able to avoid collecting identical objectsmultiple times. The statement used in BioFlow is calledcombine and has the following syntax:

combine r, s using matcher m identifier k.

This statement will take a union of semantic objects in tablesr and s, even though they are not union compatible. It usesschema matcher m to first reconcile the schemes and thenapply object identification or entity resolution operation k toidentify objects uniquely before a union is applied. Since weadopted the best effort information extraction principle, it ispossible that two completely different sets of objects mayyield a combined relation of objects having nothing incommon. Therefore, to limit such unions of unrelatedobjects, analogous to the notion of union compatibility, wehave introduced combine compatibility in [8], that essen-tially requires the relations to share at least one candidatekey. For the purpose of entity resolution and keyidentification, we have adopted Gordian [35] in LifeDBwhile users are allowed to include their favorite function.

2.1.3 Link Statement

Analogous to the Cartesian product and join operation, newrelationships are constructed between semantically identi-cal objects using the link operation as follows:

link r, s using matcher m identifier k.

While combine operation requires two relations to haveshared candidate keys, link operation requires one relationto have a candidate key, and the other to include it as a

JAMIL: DESIGNING INTEGRATED COMPUTATIONAL BIOLOGY PIPELINES VISUALLY 607

Page 4: Designing Integrated Computational Biology Pipelines Visually

foreign key. Thus, the link operation is somewhat analo-gous to the potentially one-many classical natural joinoperation in terms of its function, but semantically, it isfundamentally different. In case of natural join, the tworelations are joined if they have common attributes andvalues. Link compatibility depends on key constraints,where the keys are usually discovered in the context ofparticipating relations.

3 WORKFLOW DESCRIPTION IN BIOFLOW

BioFlow supports arbitrary computational pipeline con-struction using desktop and online resources. It usesnamed, possibly stored, processes as units of operation,two process statements perform and wait, and two controlstatements if and repeat to help specify complex andpowerful pipelines or workflows. The combination of dataintegration primitives we have already introduced inSection 2 and the workflow statements make BioFlow apowerful ad hoc pipeline language, and LifeDB a robustsystem. While Kepler [17] and Taverna [18] are well-regarded scientific workflow management systems, webelieve that LifeDB offers the power of declarative work-flow design with data integration that none of the othersystems do. Nonetheless, it is important to note that theseother systems are very powerful in their own right and havemany distinctive features that only a handful of othersystems such as LifeDB can match. Since LifeDB does notrequire web service compliance to access websites, it canaccess virtually all life science resources that offer theirservices using web interfaces. LifeDB also does not requireany customization to access any web interface. Thesefeatures are leveraged in VizBuilder for a robust interfaceto build computational pipeline by naive end users withoutmuch effort. Therefore, the VizBuilder front end is our wayof offering a user experience comparable to Kepler andTaverna without the technicalities.

3.1 Process Graphs: Sequencing Operations

A named process p is a list of BioFlow statements enclosedwithin a pair of braces. In a BioFlow script, a named processis executed only when invoked with a perform statement.These named processes can be organized in specificsequences depending on the application need and logic.In our quest to stay as close to SQL and as declarative aspossible to allow maximum compatibility, we do not allownested processes and local scoping. Named processes arejust a mechanism to group operations and invoke whenneeded. Thus, all statements and variables have globalscope in BioFlow and VizBuilder.

3.1.1 Perform Statement

Since we allow use of multiple online resources in real time,there is a potential that we can use parallelism anddistributed computing to our advantage. So, unless thereis a need to process services in each website in sequence(such as when the input to one website form depends on theoutput of a previous website), we can submit requests inparallel and collect in a desirable sequence. The syntax andsemantics of the perform statement below allows thisflexibility:

perform [parallel] p1; . . . ; pk [after q1; . . . ; qn]

[leave];

perform offers three optional controls—parallel, after, andleave. If used, parallel means the named processesp1; . . . ; pk will be invoked in parallel. If the option parallelis not used, p1; . . . ; pk are invoked in the sequence listed.after clause results in invoking p1; . . . ; pk as stated after thesequence of processes q1; . . . ; qn are executed. The leaveoption transfers control to the next statement in thescript, and makes the processes run in the background.Technically, without this option, all invocation is insynchronous mode, i.e., next sentence begins executionwhen the current sentence completes execution, andasynchronous when used, i.e., the next sentence startsexecuting immediately and is not dependent on thecompletion of the current perform statement.

3.1.2 Wait on Statement

To enforce synchronization of the processes left to executeoffline, BioFlow uses the wait on statement below. Thisstatement forces the control to wait for the processes tocomplete executions so that synchronization can be restored:

wait on p1; . . . ; pm.

For example, the statement sequence below has the effect ofexecuting o and then q, and then schedule p1 and p2 inparallel and move on to execute r. Once r is completed, itwaits until p2 is completed, which BioFlow executes in thebackground. Once p2 completes execution, process s

is invoked:

perform o; q;

perform parallel p1; p2 leave;

perform r;

wait on p2;

perform s.

Note that it does not matter if p1 is still executing or not.This example illustrates the power of the leave option inperform and wait on statements and how they help definecomplex process graph sequences. It should be obvious thata perform p1; p2 leave statement instead will force BioFlowto wait until process p1 is finished before executing p2.

3.2 Control Statements: Branching and Looping

Since perform is basically a process scheduler in sequence orparallel, it is complemented with two control statements forsequencing choices and repeating them as needed.Although the repeat construct is not usually considered adeclarative statement, we have included it to allow basicfunctions in many workflows, and because it has a simpleconstruct amenable to casting into a VizBuilder iconwithout any drawback.

3.2.1 If Else Statement

To help between two paths of executions in a process graph,BioFlow supports simple branching statement if else wherec is a simple Boolean condition, and s1 and s2 are one or aset of the four BioFlow workflow statements—perform, wait

on, if else, or repeat statements:

if c then s1 else s2.

608 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 10, NO. 3, MAY/JUNE 2013

Page 5: Designing Integrated Computational Biology Pipelines Visually

3.2.2 Repeat Statement

To allow repetition, BioFlow supports a simple loopconstruct as shown below where s1; . . . ; sk are a set ofBioFlow workflow statements:

repeat s1; . . . ; sk until ðcÞ.

3.3 BioFlow Sentences as Visual Templates

The hierarchical constructs of BioFlow statements and theirself-contained semantics allow us to cast each one of themas an abstract task icon in a graphical interface such asVizBuilder. We borrow conceptual programming constructssuch as read, write, repeat, branch, predefined process andso on, and define them in terms of BioFlow statementconstructs to allow application design even at a higherconceptual level so that users need not think in terms ofBioFlow statements at all. However, the needed informationfor each of the lower-level statements to which each iconmaps is carefully blended into the forms that define them.Once necessary information is captured, mapping followspredefined algorithms to generate the target BioFlowscripts. However, as we will discuss shortly, the modeland the process are not as simple and straightforward asthey may appear.

4 AN ILLUSTRATIVE EXAMPLE

To illustrate the capabilities of BioFlow, we adapt a real-lifelife sciences application discussed in [36] which has beenused as a use case for many other systems and as such, canbe considered a benchmark application for data integration.A substantial amount of glue codes were written toimplement the application in [36] by manually reconcilingthe source schema to filter and extract information ofinterest. Our goal in this section is to show how simple andefficient it is to develop this application in LifeDB.

In this example, the user wants to validate thehypothesis that “the human p63 transcription factor indirectlyregulates certain target mRNAs via direct regulation ofmiRNAs” by submitting several queries in multipleinternet sites and combining information from them in aspecific way. If positive, the user also wants to know thelist of miRNAs that indirectly regulate other targetmRNAs with high enough confidence score (i.e., pV alue �0:0006 and targetSites � 2), and so he proceeds asfollows: He collects 52 genes along with their chromoso-mal locations (shown partially in Fig. 3a as the tablegenes) from a wet lab experiment using the host miRNA

genes, and maps at or near genomic p63 binding sites inthe human cervical carcinoma cell line ME180. Severalthousand direct and indirect protein-coding genes areavailable to him as potential targets of p63 in ME180 andare stored as candidates in the table proteinCodingGenes(shown partially in Fig. 3d). The rest of the explorationthen proceeds as follows.

He first collects a set of genes (geneIDs) for each ofthe miRNAs in the table genes from the website www.microrna.org by submitting one gene at a time in the formthat returns for each such gene, a set of gene names that areknown to be targets for that miRNA. The response is in theform of a table from which the user collects the targetSitesalongwith the gene name partially shown as the tablemicrornaRegulation in Fig. 3c.

He also collects the set of gene names for each miRNA intable genes from www.microrna.sanger.ac.uk in a similarfashion, shown partially in table sangerRegulation in Fig. 3b.Notice that this time, the column targetSites is not available,and so he collects the pValue values. Also note that thescheme for each of the tables is syntactically heterogeneous,but semantically they are similar (i.e., miRNA � microRNA,geneName� geneID, and so on). He does so because the datain the two databases are not identical, and there is a chancethat querying only one site may not return all possibleresponses. Once these two tables are collected, he then takes aunion of these two sets of gene names (in micrornaRegulationand sangerRegulation), and finally selects the genes from theintersection of the tables proteinCodingGenes (that bind top63, i.e., p63Binding ¼ “N”) and micrornaRegulation [sangerRegulation as his response.

To compute his answers in BioFlow using LifeDB, all hewill need to do is execute the script in Fig. 1 that completelyimplements the application. It is interesting to note that inthis application, the total number of data manipulationstatements used is only seven (statements numbered (2)through (8)). The rest of the statements are data definitionstatements needed in solutions in any other system. We willdescribe shortly what these data manipulation sentencesmean in this context. For now, a short and intuitiveexplanation is in order while we refer interested readersto [8] for a more complete exposition.

In program 1, the statements numbered (1) through (7)are most interesting and unique to BioFlow. The define

function statements essentially declare an interface tothe websites at URLs in the respective from clauses,i.e., www.microrna.org and www.microrna.sanger.ac.uk.The extract clause specifies what columns are of interestwhen the results of computation from the sites areavailable, whereas the submit clauses say what inputsneed to be submitted. In these statements, it is notnecessary for users to supply the exact variable names atthe website, or in the database. The wrapper (FastWrap)and the matcher (OntoMatch) named in the using clauseand available in the named ontology mirnaOntology,actually establish the needed schema correspondence andthe extraction rules needed to identify the results in theresponse page. Essentially, the define function state-ment acts as an interface between LifeDB and the websitesused in the applications.

JAMIL: DESIGNING INTEGRATED COMPUTATIONAL BIOLOGY PIPELINES VISUALLY 609

Fig. 3. User tables and data collected from microRNA.org andmicrorna.sanger.ac.uk.

Page 6: Designing Integrated Computational Biology Pipelines Visually

To invoke and compute at these sites, we use call

statements at (4) and (5). The first statement calls getMiRNAfor every tuple in table genes, while the second call onlysends one tuple to getMiRNASanger to collect the results intables micrornaRegulation and sangerRegulation.The statements (6) and (7) are also new in BioFlow. Theycapture, respectively, the concepts of vertical and horizontalintegration in the literature. The combine statement collectsobjects from multiple tables, possibly having conflictingschemes, into one table. To do so, it also uses a key identifier(such as gordian [35]) to recognize objects across tables.Such concepts have been investigated in the literature underthe titles record linkage or object identification. For thepurpose of this example, we adapted Gordian as one of thekey identifiers in BioFlow. The purpose of using a keyidentifier is to recognize fields in the constituent relations

that essentially make up the object key1 so that we can avoidcollecting nonunique objects in the result. The link

statement, on the other hand, extends an object in a waysimilar to join operation in relational algebra. Here too, theschema matcher and the key identifier play an importantrole. Finally, the whole script can be stored as a namedprocess and reused using BioFlow’s perform statement. Inthis example, line (1) shows that this process is namedcompute_mirna and can be stored as such for later use. Weare now ready to introduce VizBuilder which can be used todevelop this entire application using its visual constructsand operators.

5 VIZBUILDER SYSTEM OVERVIEW

As shown in Fig. 4, the VizBuilder system is divided intotwo major components—the editor and the kernel. Theeditor has several predefined features such as the capabilityto open, close, or save an editing session, undo/redo opera-tions, and several other language-specific features. Theseoperations are embodied in the visual icons and constructssupported by VizBuilder. Consequently, when a newlanguage is being embedded as the target language, eachof its language feature needs to be embodied in the icons asa combination of forms, and a series of defined steps so thatonce followed, VizBuilder can translate these interactionsand collected descriptions into statements of the targetlanguage, which in our case is BioFlow. So, during theinitial deployment phase of VizBuilder, a language designerneeds to define the VizBuilder visual language as aninstance of the eXtensible VizBuilder Markup Language(XVML). The kernel then converts the XVML definition to aset of VizBuilder-specific Java classes and objects. The toolbar of the editor is reconstituted from the visual languagealphabet of BioFlow. A set of BioFlow syntax directed editingrules are then implanted into the editor. At this stage, thekernel derives a translation scheme from the visual modelof VizBuilder to the textual language of BioFlow. Thesystem is then recompiled with these components anddeployed in the web for programming purposes.2 Once a

610 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 10, NO. 3, MAY/JUNE 2013

Fig. 4. System overview of VizBuilder.

1. Note that object key in this case is not necessarily the primary keys ofthe participating relations.

2. It is readily apparent that VizBuilder is more than just a visual clientfor BioFlow. It can be extended to support other languages expressed in adeclarative manner. This is illustrated in Fig. 4.

Page 7: Designing Integrated Computational Biology Pipelines Visually

language is embedded, end users need only download andinstall the system for ready use.

To develop application programs, the end user opensup the editor in the browser and draws the visualprogram by dragging the operators and the edges fromthe toolbar.3 The syntax directed editing rules, decided inthe deployment phase, guides the workflow developmentwith on-the-fly syntax checking, for example, denying aconnection between operator A and operator B, if any suchconnection is prohibited in the rule base. When the userchooses to compile his or her workflow, it is checked forsyntactic and semantic errors. If the program is error free,then it is translated into the code of the target textuallanguage using the model2code translation scheme set inthe deployment phase.

From the implementation point of view, VizBuilder isdeveloped in Java, which makes it operating systemindependent. It is an applet which can be deployed withstandard web servers. So, the end user just needs a Java-enabled browser like Internet Explorer, Mozilla-Firefox,Safari or Opera to draw a visual program with VizBuilder.We used Java Universal Network/Graph Framework(JUNG) [37] as the library for drawing and maintaininggraphical structures.

5.1 Operator Definition Using XVML

VizBuilder captures the semantics of its visual constructsexpressed in XVML—a markup language designed afterXAML [38]. In this section, we will only present a flavor ofXVML because a complete exposition is outside the scopeof this paper. Similar to the code-behind mechanism inXAML, we allow the Class attribute to link to Javaclasses. Thus, we are able to create the user interfaceelements declaratively, while hiding the imperative logicfor syntax checking and translation inside the classdefinitions. These classes are extended from the base classOperator, which is embedded in VizBuilder. The over-ridden abstract functions in these classes ensure orderlytranslation from the visual model to the textual program.The id attributes are unique throughout the definition toavoid ambiguity of reference.

As shown in Fig. 5, every instance of XVML starts withan Operator node. Each operator must have a child nodefrom one of two possible types, namely, the NestedPa-

nel or the InputPanel. A NestedPanel holds anumber of other operators in its toolbar. NestedPanelscome with a default drawing area. For example, theProcess operator in Fig. 5 is the root node of any visualprogram. The root operator does not appear physically inVizBuilder. It is the NestedPanel, embedded inside theProcess operator, that the end user will see when sheopens up her editor. This NestedPanel holds a numberof operators in its toolbar, two of them (Write andAssign) are shown in Fig. 5. The recursive chain ofoperator-panel ends with an operator with an InputPanel,as the InputPanels are not allowed to contain anyOperator or drawing pane. For example, if VizBuilder isassembled following Fig. 5 then the top two layers,(Process and Write) will allow workflow design, while

the Table layer will only support textual input. We

support varieties of GUI elements like text boxes, combo

boxes, lists and so on. During the translation process, the

overridden translate functions in the linked classes are

called in a bottom up fashion. The function converts the

workflow drawn on the NestedPanel or the user inputs

from the InputPanel onto the script.

5.2 A VizBuilder Application Design Session

In Fig. 6, we have three layers of panels marked by three

green circles. Layer 1 is the NestedPanel for the Process

operator from Fig. 5. The user has dragged down three

operators from the toolbar and drawn a visual sentence.

One of them, namely, Write, is expanded in layer 2. This

layer is also of the NestedPanel type allowing the user to

draw workflows. The user has drawn a workflow where

data from a disk file is uploaded to a database table. Layer 3

is different from the previous two layers. It is an

InputPanel describing the schema of the table. The

hierarchy of the panels makes programming relatively

straightforward. For example, if the user switches to one of

the Read operators in layer 1, all the nested windows for

the Write operator will disappear. Users can work on

multiple processes simultaneously by switching between

the tabs containing them. Since no ordering of tasks is

enforced, users are free to draw the graph with as many

operators and place them in a manner that is convenient

aesthetically pleasing to them.

JAMIL: DESIGNING INTEGRATED COMPUTATIONAL BIOLOGY PIPELINES VISUALLY 611

Fig. 5. Partial definition of BioFlow with XVML.

3. We will take a closer look at a programming session in Section 5.2.

Page 8: Designing Integrated Computational Biology Pipelines Visually

6 BIOFLOW USING VIZBUILDER

Although it is possible to use VizBuilder for differentprogramming languages by changing the XVML instances,our primary target is BioFlow. The statements in BioFlowcan be divided into three categories, namely, control, datadefinition, and data manipulation statements. Processes,like comput_mirna in program 1, are independent pro-grams written in BioFlow. The tables and functions definedinside a Process have global scope, and the tables canbe manipulated by the two basic data manipulationstatements—Select and Insert—which add on to thefunctionalities of their SQL counterparts. Apart from thetables, BioFlow allows the use of local variables astemporary storage, and supports conditional branchingand loop to control the flow of execution.

6.1 Process Panel

The root node of the XVML definition of BioFlow is theProcess operator (see Fig. 5). The flow of execution withina process is drawn on the NestedPanel of this operator.The operators in this layer are the Read, Write, Assign,Condition and the end of block (EOB). The first twooperators have a one to one relationship with the Select

and Insert statements in BioFlow. The Assign operatoris responsible for accessing local variables, while theremaining two operators define the control blocks. We donot have any visual operator for the data definitionstatements in the process layer. They are handled insubsequent nested layers. Three types of edges are allowedin this layer. Fig. 7a describes the production rules for thegrammar for the Process panel with the terminal symbolsshaded in blue. The edges out of the Condition nodes arelabeled t and f and the regular edges without any labelrepresent the flow of execution.

In the XVML definition of the Process layer, every childoperator except the EOB has a child panel describing thatoperator. For the Assign and Condition operators, thesepanels are of the type InputPanel. The Read and Write

operators, on the other hand, have their own NestedPanelswith different graph grammars, described next.

6.2 Read and Write Panels

Unlike the control flows in the Process panel, the Read

and Write panels describe the flow of data in different

directions. In a Write statement, data can travel from the

input devices, disk storage or source tables through a

number of intermediate tables and modifying functions, to

a destination table. On the other hand, the Read

statements extract data from one or more tables; modify

them to desired formats by applying different operators

and display them on output devices. The following set of

612 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 10, NO. 3, MAY/JUNE 2013

Fig. 6. A sample session.

Fig. 7. Graph grammars for BioFlow.

Page 9: Designing Integrated Computational Biology Pipelines Visually

operators is employed to express the data flow in theRead and Write panels:

. Table operator denotes the schema of the data tablein consideration. An outgoing edge from the Tablenode holds the relation that is stored in the table. Anincoming edge to a Table node describes one ormore rows that will be inserted into the table. Onedoes not need to define the schema every time atable is accessed. Instead, the user can choose from apool of already defined schemas which ensures theglobal scope of BioFlow.

. Function operator represents the callable functions inBioFlow. Its use is restricted to the Write operationsonly. An outgoing edge from a Function nodeholds the resultant relation from a function call.The incoming edges provide the input parameters toa function.

. Input operator can be connected to a Table or aFunction node with an outgoing edge. It collectsdata from the user through the input devices. TheInput operator can be used to extract raw datafrom a file.

. Combiner operator takes in multiple relations andtransforms them into one using one of the availablealgorithms. Currently, we support natural join ofSQL along with BioFlow-specific functions, linkand combine, as possible merging algorithms.

. Slice operator changes the input relation by renam-ing or discarding one or more columns of the input

schema. This operator can also change the values ofevery tuple of a column by applying some arithmeticfunction. In other words, Slice is analogous to theprojection clause in relational algebra.

. Condition operator can be used to build the where

clause in an insert or a select statement. Thisoperator can only be connected to a Slice or aGroup node by an outgoing edge for properfunctioning. It trims out the rows that fail the logicalcondition set by the operator.

. Group operator can be connected to a Slice

operator with an outgoing edge. Group operatorhelps realize complex conditions of the form ((con-ditions 1 and 2) or condition3).

. Screen operator can be used within a Read statementonly. It cannot have any outgoing edge. It is allowedto have a single incoming edge from a Table,Function, Slice, or a Combiner operator. Itprojects the incoming relation onto the screen.

The graph grammar for the Read panel is shown inFig. 7b. The grammar for the Write panel is similar to thisone. Again, the terminal symbols are shaded.

7 REVISITING MIRNA REGULATION EXAMPLE

Armed with our understanding of VizBuilder and the visuallanguage for BioFlow, let us revisit the compute_mirna

process from Section 4. In Fig. 8, four representativestatements from the seven data manipulation statements

JAMIL: DESIGNING INTEGRATED COMPUTATIONAL BIOLOGY PIPELINES VISUALLY 613

Fig. 8. Partial miRNA application script development using VizBuilder the indices following the subtitles are the line numbers of program 1.

Page 10: Designing Integrated Computational Biology Pipelines Visually

are shown. Fig. 8a holds the first load statements, wheredata from “/genes.txt” are uploaded onto genes table.The process layer with all seven data manipulation state-ments is also shown in this figure. The schema of the genestable is shown as a case of InputPanel at work. Fig. 8bcorresponds to statement (4) of program 1. Here, we havedata flowing from the table genes to a webpage encapsu-lated by the web function getMiRNA. The result of thefunction call is stored in the micrornaRegulation tablefor later use. The InputPanel of the Slice operatordisplays the SQL projection of the two columns, namely,miRNA and “9606.” Fig. 8c demonstrates the use ofthe Combiner operator in statement (7) of program 1. Itlinks the tables regulation and proteinCodingGene.Finally, in Fig. 8d, we have statement (8) of program 1. Thecontent of the table proteinCodingGeneRegulation isdisplayed to the user after performing an SQL selection overthe rows based on the condition p63Binding ¼ “N.”

7.1 Modeling Non-BioFlow Functions

Although applications mostly need data processing func-tions supported by query languages, they often also needtrivial tasks such as displaying data, executing a specifictask, and so on. In this section, we highlight how a fewsuch tasks can be modeled in VizBuilder using theconstructs supported by it.

7.1.1 Loading Data into BioFlow Server

The textual command in BioFlow for this purpose is “LOADDATA” which is borrowed from SQL. In visual BioFlow, itis recognized as a write operation. In Fig. 8b, we show one ofthe write operations of Fig. 8a in expanded form to its nestedlayer, where the operation is defined visually.

7.1.2 Executing BioFlow Functions

A BioFlow function can be a command-line program,residing in the local disk space or a call to a hidden websource with given input parameters. In this particularexample, it is in the latter case that we extract data from themicrorna.org site. A function call must be enclosed within awrite operation. In Fig. 9, we have the complete operationwith source and destination tables (an example similar tothe one in Fig. 8b). As discussed in Section 6 and shown inFig. 9, a slice operation is dedicated to changing the shape ofa relation. In this example, the web function is called with

the microRNA column from dataset1 alongwith a syntheticcolumn with “Homo sapiens” in every row. The resultant setof microRNA target genes is stored in dataset3.

7.1.3 Displaying Results

Fig. 10 shows a join operation followed by a projection, andfinally, the results on the screen using a read operation. Theseactions are defined with various operators of the read panel.In this figure, we also find the use of two conditions in a group.They define the Boolean predicates dataset2.genbank ¼dataset3.gene AND p63binding ¼ “N” as shown in program 1.

8 RELATED RESEARCH

VizBuilder can be viewed as a novel visual query language(VQL) as well as a data integration and workflowcomposition system for distributed life sciences resources.From this standpoint, it can be compared to contemporaryrepresentative systems in these two classes. Since Viz-Builder and LifeDB is a brand new system, its current userbase includes only researchers at Wayne State Universityand the University of Idaho where it is being used fororiginal research in protein-protein interaction and generegulation. Although the results of these researchare currently being studied and are ongoing, we do believethat VizBuilder aided a speedy implementation of thecomputational pipelines we needed involving nontradi-tional and less popular databases and resources that wecould not directly implement using systems such asTaverna and Galaxy.

8.1 Visual Query Languages

While higher abstraction level of query languages tend toimply better user friendliness, unfortunately it also impliesless expressiveness in terms of its capabilities. From thisstandpoint, declarative languages such SQL and Datalog,and their domain-specific extensions, have been identifiedas major artifacts for application development, even in lifesciences [39]. Though more limited, these languages em-body higher abstractions compared to procedural lan-guages such as C and Java to ease querying by end users.However, though simpler, declarative languages can still beintimidating to master before they can be used to developusers own applications.

614 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 10, NO. 3, MAY/JUNE 2013

Fig. 9. A Write operation with a function call.Fig. 10. A Read operation with a join over two relations.

Page 11: Designing Integrated Computational Biology Pipelines Visually

To this end, visual query languages4 focus on easing thecomplexity of application development using textuallanguages by making programming more intuitive and lesscumbersome with the help of visual aids and artifacts.While most visual languages are effective in narrowdomains, in the area of database querying and workflowcomposition, success has been modest. For example, it isusually very difficult for an end user without any databaseknowledge to use the entity relationship-based querylanguages like QBD� [41] and VKQL [42]. DFQL [43] is adataflow-based query language where a large number ofoperations such as complex Boolean predicates, aggregatefunctions, and grouping, are performed textually. Whilevery powerful, the visual language HyperFlow [44] is farmore complex than VizBuilder to program using itscomplex set of workflow constructs. Though the languageof BioFlow on which VizBuilder is based shares manyconstructs with Kaleidoquery [45], the 3D representation ofKaleidoquery requires a significant amount of CPU powerwithout obvious gains in utility. Also, its join, self-joinand nested query construct are less intuitive than visualBioFlow in VizBuilder. Since most visual languages need tobe translated into a textual language for execution, ourthesis is that declarative languages make a better candidatefor visual language front ends. From this standpoint, webelieve that VizBuilder combined with its back-end lan-guage BioFlow, is an effective and powerful tool.

8.2 Computational Pipelines and Workflow Queries

The research on data integration and workflow compositionin life sciences has been intense in recent years. Broadly,these efforts can be classified into three, possibly over-lapping categories within the context of this domain. Firstare the standards and technology-based approaches. In thiscategory, systems such as Taverna, SeaHawk [46], andSADI [47]—all derivatives of BioMOBY [6] suit of efforts,Bio-jETI [48], and jORCA [49] were designed to leveragestandards such as SOAP and WSDL, and technologies suchas web services and semantic web. This implies they areprimarily geared to function with resources that are SOAPor WSDL compliant, although they can be tailored to designapplications for resources that do not comply with thesestandards [50], [51], [52]. By design and in principle,derivatives of BioMOBY (e.g., Taverna, SeaHawk, SADI),and jORCA and Bio-jETI do not rely on site cooperation, aslong as they are standards compliant and can be found inthe service registry.

In contrast, systems such as Galaxy [53] and Kepler inthe second category rely on the client-server model andcannot function without significant server cooperation. Inparticular, Galaxy is designed to establish dedicatedcommunication with service providers and requires sub-stantial site cooperation to be able to receive data fromremote sites in the format of choice. It is a graphicalapplication design tool well known for its ability to supportapplication design with databases such as UCSC genomebrowser. In this case, the UCSC database is directly linkedand accessible by a Galaxy client. The application designedat a client desktop uses direct knowledge of the UCSCdatabase, and the UCSC database in turn has intimateknowledge of Galaxy needs in terms of representation

formats and so on for the purpose of compatibility. Such atight coupling requires substantial installation efforts andcommits both server and the client side to a long-termagreement preventing independent changes in architectureand representation, and slows down needed updates andupgrades. The Kepler workflow system, on the other hand,supports scientists from different domains including biol-ogy, ecology, and astrophysics in which users are allowedto construct workflows using local applications as well asexternal sources. Users need to manually wrap and addresources an application needs, to Kepler’s library of actorsbefore they can be used in a workflow. Therefore, complexad hoc workflow orchestration using a large number ofresources becomes daunting and prohibitive. However, forboth these categories, data integration for meshing data/tables from multiple sites needs to be handled manually forwhich no specific support is extended.

The third and final category of systems is designedfor specific functions, or limited objectives, unlike thegeneral-purpose systems in the former two categories.BioGuideSRS, SeaLife [54], Anduril [1], Goober [2], S3QL[7], and LIFEdb [55] are representative of this class. Forexample, BioGuideSRS is an effective biological informationintegration system designed for a set of specific resources.But, the precomputed path-based approach of queryingdatabase resources is not flexible enough compared to SQL-based BioFlow where ad hoc integration with arbitraryresources is supported. The system mostly allows prefab-ricated queries that are abstracted based on their assumedusefulness and need. Similar comments can be made forLIFEdb, Goober and Anduril because these systems are alsodesigned for a specific set of databases and for specific goalsalthough they heavily rely on data integration and work-flow construction. Compared to these systems, SeaLife iseven more specific. Web service discovery [56], [57]becomes essential when users are not well aware of theresources they need to access or better resources becomeavailable and could potentially improve the quality ofinformation being gathered. SeaLife is a system designed toconnect user needs, specified in the form of keywords, toregistered services while the system in [58] offers the samefunctionality using NLP techniques.

In contrast to all the above systems, VizBuilder distin-guishes itself mainly as a visual language hybrid thatsupports both abstract and ad hoc workflow querying anddata integration over arbitrary resources. It is not designedfor a specific function or application either. It has built-inabstractions for workflows, data integration, local functionsand databases, combined with visual programming cap-abilities that are not as generic as some visual languages, yetlightweight and powerful enough to tackle a wide range ofcomputational pipeline construction for life sciences re-search, all using high-level declarative abstractions.

8.3 Current Limitations of VizBuilder

VizBuilder is designed as a front-end user interface for theLifeDB system to capture BioFlow programs visually.Therefore, to use it for application design, users will berequired to also install LifeDB. VizBuilder is also not a systemfor arbitrary programming, rather it is a database workflowand data integration language, and therefore “exotic”application development that requires features not sup-ported by traditional first-normal form relational databases

JAMIL: DESIGNING INTEGRATED COMPUTATIONAL BIOLOGY PIPELINES VISUALLY 615

4. A comprehensive survey of the visual query language systems can befound in [40].

Page 12: Designing Integrated Computational Biology Pipelines Visually

may not be feasible. The current release of VizBuilder alsodoes not support XML format, although an extension is beingplanned. The technical problem is when supported, users ofVizBuilder may want to process XML and relational data inthe same query, and thereby necessitate a uniform treatmentof these formats, which is difficult.

While VizBuilder is a powerful system, its functionalityis limited by the technical qualities of the family of schemamatching, wrapping and entity recognition systems it usesfor data integration purposes. As in all live, ad hoc, andbest effort [34], [32], [33] integration systems, someapplications may fail to access resources if the underlyingmatching, wrapping, or entity recognition process fails dueto its limitations. Currently, most wrappers and matchersdo not function well for multipage documents and some donot function well for HTML documents that involvenonstandard practices. We refer readers to [59] for adetailed discussion on some of the current and knownlimitations of LifeDB.

9 FUTURE PLANS

Our goal in developing VizBuilder is to free noncomputersavvy users from thinking in terms of a textual querylanguage which may be intimidating. Instead, such usersare encouraged to conceive and define an application usingcommon sense algorithmic concepts in a platform andlanguage independent manner by appropriately sequencinglogical steps using visual icons which they can describe.From the system’s viewpoint, we also wanted to allowlanguage evolution so that future changes in BioFlow wouldnot require substantial modifications or a complete redesignof VizBuilder logic. The key to the platform independencewe seek to achieve is a clear separation between applicationlogic and its representation layer with the logic andrepresentation of the execution layer. None of the currentapproaches, for example, domain-specific modeling (DSM),unified modeling language (UML), and visual languages(VL), support this separation and capability adequately.

For example, UML is so general that often it is not usefulfor actual code generation. Empirical studies [60], [61] haveconsistently shown that only about half of all developmentprojects use UML methods. Among those, over 50 percenteither modify the methods to better fit their needs or evendevelop their own methods [62], [63]. Due to its lack ofsupport for higher-level abstraction and its general-purposenature, UML changes little to help software developers interms of solving the problem in domain terms. DSMintroduces the idea of metaCASE tools [64]. In standardCASE tools, the method supported by the tool is fixed, andso it cannot be changed. In a metaCASE tool, there iscomplete freedom to change the method, or even to developan entirely new method. More importantly, they allowautomatic, full code generation, similar to the way today’scompilers generate Assembler from another programminglanguage. Unfortunately, this apparent advantage createsthe problem that users now must design a domain-specificlanguage that the system will understand.

Visual languages [21], [65], [66] on the other hand, try tobridge the gap between DSM and UML by leveraging theadvantages of both in a way similar to HyperFlow [44].

They try to provide a fixed set of abstractions, for whichthey include language constructs, notations and codegenerators. By doing so, they provide application-specificmodeling support with one step code generation. Our goalis to overcome the limitations of UML, DSM, and VL, andbridge conceptual-level applications by end users to alanguage-specific code generation as transparently aspossible. We have noted in the previously, however, thatsome of the visual language can actually be used to developsophisticated applications for analytics in life sciences, butmost often they are somewhat weak in supportingcomputational pipeline applications involving heteroge-neous databases that are commonplace in this domain.

In VizBuilder, we currently require that a mapping froma lower-level language, i.e., BioFlow, for which the code hasto be generated, must be supplied first. VizBuilder uses thatmapping in the deployment phase to instantiate the iconbehaviors and its panels so that appropriate inputs andstructures can be captured from users. Although this is asignificant departure from most visual languages, thisseparation is still not enough to support 1) completeindependence from lower-level languages, and 2) multiplelanguage and execution platforms. To understand this,consider the scenario where an application A has alreadybeen designed and mapped to a script S based on alanguage L using the mapping � supplied to the system.Now, if language L changes or is modified, the script S willbe inapplicable and will not execute, because the applica-tion is directly tied with the language L through themapping �. Once L has changed, M will also change,requiring a change in S as well. On the other hand, if theapplication A is mapped to a conceptual language C, whichis stable, and the mapping � for L is used to map C to S,changing L will not require A to be redesigned, becausechange in L only requires us to replace � with b� and map Cusing b� to S0 without making the user aware. Thisseparation into application view, conceptual view, andphysical view supports logical and physical independencein a way similar to database domain. This independencealso allows finished product design in multiple platformsfor the same application. All we now need to do is usemappings �1 through �k for languages L1 through Lk. Thegeneral idea is depicted in Fig. 11 which we plan to pursuein future research.

10 CONCLUSION

We have introduced a novel visual platform for computa-tional biology application development that supports data

616 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 10, NO. 3, MAY/JUNE 2013

Fig. 11. Enhanced VizBuilder architecture.

Page 13: Designing Integrated Computational Biology Pipelines Visually

integration and computational pipelines in user transparentways. With our visual interface VizBuilder, users are nowable to conceive applications in terms of conceptualconstructs like read, conditional, repeat, and resources suchas tables and websites, and compose them by stitchingtogether logical descriptions of smaller subparts. VizBuilderhides BioFlow language-specific idiosyncracies such asextract statement and its various options and encouragesusers to think in terms of behavior of the operation.VizBuilder is probably the first such application builderfor systems biology to address data integration andcomputational pipelines that are so critical to this domain,in a comprehensive manner.

Our departure from DSM, UML, and VL in general, andour future plans for VizBuilder coupled with the power ofad hoc querying supported in BioFlow makes it possible toembrace it as a viable computational platform, as it allowsmost needed functionalities in computational biology. Aswe extend BioFlow to include the power of declarativequerying of gene expression resources [12], interaction [67],[11], and pathway data [68], the extensions discussed inSection 9 we envision for VizBuilder will soon morph it intoan even more powerful system for application develop-ment. Our commitment to continued improvement andmaintenance of VizBuilder and LifeDB is also one of themajor strengths that we hope the community will leverage.

ACKNOWLEDGMENTS

This research was partially supported by the National ScienceFoundation grant IIS 0612203. The author acknowledges thecontributions of Shahriyar Hossain who helped implementVizBuilder’s first edition as an interface to our LifeDB systemavailable at http://dblab.nkn.uidaho.edu:8080/lifedb/.

REFERENCES

[1] K. Ovaska, M. Laakso, S. Haapa-Paananen, R. Louhimo, P. Chen,V. Aittomaki, E. Valo, J. Nunez-Fontarnau, V. Rantanen, S.Karinen, K. Nousiainen, A.-M. Lahesmaa-Korpinen, M. Miettinen,L. Saarinen, P. Kohonen, J. Wu, J. Westermarck, and S.Hautaniemi, “Large-Scale Data Integration Framework Providesa Comprehensive View on Glioblastoma Multiforme,” GenomeMedicine, vol. 2, no. 9, article 65, 2010.

[2] W. Luo, M. Gudipati, K. Jung, M. Chen, and K.B. Marschke,“Goober: A Fully Integrated and User-Friendly Microarray DataManagement and Analysis Solution for Core Labs and BenchBiologists,” J. Integrative Bioinformatics, vol. 6, no. 1, article 108,2009.

[3] B.H.J. van den Berg, J.H. Konieczka, F.M. McCarthy, and S.C.Burgess, “ArrayIDer: Automated Structural Re-Annotation Pipe-line for DNA Microarrays,” BMC Bioinformatics, vol. 10, article 30,2009.

[4] K.J. Thompson, H. Deshmukh, J.L. Solka, and J.W. Weller, “AWhite-Box Approach to Microarray Probe Response Characteriza-tion: The BaFL Pipeline,” BMC Bioinformatics, vol. 10, article 449,2009.

[5] S.C. Boulakia, O. Biton, S.B. Davidson, and C. Froidevaux,“BioGuideSRS: Querying Multiple Sources with a User-CentricPerspective,” Bioinformatics, vol. 23, no. 10, pp. 1301-1303, 2007.

[6] M.D. Wilkinson and M. Links, “BioMOBY: An Open SourceBiological Web Services Proposal,” Briefings in Bioinformatics,vol. 3, no. 4, pp. 331-341, 2002.

[7] H. Deus, M. Correa, R. Stanislaus, M. Miragaia, W. Maass, H. deLencastre, R. Fox, and J. Almeida, “S3QL: A Distributed DomainSpecific Language for Controlled Semantic Integration of LifeSciences Data,” BMC Bioinformatics, vol. 12, no. 1, article 285, 2011.

[8] H.M. Jamil, A. Islam, and S. Hossain, “A Declarative Languageand Toolkit for Scientific Workflow Implementation and Execu-tion,” Int’l J. Business Process Integration and Management, vol. 5,no. 1, pp. 3-17, 2010.

[9] S. Gupta, “A Unified Data Model and Declarative Query Languagefor Heterogenous Life Sciences Data,” Technical Report SDSC TR-2011-3, San Diego Super Computing Center, UCSD http://www.sdsc.edu/pub/techreports/SDSC-TR-2011-3-graphitti.pdf,2011.

[10] M. Westergaard, “Better Algorithms for Analyzing and EnactingDeclarative Workflow Languages Using LTL,” Proc. Ninth Int’lConf. Business Process Management, pp. 83-98, 2011.

[11] H.M. Jamil, “Design of Declarative Graph Query Languages: Onthe Choice between Value, Pattern and Object Based Representa-tions for Graphs,” Proc. ICDE Int’l Workshop Graph Data Manage-ment: Techniques and Applications (GDM ’12), Apr. 2012.

[12] H.M. Jamil and A. Islam, “Managing and Querying GeneExpression Data Using Curray,” Proc. BMC Conf., vol. 5,no. Supplement 2, article S10, Apr. 2011.

[13] A. Bhattacharjee, A. Islam, M.S. Amin, S. Hossain, S. Hosain, H.M.Jamil, and L. Lipovich, “On-the-Fly Integration and Ad HocQuerying of Life Sciences Databases Using LifeDB,” Proc. 20th Int’lConf. Database and Expert Systems Applications (DEXA ’09), pp. 561-575, 2009.

[14] S. Hossain and H.M. Jamil, “A Visual Interface for on-the-flyBiological Database Integration and Workflow Design UsingVizBuilder,” Proc. Sixth Int’l Workshop Data Integration in the LifeSciences (DILS ’09), pp. 157-172, July 2009.

[15] Y. Gil, E. Deelman, M. Ellisman, T. Fahringer, G. Fox, D. Gannon,C. Goble, M. Livny, L. Moreau, and J. Myers, “Examining theChallenges of Scientific Workflows,” Computer, vol. 40, no. 12,pp. 24-32, 2007.

[16] T. McPhillips, S. Bowers, D. Zinn, and B. Ludascher, “ScientificWorkflow Design for Mere Mortals,” Future Generation ComputerSystems, vol. 25, pp. 541-551, 2008.

[17] B. Ludascher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M.Jones, E.A. Lee, J. Tao, and Y. Zhao, “Scientific WorkflowManagement and the Kepler System,” Concurrency and Computa-tion: Practice and Experience, vol. 18, no. 10, pp. 1039-1065, 2006.

[18] T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood,T. Carver, K. Glover, M. Pocock, A. Wipat, and P. Li, “Taverna: ATool for the Composition and Enactment of Bioinformatics Work-flows,” Bioinformatics, vol. 20, no. 17, pp. 3045-3054, Nov. 2004.

[19] I. Taylor, M. Shields, I. Wang, and A. Harrison, “The TrianaWorkflow Environment: Architecture and Applications,” Work-flows for E-Science, I. Taylor, E. Deelman, D. Gannon, and M.Shields, eds., pp. 320-339, Springer, 2007.

[20] Microsoft Developer Network, “VPL Introduction,” http://msdn.microsoft.com/en-us/library/bb483088.aspx, 2013.

[21] D.-Q. Zhang and K. Zhang, “VisPro: A Visual LanguageGeneration Toolset,” Proc. IEEE Symp. Visual Languages, pp. 195-202, 1998.

[22] J.-P. Tolvanen and M. Rossi, “MetaEdit+: Defining and UsingDomain-Specific Modeling Languages and Code Generators,”Proc. Companion of the 18th Ann. ACM SIGPLAN Conf. Object-Oriented Programming, Systems, Languages, and Applications, pp. 92-93, 2003.

[23] S. Bowers, B. Ludascher, A.H.H. Ngu, and T. Critchlow,“Enabling Scientific Workflow Reuse through Structured Compo-sition of Dataflow and Control-Flow,” Proc. 22nd Int’l Conf. DataEng. Workshops, pp. 70-72, 2006.

[24] K. Ehrig, C. Ermel, S. Hansgen, and G. Taentzer, “Generation ofVisual Editors as Eclipse Plug-Ins,” Proc. IEEE/ACM Int’l Conf.Automated Software Eng., pp. 134-143, 2005.

[25] G. Papadakis, E. Ioannou, C. Niederee, and P. Fankhauser,“Efficient Entity Resolution for Large Heterogeneous InformationSpaces,” Proc. Fourth ACM Int’l Conf. Web Search and Data Mining(WSDM ’11), pp. 535-544, 2011.

[26] H. Kopcke, A. Thor, and E. Rahm, “Evaluation of EntityResolution Approaches on Real-World Match Problems,” Proc.VLDB Endowment, vol. 3, no. 1, pp. 484-493, 2010.

[27] M. Michelson and C.A. Knoblock, “Learning Blocking Schemes forRecord Linkage,” Proc. 21st Nat’l Conf. Artificial Intelligence (AAAI’06), 2006.

[28] S. Hosain and H.M. Jamil, “An Algebraic Foundation for SemanticData Integration on the Hidden Web,” Proc. IEEE Third Int’l Conf.Semantic Computing, pp. 237-244, Sept. 2009.

JAMIL: DESIGNING INTEGRATED COMPUTATIONAL BIOLOGY PIPELINES VISUALLY 617

Page 14: Designing Integrated Computational Biology Pipelines Visually

[29] L. Chen and H.M. Jamil, “On Using Remote User DefinedFunctions as Wrappers for Biological Database Interoperability,”Int’l J. Cooperative Information Systems, vol. 12, no. 2, pp. 161-195,2003.

[30] A. Bhattacharjee and H.M. Jamil, “OntoMatch: A MonotonicallyImproving Schema Matching System for Autonomous DataIntegration,” Proc. IEEE 10th Int’l Conf. Information Reuse andIntegration (IRI ’09), pp. 318-323, Aug. 2009.

[31] M.S. Amin and H.M. Jamil, “An Efficient Web-Based Wrapperand Annotator for Tabular Data,” Int’l J. Software Eng. andKnowledge Eng., vol. 20, no. 2, pp. 215-231, 2010.

[32] W. Shen, P. DeRose, R. McCann, A. Doan, and R. Ramakrishnan,“Toward Best-Effort Information Extraction,” Proc. ACM SigmodInt’l Conf. Management of Data Conf., pp. 1031-1042, 2008.

[33] M.J. Cafarella, A.Y. Halevy, and N. Khoussainova, “Data Integra-tion for the Relational Web,” Proc. VLDB Endowment, vol. 2, no. 1,pp. 1090-1101, 2009.

[34] T. Cheng and K.C.-C. Chang, “Entity Search Engine: TowardsAgile Best-Effort Information Integration over the Web,” Proc.Third Conf. Innovative Data Systems Research (CIDR ’07), pp. 108-113, 2007.

[35] Y. Sismanis, P. Brown, P.J. Haas, and B. Reinwald, “GORDIAN:Efficient and Scalable Discovery of Composite Keys,” Proc. 32ndInt’l Conf. Very Large Data Bases (VLDB ’06), pp. 691-702, 2006.

[36] D. Gusfield and J. Stoye, “Relationships between p63 Binding,DNA Sequence, Transcription Activity, and Biological Function inHuman Cells,” Molecular Cell., vol. 24, no. 4, pp. 593-602, 2006.

[37] JUNG: Java Universal Network/Graph Framework, “Overview,”http://jung. sourceforge.net, 2013.

[38] Microsoft Developer Network, “XAML Overview (WPF),” http://msdn. microsoft.com/en-us/library/ms752059.aspx, 2013.

[39] J.M. Patel, “The Role of Declarative Querying in Bioinformatics,”OMICS, vol. 7, no. 1, pp. 89-92, 2003.

[40] T. Catarci, M.F. Costabile, S. Levialdi, and C. Batini, “Visual QuerySystems for Databases: A Survey,” J. Visual Languages andComputing, vol. 8, no. 2, pp. 215-260, 1997.

[41] G. Santucci and P.A. Sottile, “Query by Diagram: A VisualEnvironment for Querying Databases,” Software—Practice andExperience, vol. 23, no. 3, pp. 317-340, 1993.

[42] K. Siau, H. Chan, and K. Tan, “Visual Knowledge QueryLanguage as a Front-End to Relational Systems,” Proc. Ann. Int’lComputer Software and Applications Conf., pp. 373-378, Sept. 1991.

[43] S. Dogru, V. Rajan, K. Rieck, J.R. Slagle, B.S. Tjan, and Y. Wang, “AGraphical Data Flow Language for Retrieval, Analysis, andVisualization of a Scientific Database,” J. Visual Languages andComputing, vol. 7, no. 3, pp. 247-265, 1996.

[44] D. Dotan and R.Y. Pinter, “HyperFlow: An Integrated VisualQuery and Dataflow Language for End-User Information Analy-sis,” Proc. IEEE Symp. Visual Languages and Human-CentricComputing, pp. 27-34, 2005.

[45] N. Murray, N.W. Paton, C.A. Goble, and J. Bryce, “Kaleidoquery-aFlow-Based Visual Language and Its Evaluation,” J. VisualLanguages and Computing, vol. 11, no. 2, pp. 151-189, 2000.

[46] P.M.K. Gordon and C.W. Sensen, “Seahawk: Moving BeyondHTML in Web-Based Bioinformatics Analysis,” BMC Bioinfor-matics, vol. 8, article 208, June 2007.

[47] M.D. Wilkinson, B.P. Vandervalk, and E.L. McCarthy, “TheSemantic Automated Discovery and Integration (SADI)Web Service Design-Pattern, Api and Reference Implementation,”J. Biomedical Semantics, vol. 2, article 8, 2011.

[48] A.-L. Lamprecht, T. Margaria, and B. Steffen, “Bio-Jeti: A Frame-work for Semantics-Based Service Composition,” BMC Bioinfor-matics, vol. 10, no. Suppl 10, article 8, 2009.

[49] V. Martin-Requena, J. Rıos, M. Garcıa, S. Ramırez, and O. Trelles,“Jorca: Easily Integrating Bioinformatics Web Services,” Bioinfor-matics, vol. 26, no. 4, pp. 553-559, 2010.

[50] P. Sztromwasser, P. Puntervoll, and K. Petersen, “Data Partition-ing Enables the Use of Standard Soap Web Services in Genome-Scale Workflows,” J. Integrative Bioinformatics, vol. 8, no. 2,article 163, 2011.

[51] T. Paterson and A. Law, “An XML Transfer Schema for Exchangeof Genomic and Genetic Mapping Data: Implementation as a WebService in a Taverna Workflow,” BMC Bioinformatics, vol. 10,article 252, 2009.

[52] P. Romano, D. Marra, and L. Milanesi, “Web Services andWorkflow Management for Biological Resources,” BMC Bioinfor-matics, vol. 6, no. Suppl 4, article S24, 2005.

[53] D. Blankenberg, N. Coraor, G. Von Kuster, J. Taylor, and A.Nekrutenko, “Integrating Diverse Databases into an UnifiedAnalysis Framework: A Galaxy Approach,” Database, http://database.oxfordjournals.org/content/2011/bar011.full, Jan. 2011.

[54] K. Sutherland, K. McLeod, G. Ferguson, and A. Burger,“Knowledge-Driven Enhancements for Task Composition inBioinformatics,” BMC Bioinformatics, vol. 10, no. Suppl 10,article 12, 2009.

[55] A. Mehrle, H. Rosenfelder, I. Schupp, C. del Val, D. Arlt, F. Hahne,S. Bechtel, J. Simpson, O. Hofmann, W. Hide, K.-H. Glatting, W.Huber, R. Pepperkok, A. Poustka, and S. Wiemann, “The LIFEdbDatabase in 2006,” Nucleic Acids Research, vol. 34, pp. 415-418,2006.

[56] X. Wang and W.A. Halang, Discovery and Selection of Semantic WebServices, vol. 453, Springer, 2013.

[57] M. Crasso, A. Zunino, and M. Campo, “A Survey of Approachesto Web Service Discovery in Service-Oriented Architectures,”J. Database Management, vol. 22, no. 1, pp. 102-132, 2011.

[58] A. Adala, N. Tabbane, and S. Tabbane, “A Framework forAutomatic Web Service Discovery Based on Semantics and NLPTechniques,” Advances in Multimedia, vol. 2011, 2011.

[59] H.M. Jamil, “Integrating Large and Distributed Life SciencesResources for Systems Biology Research: Progress and NewChallenges,” Trans. Large-Scale Data- and Knowledge-CenteredSystems, vol. 3, no. 6790, pp. 208-237, 2011.

[60] C.R. Necco, C.L. Gordon, and N.W. Tsai, “Systems Analysis andDesign: Current Practices,” MIS Quarterly, vol. 11, no. 4, pp. 461-476, 1987.

[61] B. Fitzgerald, “The Use of Systems Development Methodologies inPractice: A Field Study,” Information Systems J., vol. 7, no. 3,pp. 201-212, 1997.

[62] N. Russo, J. Wynekoop, and D. Walz, “The Use and Adaptation ofSystem Development Methodologies,” Proc. Int’l Conf. InformationResources Management Assoc. (IRMA ’95), May 1995.

[63] C.J. Hardy, J.B. Thompson, and H.M. Edwards, “The Use,Limitations and Customization of Structured Systems Develop-ment Methods in the United Kingdom,” Information and SoftwareTechnology, vol. 37, no. 9, pp. 467-477, 1995.

[64] H. Isazadeh and D. Lamb, “CASE Environments and MetaCASEtools,” Technical Report 1997-403, Feb. 1997.

[65] W. Citrin, S. Ghiasi, and B. Zorn, “VIPR and the VisualProgramming Challenge,” J. Visual Languages and Computing,vol. 9, pp. 241-258, 1998.

[66] R. Bardohl and H. Ehrig, “Conceptual Model of the GraphicalEditor Genged for the Visual Definition of Visual Languages,”Proc. Theory and Application to Graph Transformations Conf., pp. 252-266, 1998.

[67] M.S. Amin, R.L. Finley Jr., and H.M. Jamil, “Top-k Similar GraphMatching Using TraM in Biological Networks,” IEEE/ACM Trans.Computational Biology and Bioinformatics, vol. 9, no. 6, pp. 1790-1804, Nov./Dec. 2012.

[68] K.Z. Sultana, A. Bhattacharjee, and H.M. Jamil, “QueryingKEGG Pathways in Logic,” to be published in Int’l J. DataMining and Bioinformatics.

Hasan M Jamil received the BS and MSdegrees in applied physics and electronics fromthe University of Dhaka, Bangladesh, in 1982and 1984, respectively, and the PhD degree incomputer science is from Concordia University,Canada, in 1996. His current research interestsinclude the areas of databases, bioinformatics,and knowledge representation. In particular, heis interested in the management and querying ofinteraction and gene expression data, and their

applications in disease gene prioritization and unfolded proteinresponse. He is an associate professor in the Department of ComputerScience, University of Idaho. He was previously on the faculty ofMacquarie University, Sydney, Australia, Mississippi State University,and Wayne State University. He is a member of the IEEE, theAssociation for Computing Machinery, the ACM Special Interest Groupon Management of Data, the Association for Logic Programming, andthe International Society for Computational Biology.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

618 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 10, NO. 3, MAY/JUNE 2013