service-oriented architecture for integration of ...mog/papers/xiaorong_phd.pdf · service-oriented...
TRANSCRIPT
SERVICE-ORIENTED ARCHITECTURE FOR INTEGRATION OF
BIOINFORMATIC DATA AND APPLICATIONS
A Dissertation
Submitted to the Graduate School
of the University of Notre Dame
in Partial Fulfillment of the Requirements
for the Degree of
Doctor of Philosophy
by
Xiaorong Xiang, B.S., M.S.
Gregory R. Madey, Director
Graduate Program in Computer Science and Engineering
Notre Dame, Indiana
April 2007
c© Copyright by
Xiaorong Xiang
2007
All Rights Reserved
SERVICE-ORIENTED ARCHITECTURE FOR INTEGRATION OF
BIOINFORMATIC DATA AND APPLICATIONS
Abstract
by
Xiaorong Xiang
Service oriented architecture (SOA) is a new paradigm that originated in indus-
try for future distributed computing. It is recognized as a promising architecture
for application integration inside and across organizations. Since their introduc-
tion, semantic web and web services technologies are increasingly gaining interest
in the implementation of e-Science infrastructures. In this dissertation, we survey
current research trends and challenges for adopting SOA in general. We present
a practical experiment of building a service-oriented system for data integration
and analysis using current web services technologies and bioinformatics middle-
ware. The system is enhanced with an ontological model for semantics annotation
of services and data. It demonstrates that adopting SOA in the e-Science field
can accelerate the scientific research process. A new methodology and an en-
hanced system design is proposed to facilitate the reuse of workflows and verified
knowledge.
DEDICATION
To my parents, my husband, and my son
ii
CONTENTS
FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
CHAPTER 1: INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . 11.1 Main contributions of the dissertation . . . . . . . . . . . . . . . . 41.2 Organization of the dissertation . . . . . . . . . . . . . . . . . . . 7
CHAPTER 2: RESEARCH ISSUES AND CHALLENGES IN SERVICE-ORIENTED COMPUTING . . . . . . . . . . . . . . . . . . . . . . . . 82.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Overview of related concepts and technologies . . . . . . . . . . . 10
2.2.1 Web services . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.2 Semantic web . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.3 Grid computing . . . . . . . . . . . . . . . . . . . . . . . . 142.2.4 Peer-to-peer computing . . . . . . . . . . . . . . . . . . . . 15
2.3 Issues in the service-oriented computing . . . . . . . . . . . . . . 162.3.1 Service description . . . . . . . . . . . . . . . . . . . . . . 172.3.2 Service discovery . . . . . . . . . . . . . . . . . . . . . . . 222.3.3 Service composition . . . . . . . . . . . . . . . . . . . . . . 292.3.4 Service execution . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Service-oriented computing in e-Science . . . . . . . . . . . . . . . 342.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
CHAPTER 3: A SERVICE-ORIENTED DATA INTEGRATION AND ANAL-YSIS ENVIRONMENT FOR BIOINFORMATICS RESEARCH . . . . 443.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
iii
3.3.1 Use case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.3.2 Operational barriers . . . . . . . . . . . . . . . . . . . . . 51
3.4 System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 533.4.1 Data storage and access service . . . . . . . . . . . . . . . 543.4.2 Service and workflow registry . . . . . . . . . . . . . . . . 553.4.3 Indexing and querying metadata . . . . . . . . . . . . . . . 563.4.4 Service and workflow enactment . . . . . . . . . . . . . . . 57
3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.5.1 Development and deployment tools . . . . . . . . . . . . . 593.5.2 Services provision . . . . . . . . . . . . . . . . . . . . . . . 603.5.3 Workflow engine . . . . . . . . . . . . . . . . . . . . . . . 623.5.4 Building workflows . . . . . . . . . . . . . . . . . . . . . . 623.5.5 Web interface . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643.6.1 Issues with the first prototype . . . . . . . . . . . . . . . . 653.6.2 Extension of the system . . . . . . . . . . . . . . . . . . . 67
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
CHAPTER 4: EXPLORING THE DEEP PHYLOGENY OF THE PLAS-TIDS WITH THE MOGSERV . . . . . . . . . . . . . . . . . . . . . . 734.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.2 System and methods . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2.1 Data model . . . . . . . . . . . . . . . . . . . . . . . . . . 774.2.2 Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.2.3 Data collection . . . . . . . . . . . . . . . . . . . . . . . . 794.2.4 Local query . . . . . . . . . . . . . . . . . . . . . . . . . . 824.2.5 Set management . . . . . . . . . . . . . . . . . . . . . . . 824.2.6 ClustalW . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.2.7 Blast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.2.8 Phylip and Paup . . . . . . . . . . . . . . . . . . . . . . . 864.2.9 Data conversion . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3 Results of case studies . . . . . . . . . . . . . . . . . . . . . . . . 874.3.1 Case study: the rediscovery of Erythrobacter litoralis . . . 88
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
CHAPTER 5: ONTOLOGICAL REPRESENTATION MODEL . . . . . . 915.1 The MoG life sciences project and biomedical application . . . . . 925.2 Ontological representation model . . . . . . . . . . . . . . . . . . 93
5.2.1 RDF, OWL, and DIG reasoner . . . . . . . . . . . . . . . 945.2.2 Generic service description ontology . . . . . . . . . . . . . 975.2.3 Service domain ontology . . . . . . . . . . . . . . . . . . . 985.2.4 MoG application domain ontology . . . . . . . . . . . . . . 99
iv
5.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
CHAPTER 6: IMPROVING THE REUSE OF THE SCIENTIFIC WORK-FLOW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1066.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1066.2 A hierarchical workflow structure . . . . . . . . . . . . . . . . . . 1096.3 An enhanced workflow system . . . . . . . . . . . . . . . . . . . . 113
6.3.1 Knowledge management . . . . . . . . . . . . . . . . . . . 1166.3.2 Knowledge discovery . . . . . . . . . . . . . . . . . . . . . 120
6.4 Translation process . . . . . . . . . . . . . . . . . . . . . . . . . . 1206.4.1 Service discovery and matchmaking process . . . . . . . . 1206.4.2 Knowledge reuse . . . . . . . . . . . . . . . . . . . . . . . 1226.4.3 Implementation and evaluation . . . . . . . . . . . . . . . 124
6.5 Workflow reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1266.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1286.7 Conclusion and future Work . . . . . . . . . . . . . . . . . . . . . 129
CHAPTER 7: SUMMARY AND FUTURE WORKS . . . . . . . . . . . . 1317.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1317.2 Limitations and future work . . . . . . . . . . . . . . . . . . . . . 132
APPENDIX A: GLOSSARY . . . . . . . . . . . . . . . . . . . . . . . . . . 135A.1 Pictures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
APPENDIX B: MOGSERV MANUAL . . . . . . . . . . . . . . . . . . . . 141B.1 Main . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141B.2 Retrieve genome and gene data from NCBI database . . . . . . . 141B.3 Query local database . . . . . . . . . . . . . . . . . . . . . . . . . 141B.4 Set management . . . . . . . . . . . . . . . . . . . . . . . . . . . 142B.5 Data analysis services . . . . . . . . . . . . . . . . . . . . . . . . . 143B.6 Job mangement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
APPENDIX C: DEVELOPMENT AND DEPLOYMENT TOOLKITS . . 155
APPENDIX D: SUPPLEMENTARY MATERIAL FOR CHAPTER 3 ANDCHAPTER 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157D.1 Complete genome sequence in XML . . . . . . . . . . . . . . . . . 157D.2 Example of a ATP synthase subunit B sequence . . . . . . . . . . 159D.3 Protein name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160D.4 Syntax of search local database . . . . . . . . . . . . . . . . . . . 160D.5 Workflow of retrieve sequence . . . . . . . . . . . . . . . . . . . . 160
v
D.6 ClustalW input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163D.7 Blast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163D.8 PAUP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
APPENDIX E: SUPPLEMENTARY MATERIAL FOR CHAPTER 5 ANDCHAPTER 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
vi
FIGURES
1.1 The evolution of the Web, yesterday’s web is a repository for textand images; today’s web is a platform to publish and access dy-namically changing new types of contents provided by a variety ofservices.[8] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 Two basic components in a simple service-oriented architecture. Aservice requester at the right sends a service request message to aservice provider at the left. The service provider returns a responsemessage to the service requester. . . . . . . . . . . . . . . . . . . . 12
2.2 Web services standards stack includes mutliple layered and interre-lated open standards. . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Venn Diagram representation of integration web service, grid com-puting, semantic web, and peer-to-peer technology into the realiza-tion of service-oriented architecure . . . . . . . . . . . . . . . . . . 17
2.4 A common service lifecycle in a service-oriented architecure includesservice publication, service discovery, and service invocation pro-cesses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Broker-based service discovery mechanism. A service discovery bro-ker accepts requests from service requesters, translates requests intoappropriate formats, and sends them to multiple registries. The re-turned results may be unified and distilled based on requesters’needs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 P2P-based discovery mechanism containing a data layer, a commu-nication layer, and peers that control registries or service providers. 25
2.7 Summary of existing service discovery systems with different dis-covery mechanisms mapped relative to three characteristics: degreeof decentralization, richness of service descriptions, and static ordynamic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1 A manual phylogenetic data collection and data analysis process . 50
vii
3.2 MoGServ System architecture includes a services access client, MoGServmiddle layer, and other data and services providers . . . . . . . . 54
3.3 Asynchronized services and workflow invocation model . . . . . . 58
3.4 A workflow built using Taverna workbench to get complete genomesequences and specific gene sequences . . . . . . . . . . . . . . . . 71
3.5 A workflow for querying two subset sequences from local database,filtering out sequences coming from same organism, and doing se-quence alignment analysis . . . . . . . . . . . . . . . . . . . . . . 72
3.6 Abstraction of user defined workflows . . . . . . . . . . . . . . . . 72
4.1 The growth of sequence databases (NCBI Genebank and EBI Swis-sprot) and annotations. This figure is from Folker Meyer[57] . . . 76
4.2 Entity relationship diagram of the data model in MoGServ createdby SQL::Translator . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.1 A RDF graph model to represent some information for describingthe MoG project web site . . . . . . . . . . . . . . . . . . . . . . 95
5.2 Main concepts and partial relationships defined in the MoG appli-cation domain ontology . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3 The software components implementation of annotation and query-ing meta data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.1 A four level hierarchical workflow structure representation and trans-formation of scientific processes . . . . . . . . . . . . . . . . . . . 109
6.2 An example illustrates the user-oriented workflow definition withdifferent levels of knowledge . . . . . . . . . . . . . . . . . . . . . 112
6.3 An enhanced workflow system with two added components, knowl-edge management and knowledge discovery . . . . . . . . . . . . . 115
6.4 The mismatching problem may be introduced due to the inaccu-rate annotation, incomplete semantic annotation, and inaccurateontological reasoning during the translation process. . . . . . . . . 122
6.5 The creation process of connectivity graph when a new service isadded in the registry, the connectivity is refined and updated duringthe workflow translation process. . . . . . . . . . . . . . . . . . . 124
6.6 The graph representation of a workflow for describing a scientificprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
A.1 Time line for the origin of life and major invasions giving rise tomitochondria and plastids.[27] . . . . . . . . . . . . . . . . . . . . 137
viii
A.2 Gene transfer to the nucleus. [27] . . . . . . . . . . . . . . . . . . 138
A.3 Symbioses process [69] . . . . . . . . . . . . . . . . . . . . . . . . 139
A.4 ATP Synthase: the wheel that powers life. It is a candidate forascertainment of deep phylogeny. . . . . . . . . . . . . . . . . . . 140
B.1 The main menu of the MoGServ . . . . . . . . . . . . . . . . . . . 142
B.2 A web interface provides users a way to define data with interests. 143
B.3 Input the query term from this interface and choose gene or genomedatabase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
B.4 The results from querying local database . . . . . . . . . . . . . . 146
B.5 Users may copy, past particular sequences and upload to the localdatabase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
B.6 Set information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
B.7 The set filter service is used to find intersection of organisms amongmutliple sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
B.8 tblastn interface in MoGServ . . . . . . . . . . . . . . . . . . . . . 150
B.9 ClustalW Interface in MoGServ . . . . . . . . . . . . . . . . . . . 151
B.10 Job management interface shows the status, input link, output linkof a job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
B.11 An example input of a clustalW analysis, set id is a hot link, userscan view sequence information in this set. . . . . . . . . . . . . . 153
B.12 An example output of a clustalW analysis, users can download,convert, view the results. . . . . . . . . . . . . . . . . . . . . . . . 154
D.1 Phylogenetic tree generated from the PAUP . . . . . . . . . . . . 166
D.2 Phylogenetic tree file generated from the PAUP can be viewed byother program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
E.1 This is the WSDL description of QueryLocal service hosted in theMoGServ, which provides an operation to create a set in the localdatabase. This operation accepts two parameters and return theset id. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
E.2 One example of using Taverna workbench to create, test, and runworkflow. This workflow accepts users input, search the local database,create set, align set using ClustalW, convert the ClustalW result toNEXUS format, which can be fed to PAUP. . . . . . . . . . . . . 171
E.3 XScufl workflow format represents the workflow created using theTaverna workbench. . . . . . . . . . . . . . . . . . . . . . . . . . . 172
ix
E.4 Annotation of job and set information using ontological model de-fined. The sample rdf file is displayed using RDF Gravity. . . . . 173
E.5 Annotation of a service using ontological model defined. The sam-ple rdf file is displayed using RDF Gravity. . . . . . . . . . . . . . 174
x
TABLES
2.1 SYNTACTIC AND SEMANTIC DESCRIPTION METHODS FORWEB SERVICES . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 EXISTING DEPLOYMENT AND EXECUTION ENGINES FORATOMIC AND COMPOSITE SERVICES . . . . . . . . . . . . . 33
2.3 LIFE SCIENCES RESOURCES AVAILABLE AS WEB SERVICES36
3.1 ATTRIBUTES FOR SERVICES AND WORKFLOWS DESCRIP-TION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.1 PERFORMANCE EVALUATION OF MATCH DETECTION PRO-CESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.2 PERFORMANCE EVALUATION OF PATH SEARCHING PRO-CESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
C.1 OPEN SOURCE SOFTWARE PACKAGES USED FOR DEVEL-OPMENT AND DEPLOYMENT . . . . . . . . . . . . . . . . . . 156
D.1 NAME OF ATP SYNTHASE . . . . . . . . . . . . . . . . . . . . 161
D.2 SYNTAX OF SEARCHING LOCAL DATABASE . . . . . . . . . 161
D.3 INDEXING FIELD OF LOCAL DATABASE . . . . . . . . . . . 162
xi
ACKNOWLEDGMENTS
I would like to thank Dr. Gregory Madey for his encouragement and guidance
on my research. Thanks for him always saying “Life is short” and his kindness,
patience, and confidence in me. I appreciate him giving his students as much
freedom as possible on selecting research topics for our best interests and seeking
for collaborative opportunities to help us fulfill our goals. His spirit of never
stopping to learn new materials and never afraid of exploring new research areas
always encouraged me in the way to finish this dissertation and will encourage me
with my future work. Thanks for his efforts on trying to educate us as independent
researchers in numerous ways.
Many thanks goes to Dr. Jeanne Romero-Severson for providing me use cases
and training in the biological field, and her prompt feedback on my work. I
would like to thank Dr. Amitabh Chaudhary for answering my questions about
algorithms and discussion about my research topics.
I would like to thank my committee members Dr. Patrick J. Flynn, Dr. Aaron
Striegel, and Dr. Jeanne Romero-Severson for their valuable contributions.
I would also like to thank my son for trying hard not to bother me too much
while I was busing working and giving me excuses to relax. Many thanks go to
my husband, my parents, and my friends for their emotional support, always, no
matter how much frustration I had.
This research work is partially supported by the Indiana Center for Insect
xii
Genomics (ICIG) with funding from the Indiana 21st Century fund.
xiii
CHAPTER 1
INTRODUCTION
Since the first generation of the World Wide Web (the Web) appeared in 1990,
it mainly served as a repository for text and images presented in HTML format.
Nowadays, the Web is evolving as a platform to publish and access dynamically
changing new types of content provided by a variety of services that are realized
with web-accessible programs, databases, and physical devices. Tim Berners-Lee
et. al. [8] presents the evolution of the Web (see Figure 1.1); the authors emphasize
the importance of “understanding the current, evolving, and potential Web” in
the article – “Creating a Science of the Web”.
The Web has been used in e-commerce and Business-to-Business (B2B) ap-
plications to deliver information and provide services to customers and business
partners. For example, a travel agency provides services for travelers to view and
compare airfare, book tickets and hotel on-line. As the transaction of services
between businesses increases, there is a demand of increasing the interoperability
between these applications, the service-oriented architecture (SOA) is proposed as
an underlying architecture to enhance this capability. With many definitions and
non-standard definitions, the service-oriented architecture (SOA) is com-
monly accepted as a new architectural style that enables the combination and
communication among loosely coupled services. These services are described with
a standard interface definition that hides the implementation of the language and
1
BrowserBrowser Browser, blog, wiki, data integrationBrowser, blog, wiki, data integration
HTMLHTML InteractionInteraction SemanticWeb
SemanticWeb
WebServices
WebServices MultimodalMultimodal
XMLXML
Privacy, security, accessibility, mobilityPrivacy, security, accessibility, mobility
HTTP, SOAP, …HTTP, SOAP, …
URIURI
HTTPHTTP
URLURL
Yesterday Today
This picture is adapted from the article “Creating a Science of the Web” by Tim Berners-Lee et. al.
Figure 1.1. The evolution of the Web, yesterday’s web is a repositoryfor text and images; today’s web is a platform to publish and accessdynamically changing new types of contents provided by a variety of
services.[8]
platform of services in a SOA. A service can be called to perform a task with-
out the service having pre-knowledge of the calling application, and without the
application having or needing knowledge of how the service actually performs its
tasks.
The realization of a service-oriented architecture is not tied to a specific tech-
nology and protocols. The web service standards, including SOAP, WSDL, and
UDDI, have been widely accepted as the realization of a SOA with support from
a number of tools. Therefore, the service-oriented architecture is often defined
as services exposed using this web service protocol stack. A SOA based sys-
2
tem can therefore be referred to as a system developed using these technologies.
Building a SOA based system can help businesses respond more quickly and cost-
effectively to the changing market conditions. It promotes reuse of existing legacy
applications as services and simplifies the interconnection of distributed business
processes inside organizations or across organization boundaries.
As stated in the article [8], the Web “has changed the ways scientists com-
municate, collaborate, and educate”. The evolving process of using the Web in
the e-Science field is similar to the evolving process of using the Web in the e-
business domain. The effort of building the e-Science infrastructure started from
developing gateways or portals that provide access to integrated databases and
computing resources behind a web-based user interface in multiple scientific fields.
Examples of this kind of science include social simulations, physics, environmental
sciences and bioinformatics. This infrastructure has been used to solve problems
such as distributed physical or astronomic data analysis, and remote access of the
information source and simulations. It facilitates the use of the computational
resources located in different physical sites, thereby allowing users at different
locations to easily share information and communicate with each other. More
recently, the service-oriented architecture along with the combination of seman-
tic web, Peer-to-peer (P2P) computing, and grid computing technologies
are being identified as promising ways to build such infrastructures for supporting
e-Science by providing access to heterogeneous computation resources and integra-
tion of distributed scientific and engineering applications developed by individual
scientists and groups [91] [93].
With the promising future of adopting the service-oriented architecture in e-
Science and e-business, a number of challenges arise in term of integrating inde-
3
pendently developed data systems without requiring global agreements as to terms
and concepts, efficient allocation of computation resources, security and privacy is-
sues of accessing shared data resources. These challenges attract researchers from
diverse research areas such as information retrieval, database system, artificial
intelligence, software engineering, and distributed computing.
1.1 Main contributions of the dissertation
Our research work starts from an investigation of current research trends and
challenges in the SOA area. In order to discover the best practices for building
SOA based systems, we demonstrate our design and implementation of a SOA
based system to support scientific research and increase productivity. It serves
as a prototype for our future research work in this field as well as an in-silico
investigation platform for scientists. A particular scientific domain – studying
the deep phylogeny of the plant chloroplast – is applied in this prototype. This
application shows that a SOA based system can help scientists achieve a research
goal that it is difficult and almost impossible without this system. We conduct our
research from both practical and theoretical aspects. We propose a hierarchical
structure for workflow by integrating semantic web technology to improve the
reuse of workflows. To address the security and resource allocation issue, we
propose integrating the current system with an existing grid computing platform.
The main contributions of this dissertation are:
A survey and analysis of current trends and research challenges in
the service-oriented architecture: Grid computing, peer-to-peer computing
(P2P), and semantic web technologies are related to SOA. A recently proposed grid
standard, Open Grid Service Architecture (OGSA), built upon the service-oriented
4
architecture, demonstrates the convergence of grid computing with SOA. Semantic
web technology is used in grid services and SOA to enhance the automation of
scientific and engineering computational workflows. Applying P2P technology
in SOA makes service discovery and enactment more scalable than centralized
approaches. Much research has been done exploring the convergence of these
technologies so as to make this new distributed computing paradigm successful.
We present our investigation of the research issues and challenges in SOA. Our
discussion of open issues and future research trends focuses on several critical
aspects in SOA: service discovery, service composition, and service enactment.
A Service-oriented data integration and analysis environment for
In Silico experiments and bioinformatics research: As more public data
providers begin to provide their data in web service format in order to facili-
tate better data integration in bioinformatics community, we designed and im-
plemented a service-oriented architecture that integrates the data and services to
support a deep phylogenetic study. This software environment focuses on repre-
senting both data access and data analysis as web services. We believe with this
common interface, it will be easy for other researchers who are interested in deep
phylogenetic analysis to integrate our data and services into their applications.
Based on a first prototype, we discuss several issues in the implementation and
indicate the possible integration with semantic web and grid computing technolo-
gies to address these limitations. We present a practical experiment of building a
service-oriented system upon current web services technologies and bioinformat-
ics middleware. The system allows scientists to extract data from heterogeneous
data sources and to generate phylogenetic comparisons automatically. This can
be difficult to accomplish using manual search tools since sequence data is rapidly
5
accumulating and the process can be long and tedious.
An application for exploring the deep phylogeny of the plastids with
the SOA based system: To serve as an example and proof of concept that the
service-oriented architecture can help scientists increase their productivity and
solve more complex problems than possible with the traditional approaches, we
apply several use cases on the system. We detail the services provided in this
environment and illustrate the results which demonstrate that the environment
can help support scientific analysis and make new discoveries.
A methodology and a novel approach to facilitate the reuse of work-
flow and composition of services: Most current practical methodologies for
creating workflows relies heavily on users having complete knowledge and under-
standing of individual services at a low-level description. Using semantic web
technology, services can be described with rich semantics. Recent research has
focused on supporting users in the discovery and composition of services by using
rich service annotations. Users can choose to encapsulate a service in a workflow
to achieve particular goals based on the conceptual service definition in semi-
automatic and automatic ways. Most current practical methodologies for work-
flow creation pursue this using a semi-automatic way that allows users to discover
and select appropriate services to include in a workflow based on the semantic and
conceptual service definition. This effort lifts the load of requirement on bioinfor-
matics researchers of having detailed knowledge and understanding of each tool,
service, and data types. Instead, more complex middleware is used to assist with
the composition process and resolve the incompatibility between two given ser-
vices. Few approaches consider the potential of reuse of existing workflows or
partial reuse of these workflows. We present a hierarchical workflow structure
6
with a four level representation of workflow: abstract workflow, concrete work-
flow, optimal workflow, and workflow instance. This four level representation of
workflow provides more flexibility for the reuse of existing workflows. We believe
that reuse of complete or partial workflows takes advantage of the verified knowl-
edge learned in practice and can increase the soundness of the composed workflow.
We proposed an ontological representation model of data and services as well as
an approach that uses a graph matching algorithm to find similar workflows with
semantic annotation.
1.2 Organization of the dissertation
The rest of this dissertation is organized as follows: Chapter 2 introduces sev-
eral concepts and technologies related to SOA and discusses related research issues
and challenges. Chapter 3 presents the design and implementation of a SOA based
system for supporting bioinformatics research. Chapter 4 demonstrates a partic-
ular application that uses this system to discover new phylogenetic knowledge.
Chapter 5 presents an ontological model to annotate services and data. This se-
mantically enriched data allows easier reuse, sharing, and experiments involving
search to be conducted. Chapter 6 proposes a methodology and a novel approach
that can facilitate the reuse of workflow and composition of services. Chapter 7
summarizes the dissertation and identifies potential future work.
7
CHAPTER 2
RESEARCH ISSUES AND CHALLENGES IN SERVICE-ORIENTED
COMPUTING
2.1 Introduction
The evolution of computing systems progressed through monolithic, client-
server, 3-Tier to N-Tier architectures. The N-Tier architecture layers request and
response calls among applications that may reside on multiple sites. Service-
oriented computing (SOC), an term frequently used interchangeably with the
service-oriented architecure (SOA), involves service layers, functionality, and roles
as described by SOA [70]. SOA can be considered as a conceptual description
of a concrete implementation of a service-oriented computing infrastructure. It
is an emerging paradigm for distributed computing intended to enable system-
atic application-to-application interaction. Services are basic units on a service-
oriented computing platform. They are autonomous, platform-independent soft-
ware components that can be described, published, discovered, invoked, and com-
posed using standard protocols within and across organizational boundaries. A
service is a piece of work done by a Service provider in order to provide de-
sired results for a Service requester. Service providers and requesters are roles
played by software agents on behalf of their owners. The goal of this new dis-
tributed computing architecture is to enable interaction among loosely-coupled
software agents in a flexible and effective way.
8
SOC has been adopted in portal design, e-commerce, e-Science, legacy system
integration, and grid computing. One example is the integration of engineering
design processes, such as automobile and aircraft design, which typically involve
several partners located at different locations. These partners may be both coop-
erative and competitive. Successful engineering design requires well-coordinated
interactions between individuals or teams in specialized knowledge domains, infor-
mation exchange, models, and integration to achieve an optimal goal. However,
there may be a significant part of design models and tools containing propri-
etary information that cannot be disclosed. Also, these models and tools are
normally written in a variety of programming languages and run on different plat-
forms. With service-oriented computing technologies, these models and tools can
be treated as black boxes and run at their original locations [5] [43].
Reusability, interoperability, security, and easy maintenance are major poten-
tial benefits of SOC.
• Reusability – services provide a higher-level standard abstraction that allows
the reuse of existing software.
• Interoperability – The standard abstraction of services enables the interop-
eration of software produced by different programmers and improves pro-
ductivity.
• Security – With the standard abstraction of services, software can be viewed
as a black box. The internal implementations or algorithms are not accessi-
ble to competitive partners.
• Maintenance – With the standard abstraction of service, changes to the
underlying implementation will adversely impact the use of the services.
9
While the potential benefits of SOC are compelling, successful service-oriented
implementation requires solving several issues and challenges arising from these
promising features. These issues and challenges include service discovery, ser-
vice composition, and service invocation; monitoring the execution of services;
methodologies supporting services development, evaluation, and life-cycle man-
agement; approaches to guarantee quality, security, and reliability of services.
These challenges attract researchers from diverse research areas such as informa-
tion retrieval, database systems, artificial intelligence, software engineering, and
distributed computing.
In this chapter 1 , we introduce several concepts and technologies related to
SOC and discuss related research issues and challenges.
2.2 Overview of related concepts and technologies
Several definitions of SOA are available; the W3C defines SOA as a form of
distributed systems architecture with the following properties: [105]
• The service is an abstracted, logical view of actual programs, databases, and
business processes.
• A service or a function is described using a description language.
• Services tend to use a small number of operations with relatively large and
complex messages.
• Services tend to be oriented toward use over a network.
1Portions of this chapter appear in “A semantic web services enabled web portal architecture”,International Conference of Web Services (ICWS2004)[108]
10
• Messages are sent in a platform-neutral, standardized format, such as XML,
through the interface. XML is the most obvious format.
• The service is implemented as a software agent. The service is formally
defined in terms of the messages exchanged between provider agents and
requester agents, and not the properties of the agents themselves. By avoid-
ing any knowledge of the internal structure of an agent, one can incorporate
any software component or application that can be ”wrapped” in message
handling code that allows it to adhere to the formal service definition.
There are two fundenmental components in a basic service-oriented architec-
ture as shown in Figure 2.1. A service requester at the right sends a service
request message to a service provider at the left. The service provider returns a
response message to the service requester. The request and subsequent response
connections are defined in some way that is understandable to both the service
requester and service provider.
2.2.1 Web services
Although there is no standard definition of “web services”, a web service is
generally considered as one type of realization of SOA. Among various definitions,
we refer to the definition from W3C:
“ A Web service is a software system designed to support interopera-ble machine-to-machine interaction over a network. It has an interfacedescribed in a machine-processable format (specifically WSDL). Othersystems interact with the Web service in a manner prescribed by its de-scription using SOAP messages, typically conveyed using HTTP withan XML serialization in conjunction with other Web-related standards2”.
2http://www.w3.org/TR/ws-arch/
11
Service Provider
Service Requester
Return results in XML format
Send request in XML format
Internet
SoftwareAgent ImplementThe service
SoftwareAgent Has knowledgeOf the serviceIn terns of theDescription notThe implementation
Servicedescription
Figure 2.1. Two basic components in a simple service-orientedarchitecture. A service requester at the right sends a service request
message to a service provider at the left. The service provider returns aresponse message to the service requester.
Concrete software agents that implement an abstract service interface can be
written in different programming languages and can run on different platforms.
Since these concrete agents implement the same function defined in the abstract
interface, any change of underlying implementation will not effect on the use of
the service. A web service architecture is based upon many layered and inter-
related open standard and web technologies as shown in Figure 2.2. The Web
Service Description Language (WSDL) defines the abstract interface of services.
The Simple Object Access Protocol (SOAP) is a protocol for exchanging messages
among requesters and providers. Universal Description, Discovery and Integra-
tion (UDDI) provides a standard registry for publishing, discovery, and reuse of
web services. WSDL, SOAP, and UDDI are core standards based on fundamental
web technologies including XML, TCP/IP, FTP and etc. There are also emerg-
ing standards proposed for defining business, scientific, or engineering processes,
12
transactions, and security, e.g., BPEL4WS, WS-I. Two main styles of web services
are available: SOAP web services and REST (Representational State Transfer) 3
web services. In this dissertation, we use the term “web services” to mean SOAP
style web services.
Network Transport ProtocolsTCP/IP, HTTP, SMTP, FTP, etc
Network Transport ProtocolsTCP/IP, HTTP, SMTP, FTP, etc
Meta LanguageXML
Meta LanguageXML
Services CommunicationSOAP
Services CommunicationSOAP
Service Publishing & DiscoveryUDDI
Service Publishing & DiscoveryUDDI
Services DescriptionWSDL
Services DescriptionWSDL
Business Process ExecutionBPEL4WS, WFML, WSFL,
BizTalk, …
Business Process ExecutionBPEL4WS, WFML, WSFL,
BizTalk, …
Additional WS* Standards …Additional WS* Standards … TransactionsTransactions
ManagementManagement
SecuritySecurity
Figure 2.2. Web services standards stack includes mutliple layered andinterrelated open standards.
2.2.2 Semantic web
The vision of the semantic web is to represent units of web-based information
with well-defined and machine-understandable semantics so that intelligent soft-
ware agents can autonomously process them [7]. This information, including these
3http://www.xfront.com/REST-Web-Services.html
13
abstract description of services, must be defined and linked in such a way that
it can be used for automation, sharing, integration, and reuse even when these
software agents are designed, developed, and owned by different groups or indi-
viduals. SOA, more specifically web services, becomes a key component to realize
the vision of semantic web since most web sites on todays’ web do not merely
provide static information but allow users to interact and generate dynamic infor-
mation through services. To make use of a web service, a software agent needs a
computer-interpretable description of the service.
Adding “meaningful” descriptions to the interface using semantic web technol-
ogy can avoid ambiguous interpretations of information and service descriptions
and increase the soundness of the results provided by service providers. The com-
bination of these two technologies results in the emergence of a new generation of
web services called semantic web services [54]. The proposed standards for knowl-
edge sharing and reuse in the semantic web range from the Resource Description
Framework (RDF) to the Web Ontology Language (OWL) [67]. These two stan-
dards have become W3C recommendations. The appearance of open source tools
that support creation, parsing, and reasoning using these standards makes the
addition of semantic web technology into SOC feasible.
2.2.3 Grid computing
Grid computing [32] is a computing platform that is intended to integrate
resources (both data and computational resources) from different organizations,
called virtual organizations, in a shared, coordinated and collaborative way to
solve large-scale science and engineering problems. The Globus toolkit [97] is
one implementation of the specifications for grid computing. It has become the
14
standard for grid middleware. Open Grid Service Architecture (OGSA), built
upon the service-oriented architecture, describes a service-oriented architecture
for grid computing. The Open Grid Services Architecture (OGSA) describes an
architecture for a service-oriented grid computing environment for business and
scientific use, developed within the Global Grid Forum (GGF). OGSA is based
on several other Web service technologies, notably WSDL and SOAP. It is a
distributed interaction and computing architecture based around services, assuring
interoperability on heterogeneous systems so that different types of resources can
communicate and share information.
The major goal of the grid computing platform is to provide an easy-to-use and
flexible computing infrastructure for supporting e-Science. The goal of e-Science
is to offer scientists and engineers an effective way to generate, analyze, and share
their experiments, data, instruments, computational tools, and results. Seamless
automation of the scientific process becomes a major gap between the vision and
reality. Grid computing shares some of problems and technical challenges with
service-oriented computing in general. Incorporating semantic web technologies
into grid computing bring us a new concept, the semantic grid [21], which intends
to minimize this gap and solve the problem of achieving seamless integration and
automation of scientific and engineering workflows.
2.2.4 Peer-to-peer computing
Peer-to-Peer (P2P) computing has received significant attention due to the
popularity of P2P file sharing system such as Napster, Gnutella, Freenet, Mor-
pheus, BitTorrent, and KaZaa. Peers are autonomous agents and exchange in-
formation in completely decentralized manner. P2P architecture does not have
15
a single point of failure. Since nodes contact with each other directly, the in-
formation they receive is up-to-date. The P2P model can provide an alternative
for service discovery dynamically without relying on centralized registries. The
P2P model also provides an alternative for interaction between web services. We
discuss the research done on this direction in the following sections.
Semantic web technology enhances the capability of automation in SOA and
grid computing. Grid computing building upon SOA increases the flexibility. P2P
computing model increases the scalability and reliability. Figure 2.3 demonstrates
an overview of current research trends that intend to use these technologies to-
gether.
2.3 Issues in the service-oriented computing
Figure 2.4 shows service publication, service discovery, and service invocation
stages in the life cycle of a service. This process involves three roles in the SOA:
service provider, service requester, and service discovery system. Service providers
create services and provide platforms to execute these services. Service requesters
query the service discovery system to find appropriate services. To enable ser-
vice requesters to find services, service providers need to publish their services
interface in a publicly available location. Specifying the capability and quality
of services, and finding a matched service based on these descriptions are usually
done as two separate activities. The more information that is given for describing
services, the more accurate are the matched results that are returned. Services
can be categorized into simple services (atomic services) and complex services
(composite services). Generating and executing a composite service to solve
a complicated problem is an important feature leading to the adoption of SOA.
16
Figure 2.3. Venn Diagram representation of integration web service,grid computing, semantic web, and peer-to-peer technology into the
realization of service-oriented architecure
In the following sections, we discuss several active research issues in SOA, service
description, service discovery, service composition, and service execution.
2.3.1 Service description
One requirement of the services oriented architecture is to provide meaningful
descriptions for services so that software agents can understand their features and
learn how to interact with them. A service description gives a formal representa-
tion for properties of a service. These properties can be classified into funcational
and non-functional properties.
Functional properties contain the details of a service interface and service
17
ServiceConsumerService
Consumer
ServiceBroker
ServiceBroker
ServiceProviderServiceProvider
2
3 54
1
DiscoveryInvoke
Publish
Figure 2.4. A common service lifecycle in a service-oriented architecureincludes service publication, service discovery, and service invocation
processes.
behavior including data types, operations, transport protocol information, and
binding address. WSDL is the first W3C standard that is widely used for service
descriptions.
There may be multiple service providers who offer the same functionalities
defined in a service interface. Determining and choosing the best service becomes
important for service requesters. The information in WSDL descriptions is not
sufficient for ranking best services. Non-functional properties including specifica-
tion of the cost, performance, security, and trustiness of a service are introduced
for measuring the Quality of Services (QoS). There are many aspect of QoS that
can be organized into categories with a set of quantifiable parameters [75]. The
“best” service may have different meanings for different requesters. One may
prefer security over cost while the other may prefer lower cost over performance.
Measurements of these non-functional properties can be achieved using statistical
analysis, data mining, and text mining technologies. It is normally done by a
third-party through the collection of subjective evaluations from requesters. This
information dynamically changes over time.
18
Pure syntactic descriptions of services require requesters to fully understand
the capability of a service before using it. The selection of a web service among
several ones with similar WSDL descriptions requires more information than what
WSDL actually defines. The semantic web, supported by the use of an ontology, is
likely to provide better qualitative and scalable solutions to overcome these issues.
There are two directions to enhance the semantics in the web service descrip-
tion (See Table 2.1). 1) enhance the WSDL description. The Semantic Annota-
tions for Web Services Description Language Working Group [81] has the objective
to develop a mechanism to enable annotation of Web services description. This
mechanism will take advantage of the existing WSDL standards (WSDL 2.0) to
build a simple and generic support for semantic in Web services. Some systems
[54] [55] define an ontology for web services using emerging languages, such as
DAML+OIL and OWL. 2) Second, the W3C recently proposed OWL-S to provide
the ontology description of web services using OWL. OWL-S enables description of
not only the functional properties of a service, but also the non-functional proper-
ties. This domain-independent service ontology is augmented by domain-specific
ontologies in real applications.
Enhancing service descriptions with ontological representations increases the
cost and complexity of services annotation from several aspects.
Creation of domain ontology Use of ontologies is considered to be the most
promising basis for defining the semantics of objects and allowing meaning-
ful information exchange among machines and humans. A commonly used
definition of ontology is “a specification of a conceptualization” [40]. An
ontology is intended to give a concise, uniform, and declarative description
of information and knowledge that is interesting and useful to a community
19
TABLE 2.1
SYNTACTIC AND SEMANTIC DESCRIPTION METHODS FOR
WEB SERVICES
Descriptionmethods
Representation Challenges
syntactic WSDL No representation of non-functional prop-erties, not sufficient in representing mean-ingful description, no representation forprocess, only supporting the keywordssearch
semantic domain ontology +WSDL
No representation for process, complexityof services annotation
domain ontology +OWL-S
Complexity of services annotation
of users, using a common vocabulary and language.
Construction of a knowledge base involves investigating a particular domain,
determining important concepts in that domain, and creating a formal rep-
resentation (ontology) of the objects and relations in the domain. A general
ontology represents a broad selection of objects and relations at a higher-
level of abstraction [79]. Miller et al. [59] investigate ontologies for simu-
lation modeling. Christley et al. [15] presents an ontology for agent-based
modeling and simulation.
An ontology is normally defined and revised (if needed) by an authority.
Usually the authority needs to collaborate with the real experts in the do-
main before or during the process of creating formal representations. Large-
scale ontologies can be constructed by publishing a prototype ontology for
the research community. The Gene Ontology (GO) Consortium produces
20
a controlled vocabulary for classifying gene product attributes, molecular
functions, cellular components and biological process [35] in the biological
sciences field. It consists of 17838 terms (as of September 27, 2004) and
22742 terms (as of March 11, 2007).
Integration of ontologies Vast amounts of information may come from many
different ontologies. For this reason and because many heterogeneous data
repositories are developed by different research groups and reside on different
research institutes and organizations, it is impossible to process this infor-
mation and data without the knowledge of the semantic mapping between
them. Much research has been done to explore the mapping and matching
of concepts, and integrating different ontologies using sophisticated algo-
rithms and AI techniques, such as machine learning [25][62]. There are
two approaches for ontology integration. One approach involves integra-
tion of different ontologies that are developed by different groups for data
representation into a common global ontology. While this approach makes
the information correlation in the query processing easier, it increases the
complexity of integrating the ontologies and maintaining consistency among
concepts. The other approach is interoperation across different ontologies
via “terminological relationships” between terms instead of integration of
ontologies into a global one [56] [66]. Interontological relationships are spec-
ified using description logics in an interontological relationships manager to
handle vocabulary heterogeneity between ontologies. Although the adoption
of this approach increases the scalability, extensibility, and maintainability,
it shifts the burden to the interoperation mechanisms.
Annotation of services The annotation of services using ontologies is gener-
21
ally done manually. It is a complex process since there may be multiple
ontologies related to a single service. These ontologies may be developed
by different groups. Each group may represents the same concept using
different vocabulary or different concepts are presented using the same vo-
cabulary. Some systems, such as MWSAF (Meteor-S web service annotation
framework) [71] provides graphical tools that enable users to annotate exist-
ing web services description with ontologies in a semi-automatical way using
AI technologies such as machine learning. The IBM ETTK [30] technology
provides a set of toolkits including a graphical editor for annotating service
compatible with WSDL-S.
2.3.2 Service discovery
Without prior knowledge of a service, service requesters may not know the
location or even the existence of services they desired. A goal of the service
discovery process is to find services that are best suited for the requirement of the
requester.
A basic service discovery process can be described as follows.
1. Service providers provide descriptions of their services and advertise these
services in a service registry. A service registry is a service discovery system
that consists of mechanism for supporting efficient searching appropriate
services and physical spaces for storing characteristics of services. UDDI is
a registry standard.
2. Service requesters request desired services using keywords or complicated
query languages.
22
3. A service discovery system accepts requests from requesters. It searches
service descriptions in its database and tries to find services that match
requests. This process is also called matchmaking.
As the number of web services grows, new registries appear as needed. A ser-
vice may be registered in several registries. A service discovery broker accepts
requests from service requesters, translates requests into appropriate formats, and
sends them to multiple registries. The returned results may be unified and dis-
tilled based on requesters’ needs (See Figure 2.5). In this mechanism, the broker
may issue the request to multiple registries in parallel, however, there is still a
communication bottleneck to the broker and a single point failure may occur.
An alternative of the centralized discovery mechanism is the P2P based dis-
covery mechanism. In this approach, each service provider acts as a peer in the
P2P network. Each provider has its own way to store information about other
service providers, called neighbors, and provides the resources to relay or pass
information through. A network like a social network is eventually formed. At
the discovery process, a requester queries its neighborhoods for searching a desired
service, the query propagates through the network until a suitable one is found
or terminates [105]. This approach provides higher reliability than a centralized
approach. It avoids the single point of failure and the latency of providing up-to-
date description for updated services. However, since each service provider is a
peer, a huge peer community may result in inefficient search. Instead of treating
every provider as peer, each registry can act as a peer in the network to overcome
the problem.
Much research has been done for realizing a P2P discovery mechanism. Schmidt
and Parashar present a P2P based keyword web service discovery system on the
23
A B C … …
3
Service Discovery Broker
•Handle queries from requesters•Translate queries into appropriate formats needed by each registry
•Communicate each registry•Unify and distill results returned from registries
Service Discovery Broker
•Handle queries from requesters•Translate queries into appropriate formats needed by each registry
•Communicate each registry•Unify and distill results returned from registries
a
ServiceProviders
ServiceRegistries
ServiceRequesters
1 2 …
b …c
Publish services into one or multiple service registries
Send a request for inquiry services using broker required syntaxReceive the results from broker
Figure 2.5. Broker-based service discovery mechanism. A servicediscovery broker accepts requests from service requesters, translates
requests into appropriate formats, and sends them to multipleregistries. The returned results may be unified and distilled based on
requesters’ needs.
Chord overlay network [82] 4. In this system, a set of keywords is extracted from
the web services descriptions. These web services descriptions are indexed using
these keywords. The index is stored at the peers in the P2P system. Each web
service description is mapped to the index space. The underlying node joins, de-
partures, failures, and data lookup are build upon the Chord network’s lookup
protocol. Speed-R [88] is a JXTA based P2P network system supporting semantic
publication and discovery of web services. In this system, each service registry is
controlled by a peer. Dogac et. al. [26] describe a way to expose the semantic
of web service registries and connect the service registries through a P2P network
for the travel industry. A general P2P discovery system (See Figure 2.6) contains
4http://en.wikipedia.org/wiki/Chord project
24
a data layer, a communication layer, and peers that control registries or service
providers. The data layer can be formed by registries or service providers. Com-
munication layers are implementations of P2P network, such as JXTA and Chord.
Semantically enriched services and registries make the automation of the service
discovery and the discovery of service registries possible.
Registry A Registry B Registry NData Layer
CommunicationLayer
Peer A Peer B Peer N
SemanticEnrichedServices
OrRegistries
DescriptionUsing
ontology
JXTA … … ChordJXTA … … Chord
Figure 2.6. P2P-based discovery mechanism containing a data layer, acommunication layer, and peers that control registries or service
providers.
The traditional service discovery method, static discovery or manual discovery,
relies on humans’ intervention by using a discovery system to locate and select a
service description that meets the desired criteria at design time. The dynamically
changing service environment requires service discovery that should be possible
using a software agent during run time. The realization of the dynamic discovery
mechanism needs machine processable semantics to describe services.
The implementation and performance of a service discovery system depends
on the available information in service descriptions. The more information the
25
system can gather, the more accurate results the system can give back to the
requester. The implementation also depends on the kind of query that can be
given by the requester. Two examples are: “give a forecast service”, “give
a forecast service which has fastest response time.” For the first query,
a simple key-word based discovery system is sufficient. For the second query, the
discovery system needs to gather information on quality of service, find several
forecast services in registries and rank them based on response time.
The service discovery problem is related to the information retrieval problem.
Two key quality measurements in information retrieval are also applicable when
evaluating the performance of service discovery systems [45]. Recall is the number
of relevant items retrieved, divided by the total number of relevant items in the
collection. Precision is the number of relevant items retrieved, divided by the
total number of items retrieved.
The discovery mechanism in the traditional UDDI standard that only supports
the static service discovery has been recognized as insufficient. This discovery
mechanism often gives no result at all or gives many irrelevant results because
keywords are a poor method to capture the semantics of a request. Synonyms
(syntactically different words may have the same meaning) and homonyms (same
words may have different meaning in different domain) can not be distinguished
in a keyword-based retrieval. Also, relationships between different keywords in a
request can not be captured. This mechanism offers low retrieval precision and
recall.
WordNet [102] is used to handle the synonyms and to employ an information
retrieval model in the service retrieval process [99] so as to improve the precision
and recall. WordNet is a lexical reference system developed by the Cognitive Sci-
26
ence Laboratory at Princeton University. English nouns, verbs, adjectives and
adverbs are organized into synonym sets, each representing one underlying lexical
concept, and the synonym sets are linked with different relations. WordNet is
distributed as a data set. However, WordNet only supports the query of com-
mon words; vocabularies for a particular domain are most-likely not included in
WordNet.
With rich formal semantic descriptions added to web services, a service discov-
ery system can provide more accurate results with high precision and recall. It also
reduces human interference with the discovery process and makes the dynamic dis-
covery possible. Therefore, semantic web technologies become a solution for this
matchmaking process [47] [26] [28]. In the mean time, quality of service becomes
an interesting topic for selecting optimal services from a subset of services that
have the same functionality the requester asked for [10] [53] [48] [19]. Two types of
semantic descriptions result in two types of semantic discovery system: (1) Adding
semantics to current web services standards (UDDI and WSDL) [85]; (2) Using
DAML-S and OWL-S to represent both functional and non-functional properties
of web services enables software agents or search engines to automatically find ap-
propriate web services via ontologies and reasoning-algorithm enriched methods.
However, the high cost of formally defining heavy and complicated services makes
adoption of this improvement unlikely in the current stage.
Figure 2.7 shows, in three dimensions, existing service discovery mechanisms
currently used in implementations of service discovery systems. A is a keywords-
based system, such as traditional UDDI. B is semantic enriched UDDI systems
[85]. C is keywords-based P2P systems [82]. D is semantic-based systems with
DAML-S or OWL-S [47] [26] [28]. E is semantic-based systems on P2P network
27
[88] [26].
A
Keywords based Semantic based
BStatic
Dynamic
Centralized
Decentralized
(P2P)C E
D
Figure 2.7. Summary of existing service discovery systems withdifferent discovery mechanisms mapped relative to three characteristics:degree of decentralization, richness of service descriptions, and static or
dynamic.
The research challenges residing in the service discovery process may suggest
a way to integrate semantic and P2P technologies for building a discovery system.
This system should allow automatic service discovery and provide high precision
and recall at the same time, however, the cost of implementing this system makes
it hard to be adopted at this time.
28
2.3.3 Service composition
One of the most attractive features of service-oriented computing is that atomic
services can be combined into a large application to solve complicated problems.
The orchestration of a set of services to accomplish a larger and sophisticated goal
is called a workflow. In the business world, a workflow is referred to as a business
process. In the scientific domain, a workflow is sometimes referred to a scientific
process.
Several different approaches and platforms are being developed to achieve the
common goal of the web service composition. These approaches range from adop-
tion of industry standards to adoption of semantic web technology, from manual
or static composition to automatic dynamic composition [90]. Since there is no
standard service composition specification, each approach and platform defines its
own way for service composition, provides its specifications and languages, and
executes the workflow on a specific workflow execution engine.
Current solutions for web service composition include the adoption of industrial
standard, semantic web technologies [86] [29] [41], web components [111], Petri
nets [112], and so on. The long term goal of a successful composition mechanism
should meet several requirements: connectivity, quality of service, correctness,
and scalability [58].
Adoption of industrial standards and adoption of the semantic web technolo-
gies are two active research areas among current service composition mechanisms.
Both of these mechanisms support complex process activities, such as sequences,
branching, etc.
Current industrial standards include WSDL, UDDI, SOAP and a set of work-
flow specification languages (BPEL4WS, WSFL, BPML, WSCI, and XLANG)
29
used to support the data flow and control flow representations [98]. Among
all of these specifications, BPEL4WS is the most mature and widely supported
by the industry and research community. Service compositions described in the
BPEL4WS format may be deployed on execution engines, such as BPWS4J [11]
and Collaxa BPEL server [17].
The other model approach is based on semantic web technologies and AI plan-
ning techniques [84] [13]. In this model, services are semantically annotated with
RDF/RDF Schema, DAML-S, or OWL-S. The objective is to enable automation
of web service discovery, invocation, composition and execution. However, there
is limited implementation and product support for generating service descriptions
automatically at the current research stage.
Most service composition models require application developers to possess
complete knowledge of available services and the exact process logic. It depends
on developers to choose a particular service at each step. Adoption of seman-
tic web technologies allows automation of the composition process to be possi-
ble. There are two type of automation, semi-automatic and automatic. Both of
them require the existence of domain ontology. The typical system [84] using the
semi-automation method maintains a knowledge base which contains ontology of
services, such as DAML-S or OWL-S. A matchmaker is used to find a service
with required functionality. All the optional services that meet the requirement
are presented to user with ranking of the quality at each step. The user makes a
choice and continues the process. A typical system using the automatic method
is often cooperating using AI planning technology [13]. The composition process
starts from an explicitly defined goal. The workflow composition engine lets the
service requester provide the input and output information. This information is
30
fed into an AI planner. The planner returns one plan, multiple plans, or no results
to the end user for a further decision. Although the service composition problem
is highly related to the AI planning problem, the current planning technologies
can not be directly applied [90].
Services are dynamically changed and may have voluntary failure during exe-
cution time. A composed workflow that does desired work at one time may not
work at another time. Preventing run time failure at design time is important.
An issue in the automatic composition of web services is defining the compat-
ibility [55] or connectivity [58] of services. It can be a time comsuming process
to check if services to be composed can actually interact with each other. For
example, the output of one service is a required input of the subsequent service
in a workflow. It also requires a way to verify the soundness and correctness of
the composite services. Much research has been done to explore this using AI
planning techniques for automation of the composition process. It is still an open
research problem whether or not it is possible to use or extend the current plan-
ning techniques in the service composition process and modeling of services. The
application used most to motivate research in automatic service composition is a
virtual travel agent example; typically, the motivations lack a real world example.
This approach now may be practically used in domains with well-defined ontolo-
gies and a small number of available services in that domain. We believe the
semi-automatic approach is more practical when large number of services exist in
the domain.
31
2.3.4 Service execution
Service execution is a process in which an atomic service or a composite service
is invoked and results are returned to requesters.
Atomic web services can be created with different languages and deployed on
various platforms. Two major platforms are J2EE and .NET. Since execution of
atomic services does not require results from other services, the technologies to
support atomic services are relatively mature (See Table 2.2).
Service execution for composite services depends on the composition model
and the existing execution engine support. The industrial standard based model
can be transferred to a particular workflow specification, such as BPEL4WS, and
executed on a workflow engine. The semantic web based model can be represented
using the DAML-S specification and executed on a DAML-S Virtual Machine [84]
or OWL-S execution engine. Since there is no standard service composition spec-
ification, each composition approach and platform provides its own specifications
and languages for composite services and executes the workflow on a specific work-
flow execution engine. There are also composition toolkits that convert the visual
graph composition of service into a language-specific workflow. Several issues exist
in the service execution process.
Synchronized vs. Asynchronized communication Web service technol-
ogy is message passing oriented; the architecture should be able to support
different message passing methods. Most service-oriented frameworks only
provide support for synchronous invocation, such as Axis [3], which blocks
the process before the response from service provider arrives. The loose cou-
pled nature of web service requires more flexible invocation method. The
requester should not be blocked because it is waiting for the response from
32
TABLE 2.2
EXISTING DEPLOYMENT AND EXECUTION ENGINES FOR
ATOMIC AND COMPOSITE SERVICES
Service type Specification Execution Engine
Atomic service WSDL Implemented using Java, C++,Perl, Python on .NET, J2EE,gSoap, SoapLite
OWL-S OWL-S execution engine
Composite service BPEL BEPL4J
OWL-S OWL-S execution engine
DAML-S DAML-S virtual machine
XScufl Freefluo
providers. Various research has been done to support this asynchronized
communication method [107] [113].
Centralized vs. Decentralized execution of composite web services Al-
though most of the composite service execution engines invoke an individual
atomic service on distributed service providers, the engine acts as a central-
ized coordinator for all interactions among these atomic services. Decentral-
ized execution allows independent sub-workflows to interact with each other
without any centralized control. It can reduce the amount of network traf-
fic. Mangala Gowri Nanda et. al.[60] present an algorithm that partitions
a composite services in BPEL into independent sub process. Each service
provider should host a BPEL engine. Their experimental results show that
decentralized execution can increase throughput substantially. Roger Weber
et. al.[100] present a peer-to-peer based execution systems. In this system,
33
when a node finishes its part of the work, then the data is migrated to nodes
offering a suitable service for one of the next steps in the process. Boualem
Benatallah et. al.[6] present an environment where a composite service can
be executed in a decentralized way within a dynamic environment.
Monitoring service and workflow execution One of the issues in service exe-
cution is that it is possible the selected service in the workflow is unavailable
or temporarily off-line. The execution engine then invokes the alternative
service if one is defined in the workflow at the service composition stage.
Service execution often needs a duration to be completed. Service requesters
may require a monitoring service so that they can query the status of their
requested services. Monitoring the service execution status is another im-
portant issue. The experience in grid computing research may be adopted
in the SOA for building a reliable infrastructure for service execution.
2.4 Service-oriented computing in e-Science
An individual life sciences researcher or research group starts a scientific project
by developing hypotheses, designing experiments to test those hypotheses, collect-
ing observational data, and publishing results. The published data allows other
researchers to build upon or verify the results. With the assistance from computer
software, users can import the raw data, click on buttons, and retrieve the results.
The analysis process, however, requires certain knowledge of how to use these
toolkits and how to access these data from different locations. Even for users who
posses this knowledge, this manual analysis process is a bottleneck when large data
sets are involved. As the World Wide Web becomes a platform for scientific study
(e-Science), research data can be published on the web to be shared with other
34
researchers. These data can be distributed in various formats (such as RDBMS
tables, text files, or XML documents) depending on the preferences and needs
of research groups. Manually accessing these data files becomes difficult as these
data may come from different institutes, different research groups, and in different
formats. There is a need for a methodology that frees users from having to locate
the data sources, interact with each data source, and manually combine data in
multiple formats from multiple sources. Applying semantic web and web services
technologies to support life sciences research becomes a promising solution to this
difficulty.
As the adoption of web services in the life sciences field grows, many large
public resource sites are publishing web services interfaces in WSDL format to
allow their data and analysis tools to be accessible to the research community, see
Table 2.3.
35
TA
BLE
2.3
LIF
ESC
IEN
CE
SR
ESO
UR
CE
SAVA
ILA
BLE
AS
WE
BSE
RV
ICE
S
Ser
vic
eP
rovid
erD
escr
iption
Res
ourc
esU
RL
NC
BI(
the
Nat
iona
lCen
-te
rfo
rB
iote
chno
logy
In-
form
atio
n)
Pro
vide
sa
vari
tiey
ofE
-Uti
lity
web
serv
ices
toal
low
data
retr
ieva
lag
ains
tth
eN
CB
Ida
taba
seus
ing
WSD
Lan
dSO
AP
http
://w
ww
.ncb
i.nlm
.nih
.gov
/ent
rez/
quer
y/st
atic
/eso
aphe
lp.h
tml
EM
BL-E
BI
(the
Eur
o-pe
anB
ioin
form
atic
sIn
-st
itut
e)
Pro
vide
sa
num
ber
ofw
ebse
r-vi
cesfo
rda
tare
trie
val,
data
anal
-ys
isto
ols,
and
onto
logy
look
upus
ing
WSD
Lan
dSO
AP
http
://w
ww
.ebi
.ac.
uk/T
ools
/web
serv
ices
/
DD
BJ
(the
DN
AD
atab
ase
ofJa
pan)
Pro
vide
sw
ebse
rvic
esfo
rda
tare
trie
val,
data
anal
ysis
agai
nst
DD
BJ
data
base
usin
gW
SDL
and
SOA
P
http
://x
ml.n
ig.a
c.jp
/ind
ex.h
tml
KE
GG
(the
Kyo
toE
n-cy
clop
edia
ofG
enes
and
Gen
omes
)
Pro
vide
sw
ebse
rvic
esfo
rda
tare
-tr
ieva
lan
dda
taan
alys
isag
ains
tK
EG
Gda
taba
se
http
://w
ww
.gen
ome.
ad.jp
/keg
g/so
ap/
SeqH
ound
Pro
vide
sw
ebse
rvic
esfo
rda
tare
trie
val
from
the
sequ
ence
and
stru
ctur
eda
taba
se
http
://w
ww
.blu
epri
nt.o
rg/s
eqho
und/
seqh
ound
docu
men
tati
on.h
tml
36
In e-Science, a number of legacy data analysis tools are designed to be command-
line applications. Soaplab 5, developed by EBI, is a SOAP-based web service
utility used to wrap such command-line applications into web services. Recently,
service-oriented computing middleware, capable of supporting life science exper-
iments, have been developed. We believe that an “ideal” service-oriented archi-
tecture should allow service and data providers to publish their information into
registries with semantically defined properties using domain ontologies; it should
allow not only experts but end-users to define their workflow at a high level of ab-
straction using vocabulary provided in the domain ontology; allow the execution
of the workflow and monitoring the workflow execution process; allow the reuse
or partially reuse the existing workflow and support the data provenance manage-
ment. Several workflow managment systems are developed in order to meet this
goal.
Discovery Net 6 is a service-oriented computing system based on an open
architecture re-using common protocols and common infrastructures such as the
Globus Toolkit for knowledge discovery. It is a multidisciplinary project serving
application scientists from various fields including biology, combinatorial chem-
istry, renewable energy research and geology. The system allows service providers
to publish and make data mining and data analysis software components as ser-
vices. It allows data owners to provide interfaces and access to scientific databases,
data stores, sensors and experimental results as services. It also allows users (sci-
entists) to plan, manage, share and execute complex knowledge discovery and data
analysis procedures. Besides re-use of the common protocols and infrastructure,
Discovery Net define its own protocol – DPML (Discovery Process Markup Lan-
5http://www.ebi.ac.uk/Tools/webservices/soaplab/overview6http://www.discovery-on-the.net/
37
guage) – for constructing and managing knowledge discovery procedures, as well
as recording their history. The defined data analysis task (scientific workflow) can
be executed on distributed resources, stored, shared, and re-executed.
Pegasus 7 [34] [23] [2] is a framework that enables the mapping of complex
scientific workflows onto the Grid. In the Pegasus system, an abstract work-
flow is a workflow in which the workflow activities (software components) are
independent of the Grid resources used to execute the activities. The abstract
workflow depicts the main steps in the scientific analysis including the data used
and generated, but does not include information about the resource needed for
execution. The abstract workflows can be constructed by using Chimera – VDS
(the GriPhyN Virtual Data System) 8 – or can be written by users from a workflow
template.
The concrete workflow represents an executable workflow that includes details
of the execution environment. It also includes the necessary data movement to
stage data in and out of the computations. Other nodes in the concrete workflow
also may include data publication activities, where newly derived data products
are published into the Grid environment. A major focus of research on the map-
ping of abstract workflows to concrete workflows in the Grid computing environ-
ment is on how to find an appropriate resource currently registered at each step.
Extra service components such as data transfer and data registration in the grid
environment may have to be encapsulated in the workflow. This mapping process
may be automated with algorithms and AI planning technologies if the resources
are semantically well-described. During the mapping process, the workflow may
be restructure, reordered, and refined to improve the overall performance and to
7http://pegasus.isi.edu/8http://www.ci.uchicago.edu/wiki/bin/view/VDS/VDSWeb/WebMain
38
adapt to dynamically changing execution environments. The concrete workflow
can be given to Condor’s DAGMan9 for execution.
myGrid 10 is a service-oriented computing middleware for supporting life sci-
ences researchers with the construction, execution, and sharing of scientific work-
flows using the Taverna 11 workbench. Researchers can use the graphic work-
bench to drag and drop service components into the model explorer. Recent
myGrid developments focus on supporting users in the discovery and composition
of services by using rich service annotations to make the workflow design more ac-
cessible to non-expert users. With incorporated semantic web technology, services
and workflows can be described using domain specific ontologies. It is a valuable
capability in a system potentially searching over thousands of services. Instead of
locating available Grid resources, the semantic web enabled services annotation
and discovery in myGrid is used to locate the software components or data that are
exposed as web services. The executable workflow is written in XScufl language
and executed in Freefluo workflow engine. Life sciences researchers can monitor
the execution status through the Tavana workbench. In the myGrid system, the
Feta data model is used to represent the semantic description of available services
[50]. Web services are annotated with terms from an OWL-base myGrid domain
ontology [103] with an GUI based interface Pedro [33]. This approach is more
lightweight than the OWL-S and WSMO ontologies, although less expressive of
details which could be more supportive for the automation process. Although the
description methods adopted in myGrid has limited expressivity, they are suffi-
cient for describing most services and their simplicity makes them more practical
9http://www.cs.wisc.edu/condor/dagman/10http://www.mygrid.org.uk11http://taverna.sf.net
39
for describing large number of services.
IRIS [74] project is another project that targets discovery, composition, and
interoperability of services required within in silico life science experiments. The
IRIS project uses a semi-automatic procedure for identifying and placing cus-
tomizable adapters (mediators) into workflows built by service composition. In
IRIS, the capabilities of a mediator are described using the Mediator Profile Lan-
guage (MPL). MPL is developed as a top-level ontology using the Web Onotology
Language (OWL).
BioMoby 12 is an open source research project which aims to generate an
architecture for the discovery and distribution of biological data through web ser-
vices [101]. Decentralized data and services are registered at a centralized registry
called MOBY Central. The BioMOBY project focuses on the area of service de-
scription, discovery, transaction and simple input/output object type definitions.
This foundational set of functionality allows client programs to expand the specifi-
cation to include additional new features. The architecture provides a set of foun-
dational functions that allows client programs to expand on the specification to
include additional new features. There are two development tracks with different
architectures, MOBY-Services (MOBY-S) and Semantic-MOBY (S-MOBY). The
BioMoby project recently integrated access to many BioMoby features to the Tav-
erna workbench interface using a Taverna plug-in. Users are guided through the
construction of syntactically and semantically correct workflows from the graphic
interface [44].
Open Middleware Infrastructure Institute UK (OMII-UK) 13 is a
project that aims to provide software and support to enable the collaboration
12http://biomoby.open-bio.org/13http://www.omii.ac.uk/
40
of building infrastructure for the UK e-Science community and its international
collaborators. The OMII environment integrates other open-source software com-
ponents to provide users a secure web services hosting and services execution
environment. Users can deploy web services on different levels in the OMII server
architecture, a normal Axis web service and a secure web service with the WS-
Security support. GridSAM provides a Web Service for submitting and monitor-
ing jobs managed by a variety of Distributed Resource Managers (DRM). The
modular design allows third-parties to provide submission and file-transfer plug-
ins to GridSAM. It also integrates GRIMOIRES, a registry service, to provide
descriptions of services and workflows. The GRIMOIRES implementation ex-
tends the UDDI specification and provides not only the syntactic description but
also semantic descriptions. The OGSADAI middleware provides data integration
and secure infrastructure for exposing data resources as web services in a grid
or any other context. WSRF::Lite follows on from OGSI::Lite Perl, the Web
Service Resource Framework (WSRF) which was inspired by and supersedes the
Open Grid Services Infrastructure (OGSI). WSRF::Lite provides support for the
following Web Service Specifications: WS-Addressing, WS-ResourceProperties,
WS-ResourceLifetimes, WS-BaseFaults, WS-ServiceGroups, WS-Security.
2.5 Conclusion
In this chapter, we introduce several concepts related to SOA and discuss
the integration of these technologies to solve some open issues in SOA research.
Applying semantic web technology is intend to automate the web service discovery
and composition process with little (or without) guidance of a human being. The
challenges are: 1) define a high quality domain ontology; 2) interoperability of the
41
ontology among different domains; 3) correct annotation of large numbers of web
services and data using the ontology; and 4) an agreed on definition of service
composibility, soundness, and scalability. AI planning technologies used in the
service composition process is largely studied at the theoretical level and often
demonstrated with a well-defined, small domain, such as a travel agency, instead
of large real world applications.
Services provided in the Grid architecture, in particular, Globus toolkits, can
be exposed with a web services interface and be composed into a workflow. When
combined with Grid computing technology, this allows the creation of virtual
organizations and groups, provides a service-oriented architecture that is more
efficient and flexible with resource allocation and data transfer (such as gftp tool),
and enables an increased level of privacy inside and between virtual organizations.
Since Grid computing and service-oriented architecture are converging together,
there are many standards and specifications that are constantly being expanded,
updated, refined, and obsolete rather rapidly, it is hard to keep up with those
evolving standards and specifications. For example, the Open Grid Services In-
frastructure (OGSI) was published by the Global Grid Forum (GGF) as a proposed
recommendation in June 2003. It was intended to provide an infrastructure layer
for the Open Grid Services Architecture (OGSA). OGSI is now obsolete, and has
been superseded by Web Services Resource Framework (WSRF). With the re-
lease of GT4, the open source tool kit is migrating back to a pure Web services
implementation (rather than OGSI), via integration of the WSRF.
Applying peer-to-peer technology can help to avoid central failure and increase
the scalability during the service discovery and workflow execution process.
Service-oriented computing is a new research area, with many in-progress
42
frameworks and middleware, workflow specification, WS-* standards, and onto-
logical representations that have been presented without complete tool support.
There are still many areas of research need to be addressed in order to build a
complete, reliable, and ideal service-oriented architecture.
43
CHAPTER 3
A SERVICE-ORIENTED DATA INTEGRATION AND ANALYSIS
ENVIRONMENT FOR BIOINFORMATICS RESEARCH
In this chapter 1, we present a practical experiment of building a service-
oriented system upon current web services technologies and bioinformatics mid-
dleware. The system allows scientists to extract data from heterogeneous data
sources and generate phylogenetic comparisons automatically. This can be diffi-
cult to accomplish using manual search tools since sequence data is rapidly ac-
cumulating and those manual tools will need to be repeatedly invoked as that
new data becomes available. A web-based environment enables scientists to more
effectively define a task, perform the task at a desired time, monitor the execu-
tion status, and view the results. The first prototype of this system is evaluated
on a phylogenetic research application, Mother of Green (MoG). Our evaluation
demonstrates that a service-oriented architecture can accelerate scientific research,
increase research productivity, and provide a new approach to doing science. We
also discuss issues in design and implementation of the system and identify our
future research directions to enhance the system.
1Portions of this chapter appear in the 40th Annual Hawaii International Conference onSystem Sciences, HICSS40, Hawaii, 2007[110]
44
3.1 Introduction
As biological research is becoming increasingly data driven, scientists are con-
ducting experiments using the cyberinfrastructure (in silico experiments) to gather
information in public online databases and to test their hypotheses. These hetero-
geneous, independently developed data sources make traditional approaches in-
sufficient for this type of research and experimentation. Complex queries against
several of these databases may provide valuable new insights, but interoperability
problems make this difficult. The researcher must often manually cut and paste
data from one database resource to another and repeatedly use multiple tools to
format and analyze the data, a process that may take days or weeks. In many
investigations, the process stops once the scientist requires a workflow that is not
feasible using manual retrieval and analysis.
There is a demand for a methodology that frees users from having to locate
the data sources, interact with each data source, and manually combine data
in multiple formats from multiple sources. A promising solution to achieve the
seamless interoperability among these data sources and analysis tools relies on the
emerging technology of service-oriented architecture (SOA). SOA has been recog-
nized during the past few years as an approach to achieve interoperability among
multiple data sources [91] [92]. Many large bioinformatics database providers,
such as NCBI, EMBL, DDBJ, already make their databases available via a SOA.
Emerging toolkits and platforms, such as Soaplab [87] enable many data analysis
tools to be wrapped as web services. These existing services permit software engi-
neers to build unified interfaces for scientists to access heterogeneous data sources.
The platform independent feature of SOA makes it a feasible solution to integrate
increasingly available data analysis tools.
45
While there are protocols, toolkits, and middleware that are increasingly avail-
able to address the majority of the technical issues in building a data integration
and data analysis environment, the question of how real world problems can be
solved successfully using these technologies needs to be answered through practical
implementations in a real world context. In this chapter, we describe the design
and implementation of a web-based data integration and analysis environment.
The underlying infrastructure is built upon current web service technologies and
bioinformatics middleware to enable biologists to better utilize heterogeneous ge-
nomic data. The first prototype of the system is used in a phylogenetic research
application, the Mother of Green (MoG). MoG is a collaborative research project
on plastid phylogenetic analysis involving information technologists and biologists.
Genomic sequence data is accumulating faster than scientists can find and ana-
lyze it using manual search tools. The SOA-based platform allows scientists to
extract data and analyze phylogenetic comparisons automatically. The web-based
environment enables scientists to more effectively define a task, perform the task
at a desired time, monitor the execution status, and view the results. The over-
all aim of this project is to provide an easy-to-use environment for biologists to
research the puzzle of plastid phylogeny and to answer an open question on the
phylogenetic history of the plastid genome.
In the rest of this chapter, we briefly review web service technologies and
related work followed by an overview of the MoG project and a description of the
overall system architecture. We then describe a prototype implementation of the
system, related issues, and extensions of the system.
46
3.2 Related work
The service-oriented architecture (SOA) was proposed initially as an emerging
paradigm for business process integration inside or across organization boundaries.
It is gaining significant attention from the scientific research community for use
in building e-science infrastructures. The proposed standard in grid computing,
Open Grid Service Architecture (OGSA) [63], is built upon service-oriented ar-
chitecture and demonstrates the convergence of the Grid with SOA. Three basic
standards in SOA, Simple Object Access Protocol (SOAP), Web Services De-
scription Language (WSDL), and Universal Description, Discovery and Integra-
tion (UDDI), are sufficient for providing simple atomic services. However, single
atomic services are not adequate for developing complex applications. One of the
most important features of SOA is that services developed in different groups can
be combined as a workflow to solve complicated problems. This feature leads to
several research issues and challenges including service discovery, services compo-
sition, and service enactment. Semantic web technology[54] [7] and peer-to-peer
technology are used in SOA to automate the service discovery process and make
the service enactment more reliable.
BioMOBY is an open source research project which aims to generate an archi-
tecture for the discovery and distribution of biological data through web services
[101]. Decentralized data and services are registered at a centralized registry called
MOBY Central. The BioMOBY project focuses on the area of service descrip-
tion, discovery, transaction and simple input/output object type definitions. This
foundational set of functionality allows client programs to expand on the spec-
ification to include additional new features. The architecture provides a set of
foundational functions that allows client programs to expand on the specification
47
to include additional new features. There are two development tracks with differ-
ent architectures, MOBY-Services (MOBY-S) and Semantic-MOBY (S-MOBY).
REMORA [14] is a web server implementation base on the BioMOBY service
specification. It provides life science researchers with an easy-to-use workflow
generator and launcher, a repository of predefined workflows and a survey system.
Another project, myGrid, provides e-Science application developers a toolkit
based upon a high-level middleware layer. It builds on and extends the Grid frame-
work of distributed computing through a SOA. It not only provides a semantic
based service discovery system but also the Taverna workflow bench [65], person-
alized data repositories, provenance and update notification. The direct users of
myGrid are users who build applications using the myGrid toolkit [94]. Compared
to the BioMOBY project, myGrid has more ambitious goals. Bioinformaticians,
tool builders and service providers can collectively or selectively employ these mid-
dleware services to produce applications that support research in the biological
and life sciences [36].
The IRIS [74] project is another active project that targets the service discov-
ery, composition, and interoperability of services required within in silico exper-
iments. The IRIS project handles this problem through a semi-automatic pro-
cedure for identifying and placing customizable adapters into workflows built by
service composition.
Web Service for Bioinformatic Analysis Workflow (WsBAW) [106] and Bioin-
formatic Workflow Builder Interface (BioWBI) [9] are two projects provided by
IBM aphaWorks to allow life science researchers to build and execute bioinfor-
matics workflows and share their analysis processes.
WsBAW is an application that automates bioinformatic workflow by deploying
48
a web service. BioWBI is an easy-to-use, Web-based working environment from
which a life sciences researcher and/or a research community can build and execute
bioinformatic workflows and share their analysis processes.
IBM alphaWorks provides the applications, Web Service for Bioinformatic
Analysis Workflow (WsBAW) [106] and Bioinformatic Workflow Builder Interface
(BioWBI) [9]. WsBAW is an application that automates bioinformatic workflow
by deploying a web service. It consists of a client application through which
users are able to send batch requests to a specific bioinformatic workflow execu-
tion engine, such as BioWBI, by using a Web service. BioWBI is an easy-to-use
Web-based working environment from which a Life Sciences researcher and/or
a research community can build and execute bioinformatic workflows and share
their analysis processes.
3.3 Motivation
The motivating application is the phylogenomics of the plastid. Named the
Mother of Green (MoG) project by an multidisciplinary team of computer scien-
tists and biologists, MoG aims to identify the most recent common ancestor of all
plastids. While many biologists support the view that all plastids are descended
from a single endosymbiont ancestor, the data are not conclusive due to missing
information and inefficient use of existing information. Using the nucleotide and
amino sequences of expressed genes to infer ancient ancestral relationships, MoG
investigators hope to identify which of the ancestral plastid genes have traveled
into the host nucleus and why some genes are more likely to be transferred than
others. The rate of data accumulation, the rapid development of new phyloge-
netic analysis tools, and the refinement of existing tools simply overwhelm the
49
researchers. The biologists need a better approach than manual or ad-hoc script-
ing to accumulate and analyze enough relevant data to rigorously test the single
ancestor hypothesis.
3.3.1 Use case
A typical phylogenetic analysis process consisting of multiple manual data
collection and data analysis steps is described below and shown in Figure 3.1.
2007-4-18 Ph.D defense 12
A typical in-silico investigationDatadriven research workflow
A: Query completegenome sequences
given a taxon
A: Query completegenome sequences
given a taxon
B: Query protein coding genes
for each genome sequence
B: Query protein coding genes
for each genome sequence
C: Eliminate vectorsequences
C: Eliminate vectorsequencesD: Sequences
alignmentD: Sequences
alignmentE: Phylogenetic
analysisE: Phylogenetic
analysis
Figure 3.1. A manual phylogenetic data collection and data analysisprocess
A) Biologists send a query to a data provider, NCBI for example, through a
web-based interface to retrieve the whole genome sequence of a specified taxon.
After recording the query terms and results, the investigator must examine the list
of sequences, delete inappropriate entries and then add new entries based on their
knowledge of plastid phylogenomics or from sequences generated in their own lab.
50
B) For each whole genome sequence, biologists need to find specific protein
coding genes, or the specific subunits of protein coding genes, or specific active
sites within a specific gene or subunit. This is an iterative process for each entry
in the list.
C) Each nucleotide sequence must be checked for vector sequences, a com-
mon contaminant of nucleotide sequences in unvetted public databases, and any
detected vector contaminants removed.
D) Biologists then choose a subset of these genes and use a sequence alignment
program, (e.g. ClustalW), to align the sequences. After viewing the results,
biologists may decide to choose another subset for sequence alignment analysis or
continue the comparison using phylogenetic tree building tools.
E) Once the initial sequence alignment results prove satisfactory, biologists
convert the alignment output to the appropriate data format required by the
phylogenetic analysis programs, such as PAUP or Phylip.
3.3.2 Operational barriers
The data retrieval and data analysis processes need to be repeated multiple
times, as different hypothesis are evaluated and new data pours into the public
databases. From an operational perspective, this repetition makes the research
process time consuming or even impossible using manual approaches. Other bar-
riers also make this particular scientific research process even more difficult.
• Data collection The capabilities offered by a data retrieval system can-
not always meet the requirements of scientists. Entrez [61] is a web-based
data retrieval system available from NCBI that provides integrated access
to multiple databases covering a variety of data domains, including com-
51
plete genomes, nucleotide and protein sequences, gene sequences, three-
dimensional molecular structures, literature, and more. However, sometimes
scientists are not able to get desired information with a simple query. For
instance, “find all of the subunits for the plastid ATP synthase” requires
that the investigator first identify the official protein names of all subunits
of which there many (atp alpha, atp beta, atp gamma, atp delta, atp ep-
silon and so on) for the plastid-specific ATP synthase. The next process is
to retrieve these sequences for each new genome and to merge these data
with the data previously retrieved.
• Analysis tool usage Each data analysis program may have different require-
ments for input data formats even for programs providing similar function-
alities. Correct use of these programs and correct implementation of this
workflow relies heavily on the researcher having detailed knowledge and un-
derstanding of each tool. A typical work unit might be: “find all of the
sequences for atp synthase alpha subunit that are most similar to the atp
alpha synthase sequence found in Prochlorococcus, align the sequence using
clustalW, save that output, then reformat the data and submit the sequences
to Phylip for phylogenetic analysis”. The output from one data analysis pro-
gram needs to be fed into the next program as its input with appropriate
conversion to the required data format. The rapid development of new data
analysis tools and the refinement of existing tools make the manual data
conversion process even more difficult.
• Experimental record keeping Accurate recording of an in silico investigation,
including materials, methods, and results is as important as accurate record-
ing of bench top or field experiments. Keeping the provenance data, includ-
52
ing the input, output, and intermediate data sets is also critical. Manual
organization of these metadata quickly approaches impossibility for anything
but the most trivial of queries.
An easy-to-use environment is essential and necessary to support the automa-
tion of deep phylogenetic analysis. For many years the data were sparse. Now
mountains of data exist but our limited 20th Century tools do not properly equip
us to mine for the gems within them. Automation has become necessary.
3.4 System architecture
The whole system, the MoGServ, includes an underlying infrastructure, MoGServ
middle layer, and a web-based environment that provides an easy-to-use interface
for scientists to access functions provided by the middlelayer. The system acts
as both service consumer and service provider in the context of SOA. While it
consumes and aggregates services provided by other service providers, the system
also provides services that can be used and integrated by other applications.
There are two roles in the design and implementation of the system, end-
users and software developers. End-users are biologists who focus on the study
of what information needs to be gathered and what data analysis needs to be
preformed. The software developers are responsible for several tasks based on end-
users requirements: collecting and annotating available services; creating services
to implement functions in the specific application; building workflows to automate
a variety of tasks required by end-users; providing a flexible, high performance,
fault-tolerant infrastructure to execute the workflows; providing a mechanism for
end-users to keep track of the origin of the data (data provenance); and providing
end-users a web interface to configure a task, monitor the execution status, and
53
view results. An overview of the MoGServ system architecture is given in Figure
3.2.
Web InterfaceWeb Interface ApplicationsApplications
Application ServerApplication Server
Data AccessServices
Data AccessServices
Data AnalysisServices
Data AnalysisServices
Job ManagerJob Manager
Job LauncherJob Launcher
Service/WorkflowRegistry
Service/WorkflowRegistry
MetadataSearch
MetadataSearch
Local DataStorage
Local DataStorage
Workflow/SoapEngines
Services
NCBINCBI DDBJDDBJ EMBLEMBL
Data/ServicesProviders
MoGServMiddleLayer
ServicesAccessClient
OthersOthers
MoG
Ser
vS
yste
m A
rchi
tect
ure
Figure 3.2. MoGServ System architecture includes a services accessclient, MoGServ middle layer, and other data and services providers
3.4.1 Data storage and access service
Data collection from multiple distributed data resources is one of the first steps
of a bioinformatics research project. In the MoG project, an in silico experiment
involves the collection of large data sets, a computational and memory intensive
process that involves daily checking for new information and quality control for
each new sequence detected. Some data service providers limit the number of
connections to their data server for performance concerns. The refresh rate of the
54
data in a data source is much lower than the rate of end-user requests for the data.
Therefore, a local data storage is required to store biological data collected from
remote data providers, to avoid repeated vetting of the same data and to insure
access to the data for time sensitive projects. The biological data from remote data
sources is gathered, aggregated, and integrated into the local database through a
set of data access services.
An in silico experiment also requires the integration of results from numerous
data analysis tools. Recording the intermediate data in the local database allows
MoGServ to preserve the data provenance and provides opportunity for end-users
to keep track of where a piece of data has come from. The information stored in
the local database can be accessed through a set of data access services.
3.4.2 Service and workflow registry
A service and workflow registry provides a repository to store descriptions of
services and workflows that may be used in a phylogenetic study. These services
and workflows include both locally constructed and preexisting services. The reg-
istry also provides functions to allow inquiries about services or workflows. In the
first prototype, neither UDDI-based registry nor semantic-based descriptions are
employed. While a UDDI type registry is more business-oriented and may not
be a perfect fit for this application, the semantic-based description takes more
time to define a commonly used ontology. The current registry is a simple table
with focus on capturing both functional and non-functional properties of services
and workflows to support service selections, service and workflow enactment, and
provenance data representation. Semantic-based description and inquiry provides
the attractive capability of automating service discovery and will be used in the
55
TABLE 3.1
ATTRIBUTES FOR SERVICES AND WORKFLOWS DESCRIPTION
Attributes Description
id a unique sequence number assigned to the service/workflow during the reg-istration process
name the name of the service or workflow
text description description of the functions provided by the service or workflow
location the URL of the definition of the workflow or WSDL location of the service
input/output description of input/output parameters
provider the name of the service or workflow provider
version the version of the service or workflow implementation
algorithm the algorithm used in the service or workflow implementation
invocationmethod
the method used to execute the service or workflow
next version of MoGServ. The description of a service or workflow includes at-
tributes as shown in Table 3.1.
When end-users view results from their experiments, they may ask a question
“which algorithm was used to generated the data and what is the source
of the data?” Service consumers may prefer a service or workflow based on their
preference for a particular algorithm or provider. For example, a sequence align-
ment service can be implemented using the Sequence Alignment and Modeling
System (SAM) or ClustalW.
3.4.3 Indexing and querying metadata
The data is best managed with a relational database; however, for searching
purposes, an indexer is more efficient. We identify and extract metadata about
56
additional actual data sequences, experiments, services and workflow descriptions
in the local database. For example, the metadata of a gene sequence includes gid,
accession number, name of the sequence, from which organism, and taxonomy. An
experiment can generate results that leads to new or more detailed information
requirements and a new series of experiments. End-users may need to know the
origin of a piece of data – “which query was used to get this subset of sequence,
when was the data generated, what process was used to generate the results”. This
may lead to new experiments using different data sets or even different methods.
These metadata are extracted and indexed by a metadata indexing service.
This service is triggered when new data is added into the database. A metadata
searching service provides functions to query an index.
3.4.4 Service and workflow enactment
The system supports both synchronized invocation and asynchronized invoca-
tion methods. Synchronized invocation is mostly used for invocation of services
or workflows with short running times, e.g. querying sequence data or job infor-
mation in a local database.
Asynchronized invocation is used for executing long running services and work-
flows. As shown in Figure 3.3, the job manager accepts the input parameter of the
service/workflow, service/workflow id, and timer. The definition of the services
and workflows is found in the registry. A job definition including the services or
workflows URL, input parameter, timer, and other metadata of the job informa-
tion (such as when and who submitted this job) is stored in the database. A job
id used to identify the job is generated. The job launcher periodically checks the
database to retrieve a service and workflow which needs to be executed at a time
57
point.
Multiple workflow engines are deployed on different nodes to prevent single
engine failure and achieve higher performance. A similar mechanism is used for
deploying long running services to prevent service failure. Each node hosts a
service that is responsible for returning the current load information of the node.
This information is used by the job launcher to dispatch a job to an optimal node.
With the SOA, it is easy to distribute and invoke workflows and services remotely.
The execution status of the workflow or service is recorded into the database as
an attribute of a job description. This information can be used for implementing
failure recovery functions, such as restart. The job information accessed through
data access services allows end-users to monitor the execution and view the results.
2007-4-18 Ph.D defense 18
Service and workflow enactment
INPUT
Parameters
Task Name
Timer
INPUT
Parameters
Task Name
Timer
Service/WorkflowRegistry
Job ManagerJob Manager
Find the service/workflowdefinition using the task name
Form a JobDescription
Output
Job ID
Output
Job ID
Job LauncherJob Launcher
Instances of Workflow/Service Engines
Instances of Workflow/Service Engines
JobInformation
Figure 3.3. Asynchronized services and workflow invocation model
58
3.5 Implementation
3.5.1 Development and deployment tools
Among a large number of programming platforms for web services develop-
ment and deployment, Microsoft’s .NET and Sun’s J2EE typically are two main
choices for applications and middleware developers. With consideration of future
extensions of the system as well as our previous experience with Java, the J2EE
based platform appeared more suitable for MoGServ. In particular, Apache’s open
source tools - Tomcat(5.0.18) and Axis(1 2RC2), are used.
Tomcat/Axis are active projects with support from the open source commu-
nity. Another open source software tool, Eclipse, is used to develop the web
interface for the system.
There are more than a dozen proposed languages to coordinate messaging and
transactions among independent web services. The business process execution
language for web services (BPEL4WS) is a promising workflow language since it
has wide support from IBM, Microsoft, and BEA. Several workflow enactment
engines, such as BPWS4J, Collax, ActiveBPEL, are already in place to support
the execution of workflow. While a business-oriented workflow language and cor-
responding execution engine can be used in the scientific domain [20], the Taverna
[65] project possesses more attractive features and naturally fits the development
of our system. The Taverna project is open source and a part of the myGrid
project developed in the e-Science community to support data-intensive in sil-
ico bioinformatics experiments. The Taverna workbench provides a graphical
tool for building, editing, and browsing workflows and generates a XML-based
Simple conceptual unified flow language (Scufl) document. The embedded work-
flow execution engine, Freefluo, facilitates testing during the development process.
59
Freefluo, a Java workflow enactment engine, which supports the Scufl specifica-
tion, coordinates execution of the parallel and sequential activities in the workflow
and supports data iteration and nested workflows. The enactor can invoke arbi-
trary WSDL type service operations as well as more specific bioinformatics service
operations such as Soaplab and BioMoby.
Apache Lucene [51] is used in our system for building a search engine to sup-
port full-text search on sequence data, intermediate data results, and job infor-
mation stored in the local database. Since Lucene is a search engine library
written entirely in Java instead of a command line toolkit, it provides flexibility
to write a variety of applications with rich search capabilities. These capabili-
ties include ranked searching, phrase queries, wildcard queries, proximity queries,
fielded searching, and so on.
PostgreSQL(8.0) is used to store all the intermediate data results, job infor-
mation, sequence data, and services/workflow descriptions.
3.5.2 Services provision
We create web services using the RPC style due to its easy implementation
with full support from most tools. As most bioinformatics applications take a
number of input parameters and produce a number of outputs, we use an XML
document to represent the input/output of a service for which a large number
of parameters are needed. The XML document is provided as a single input
parameter to the service or workflow and the output results are produced as a
single XML document. Using this method, the service consumers themselves
create a valid and accurate XML document for input while service providers parse
the XML and extract the input parameters.
60
Multiple services are created and deployed on the Tomcat/Axis server using
the Java2WSDL and WSDL2Java toolkits. Individual services can be invoked
statically or dynamically through a client side application. They can also be
used as a building block in the workflow creation process. We separated services
provided in the first prototype into the following categories.
Data collection The original data source is NCBI. NCBI’s Entrez Programming
Utilities (eUtils) provide access to Entrez data outside of the regular web query
interface and help for retrieving search results for future use in other environments.
With the eUtils SOAP interface, we create services to get data, such as complete
genome sequences and specific genes of interest.
Query local database All the intermediate data and job information are stored
in the local database to help biologists keep track of the data provenance and mon-
itor the job execution. Also in this particular application, biologists are interested
in selecting sequence subsets from the local database and using sequence align-
ment services to do preliminary comparisons. A set of services are implemented
to query desired information.
Indexing and querying metadata The creation and update of each of these
indices is done by a service operation. The index service is triggered whenever
new data is stored in the database. The query service accepts a query string and
an index name to search the index and return output.
Data format services Each particular data analysis tool used in a bioinformat-
ics study requires a specific data format as input. A set of data format services in
the system is implemented to convert data into an appropriate format. This type
of service can be used in a workflow creation process or used explicitly.
Data analysis services Many existing data analysis tools in bioinformatics re-
61
search are available as command line applications. The creation of a data analysis
service is a process to wrap these toolkits as web services. JLaunch [42] is a light-
weight Java library for launching command line applications from Java programs.
With the JLaunch library, we can write Java programs to execute any type of
command line programs.
3.5.3 Workflow engine
The Freefluo workflow engine is deployed on a application server. The invo-
cation of the workflow engine is done by generating a local stub specific to the
Freefluo web services API. The local stub is implemented as part of the job luncher
in our system.
The execution of a workflow on the Freefluo engine follows the following steps:
1) obtain a proxy to the remote Freefluo server; 2) create a Scufl model; 3) pass
a XScufl workflow to the Scufl model and form the input using the Baclava data
model, a representation of Taverna data type 2; 4) compile the XScufl workflow as
a workflow instance; 5) execute the workflow instance and obtain an ID from the
server; 6) poll the Freefluo engine until the execution has completed; 7) retrieve
a list of outputs from the server; 8) extract the required output from the Baclava
data model; and 9) destroy the workflow instance.
3.5.4 Building workflows
A Scufl workflow represents a procedure as a set of processes and the rela-
tionships between these processes. Our workflow design uses available services
as building blocks whenever possible and creates new ones when necessary. The
2http://taverna.sourceforge.net/index.php?doc=usingbaclava.html
62
Taverna workbench provides a graphical tool to build and test workflow as well
as a number of integrated bioinformatics services. The Scufl language has some
useful features such as implicit iteration and conditional branching that are most
important for building workflows in this application. During the construction of
workflows, we often encounter the case that output of one service can not be com-
pletely fit to the input of the next chosen service. One approach we take is to
create a new service, such as the type of data format service described above, and
expose it in the same way as other services. An alternative approach provided
in the Taverna workbench is to use the Beanshell scripts [4] to convert the out-
put to appropriate input. We create a number of workflows using the Taverna
workbench to support the research. One example is shown in Figure 3.4. It is
a workflow used to retrieve a complete genome sequence and particular gene se-
quences from the NCBI site. The workflow accepts two inputs, the query term and
the particular gene group. The service genome gids by terms returns a String
of gids and a Beanshell script converts the String to a list of gids. The service
Get Nucleotide Fasta, a third party service, accepts a gid and returns a sequence
in fasta format. The implicit iteration method in the Xscufl workflow enables it-
eration for all the gids in the list. With the service-oriented architecture, the same
services can be used for different workflows, minimizing the need to create new
services.
3.5.5 Web interface
The web interface provides scientists a convenient interface to configure their
tasks, monitor the job execution status and view results. It is implemented
with a number of server side JSPs (Java Server Pages). The returned results
63
are transformed with appropriate XSLT to HTML pages. The service-oriented
architecture provides flexibility of building the front-end web application with dif-
ferent languages, e.g. Perl, and deploying on a different web service engine, e.g.
Apache/SOAP::lite.
3.6 Discussion
Although current development and deployment tools haven’t implemented all
the features claimed in the service-oriented architecture specification, they are
actively evolving to make it happen. In particular, the Apache Tomcat/Axis,
Taverna workbench, and Freefluo engine enabled the implementation of our first
prototype.
In general, SOA offers considerable benefits for building the system: 1) The
loosely coupled feature of SOA facilitates the distribution of computational in-
tensive processes across multiple nodes; 2) The platform independent feature
of SOA facilitates the integration of data from heterogeneous data resources
through distributed web services; 3) The composition-of-services feature allows
reuse of a service in multiple workflows minimizing the need to create new ser-
vices; and 4) SOA also provides flexibility for building the front-end web appli-
cation with different languages, e.g. Perl, and deploying on different web service,
e.g. Apache/SOAP::lite.
While we believe a simple SOA architecture is appropriate in the design and
implementation of our system, there are various aspects of the system that need
to be improved. We summarize issues and the directions to enhance the system
in this section.
64
3.6.1 Issues with the first prototype
Security Although security was not our major concern during the first proto-
type implementation, it is an important component in the next implementation.
Services and workflows provided in the system allow users to access the compu-
tational and data resources in the system with no restrictions. A certain level
of security is required to prevent abuse of the system and to protect sensitive
data and analysis results. An authorization component should be built in the sys-
tem to enable users to access the permitted services and to personalize their own
workspace. A web portal will be built to enable users to create an account, login
and logout with username and password. The user account information including
the access level will be stored in a database. The GridSphere portal framework
[39], an open-source portlet based web portal, is one of the candidates.
Service and workflow description and selection In the first prototype
implementation, the same development group acts as both service provider (ser-
vices/workflow creation) and service consumer (building the web-based applica-
tion using these services and workflows) roles. While there is no demand for
supporting the selection of appropriate services/workflows, the major capability
of the index-based services/workflows registry is to keep track of data provenance
and to provide definition for performing services/workflows.
However, the index-based syntactic description services/workflows provide lim-
ited flexibility for third party service consumers to choose appropriate services/workflows
provided in the system and to integrate them into their application without prior
knowledge.
Failure tolerance and recovery The workflow or service execution may fail
at some point due to the failure of the enactment engine, failure of the service,
65
and failure of the network fabric [64]. Our system handles these failures during
the static workflow design stage and services or workflows invocation stage.
Multiple workflow engines and long running services are deployed on different
physical locations. It allows a submitted task to be invoked on the most idle
site to achieve higher performance. More importantly, this approach can prevent
dispatching services/workflows to the engine with a physical failure. Recording
execution status of long running services/workflows in the database allows us to
add policies for determining if a failed service/workflow should be restarted. The
Taverna workbench and Xscufl provide a capability that allows users to specify
an alternate service and to configure basic fault tolerance mechanisms during the
workflow design stage, which can prevent the failure of services to a certain degree.
Another more promising, yet more complicated approach for failure recovery
is to support the dynamic selection of alternate services during execution time.
However, the implementation of this feature requires services to be described in
rich semantic formats using a widely accepted ontology.
Data provenance In the system, the metadata description of sequence, job
information, and services/workflows are stored in the database. A set of indexing
and querying services allows end-users to trace the origin of the data, which is
a desired feature for scientists. Also, the workflow engine and Xscufl provides
mechanisms to record more detailed information including the type of processor,
status, start and end time, and a description of the service operation. A sys-
tems administrator may be interested in using this information to investigate how
results, in particular erroneous or unexpected ones, were produced by workflow
processes.
66
3.6.2 Extension of the system
Although the first prototype of the system focuses on design and implemen-
tation based on relatively mature technologies in service-oriented architecture,
we are extending the system to address some issues described above with grid
computing and semantic web technologies.
Grid technologies specify the mechanisms for distributed resource manage-
ment, coordinated fail-over, and security. As the Grid technologies, and Grid
framework Globus toolkit [97] in particular, are evolving towards the OGSA stan-
dard, integration of the Grid technologies into the system can help address some
issues discussed above. The convergence of service-oriented architecture and Grid
technology allow us to enhance the system through the integration of existing
components.
In a scientific domain, the process used to generate the output of a service and
workflow is often as important as the result. As is the case with bench scientists,
in silico investigators will decide for themselves which methods and which data
will be used for their study as well as what kind of outputs they are expecting.
In the first prototype implementation, this requirement is satisfied through close
collaboration among team members.
As this system will be used by a phylogenomics research community that
spans multiple disciplines, different investigators will have their own methods for
approaching problems of common interest. A mechanism that allows end-users
to define the workflow at a higher level of abstraction is required. Instead of
choosing specific services to form a workflow, scientists would rather define a
workflow by specifying functions that a service should provide. Different levels of
training and experience also require different levels of abstraction. For example, a
67
graduate student in a particular research domain may have limited knowledge of
the methods available to perform an experiment, while an experienced investigator
may know ahead of time which building blocks are required and which approach
is most efficient for the scientific hypothesis to be tested. We represent different
abstraction levels in Figure 3.6. End-users may need to define the workflow at
any one of these four stages based on their knowledge of provided services.
A concrete workflow, which can be sent to a workflow engine, is represented
at the fourth phase. The conversion from the third phase to the fourth phase is
related to choosing an instance of a service with Quality of Service (QoS) metrics.
One service interface may have multiple implementations provided by different
service providers. These implementations have different quality properties such
as trustworthiness, cost, execution time, and so on. An optimal service should be
chosen during this conversion process. The conversion from the second phase to
the third phase requires mapping a particular task to a service, or a sequence of
multiple services.
This mapping process can be accomplished manually by software developers
in an ad-hoc way, like the approach we took in the implementation of the first
prototype. This approach relies heavily on developers’ knowledge of services and
logical ordering in the workflow.
Preferably, this process should be able to be done partially or wholly automat-
ically. In order to support this semi-automatic or automatic process, a complete
presentation of knowledge should be in place to allow software agents to substi-
tute the work of the human. Using semantic web technology, in particular OWL
and OWL-S, to represent the ontological representation of domain knowledge and
semantic description of services is a promising approach. Semantic web technol-
68
ogy offers promising features for supporting bioinformatics research [12]. Some
bioinformatics middleware, such as the myGrid and BioMoby projects, have their
own approaches to support automated discovery and composition of services using
semantic web technology [49]. Much research has been done exploring AI planning
techniques for automation of the composition process. The long term goal of a
successful composition mechanism should meet several requirements: connectivity,
quality of service, correctness, and scalability [58].
Although there are still practical difficulties in developing semantic web ser-
vices, we believe that the appearance of tools for creating ontologies, annotating
services [89], and development of widely accepted domain ontologies allow us to
add semantics into our system and support the automation of the mapping pro-
cess.
3.7 Conclusion
As both data and tool providers begin to present their resources with web
services interfaces, and as open source tools and middleware for supporting web
services, workflow generation, and enactment become more available, biologists
will begin to use those available services, as well as begin to provide service access
to their databases and programs for sharing within the bioinformatic community
[65]. Our system is a demonstration of progress toward this goal.
In summary, current SOA standards and toolkits are sufficient to build the first
prototype of MoGServ. MoGServ is in its early stage of development with limited
services and workflows available. The basic implemented functionalities enable the
user to collect data and do preliminary data analysis as well as metadata searching.
By using the system, scientists are able to get some scientific insights about the
69
alpha subunit of ATP synthase and indicate that it retains the signal of a very
ancient line of descent while having enough polymorphism to infer phylogenetic
relationships [78].
Building the system upon the SOA provides us flexibilities to integrate services,
to build a variety of workflows, and to build a web portal for scientists to access
the system via a web interface. New features and services are continuously being
added to the system in response to scientists’ feedback and requirements. The
future direction of our research will be to focus on enhancing the system using
semantic web and grid computing technologies.
70
Figure 3.4. A workflow built using Taverna workbench to get completegenome sequences and specific gene sequences
71
Figure 3.5. A workflow for querying two subset sequences from localdatabase, filtering out sequences coming from same organism, and doing
sequence alignment analysis
Figure 3.6. Abstraction of user defined workflows
72
CHAPTER 4
EXPLORING THE DEEP PHYLOGENY OF THE PLASTIDS WITH THE
MOGSERV
In this chapter, we illustrate a research application that uses the MoGServ to
investigate the deep phylogeny of the plastids and attempts to answer an open
question on phylogenetic history of the plastid genome.
4.1 Introduction
Plastids are important organelles found only in plants and algae. Chloroplasts
are the photosynthetic form of a plastid. Similar to mitochondria, both of them
have their own DNA and are involved in energy metabolism. Other forms of a
plastid may be responsible for storage of products like starch and for the synthesis
of many classes of molecules such as fatty acids which are needed as cellular
building blocks and/or for the functioning of the plant.
Phylogenetics is the study of the evolutionary relationship among various
groups of organisms. The origin and evolution of a group of organisms is called
phylogeny or phylogenesis.
The endosymboint hypothesis suggest that mitochondria were free living bac-
teria that were engulfed and subsequently enslaved by a primitive ancestor of
all living eukaryotes [27] [69]. Between 1.2 and 1.5 Ga (billion years ago), one
73
or more of these early eukaryotic cell lineages captured a cyanobacterium and
produced three primary plastid lineages: green plant lineage (chlorophytes), red
algal lineage (rhodophytes), and glaucophyte lineage (a group of freshwater algae)
[69]. Surviving endosymbiotes include the green algal and red algal photosynthetic
chloroplasts and the cyanelle, the endosymbiont in the glaucophytes that retains
more of the character of a cyanobacterial progenitor. Plastids have also spread
by secondary endosymbiosis, in which a cell engulfs a cell already containing an
endosymbioint. In secondary endosymbiosis, the nuclear genome of the engulfed
cell usually disappears. Seven lineages are produced from green algae and red al-
gae in secondary symbiosis [69], see Figure A.1. The evolution of these secondary
plastids suffers the reducing of their genomes by gene transfer into the nucleus
[77]. The red algal lineage also includes organisms that have lost the capacity
to photosynthesize but sill retain a degenerate plastid. Apicomplexans produced
from the red algal lineage are non-photosynthetic intracellular parasites whose
members include Toxoplasma gondii and Plasmodium falciparum.
Plasmodium falciparum (P. falciparum) is a protozoan parasite, one type of
apicomplexa, which cause malaria in humans. P. falciparum has three genomes:
nuclear, mitochondrial, and plastid (apicoplast). Phylogenetic analysis of plastid
genes provides a new way for targeted antiparasitic drug design [31].
Organisms generally inherit genes from their parents (Vertical Gene Trans-
fer), or receive genes from other organisms through Horizontal Gene Transfer
(HGT) and Lateral Gene Transfer (LGT). Most plastid genomes are circular, do
not recombine and are inherited through only one parent. The highly conservative
character of the plastid genome makes phylogenetic analysis possible. However,
the HGT and the LGT in multiple endosymbiotic events complicate the phylogeny
74
of plastids. Another complication is gene duplication and loss within the plastid
itself. While there is a broad consensus that all plastids are descended from a sin-
gle endosymbiont ancestor, some researchers also suggest an alternative hypothesis
of multiple origins that is “at least equally consistent in most cases” [95]. Plastid
phylogenetic analysis must account for multiple endosymbiotic events, superim-
posed upon a process of LGT that occurs throughout the process of converting a
free-living cell to an endosymbiont. Accumulating and analyzing enough data to
rigorously test the single ancestor hypothesis is a promising research direction to
take.
The development of advanced sequencing techniques makes a large amount of
DNA and amino acid sequences available for phylogenetic analysis. The commonly
used methods for inferring phylogenies include parsimony, maximum likelihood,
and Bayesian inference. The rate of data accumulation, the rapid development
of new phylogenetic analysis tools, and the refinement of existing tools, however,
make manual collecting and analyzing these sequences difficult. For example,
the number of cyanobacteria sequences in NCBI database increase from 42 to 57
within about 6 months (June 2006 - December 2006). Figure 4.1 shows the growth
of sequences databases in last a few years [57].
In this chapter, we describe a scientific application that uses the cyberinfras-
tructure to collect and analyze data to gain biological meaning from this data
analysis. The use of a web-based system, MoGServ, shows the ability to signifi-
cantly increase a scientist’s productivity over using a manual process.
75
Figure 4.1. The growth of sequence databases (NCBI Genebank andEBI Swissprot) and annotations. This figure is from Folker Meyer[57]
4.2 System and methods
MoGServ is a service-oriented environment described in detail in Chapter 3.
It facilitates scientific research and discovery from several aspects:
• Easy and rapid extraction of DNA and protein sequence from public databases
to a local database which saves scientists months of repetitive searching,
downloading, and data management.
• Painless reformatting of the extracted data for commonly used analytical
tools.
• Preliminary data inspection and analysis using these tools within the web-
services environment which permits inspection of many conserved gene can-
didates, enabling the investigator to rapidly determine the suitability of the
76
chosen gene for deep phylogenetic analysis.
• User-specified additions to the local database which allows the upload of
sequences into the local database.
• User-specified additions to the automated queries which provides a free-text
searching interface for constructing data sets with interests.
Deep phylogenetic analyses are highly context-dependent. The addition of a
single new cyanobacterial or algal genome would fundamentally change the result.
MoGServ permits an investigator to address these hypotheses using the most
current data available and rapidly reanalyze data as more genomes and genes are
sequenced. This enables rapid hypothesis testing and creates an environment in
which genuine discovery is possible. The most exciting form of discovery is the
surprise result, the result that leads to an entirely new hypothesis.
4.2.1 Data model
As web services technologies have been used by several large data source
provider, such as DDBJ, EMBL, and NCBI, to leverage their data and computa-
tional services, accessing up-to-date sequences becomes more flexible and feasible.
Due to the nature of accessing data sources via the Internet, however, the re-
quirement of data retrieval on-the-fly efficiently and reliably can not be fulfilled
easily. Additional requirements of data manipulation and information manage-
ment cannot be done without local database support. Also, the incomplete and
misannotation in biological data requires biologists’ expertise to ensure the accu-
racy before using this data for analysis.
In order to provide the capability of storing, integrating, and accessing se-
quences from diverse data sources, a data model needs to be developed to meet
77
the following requirements:
• Store sequences from distributed data sources and provide general annota-
tion to facilitate querying.
• Ensure the integrity of sequences during the data collection process and the
efficiency of updating the database periodically.
• Provide an easy way for scientist to manipulate their data sets and manage
their scientific experiments records and data provenance.
The custom data model of MoGServ consists of four modules: sequence module,
set module, user module, and job module. Figure 4.2 shows the entity-relationship
(ER) diagram. An alternative data model, the Chado database schema, one of
the components of GMOD 1, is the foundation of interoperatability of GMOD
applications. We did not use the Chado database schema because it contains
large number of modules and tables that are not necessary for our system; also it
does not model some of the information we are trying to capture in this system.
A sequence in the system is a biological sequence that comes either from a
public data resource or from laboratory experiments uploaded by scientists.
A sequence can be a nucleotide sequence or protein sequence. It can also
be a complete genome sequence or a gene product sequence. Each sequence
can be classified using a taxonomy defined by the public database and terms
that scientists used to find this sequence.
A set is a group of sequences that scientists put together to support their research
interests and usually is used for subsequent data analysis. The properties
1Generic Model Organism Database http://www.gmod.org/, an open source project to de-velop a set of software for creating and administering a model organism database
78
of a set not only contain the sequences in the set but also the provenance
of the set. For example, a set may be created by users querying the local
database, or generated from the previous data analysis.
A job is defined as a task involving data collection or data analysis. It contains
input, output, execution status, and other properties.
4.2.2 Services
The MogServ provides a number of services to support deep phylogenetic re-
search. These services are integrated in the system and accessible from a web
interface. These services can also be used as a component to be integrated into a
workflow.
4.2.3 Data collection
Data collection is a suit of services used to retrieve desired data from public
data providers into the local database. The data collection service updates the
local database periodically. Users can define the query term using any NCBI
query format from the web interface provided in the MoGServ system. Users can
also define a particular gene name or gene products as shown in Figure B.2. The
retrieved data is indexed using the Lucene indexer and search engine [51], which
supports the free text search. The syntax of search is shown as Figure D.4.
The data collection suit consists of 5 components: retrieve genome sequence,
retrieve gene sequence, convert genome sequence to file, convert name sequences,
and index sequences. When combined with the database model, these services
ensure data integrity, data accuracy, data consistent, and exception handling.
Data integrity: Since users are allowed to use any appropriate query term to
79
search the public database (NCBI), it is highly possible that the same se-
quence would be retrieved with different query terms. Also, the classification
and taxonomy of sequences in the public data sources also bring duplicated
sequences. For example, both query terms “chloroplast”, “cyanobacteria”
get the sequence
>gi|72381840|ref|NC 007335.1|Prochlorococcus marinus str. NATL2A.
“chloroplast”, “cyanobacteria”, and “plastid” get the sequence
>gi|42592260|ref|NC 003070.5|Arabidopsis thaliana chromosome 1.
“apicoplast”, “chloroplast”, and “plastid” get the sequence
>gi|31442363|ref|NC 004823.1|Eimeria tenella chloroplast.
The design of the data model ensures that the same sequence can not be in-
serted to the table twice. However, the query term used to get the sequence
should be recorded in the query by term field. This information may help
scientists better understand the relationships and discover new insights.
Data accuracy: Since scientists are interested in particular gene sequences that
reside in a range of a complete genome sequence, instead of searching the
gene database of NCBI, we choose to parse the XML file of a complete
genome sequence to get particular gene sequences and gene products. In
such way, the accuracy of the data may be guaranteed. The NCBI service
provides the search result in XML format; an example is shown in Fig-
ure D.1. The data collection service provided in the MoGServ parses the
XML file to find the INSDFeature key tag for each INSDFeature and then
to see if it is a CDS (CoDing sequence, i.e., a region of nucleotide that corre-
sponds to the sequence of amino acids in the predicted protein). The next
step is to find the INSDQualifier name and INSDQualifier value pair for
80
the gene name (e.g., atpD) or gene product description (e.g., ATP synthase
subunit B). Gene names and gene product descriptions with scientific in-
terests are defined by users through the web interface. In most cases, gene
names are enough to get the desired CDS. However, due to the incomplete
annotation of the CDS, gene names may be not available for a particular
CDS in the nucleotide sequence. The gene product description becomes an-
other criterion for getting accurate CDS; an example of a gene sequence in
fasta and TinySeq XML format is shown in Appendix D.2.
Exception handling: Since the data is retrieved from a remote data source, NCBI,
using web service interfaces, failures may occur because of the network,
hardware, or services themselves. Recording the execution status of a data
collection service is important for detecting and recovering when a failure
appears. Since the data collection service normally runs periodically as a
batch job, we record the status in a log file on the file system. In order to
reduce the repetitive work when a failure occurs, we treat retrieving a single
sequence as a transaction. In another words, we sacrifice I/O performance
to the databases that could be possible using batched mode transactions.
Data consistent: Data analysis is an important component provided by the
MoGServ. Different data analysis tools require different data formats of
a set for their input. There are two ways to provide the desired data format,
converting the data on-the-fly or preparing the data and storing it in the
database during the data collection process. The first approach is flexible;
however, it may result in inconsistent naming problems for sequences. For
example, the same sequence in set A may have the different name in set B.
Therefore, we use an algorithm to map sequence name at the data collection
81
process. Each sequence has a fixed name for each format. Each duplicate
name is ordered by adding numeric numbers at the end of the name.
4.2.4 Local query
After desired sequences are stored in the local database, users need a way to
find an interested subset of these sequences in order to perform further data anal-
ysis. The system provides an interface for users to query the local database using
free text searching. The underlying search engine is built with the Lucene search
library. The content in the index includes metadata that are used to describe a
sequence, such as taxonomy, term, name, and etc. For example, users can use
a query “atp synthase AND B AND plastid” to get a number of sequences(See
Figure B.4). Users can manipulate these returned sequences and group these
sequences as a set. Users can also download these sequences in a variety formats.
4.2.5 Set management
In order to help scientists preparing the data set for subsequent data analysis,
MoGServ provides set management services to:
Creat set: With an appropriate query to the local database, users can look into
the list of sequences returned from the query and delete undesired sequences.
Users can create a new set using these sequences. These sequences can also
be added into an existing set.
Upload set: Users can upload a set of sequences in fasta format into the local
database. These sequences can be from users’ own lab experiments, which
may not be ready to submit to the public database. They can also be a
82
small number of sequences not in the local database at that time. These
sequences are annotated using the appropriate metadata description.
Show set: Users can query the information of a set as shown in Figure B.6, such
as the creation date, the origination of the set, etc.
Download set: Users can download a set in a variety of formats, such as fasta
format, NEXUS format.
Set filter: This service provides the capability to find the intersection of all
the organisms (species) given a number of sets that contains gene or protein
sequences in different species. The purpose of this service is to help scientists
preparing data to determine if the gene genealogies for the subunits are
different. For example, scientists may be interested to determine if gene
genealogies for the subunits α, β, γ, δ, ε of ATP synthase CF1 are
different. The first step is to form 5 sets using query such as “ ATP AND
synthase AND delta AND CF1.” Then use the set filter services to find
all organisms (species) that contain all of the gene or protein sequences
type. These sequence sets will be used in the subsequent data analysis such
as using ClustalW to construct phylogenetic trees. While constructing a
phylogenetic tree based on the analysis on a single gene or protein taken
from a group of organisms (species) can be problematic, the analysis based
on multiple unrelated gene or protein sequences may increase the soundness
of the results.
83
4.2.6 ClustalW
Multiple alignments of sequences provide information to identify the conserved
sequence regions. ClustalW is a tools for global multiple alignment (across their
entire length) of DNA and protein sequences. EMBL-EBI provides a soap-based
web service that allows programmatic access to the data analysis tool [72]. Two
other services, T-Coffee and Muscle, are implemented using newer algorithms to
improve the accuracy and achieve higher performance. Based on users’ preference,
we integrated the ClustalW service into the MoGServ.
The integration of a service in the system is done by creating a new java-based
program using a web service interface to invoke the remote service. Instead of
copying, pasting, or uploading a sequence file, users can set up the parameters
from a web interface as shown in Figure B.9. These parameters are accepted and
combined as a XML file that is sent to the new program as input; an example
file is shown in D.6. The input and output are stored into the database, so the
information can be queried later and displayed with XSLT.
The input and output information are delivered with XML/XSLT. The output
from ClustalW includes phylogram tree, cladogram tree, distance, and ph file
based on the parameter setting. The binary results can be viewed using a Java-
based multiple alignment editor, Jalview [16].
4.2.7 Blast
The Basic Local Alignment Search Tool (BLAST) algorithm and the imple-
mentation at NCBI [1] is one of the most widely used bioinformatics programs. It is
used to compare nucleotide or protein sequences to sequence databases and calcu-
lates the statistical significance of matches. With well designed queries and align-
84
ments, the results of BLAST can infer functional and evolutionary relationships
between sequences and may provide important clues to the function of unchar-
acterized sequences. There are several alternative implementations, WU-BLAST
2, FSA-BLAST 3, parallel blast 4, available for better performance with mini-
mum loss of sensitivity. EBI and NCBI provide web-based WU-BLAST and/or
NCBI-BLAST. However, it could not meet the requirements of this particular ap-
plication in two aspects: 1) a large number of sequences needs to be downloaded,
copy and pasted to the interface; 2) sequences alignment only can be compared
against databases in EBI or NCBI, thus users could not defined their own datasets
to conduct comparisons.
BLAST requires two sequences as input: a query sequence (also called the
target sequence) and a sequence database. BLAST will find subsequences in the
query that are similar to subsequences in the database.
Hosting a service on the MoGServ eliminates these two limitations. Users
can define the compare set and database sets. The result is stored in the local
database. The job information is accessible any time when needed. The service
has two execution methods, synchronized and asynchronized, which are the same
for every data analysis service provided in MoGServ. Similar to the ClustalW
service, the Blast service accepts input in XML format, as shown in D.7. A
tblastn web interface is shown in Figure B.8.
2http://blast.wustl.edu/3http://www.fsa-blast.org/4http://www-users.cs.umn.edu/ rangwala/final bglBLAST.pdf
85
4.2.8 Phylip and Paup
PAUP* 5 is a program for phylogenetic analysis using parsimony, maximum
likelihood, and distance methods. The program features an extensive selection of
analysis options and model choices, and accommodates DNA, RNA, protein and
general data types. Among the many strengths of the program is the rich array
of options for dealing with phylogenetic trees including importing, combining,
comparing, constraining, rooting and testing hypotheses.
PAUP* uses the NEXUS file format, which is a modular format used by sev-
eral programs. All versions require data and commands to be present in the
NEXUS format (with the exception that commands can additionally be executed
interactively from the command prompt).
PHYLIP 6 is a set of modular programs for performing numerous types of
phylogenetic analysis. Individual programs are broadly grouped into several cat-
egories: molecular sequence methods; distance matrix methods; analyses of gene
frequencies and continuous characters; discrete characters methods; and tree draw-
ing, consensus, tree editing, and tree distances. Together the programs accommo-
date a broad range of data types including, DNA, RNA, protein, restriction sites,
and general data types. The programs encompass a broad variety of analysis types
including parsimony, compatibility, distance, invariants and maximum likelihood,
and also include both jackknife and bootstrap re-sampling methods. Therefore
for a typical analysis the user makes choices regarding each aspect of an analysis
and chooses specific programs accordingly. Programs are run interactively via a
text-based interface that provides a list of choices and prompts users for input.
5http://paup.csit.fsu.edu/6http://evolution.genetics.washington.edu/phylip.html
86
Phylogentic trees generated from these phylogenetic analysis tools can be
viewed using TreeView 7 [68]. TreeView is a simple program that displays a phy-
logenetic tree of up to a certain number of taxa. Phylogenies may be displayed
either as slanted or rectangular cladograms. TreeView provides a way to view the
contents of a NEXUS, PHYLIP, ClustalW or ClustalX, or other format tree files.
4.2.9 Data conversion
In a typical workflow, one program’s output may be used as the next program’s
input in the workflow. A necessary data conversion process is needed in order to
make the output suitable as the input for the next program. MoGServ provides
a number of services to convert fasta format to ClustalW format, fasta format to
NEXUS format, and so on.
The program readseq 8, developed by D. Gilbert, is a reformatting program
used to reformat DNA or protein sequence data. It allows the input of single
or multiple sequences in 18 different formats and converts to a specified format.
MoGServ integrates readseq program as a service to convert the output from
ClustalW to the NEXUS format.
4.3 Results of case studies
The evolution of ATP synthase is considered severely constrained; the structure
of ATP synthase is shown in Figure A.1. It can be a candidate for ascertainment
of deep phylogeny. The step that we use to test the hypothesis is first to identify
individual subunit genealogy, then to merge the data and reanalyze the data.
7http://www.molecularevolution.org/software/treeview/8http://iubio.bio.indiana.edu/soft/molbio/readseq/java/
87
4.3.1 Case study: the rediscovery of Erythrobacter litoralis
The MoGServ local database includes whole genome sequences from chloro-
plasts, cyanobacteria, plastid, and apiocolasts. The biological investigator hypoth-
esized that the amino acid sequence of the chloroplast subunits of ATP synthase
would be a good choice for a deep phylogenetic analysis, a departure from estab-
lished procedures. DNA sequence from ribosomal genes, the protein synthesizing
machinery, are the traditional choices for deep phylogenetic analysis. Preliminary
analyses on 33 taxa revealed that the α and β subunits of this enzyme have a
stunningly high degree of amino acid sequence conservation across cyanobacterial
genomes and chloroplast genomes from a wide array of algal taxa and green plants.
As the nuclear genomes of the algal taxa are more phylogenetically distinct from
one another as humans are from fungi, this result indicated that that chloroplast
ATP was a suitable candidate enzyme and provided support for the single ancestor
hypothesis.
The problem now was one of excessive conservation in the α and β subunits.
Comparison of the most conserved region of the α subunit against all sequences
at NCBI revealed that this region is so conserved that it matches that of an ATP
synthase subunit in the mitochondrial genome. Phylogenetic evidence clearly
indicates that mitochondria descend from a single bacterial ancestor and that this
ancestor was related to the alpha proteobacteria, a group closely related to the
cyanobacteria. MoGServ enabled the investigator to add to the already convincing
evidence that the mitochondrial and chloroplast genomes are related. This was
not the hypothesis of interest but lead the investigator to try a different approach.
The investigator then examined the amino acid sequence of the ε subunit of
ATP synthase for the same 33 taxa examined previously. Sequence conservation
88
was evident but somewhat less than that seen in the α and β subunits. The local
database query was relaxed to permit inclusion of the ATP synthase ε subunits of
both cyanobacteria and alpha proteobacteria. More than a dozen proteobacteria
were identified, all of which except one are nonphotosynthetic.
The surprise bacterium was Erythrobacter litoralis. This organism is a faculta-
tive photoheterotroph, able to photsynthesize in the light and catabolize organic
sources in the dark. It was found in the Sargasso Sea in 1994 and sequenced in
2005. This discovery suggests that Mother of Green may not be a cyanobacterium
but an α proteobacteria.
4.4 Summary
In this chapter, we detail the data and services integrated in the MoGServ
system in order to support the deep phylogenetic investigations. We describe one
case study for a phylogenetic investigation 9. This case study shows that the
investigator is able to gather data and perform more advanced data analysis and
lead to dicovery new knowledge using the web based environment and services
provided in the MoGServ system.
9The case study on the use of MoGServ for a phylogenetic investigation, was conducted incollaboration with Professor Jeanne Romero-Severson[78], Department of Biological Sciences,University of Notre Dame, and partially supported by the Indiana Center for Insect Genomics(ICIG) with funding from the Indiana 21st Century fund.
89
Figure 4.2. Entity relationship diagram of the data model in MoGServcreated by SQL::Translator
90
CHAPTER 5
ONTOLOGICAL REPRESENTATION MODEL
MoG (Mother of Green), a project involving deep phylogeny of plastids, in-
cludes the development of a system (MoGServ) to enable life scientists to easily
aggregate heterogeneous data and conduct data analysis using the growing array
of web-based scientific databases and analysis tools. MogServ, a SOA-based data
integration environment, is built using current web service technology and existing
middleware for life sciences research. Based on the successful design and imple-
mentation of this prototype, in this chapter, we present an enhanced system with
semantic annotation of services and data. The enhancement aims at allowing life
science researchers to define their experiments at different levels based on their
knowledge of the tools, data, and the system. The semantically enriched data
allows easier reuse, sharing, and experiments involving search to be conducted.
While the service-oriented architecture is used in the implementation of e-
Science infrastructure, semantic web technology is increasingly gaining interest to
be used for annotating the life science and medical information [12]. For example:
UniProt RDF 1 project provides all UniProt protein sequence and annotation
data in RDF. These efforts makes the vision of the semantic web [7] become more
1http://dev.isb-sib.ch/projects/uniprot-rdf/
91
practical. Other open source projects, such as HayStack 2 and SIMILE 3, aim at
delivering these semantically annotated data to web browsers. The appearance of
open source tools that support the semantic web and service-oriented computing
encourage the life science community to provide their data, analysis tools, and
share scientific experiments with these technologies.
5.1 The MoG life sciences project and biomedical application
As part of the Mother-of-Green (MoG) project 4 we are developing scientific
workflow tools (MoGServ) that enable end-user composed semantic web-services
to increase the interoperability of the growing array of web-based life science
databases and analysis tools. These workflow tools are built from available and
emerging open-source, open-standards technology.
The prototype problem domain that guides this project, the phylogenomics of
the plastid, includes genomic, transcriptomic, and proteomic data. Plastids are
hypothesised to be descendants of cyanobacterial ancestors captured by eukaryote
hosts. As more cyanobacterial and plastid genomes are sequenced, information
accumulates that could shed light on plastid genomics and phylogeny. One of the
major plagues of humankind, malaria, is caused by a parasite containing a plastid:
Plasmodium falciparum. A new pharmaceutical drug that disrupts the function of
this plastid (the apicoplast) might be harmless to humans, who, like all animals,
have no plastids.
Examination of the genes, the linear order of the genes, the proteins, and the
temporal order of protein expression of related organisms can suggest possible
2http://haystack.lcs.mit.edu/3http://simile.mit.edu/4http://www.nd.edu/∼mog/
92
apicoplast functions. The problem is the accurate identification of relatives or
even closely related plastid genes of known function. At present, the phylogeny of
the apicoplast is not clear. A phylogenomics approach requires the extraction and
analysis of genomic information from diverse scientific disciplines: plant, algal and
cyanobacterial systematics, plant biochemistry, animal parasitology, genetics and
cell biology. This phylogenomics investigation provides software design use-cases,
testing, and an opportunity for the evaluation of scientific workflow composition
tools and technology.
5.2 Ontological representation model
Metadata about services, sequences, and users’ experimental results are cap-
tured in MoGServ in order to facilitate the information inquiry from application
developers searching for appropraite services and from end-users to keep track
of their in-silico experiments. The inquiry system in the prototype is based ini-
tially on a keyword search method for easy implementation purpose. With the
prospect of hosting MoGServ at multiple sites in the phylogenetic research com-
munity, applying the semantic web approach for representing the metadata allows
for much more focused and structured queries and the possibility to answer ques-
tions based on logical inference rather than text associations. An ontology that
describes the concepts relevant to a given domain along with properties character-
izing these concepts can meet these requirements. By relying on shared ontologies
and agreements on the definition of common concepts, data and information can
be annotated using the shared vocabularies in these ontologies.
Since most semantic web services standards are relatively mature and stable,
we build an application-specific ontology using a distributed and modularized
93
ontology structure and re-used some cross-domain ontologies such as the Dublin
Core 5 and other well-defined bioinformatics ontologies. The use of well-defined
ontologies could potentially increase the interoperobility when information is pub-
lished on the web.
There are three ontology sets that are clearly differentiated in the system: MoG
application domain ontology, which is used to represent concepts and information
unique to MoGServ system, such as jobs, sequences collections, etc; generic service
description ontology, such as OWL-S, which is used to specify generic web service
concepts such as service inputs, outputs, preconditions, and effects; and the service
domain ontology, which is designed and used for the semantic description of web
services in the bioinformatics domain.
5.2.1 RDF, OWL, and DIG reasoner
The Resource Description Framework (RDF) 6 has been proposed as a W3C
standard to enable distributed knowledge representation on the Semantic Web.
It is a graph model of the statements that encode the metadata description of
web resources, people, places, and other concepts. RDF is based on the idea of
identifying things using Uniform Resource Identifiers (URIs), and describing re-
sources in terms of simple properties and property values. This enables RDF to
represent simple statements about resources as a graph of nodes and arcs repre-
senting the resources, and their properties and values. An RDF graph is a set of
triples. Each triple consist of a subject(start node), a predicte(edge), and an
object(end node). A fact is expressed as a Subject-Predicate-Object triple,
also known as a statement. A triple can be written as P (S, O), that is, a subject
5http://dublincore.org6http://www.w3.org/TR/rdf-primer/
94
S has P (predicate or property) with value O. RDF/XML and Notation 3 (N3),
are two formats for representing RDF models. Figure 5.1 is a RDF graph model
that represent some information for describing the MoG project web site. Facts
are expressed as subject-predicate-object triples:
<’’http://www.nd.edu/~mog’’> <#hasCreator> < #gmadey><#gmadey> <#hasFullName> <Gregory. Madey>Note: # is represent as some URIs
The RDF/XML representation:
<rdf:RDF xmlns:rdf=’’http://www.w3.org/1999/02/22-rdf-syntax-ns#’’xmlns:ex=’’http://someexample.org#’’>
<rdf:Description rdf:about=’’http://www.nd.edu/~mog’’><ex:hasCreator rdf:resource=’’ex:gmadey’’ />
</rdf:Description><rdf:Description rdf:about=’’ex:gmadey’’>
<ex:hasFullName>Gregory Madey</ex:hasFullName></rdf:Description>
</rdf:RDF>
http://www.nd.edu/~mog
#hasCreator
#gmadey
#hasFullName
Gregory Madey
#hasTitle
#professor http://www.nd.edu/~gmadey
#hasPersonalSite
MoG is a … project
#hasTextDescription#hasResearchTopic
#bioinformatics
Literal Resource # URI provided the definitionof these vacabulary
#hasFundedBy
#foundation
Figure 5.1. A RDF graph model to represent some information fordescribing the MoG project web site
95
RDF schema is a mechnism that allow developers to define a particular vo-
cabulary for specifying the kinds of objects to which predicates can be applied.
These pre-defined terminologies such as Class, subClassOf, Property establish
an agreement on the semantics of specified terms and the interperation of given
statements. The Web Ontology Language (OWL) 7 is one type of ontology lan-
guage available for describing semantic web information, which is more complex
and powerful than the RDF schema. It is built on top of the RDF graph model
with better capabilities for describing the relationship among resources and their
properties 8. The OWL language is divided into three syntax classes: OWL Lite,
OWL DL and OWL Full. Classes(concepts), properties(roles, relation-
ships), and individuals(instances) are three components in OWL language.
Let’s consider that the interperation of a domain knowledge using function I. A
domain knowledge is represented with a number of concepts CI ⊆ DI . Each
concept may contain a number of individuals and one individual may belong to
different concepts II ∈ DI . The relationship between two individuals are repre-
sented as RI ⊆ DI ×DI . The web data information can be related by using the
definition of these concepts.
Jena 9 is a semantic web framework for creation of RDF and OWL models
as well as a common interface for parsing and reasoning. Protege 10 is a free,
open source ontology editor and knowledge-base framework that supports two
main approaches to modeling ontologies via the Protege-Frames and Protege-
OWL editors. The OWL DL ontology can be translated into a description logic
7http://www.w3.org/TR/owl-ref/8http://jena.sourceforge.net/ontology/index.html9http://jena.sourceforge.net
10http://protege.stanford.edu/
96
representation that are decidable fragments of First Order Logic (FOL) 11. A
Description Logic Reasoner can perform automated reasoning over an ontol-
ogy, such as computing the inferred superclasses of a class, determining whether
or not a class is consistent, deciding whether or not one class is subsumed by an-
other (subsumption reasoning). Pellet12, FaCT/FaCT++ 13, Racer/RacerPro 14,
KAON2 15 are four popular ones among a number of DL reasoners. The DIG 16 in-
terface specifies a common interface for DL reasoners. A DIG compliant reasoner
is a DL reasoner that provides a standard access interface (DIG interface), which
enables the reasoner to be accessed over HTTP using the DIG langauge. Jena
and Protege-OWL provide APIs that can be used to interact with any external
DIG compliant reasoner without requiring developers’ to have detailed knowledge
of the reasoner.
5.2.2 Generic service description ontology
OWL-S 17 is an OWL based ontology for semantic representation of services.
It is a complex and rich model that includes the representation of both atomic
services and composite services as well as complicated control flow and data flow.
Most of the current open-source APIs, editors, and annotation tools at this stage
only partially support the OWL-S service model having primary focus on the
11Logics are decidable if computations or algorithms based on the logic will terminate in afinite time.
12http://pellet.owldl.com/13http://owl.man.ac.uk/factplusplus/14http://www.racer-systems.com/15http://kaon2.semanticweb.org/16http://dig.sourceforge.net/17http://www.w3.org/Submission/OWL-S/
97
OWL-S service profile and service grounding. Annotating a service with the OWL-
S model is a non-trival task even with support from annotation tools, such as the
SRI OWL-S editor 18.
The Feta [50] data model is used for semantic description of services in the my-
Grid project. Web services can be annotated using terms in a OWL-base myGrid
domain ontology [103] with an GUI based interface Pedro [33]. This approach
is more lightweight than the OWL-S approach. Although OWL-S provides more
support for the automation process, especially since its definition of the precon-
dition and post effect allows the possible application of AI planning technologies,
it is difficult to utilize its full functionality. The Feta data model has limited
expressivity but sufficient for describing most services and its simplicity makes it
more practical for describing large number of services.
We believe it is more practical to use the Feta model for service and workflow
description at this stage. Since the semantic representation model in the system
is modularized, it is easy to convert to an OWL-S representation when the tools
and API that support the OWL-S becomes more stable and mature.
5.2.3 Service domain ontology
The service domain ontology should be generic enough to provide the concepts
needed by any web service in a certain domain, and rich enough to represent
the available knowledge for performing complex reasoning. The service domain
ontology plays an important role for the automation of service discovery. However,
building such a quality domain ontology is a challenging task. Sabou et. al [80]
presents an automatic method that learns a domain ontology for the purpose of
18http://owlseditor.semwebcentral.org/
98
web service description from natural language documentation of web services. It
provides a guideline and tool for domain experts to inspect a large number of web
services in a certain domain in order to build a high quality generic ontology.
BioMOBY’s object ontology, MOBY-S 19, contains concepts that are related
to data formats and data types usually used in bioinformatics. There are no
restrictions on complex relationship definitions in the ontology. It serves as a
common vocabulary collection that can be used to define services that accept
a particular type of data as their input/output in certain format. The myGrid
ontology 20 describes the bioinformatics research domain and the dimensions with
which a service can be characterised from the perspective of the scientist. The
scope of the ontology is limited to supporting service discovery. Descriptions
of services are constructed to present their properties such as “what the service
does”, “what data sources it accesses”, and “what domain specific methods the
analysis involves”. Each hierarchy contains abstract concepts to describe the
bioinformatics domain at a high level of abstraction. By describing the domain
of interest in this way, users should be able to find appropriate services for their
experiments from a high level view of the biological processes they wish to perform
on their data.
5.2.4 MoG application domain ontology
The MoG application domain ontology auguments the two ontology sets de-
scribed above, representing concepts that only exist in the MoGServ system, in-
cluding jobs, collections of sequences, etc. The ontology definition provides vo-
cabulary to annotate services that use data types and data formats not available
19http://biomoby.org/RESOURCES/MOBY-S/Objects20http://www.mygrid.org.uk/ontology
99
elsewhere. It also allows the annotation of experimental data permitting users to
keep track of their data. The MoG application domain ontology also represents
the interactions between end-users and the system.
Sequence, SequenceSet, Job are three main concepts in a MoG application.
The MoGServ system contains a local database that stores integrated sequences
of scientific interests from multiple public databases along with private data from
the life scientists’ own laboratory experiments. One activity a scientist may need
to do often is to query the local MoGServ database to get a collection of sequences
supporting a particular research investigation, and use this collection to do sub-
sequent data analysis. We also define other concepts User, Input, Output,
Privacy to annotate the access permissions for data sets. For example, if the
data is updated from a scientist’s lab experiment which is not intent to be pub-
lished at one point, this piece of data should be retricted to be used by authorized
persons only.
The ontology is defined with OWL using Protege. Each concept consists of
two main types of properties: object properties and datatype properties. An
object property represents the relationship between two individuals in the domain.
A datatype property links an individual to an XML Schema data type or a RDF
literal. Figure 5.2 demonstrates the main concepts and relationships defined in
the MoG application domain ontology.
The Sequence class has multiple properties: 1) hasSequenceID, a unique iden-
tifier of the sequence - an identifier may be in the life science identifier (LSID)
format, 2) hasSequenceName, a string of XML data type, 3) hasTaxonomy, a
datatype property with a string of XML typed data - each individual of the se-
quence class may either be retrieved from a public data base or uploaded by
100
Figure 5.2: Main concepts and partial relationships defined in the MoG applicationdomain ontology
scientists from their own labratory experiments.
SequenceSet class has property: isChildOf is a functional property, which
means there can be at most one individual that is related to the individual via the
property. A sequence set can only be a child of one sequence set no matter how
the sequence get created. A sequence set can have multiple child sequence sets. A
sequence set can be a sibling of another sequence set only when other sequence sets
are also generated from the setfilter service. The property isSiblingOf is a sym-
metric property. The existential restriction hasSequence∃Sequence indicates a
necessary condition for an individual if it belongs to the class SequenceSet.
The Job class has execution time properities such as submittedAt, startedAt,
finishedAt. This informtion provides data for measuring Quality of Services (QoS).
It also provides information for end-users to monitor their job execution.
101
5.3 Implementation
Given a well-defined domain ontology, associated services, workflows and the
data products generated from these, the services and workflows can be annotated
using a common vocabulary. The meta data with semantic annotation is stored
in a RDF repository. From a number of RDF storage packages, we chose Sesame
1.2.6 21 as the repository. Sesame is an open source Java framework for storing,
querying and reasoning with RDF and RDF Schema. Using RDF as the main
storage and exchange method makes knowledge in the field portable to other
applications and readable by machine as well as by human.
The annotation in RDF/XML format of one service provided in the MoGServ
system is shown as below and displayed in Figure E.5. It is a service that ac-
cepts a sequence set id and sequence type as input parameters, executes ClustalW
sequence analysis, and returns the result.
<rdf:RDFxmlns:mygrid="http://www.mygrid.org.uk/ontology#"xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"xmlns:mog="http://almond.cse.nd.edu:10000/mog#">
<mygrid:service><mygrid:hasOperation><mygrid:operation><mygrid:isFunctionOf><mygrid:operationApplication><rdf:type rdf:resource="http://www.mygrid.org.uk/ontology#aligning"/>
</mygrid:operationApplication></mygrid:isFunctionOf><mygrid:outputParameter><mygrid:sequence_alignment_report><mygrid:mygInstance rdf:resource="http://www.mygrid.org.uk/ontology#sequence_alignment_report"/>
<mygrid:hasParameterDescriptionText>ClustalW alignment file</mygrid:hasParameterDescriptionText><mygrid:hasParameterNameText>filename</mygrid:hasParameterNameText><rdf:type rdf:resource="http://www.mygrid.org.uk/ontology#parameter"/>
21http://www.openrdf.org/
102
</mygrid:sequence_alignment_report></mygrid:outputParameter><mygrid:usesResource><mygrid:operationResource><rdf:type rdf:resource="http://www.mygrid.org.uk/ontology#sequence_database"/>
</mygrid:operationResource></mygrid:usesResource><mygrid:performsTask><mygrid:aligning><rdf:type rdf:resource="http://www.mygrid.org.uk/ontology#operationTask"/>
</mygrid:aligning></mygrid:performsTask><mygrid:hasOperationNameText>runClustalWdf</mygrid:hasOperationNameText><mygrid:inputParameter><mog:set><mygrid:mygInstance rdf:resource="http://almond.cse.nd.edu:10000/mog#set"/>
<rdf:type rdf:resource="http://www.mygrid.org.uk/ontology#parameter"/><mygrid:hasParameterNameText>setid</mygrid:hasParameterNameText>
</mog:set></mygrid:inputParameter><mygrid:inputParameter><mygrid:parameter><mygrid:mygInstance rdf:resource="http://www.mygrid.org.uk/ontology#biological_sequence"/>
<rdf:type rdf:resource="http://www.mygrid.org.uk/ontology#biological_sequence"/><mygrid:hasParameterNameText>sequenceType</mygrid:hasParameterNameText>
</mygrid:parameter></mygrid:inputParameter>
</mygrid:operation></mygrid:hasOperation><mygrid:hasServiceNameText>mog:service:ClustalW</mygrid:hasServiceNameText><mygrid:locationURI rdf:resource="http://almond.cse.nd.edu:10000/axis/services/ClustalW?wsdl"/><mygrid:hasServiceType>WSDL</mygrid:hasServiceType><mygrid:publishedBy><mygrid:organisation><mygrid:publishedBy><mygrid:organisation><mygrid:hasOrganisationNameText>MoG</mygrid:hasOrganisationNameText><mygrid:hasOrganisationDescriptionText>MoG</mygrid:hasOrganisationDescriptionText>
</mygrid:organisation>
103
</mygrid:publishedBy></mygrid:organisation>
</mygrid:publishedBy><mygrid:hasServiceDescriptionText>This is a service accepts setid,sequecenType as parameters and return the name of the alignmentreport stored in the local database</mygrid:hasServiceDescriptionText><mygrid:hasServiceDescriptionLocation>http://almond.cse.nd.edu:10000/axis/services/ClustalW?wsdl</mygrid:hasServiceDescriptionLocation>
</mygrid:service></rdf:RDF>
All the data sets stored in the local database are generated by a service or a
workflow. The annotation of experimental data is through services provided in the
MoGServ system. These services are invoked automatically when an individual is
created. Each sequence, set of sequences, and job are identified with the LSID.
The Life Science Identifiers (LSID) 22 is a special kind of Uniform Resource Name
(URN) for biological entities. The LSID concept defines an approach for naming
and identifying data resources stored in multiple, distributed data stores. Since
adoption of the LSID in the life sciences is increasing, using it as an identifier for
experimental data provides the extensibility of our system to publish those data.
We implement a number of software components to annotate and query meta
data including job information, services/workflows descriptions (See Figure 5.3).
The query components embed queries in the SeRQL (Sesame RDF Query Lan-
guage) format supported by Sesame.
5.4 Conclusion
In this chapter, we present an ontological model that is used to semantically an-
notate data and services in the MoGServ system. This ontological model contains
three ontology sets: MoG application domain ontology, generic service description
22http://lsid.sourceforge.net/
104
2007-4-18 Ph.D defense 69
AnnotationTemplates
(Data)
AnnotationTemplates(Service
Workflows)
Querytemplates
result
Generic Service Description Ontology(myGrid/Feta model)
Service DomainOntology(myGrid)
MoGServ applicationDomain Ontology
(MoGServ)
RDFStore
QueryComponents
AnnotationComponents
Ontological modules
Use vocabularies defined in these ontological modules to annotate and query
Figure 5.3. The software components implementation of annotation andquerying meta data
ontology, and service domain ontology. Using a distributed and modularized on-
tology structure and reuse of well-defined ontologies could potentially increase the
interoperobility when the data generated from the MoGServ is shared with other
researchers. At this stage, the developed MoG application domain ontology is sim-
ply served as a common vacabulary definition to capture the relationship among
data set, sequences, jobs, and other properties related to these three concept.
The Feta data model is used to annotate services in the MoGServ. Compared to
the table and index based metadata search method, the semantically-annotated
experimental data provides a better, flexible approach for users to search and
share their experiments. However, how to annotate the meta data accurately and
efficiently becomes the major difficulty in applying ontological model.
105
CHAPTER 6
IMPROVING THE REUSE OF THE SCIENTIFIC WORKFLOW
Most current practical methodologies and workflow systems for service com-
position and workflow creation in e-Science pursue a semi-automatic way to allow
users to discover and select appropriate services to include in a workflow based on
semantic and conceptual service definitions. This effort shifts the load of requiring
users to have detailed knowledge and understanding of each tool, service, and data
type. However, few of these approaches consider the potential for reuse: to share
the knowledge gained during the service composition process and to reuse com-
plete or partial reuse of existing workflows. We believe that providing a capability
for reuse of this knowledge and workflows could be an important component in a
workflow system. In this chapter [109], we present a methodology and an enhanced
system design to facilitate the reuse of knowledge and workflows. It contains 1)
a hierarchical workflow structure representation, 2) knowledge management and
knowledge discovery components to capture and manage the reusable knowledge
in a system, and 3) an approach for using a graph matching algorithm to discover
similar workflows.
6.1 Introduction
As more data, analysis tools, and other resources are delivered as services on
the web, the major benefit of adopting service-oriented architecture in e-Science,
106
is that of allowing scientists to describe and enact their experimental processes by
orchestrating distributed and local services into a workflow. Service orchestration,
also called service composition, is a difficult and complex task. It often involves
choosing a set of appropriate services based on the functional and non-functional
properties of services, ordering them in sequence, resolving connectivity between
the services, and converting the complex process into a target workflow language
that can be deployed and invoked on a platform.
Over the past several years, much research has been done on approaches for
service discovery and composition in order to achieve the goal of seamless web
service composition [58]. These approaches range from both adoption of indus-
try standards to adoption of semantic web technology, and from manual or static
composition to automatic dynamic composition [90]. A significant portion of the
work aims at automating discovery and composition by combining ontological
annotation of services and AI planning technology. In the literature, the demon-
stration of these approaches is largely applied on virtual travel agencies or small
well-defined domains. Applying these approaches to larger, more complex and
less-defined applications can be difficult, especially before a complete strong on-
tological agreement is established in the application domain or across multiple
domains.
Most current practical methodologies for service composition or workflow cre-
ation employ a semi-automatic design that allows users to discover and select
appropriate services to include in a workflow based on semantic and conceptual
service definitions. This partially lifts the load on the users of requiring detailed
knowledge and understanding of each tool, service, and data type. In the mean-
time, it increases the complexity of building such middleware to support workflow
107
creation at a higher level abstraction. Mediator, shim, and adaptors technologies
[74] are applied to resolve the connectivity between the services. Several work-
flow management systems and service-oriented middleware, such as Pegasus [34],
myGrid/Taverna [65], Kepler [52], and Triana [96], are developed and with the
intent to streamline the workflow design, execution, monitoring, and re-run the
workflow.
Most of these systems and approaches provide users an environment to com-
pose services from scratch in terms of more accurately choosing appropriate ser-
vices with consideration of semantic matching and quality of services (QoS). Fewer
of them consider the potential of reuse and sharing of the knowledge gained during
the service composition process and reuse of complete or partial existing work-
flows. We believe that providing a capability to reuse the knowledge and workflows
is an important component in such a system. This reusability will lead to a more
efficient and more structured composition process that will accelerate rapid appli-
cation development. It will provide more valuable guidelines to assist users with
their workflow creation using knowledge that has been gained and verified by oth-
ers. Reuse of the verified knowledge will potentially increase the correctness of
composed workflows and reduce the errors that may be caused by misannotation,
inaccurate annotation, and incomplete annotation of services. The requirement of
complete information about the world brings challenges of applying traditional AI
planning technologies into the service composition process since it is not feasible
nor possible to collect all the information to form a complete initial state of the
world [46, 58]. The gradually gathered knowledge in the system during service
composition process may help accumulate more complete information for an AI
planner.
108
In this chapter, we present a methodology and an enhanced system design
to facilitate the reuse of knowledge and workflows. It contains a hierarchical
workflow structure, knowledge management and knowledge discovery components
to capture and manage the reusable knowledge in a workflow system, and an
approach for using a graph matching algorithm to discover similar workflows.
The methodology proposed is being used in the design and implementation of a
service-oriented based system for supporting bioinformatics research.
6.2 A hierarchical workflow structure
We define a hierarchical workflow structure that contains four levels of repre-
sentation (see Figure 6.1): abstract workflow, concrete workflow, optimal work-
flow, and workflow instance.
Abstract workflow
Concrete workflow
Optimal workflow
Workflow instance
Encode, convert theHigh level definition To low-level executable
Replace individual Services with theiroptimal alternatives
Invoke a workflow withSpecific input data andRecord the data Provenance and Performance of services,workflows.
Task A Task B
Service B
Service A
Service DService C
Service B
Service A
Service DService C’
input
outputService B
Service A
Service DService C’
Figure 6.1. A four level hierarchical workflow structure representationand transformation of scientific processes
109
Abstract workflow is a definition of a scientific process with emphasis on the
analytical operations or function to be performed rather than the mecha-
nisms for performing these operations.
Concrete workflow is a definition of a number of tasks represented as actual ex-
ecutable services. A concrete workflow can be converted to specific workflow
language and sent to a workflow engine to be executed.
Optimal workflow is a concrete workflow where individual executable services
are replaced by alternatives with highest quality.
Workflow instance is an actual run of a concrete workflow or optimal workflow
with input data and generated output data.
Users can use a GUI-based interface to define an abstract workflow by drag-
ging and dropping high level abstracted components provided in the system. An
alternative way is to define an abstract workflow using standardized syntax, vo-
cabularies, and semantics developed in their scientific communities. Users logically
create each task in terms of functions they wish the task should accomplish.
The translation of an abstract workflow into a concrete workflow is a process
of discovering suitable services that implement these functions and solving the
connectivity between services.
The optimization of a concrete workflow into an optimal workflow is a process
of ranking services based on a set of metrics and selecting an optimal service to
replace each service in the workflow.
A concrete workflow can be invoked repeatedly with different input parame-
ters. Since a scientific process is a process for discovering new knowledge, keeping
track of the source of a workflow result can be as important as the result itself.
110
Data provenance is metadata recording the process of experiment workflows, an-
notations, and notes about experiments. It provides significant added value in
such data intensive e-Science [83]. Many data provenance systems in e-Science
have focused on recording the data from which a data product evolved and the
process of transformation of these data, i.e., input data, output data, and process.
This may include information on running time and failure rates of each running
instance of a workflow; these can provide measurements for profiling the quality
of services and workflow. This information can be used to assist the workflow
optimization process.
Several benefits are provided with this hierarchical workflow structure defini-
tion:
Allows users to define workflow at different abstract levels. Less expe-
rienced users may define a workflow in terms of functions they wish a task should
perform. Intermediate users may define a workflow with more detailed properties
of each task, such as the algorithm and data source they may want to use. Ex-
pert users may be able to define a workflow in an ad-hoc approach by choosing
appropriate executable services and form a workflow with appropriate logic. An
example is shown in Figure 6.2. Users would like to conduct an experiment to
determine “if gene genealogies for ATP subunit α, β, γ are different”. A less ex-
perienced user may define a workflow with two tasks, retrieving and aligning.
An intermediate user may have knowledge of two particular services (queryGene,
clustalW) that should be used in the workflow in order to perform each task. An
expert bioinformatician may know that in order to get more accurate results, it
is necessary to encapsulate a service (setFilter) to compute the intersection of
all the organisms in the sequence sets.
111
2007-4-18 Ph.D defense 34
Aligning
Retrieving
Workflow A defined by a less experienced user using the functional definition of services
queryGene
clustalW
Workflow B defined by an intermediate user with executable services
queryGene
clustalW
queryGene queryGene
setIds
setFilter
clustalW clustalW
Workflow C defined by an expert user with two extra executable services to ensure the accurate output of
the biological process
Three user-defined workflows from different viewsQuestion: “are gene genealogies for ATP subunitαβ γ different?”
Figure 6.2. An example illustrates the user-oriented workflow definitionwith different levels of knowledge
Allows the transformation of workflows in semi-automatic or auto-
matic ways. The transformation from abstract workflow to the concrete workflow
can be completed by an expert bioinformatician with assistance from a service
discovery agent provided in the system. The myGrid/Taverna [65] workbench
provides users not only a visual workflow building tool but supportes the annota-
tion and discovery of services using an ontology. IRIS [74] provides an approach
to create, discover, and manage adapters (mediators) that are intended to glue
two bioinformatics services together with appropriate data transformation, iden-
tifier mapping and so forth. The BioMoby [44] project integrates access to many
of BioMoby’s features to the Taverna interface in the form of a Taverna plug-
in. Users are guided through the construction of syntactically and semantically
correct workflows through plug-in calls to the Moby Central registry.
112
The transformation from the concrete workflow to the executable workflow can
be completed automatically by ranking services and choosing optimal ones. Most
of previous work bound the information regarding the quality of service with
the translation process results in more sophisticated and complex composition
methods. Since most measurements of the quality of services are dynamically
changing, this tightly-coupled representation and composition method is not easily
adapted to these changes. Separating the optimal workflow from the concrete
workflow allows the easy integration of Grid computing technology to address the
resource allocation and security issues of data and computation resources.
Allows the full or partial reuse of workflows defined at different
levels. Reuse of a workflow may occur when users need to replicate their data
sets or rerun the same workflow using different input data. For example, consider
a scientist who is interested in a data set generated from a given workflow. Using
the recorded data provenance, the corresponding concrete workflow that was used
to generate this data set can be discovered. The concrete workflow can be re-
optimized and invoked with different input data.
The reuse of a workflow may also occur during workflow design. For example,
a scientist may have a high level representation or partial representation of a work-
flow, searching the workflow repository may return a number of similar workflows
at an abstract level and/or concrete level. This scientist may choose a candidate
to reuse or modify the workflow to meet the goal.
6.3 An enhanced workflow system
A general workflow system contains most of the components illustrated in the
Figure 6.3 to support the semi-automated workflow composition process.
113
• Ontologies serve as a common vocabulary for semantic annotation of services
and data in the system.
• Semantics enabled service registry is responsible for storing the semantic
and syntactic information of services as well as answering the inquiry. The
semantic information can be provided by service providers or third party
annotation.
• Workflow composer discovers appropriate services and resolving the connec-
tivity between services. It is also responsible for converting the workflow
into a workflow language that can be executed on a workflow engine.
• Data provenance management keeps track of the origination of the data
products.
Few workflow systems have the capability for the reuse of the knowledge gained
during the service discovery, service composition, and service invocation process.
We add two components – knowledge discovery and knowledge management – to
the workflow system and discuss how this knowledge can be used over time to
provide more accurate guidelines to users.
As most current semantic web services standards are relatively mature and
stable, the ontology model used in a system is built upon a distributed and modu-
larized ontology structure and reuse some cross-domain ontologies such as Dublin
Core (http://dublincore.org). The use of a well-defined ontology could potentially
increase the interoperability for information published on the web. The ontology
model used in a system normally contains two modules: generic service descrip-
tion ontology, such as OWL-S, is an ontology module used to specifies generic
web service concepts including service inputs, outputs, preconditions, and effects;
114
UserService
Annotator
Abstractworkflow
DL reasonerDL reasoner
Ontology
Create abstract workflowusing ontology
Annotate servicesusing ontology
Semantics enabledservice registry
Semantics enabledservice discovery
Semantics enabledservice discovery
Service matchmakingService matchmaking
Workflow composer (software agent/experienced users)
Find appropriate service
Workflowexecutionengine
Workflowexecutionengine
concreteworkflow
Data provenancemanagement
Data provenancemanagement
Collect and manage information about data origination
Knowledgebase
management
Knowledgebase
managementKnowledgediscovery
Knowledgediscovery
Figure 6.3. An enhanced workflow system with two added components,knowledge management and knowledge discovery
service domain ontology is an ontology module designed and used for the semantic
description of web services in a particular domain and normally represented with
OWL-lite or OWL-DL.
We give a definition of service in our system as a tuple with several important
attributes:
servicei(descriptioni, operationi, ...) – a service contains text descriptions of
its feature, a set of operations (must not be ∅), and other attributes;
operationij(descriptionij, inputij, outputij, qualityij, performtaskij, ...) – an op-
eration in a service contains text descriptions of its features, a set of input param-
eters (may be ∅), a set of output parameters (may be ∅), a set of quality metrics,
semantic description of the features using vocabulary from service domain ontol-
115
ogy, and others;
parameterk(semantick, datatypek) – a parameter contains semantic descrip-
tion using vocabulary from service domain ontology and the data type.
The semantic annotation of services and workflows can be represented as a
RDF model and stored in a RDF repository.
6.3.1 Knowledge management
The knowledge management component is responsible for collecting, analyz-
ing, and handling inquiries on the knowledge base. The knowledge base holds
information gathered incrementally during workflow translation and service com-
position processes. This information provides increasingly accurate guidelines for
users over time. Four types of information are classified:
- Connectivity of services. A concrete workflow can be viewed as a graph
with a number of linked services in a certain order and logic. Each node
in the workflow is an operation of an executable service. In a simple case,
two nodes are connected if an output parameter of one operation maps an
input parameter of another operation based on their syntactic and semantic
description.
Rule1 :
operationij → operationmn
if ∃parameterk ∈ outputij and ∃parametero ∈ inputmn and
datatype(parametero) = datatype(parameterk) and
semantics(parametero) = semantics(parametero)
Rule2 : if operationij → operationmn then servicei → servicem
116
While the composability of services can be determined by these above simple
rules, it can be identified using more complex models [55].
The connectivity between two services can be identified automatically based
on the rule defined above when a new service is added to the system. It
is a computationally intensive process when the number of services in the
system and the number of parameters for each operation is large. Also,
incorrectly identifying the connectivity between two services is most likely
introduced by the misannotation of services or an incomplete ontological
model. Therefore, during the translation process, the connectivity struc-
ture should be refined and updated based on human judgment. After a
concrete workflow is created and verified, the connectivity of services in the
workflow can be added into the system. As time goes by, the connectivity
of services in a system forms a graph of the knowledge space. A vertex
in the graph can be represented as (servicei, opertationij, parameterijk) or
(servicei, operationij) if one operation does not have parameters and the
edge represents the connectivity of two vertices.
- Alternativity of services. In the context of our research, we define
servicei as an alternative of servicem if ∀operationij ∈ servicei and
operationmn ∈ servicem their syntactic and semantic description are the
same except the quality properties. For example, two services that imple-
ment the same WSDL interface are alternatives for each other. These two
services may implement the WSDL interface using different underlying tech-
nologies, charging different fees, and having different performance.
The execution of workflows and services takes place in a distributed comput-
ing environment. The execution may fail at some point due to the failure of
117
the workflow engine, failure of the service, and failure of the network fabric
[64]. The capability to dynamically select alternative services ensures the
recovery from service failure. The myGrid/Taverna project provides users
a way to encapsulate alternative services into the workflow at the design
time. Another approach is to find an alternative service during run time
using general semantic service discovery technologies. We believe that iden-
tifying and storing the alternatives of a service ahead of time can increase
the performance by eliminating this semantic service discovery process. The
method can also improve the correctness of finding alternative services. The
alternativity of services can be automatically identified when a new service
is added in the system and refined during the workflow translation process.
The alternatives of a servicei can be presented as a named property of
a service – alternativeOf. The alternativeOf property is a transitive
property, which means that if servicei is an alternative of servicem, servicem
is an alternative of servicex, then servicei is an alternative of servicex.
- Quality profile of services. As more services with similar functionali-
ties are published, it is important to define qualitative metrics that help
the selection of the optimal services. Modeling the quality of service and
approaches for choosing optimal services has been well studied for several
years [10]. While there are a number of quality criteria that can be used for
ranking services, different systems choose different sets of metrics and qual-
ity models for computing the overall quality of service. We define quality
with four attributes. Quality(cost, trustness, executiontime, failurerate)
– cost is the fee needed to execute an operationij and it is provided by the
service provider;
118
– trustness defines users preference of using the operationij based on their
experiences and it is annotated by users;
– executiontime and failurerate define the performance of an operationij
and they are collected and calculated from each run of a workflow or service.
Other QoS properties, such as security, may also be added when needed. The
overall quality of each service can be computed periodically or during the
optimization process using the similar QoS computation model algorithm
defined in [48].
- Mapping between abstract workflow and concrete workflow. The
construction of the abstract workflow represents the knowledge that sci-
entists knows about their domain and the services/tools provided in the
system. The abstract workflow and the semantic annotation of the con-
crete workflow are represented using the ontology. The concrete workflow
also is represented using a particular workflow specification that can be in-
voked on the workflow engine. Recording the mapping relationship between
abstract workflow to concrete workflow enables finding similar workflows in
the system given a workflow in different representation format. The concrete
workflow can have its own semantic annotation. It can also be represented
using the specific workflow language that can be invoked in the workflow
engine.
The knowledge about the connectivity of services, alternativity of services,
quality of services, and workflow representations is typically stored in tables.
119
6.3.2 Knowledge discovery
The Knowledge discovery component resides in the workflow composer. It is
responsible for communicating between the workflow composer and the knowledge
management component during the workflow translation process to find appro-
priate knowledge in the system. It is also responsible for selecting and replacing
services with their optimal alternatives during the optimization process and to
find a replacement during run time. The knowledge discovery component accepts
and sends requests to the knowledge management component.
6.4 Translation process
The process of translating abstract workflow into a concrete workflow involves
the discovery of appropriate services and resolving the connectivity between ser-
vices in order to accomplish tasks defined in the abstract workflow.
6.4.1 Service discovery and matchmaking process
During the translation process, the workflow composer issues a query to find
appropriate services that can be used to accomplish the defined task. For exam-
ple, the composer is interested in finding an operation which performs the task
“aligning”. We assume that one property of an operation is annotated using
#performTask which is the vocabulary term defined in the OWL-based bioinfor-
matics ontology of the myGrid project “http://www.mygrid.org.uk/ontology”.
A general query returns all services having #performTask property equals #aligning.
More sophisticated discovery processes use reasoning capabilities to infer a sub-
sumption relationship between the requested service and the services described us-
ing the ontology. For example, there is an operation that has been annotated with
120
the property #performTask using the vocabulary #pairwise local aligning.
In the ontology definition, the class #pairwise local aligning is not an asserted
subclass but an inferred subclass of #aligning. With the subsumption reasoning,
not only services annotated with #aligning should be returned, but also services
annotated with #pairwise local aligning.
The general translation from an abstract workflow to a concrete workflow
requires solving the connectivity between two executable services with mismatched
or inappropriate input to output. The mismatching problem may be introduced by
inaccurate semantic annotations, incomplete semantic annotations, and inaccurate
ontological reasoning (See Figure 6.4). One of the false positive examples is that a
DDBJ-XML service with attached semantic annotation of its output as Sequence
Data Record actually returns a document using self-defined format. The NCBI
blast service with attached semantic annotation of its input as Sequence Data
Record requires FASTA formatted sequence data. The connectivity of these two
services is identified as positive but in fact is not. This type of error can be detected
by expertise at design time or after the formed workflow runs and returns incorrect
results.
The true negtive case can be detected automatically during the translation
time. Adaptor, shim, or mediator [74] technologies are used to align or modify
poorly typed input and output of consecutive services in a workflow. These media-
tors are stored in mediator pools and discovering such a mediator is achieved with
ontologies and machine reasoning, the same as the discovery of normal services.
Most research has focused on how to discover these mediators using semantic web
technology and machine reasoning. A general mediation process terminates in
methods for translating the output of one web service into the input for the next.
121
2007-4-18 Ph.D defense 71
TNFN
FP TPMatch Detectionoutput
Accurate annotation
Inaccurate annotationLack semantic annotationInaccurate ontological reasoning
Inaccurate annotationIncomplete semantic annotationInaccurate ontological reasoning
Accurate annotation
GenBankServiceOut:GenBank record
BlastpIn: protein sequenceX
Mediator, adaptor,shim
DDBJ-XMLOut: sequence
data record
NCBI blastIn: sequence data
record
fasta formatSelf-defined format
May be detectedby expertise at design time or afterrun
Can be detected automatically
X
Yes No
Yes
No
FPTN
Real match
Figure 6.4. The mismatching problem may be introduced due to theinaccurate annotation, incomplete semantic annotation, and inaccurate
ontological reasoning during the translation process.
6.4.2 Knowledge reuse
With the incrementally added information in the knowledge base, solving con-
nectivity can be done completely at the syntax level without need for consulting
the domain ontology. As time goes by, converting the abstract workflow to the
concrete workflow may be achieved by finding a mediator between two services in
the knowledge base. Thus, the use of ontologies will be exactly on those parts of
the workflow that were never used before. The manual translation process will be
122
required just once for every new element of the set of components in a workflow
and when a new service is added in the registry. The problem of solving the con-
nectivity between two services can be converted to a problem of finding a path
between two nodes in a connectivity graph.
During the translation process, instead of resolving the connectivity from
scratch using semantic reasoning technology, the composer can reuse stored knowl-
edge to support the semi-automatic and automatic composition.
1. Given a service or operation, all services or operations connected to the
current service or operation can be found by table lookup and presented
to the users. Users can choose one based on their expertise. Since the
connectivity stored in the table is verified during the previous workflow
creation process, we expect the probability of finding an accurate one is
higher and faster than using the semantic reasoning techniques from scratch.
2. Given two services or operations, find one or a sequence of services or op-
erations between them (mediators) that can connect these two services or
operations together. This problem can be converted to a problem of finding
a path between the service or operation A to the service or operation B.
Since the connectivity structure of services or operations in the knowledge
base is a graph, the shortest path algorithm (Dijkstra) is applicable to this
problem.
3. This concept can be extended into a wider use case when users know the
exact input they can provide and output they are trying to get. A general
planning technology is trying to find a service or operation that accepts
this input and a service or operation that generates this output. Using the
123
connectivity structure, the path between the input and output can be found,
if there is any.
6.4.3 Implementation and evaluation
The connectivity between two services is identified automatically when a new
service is registered into the semantic-enabled registry using the matching rules
defined in the Section 6.3. As more services are registered in the registry, the con-
nectivity graph is formed. Since the automatic identification process may intro-
duce some mismatching problems, the mismatching cases can be corrected during
the workflow translation process with knowledge from experts (See Figure 6.5).
During the workflow translation process, the knowledge discovery component can
find the path between two services/operations and suggest the next available ser-
vices/operations by searching the knowledge base at syntactic level. The searching
function is implemented with the Dijkstra algorithm.
2007-4-18 Ph.D defense 72
Registrationprocess
registry
Automatically Identify the connectivity
Knowledge base
Store the connectivity
Workflow Translation /
Service compositionprocess
Refine, update, decompose the workflow
Figure 6.5. The creation process of connectivity graph when a newservice is added in the registry, the connectivity is refined and updated
during the workflow translation process.
124
TABLE 6.1
PERFORMANCE EVALUATION OF MATCH DETECTION
PROCESS
Number of Ser-vices
Number ofMatched Pairs
Load RDF repos-itory (millisec-onds)
Average time ofmatch detectionper single service(milliseconds)
200 10 1547 12.02
400 34 2346 13.01
600 84 2600 12.31
800 138 3015 12.35
1000 225 3325 12.51
The connectivity graph approach is evaluated on an Dell laptop with a 1.5GHz
Pentium M CPU and 512M of RAM. Service decriptions are randomly generated
using 418 concepts from domain ontology (myGrid and MoGServ) for semantic
type and defined 10 concepts for data type. Each service contains 1 operation.
Each operation has 1 input and 1 output. The measured performance of the
match detection process during the service registration process is reported in Table
6.1. The number of matched pairs reports the identified pair of services that one
service’s output can be fed as input for the other service. Although the time to
load the RDF repository (Sesame) increases as the number of generated services
increases, the process is typically done once. The average time of the matching
process when a new service is registered in the repository requires about 12-13
milliseconds.
The searching function of shortest path algorithm is evaluated using the con-
nectivity graph created from 1000 randomly generated semantic web services. The
125
TABLE 6.2
PERFORMANCE EVALUATION OF PATH SEARCHING PROCESS
Number of nodes Number of arcs Average pathsearch time (mil-liseconds)
Connectivitygraph load time(milliseconds)
724 587 Less than 1 220
graph is formed with matched pair and input/output with each service in matched
pair. The measured performance of the path finding process is reported in Table
6.2. Loading the connectivity graph is typically done once. The average path
search time is less than 1 milliseconds. The longest path between two nodes has
9 additional nodes.
The preliminary results for testing the feasiblility of our implementation in
terms of performance is acceptable. Further testing with real services and work-
flows is needed to fully testing our approach.
6.5 Workflow reuse
Both abstract worklow, and concrete workflow can be viewed as a graph. With
this type of graph represenation, graph matching techniques can be applied to find
similar workflows in the system. Although in-depth graph theoretic research is
not the main focus of this investigation, we are interested in applying an efficient
algorithm to find similar workflows in the system given the graph representation
of abstract workflow or concrete workflow.
SUBDUE (available at http://cygnus.uta.edu/subdue/) is a graph-based knowl-
edge discovery system that finds structural and relational patterns in data rep-
126
resenting entities and relationships. SUBDUE represents data using a labeled,
directed graph in which entities are represented by labeled vertices or subgraphs,
and relationships are represented by labeled edges between the entities. The SUB-
DUE graph match utility [18] is a part of the SUBDUE data mining system. The
graph match utility can perform exact and inexact graph matches on directed or
undirected graphs with labeled vertices and edges. It solves the graph isomor-
phism problem which is defined as: given two graphs G1 and G2, is it possible to
permute (or relabel) the vertices of one graph so that it is equivalent to the other.
For example, a scientist may have a scientific process in her mind such as:
“I’d like to get all ATP alpha units of plastids in my MoG investigation and do
mutliple sequence alignments and get an alignment report with a format that I am
able to feed into my local PAUP program.” A possible abstract workflow she may
define is similar to Figure 6.6.
The given workflow is converted to the graph representation that can be fed
into the match algorithm. The match algorithm computes the similarity of the
given workflow against all the workflows stored in the knowledge base. The re-
turned match cost from the SUBDUE algorithm is the measurement that we use
to rank the similarity of the workflows. If two graphs are identical, the match
cost is 0. Costs of various graph match transformation have effects on the results.
The costs can be changed based on the importance of each transformation. For
example, we might like to define that the cost of substituting a vertex label or edge
label is higher than the cost of deleting the vertex or edge. With this specified,
the algorithm can find more optimal results.
The threshold of returned workflows is defined based on the match cost. One or
more workflows with the most similarity are returned and presented to users. Users
127
output
query_term
hasParameter
input
task
hasInput
task
hasNextretrieving
aligning
multiple_alignment_report
performTask
hasOutputperformTask
hasParameter
v 1 inputv 2 outputv 3 taskv 4 taskv 5 query_termv 6 retrievingv 7 aligningv 8 multiple_aligning_report
e 3 4 hasNexte 3 1 hasInpute 4 2 hasOutpute 3 6 performTaske 4 7 performTaske 1 5 hasParametere 2 8 hasParameter
SUBDUE input formatGraph view
Figure 6.6. The graph representation of a workflow for describing ascientific process
may decide to use these workflows as a template to manipulate their workflow
definition on the abstract level or concrete workflow level. Alternatively, users
may decide to use the returned workflow to conduct their experiments.
6.6 Related work
Abstract and concrete workflows have been introduced in various scientific
workflow literature and systems [22, 24, 73]. These two representations create a
view of certain aspects of a workflow that meet the interests of users with differ-
ent knowledge levels of the services and a particular domain. However, in these
systems and literature, the notion of concrete workflow and optimal workflow are
128
combined together and are often not distinguished as two separate representations.
We believe that separating these two workflow representations provides flexibility
of dynamic binding, the ability to select optimal services, and easier integration
of Grid resource management services.
The translation of an abstract workflow into a concrete workflow is a process
of service discovery and service composition. It normally uses an ontology to
annotate services and applies reasoning and matchmaking technologies from a
workflow. A number of research investigations focus on automation of this process
and assume that the ontological model is well defined and services are correctly
annotated, which is not always the case. Rao et. al. [76] presents an approach that
addresses the reality of incomplete annotation. The framework helps users become
better at annotating composable functionality over time. The enhance system and
methodology proposed in this paper is intended to reuse the knowledge that has
been verified by others. It provides users more accurate guidance for service
discovery using that stored knowledge.
The importance of reuse and repurposing of workflows has been reported in
[37, 104]. Antoon Goderis et. al. [38] presents an approach of using graph
based solution to find similar concrete workflow on the web. This is an similar
approach similar to what we used but with a different graph matching algorithm
and different graph representations.
6.7 Conclusion and future Work
In this chapter, we present the importance of implementing workflow and
knowledge reuse. In order to support that reuse, we propose and describe a
methodology and an enhanced workflow system. In includes a hierarchical work-
129
flow structure consisting of four levels that allow users to specify workflows at
different levels of abstraction, based on their knowledge and experience. Two com-
ponents are added into the workflow system to collect and analyze the reusable
of products from the workflow translation process when new services are added in
the system.
The methodology proposed is being used in the design and implementation
of a service-oriented based system for supporting bioinformatics research. Based
on its successful design and implementations of the system (MoGServ) [110], we
developed an ontological model for data and services annotation in the system.
At the current stage, the number of services, operations, workflows in the system
is relatively small, but are expected to grow with usage. The future MoGServ is
intended to support genomic research and provide a workbench for biologists in
the Indiana Center for Insect Genomics (ICIG)1, a research center composed of
three academic institutional partners. Users can define a genomic research work-
flow through a web interface for a particular application. It may result in higher
productivity for genomics researchers and synergy resulting from transparent in-
tegration of data and analysis tools from multiple locations. We believe that the
enhanced workflow system with the knowledge reuse capability can provide more
accurate guidelines during the workflow creation process and make the process
more efficient. A systemic evaluation is being conducted.
1http://ctdrt.bio.nd.edu/index.php?content=projectinfo.php&projectno=4
130
CHAPTER 7
SUMMARY AND FUTURE WORKS
7.1 Summary
In this dissertation, we present a practical experiment of building a service-
oriented system upon current web services technologies and bioinformatics mid-
dleware. The first prototype of this system integerates data and services from
other service providers. It is being evaluated on a phylogenetic research applica-
tion, Mother of Green (MoG). Our evaluation demonstrates that a service-oriented
architecture can accelerate scientific research, increase research productivity, and
provide a new approach to doing science.
Based on the successful design and implementation of this prototype, we
present an enhanced system with semantic annotation of services and data. The
enhancement aims at allowing life science researchers to define their experiments
at different levels based on their knowledge of the tools, data, and the system. The
semantically enriched data allows easier reuse, sharing, and experiments involving
search to be conducted.
Few of current practical methodologies and workflow systems for service com-
position and workflow creation in e-Science consider the potential for reuse: to
share the knowledge gained during the service composition process and to reuse
complete or partial of existing workflows. We believe that providing a capability
131
for reuse of this knowledge and workflows could be an important component in a
workflow system. We propose a methodology and an enhanced system design to
facilitate the reuse of knowledge and workflows. It contains a hierarchical work-
flow structure representation, knowledge management and knowledge discovery
components to capture and manage the reusable knowledge in a system, and an
approach for using a graph matching algorithm to discover similar workflows.
7.2 Limitations and future work
The future MoGServ is intended to support genomic research and provide a
workbench for biologists in the Indiana Center for Insect genomics (ICIG). The
ICIG includes three partners, University of Notre Dame, University of Purdue, and
Indiana University. The future MoGServ can help a user at the ICIG site discover
data or computational web services that are available at the site, other ICIG
partner’s locations, or elsewhere on the web. There are several limitations of the
initial MoG implementation we discussed in Chapter 3 in order to use the system
across mutiple sites. These limitations include security, resource management,
and end-user oriented workflow creation. Several improvements and theoretical
approaches are described in Chpater 5 and Chapter 6, however, there are still more
work need to be done. The future work may be conduct from several aspects.
• Integration of GridSAM. We will explore a way to integrate the MoG system
with a grid computing architecture such that the security issue, resource al-
location, and resource management can be shift to using the existing grid
computing technologies. In the MoGServ implementation, we have a sim-
ple resource management mechanism implemented by two components, job
manager and job luncher. A better sophisticated mechanisms can be used
132
to integrate into the MoGServ system. The GridSAM 1 Web Service is a
WS-I compliant Web Service implementation of the GridSAM service inter-
face as well as the upcoming Global Grid Forum Basic Execution Service
interface. It integrates with the GridSAM Core Engine to provide remote
job launching and file staging capability as described in a Job Submission
Description Language document. As a new feature introduced to GridSAM
2.0.0, a sophisticated authorization mechanism provides a powerful capabil-
ity to control incoming service requests on a user/group basis.
• Enhancement of user interface. In the current MoGServ, the data model
design has table to keep the personlize information for individual user. An
authorization component should be built in the system to enable users to
access the permitted services and to personalize their own workspace. A web
portal will be built to enable users to create an account, login and logout
with username and password. The user account information including the
access level will be stored in a database. The GridSphere portal framework
[39], an open-source portlet based web portal, is one of the candidates.
• Enhancement of data annotation and ontological model. The current onto-
logical model captures the meta data that users need to query their data
provenance. As the system is used more and more, the ontological model
may need to be updated and add new properties and concepts in.
• Integration of presented new functionalities into the system. In this dis-
sertation research work, we present a new approach to improve the reuse
workflows and their by products as well as a heireachical workflow struc-
1http://gridsam.sourceforge.net/2.0.0/index.html
133
ture. The future work is adding these funcationailities into the system by
developing an easy-use interface for users to define workflows at multiple
levels; allow users choose similar workflows and manipulate the workflow as
desired.
134
APPENDIX A
GLOSSARY
BPEL4WS – Business Process Execution Language for Web Services providesa language for the formal specification of business processes and businessinteraction protocols.
BLAST – Basic Local Alignment Search Tool algorithm is used to compare nu-cleotide or protein sequences to sequence databases and calculates the sta-tistical significance of matches.
ClustalW – a tools for global multiple alignment of DNA and protein sequences.
FASTA – a common sequence format that begins with a single-line descriptionfollowed by lines of sequence data.
HGT – Horizontal gene transfer is a process in which an organism transfersgenetic material to another cell that is not its offspring. HGT occurs outsideof the mechanisms of Mendelian genetics, crossing over species, order andfamily reproductive barriers.
J2EE – Java 2 Platform, Enterprise Edition defines the standard for developingcomponent-based multitier enterprise applications.
JXTA – is an open source peer-to-peer platform created by Sun Microsystems in2001.
LGT – Lateral gene transfer is a process in which an organism transfers geneticmaterial to another cell that is not its offspring. LGT occurs within the cell,from endosymbiont genomes to the host cell nucleus.
MoG – Mother of Green is a collaborative research project on plastid phylogeneticanalysis involving information technologists and biologist.
MoGServ – A service-oriented system for data integration and data analysis forphylogenetic analysis.
135
NEXUS – The NEXUS format was designed by David Maddison, Wayne Maddi-son, and David Swofford to facilitate the interchange of input files betweenprograms used in phylogeny and classification.
OGSA – Open Grid Services Architecture
OWL – Web Ontology Language
OWL-S – the ontology description of web service using OWL.
PAUP – a progarm for phylogenetic analysis using parsimony, maximum likeli-hood, and distance methods.
Phylip – a set of modular program for performing numerous types of phylogeneticanalysis
Phylogeny – also called phylogenesis is the origin and evolution of a group oforganism.
Phylogenetics – an area of study the evolutionary relationship among variousgroups of organisms.
RDF – Resource Description Framework is the basic standard for knowledgesharing and reuse in semantic web
REST – Representational State Transfer is a term coined by Roy Fielding todescribe an architecture style of networked systems.
SAM – Sequence Alignment and Modeling System.
SOA – Service-oriented architecture
SOAP – Simple Object Access Protocol is a protocol for exchanging messagesamong requesters and providers.
SOC – Service-oriented computing
UDDI – Universal Description, Discovery and Integration provides a standardregistry for publishing, discovery and reuse of web services.
WSDL – Web Service Description Language defines the abstract interface ofservices.
WS-I – an open industry organization chaptered to promote Web services inter-operability, creates, promotes and supports generic protocols for the inter-operable exchange of messages between Web services.
WSRF – Web Services Resource Framework
136
XML – Extensible Markup Language
XSLT – XSL Transformations is a language for transforming XML documentsinto other XML documents. SL specifies the styling of an XML documentby using XSLT to describe how the document is transformed into anotherXML document that uses the formatting vocabulary.
A.1 Pictures
Figure A.1. Time line for the origin of life and major invasions givingrise to mitochondria and plastids.[27]
137
Figure A.2. Gene transfer to the nucleus. [27]
138
Figure A.3. Symbioses process [69]
139
Figure A.4. ATP Synthase: the wheel that powers life. It is a candidatefor ascertainment of deep phylogeny.
140
APPENDIX B
MOGSERV MANUAL
B.1 Main
MoGServ is accessible through URL, http://almond.cse.nd.edu:10000/bioinfor1.
If you are inside the ND network, you may access another host of MoGServ from
http://biocomp.science.nd.edu:8080/mog. (See Figure B.1)
B.2 Retrieve genome and gene data from NCBI database
Data collection service retrieves complete genome squences and gene sequences
using terms defined by users. Retrieved sequences are stored into the local database.
The service is executed weekly during weekend or daily during night to update
the database. See Figure B.2.
B.3 Query local database
This service allows users to create gene sequence sets or genome sequence sets
by querying the local database. The meta data of these sequences are indexed
using Lucene index and search engine. The valid query can be “chlo*”, “ATP
and atp”, and so on, which follows the Lucene syntax. Users input their query
and choose either “gene” or “complete genome” sequence. A set of sequences is
returned. Users can examine the set and delete sequences from this set. Then
141
Figure B.1. The main menu of the MoGServ
users can choose either “create new set” or “add to an existing set”. “create
new set” puts these sequences together in order to do sequence alignment. A
set id returns to users for further reference. “add to an existing set” puts these
sequences into an existing set (id is input by users). See Figure B.3 B.4. Users
can also download these sets with different formats.
B.4 Set management
Users can upload a set of sequences in fasta format to the local database.
These sequences can be from users’ own lab experiments, which may not be ready
to submit to the public database. They can also be a small number of sequences
142
Figure B.2. A web interface provides users a way to define data withinterests.
not in the local database at that time. These sequences are annotated using the
appropriate metadata description. See Figure B.5.
Users can query the information of a set as shown in Figure B.6, such as the
creation date, the origination of the set, etc.
Users can also use the set filter service to find the intersection of organisms
among multiple sets. See Figure B.7.
B.5 Data analysis services
The MoGServ system provides 7 data analysis services: blastn, blastp, blastx,
tblastn, tblastx, MegaBLAST, ClustalW.
In order to use blast and megablast to do sequence alignment, users need to
143
Figure B.3. Input the query term from this interface and choose gene orgenome database
input two sequence sets: base set and compare set. A base set is a set of sequence
that is similar to the “database” field in NCBI blast search website. A compare
set is a set of sequence that is compare against a base set. It is similar to the
“search” field in NCBI blast search website. Base sets and compare sets need to be
created using “Query Local” or “Set managment” services. Users can define a few
parameters, such as e-value, window size, and so on. A job id will be returned and
shown on the browser. Users should record this id number for further reference.
When the task is executed, required sequences are retrieved from local database
and input to blast (megablast) program. Comparison results are stored in the
local file system for downloading. See Figure B.8 shows the tblastn service.
144
In order to use ClustW service, users need to define the set id and the sequence
type. See Figure B.9. The job id is returned for further reference.
B.6 Job mangement
This service allows users to query the job information and monitor the execu-
tion status of their submitted jobs. There are three execution status, “submit”,
“start”, and “finish”. “output” becomes hot link when execution status turns to
“finish”. Users can follow the link to view the input, output of each data analysis
job. See Figure B.10 B.11 B.12.
145
Figure B.4. The results from querying local database
146
Figure B.5. Users may copy, past particular sequences and upload tothe local database
147
Figure B.6. Set information
148
Figure B.7. The set filter service is used to find intersection oforganisms among mutliple sets.
149
Figure B.8. tblastn interface in MoGServ
150
Figure B.9. ClustalW Interface in MoGServ
151
Figure B.10. Job management interface shows the status, input link,output link of a job
152
Figure B.11. An example input of a clustalW analysis, set id is a hotlink, users can view sequence information in this set.
153
Figure B.12. An example output of a clustalW analysis, users candownload, convert, view the results.
154
APPENDIX C
DEVELOPMENT AND DEPLOYMENT TOOLKITS
Some development and deployment toolkits we used for the implmentation
are listed in Table C.1. All the software packages are open source and can be
download from the URL.
155
TA
BLE
C.1
OP
EN
SO
UR
CE
SO
FT
WA
RE
PA
CK
AG
ES
USE
DFO
R
DE
VE
LO
PM
EN
TA
ND
DE
PLO
YM
EN
T
Pac
kage
sV
ersi
onD
escr
ipti
ons
UR
L
Apa
che
Axi
s1ax
is-1
2RC
2a
SOA
Pen
gine
forde
velo
ping
and
host
ing
web
serv
ices
http
://w
s.ap
ache
.org
/axi
s/
Tom
cat
jaka
rta-
tom
cat-
5.0.
18J2
EE
com
plia
ntse
rvle
tco
ntai
ner
http
://t
omca
t.ap
ache
.org
/
Tav
erna
1.4
aG
UIba
sed
wor
kben
chfo
rcr
eati
ng,e
xe-
cuti
on,an
dm
onit
orin
gw
orkfl
ows
http
://t
aver
na.s
ourc
efor
ge.n
et/
Apa
che
Luc
ene
1.4.
3a
high
-per
form
ance
,fu
ll-fe
atur
edte
xtse
arch
engi
nelib
rary
wri
tten
inJa
vaht
tp:/
/luc
ene.
apac
he.o
rg/j
ava/
docs
Pos
tgre
sSQ
L8.
0.3
are
lati
onal
data
base
syst
emht
tp:/
/ww
w.p
ostg
resq
l.org
/
Pro
tege
3.2
anon
tolo
gyed
itor
and
know
ledg
e-ba
sefr
amew
ork
wit
hO
WL
supp
orts
http
://p
rote
ge.s
tanf
ord.
edu/
Pel
let
1.3-
beta
2an
open
-sou
rce
Java
base
dO
WL
DL
rea-
sone
rht
tp:/
/pel
let.
owld
l.com
/
sesa
me
1.2.
6an
open
sour
ceR
DF
fram
ewor
kw
ith
sup-
port
for
RD
FSc
hem
ain
fere
ncin
gan
dqu
eryi
ng
http
://w
ww
.ope
nrdf
.org
/abo
ut.js
p
subd
ue5.
1.4
agr
aph-
base
dda
tam
inin
gsy
stem
http
://c
ygnu
s.ut
a.ed
u/su
bdue
/
156
APPENDIX D
SUPPLEMENTARY MATERIAL FOR CHAPTER 3 AND CHAPTER 4
D.1 Complete genome sequence in XML
<?xml version="1.0"?><!DOCTYPE INSDSeq PUBLIC "-//NCBI//INSD INSDSeq/EN"
"http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd"><INSDSeq><INSDSeq_locus>NC_005042</INSDSeq_locus><INSDSeq_length>1751080</INSDSeq_length><INSDSeq_strandedness>double</INSDSeq_strandedness><INSDSeq_moltype>DNA</INSDSeq_moltype><INSDSeq_topology>circular</INSDSeq_topology><INSDSeq_division>BCT</INSDSeq_division><INSDSeq_update-date>24-JUL-2006</INSDSeq_update-date><INSDSeq_create-date>25-JUL-2003</INSDSeq_create-date><INSDSeq_definition>Prochlorococcus marinus subsp. marinus str. CCMP1375,
complete genome</INSDSeq_definition><INSDSeq_primary-accession>NC_005042</INSDSeq_primary-accession><INSDSeq_accession-version>NC_005042.1</INSDSeq_accession-version><INSDSeq_other-seqids><INSDSeqid>ref|NC_005042.1|</INSDSeqid><INSDSeqid>gnl|NCBI_GENOMES|310</INSDSeqid><INSDSeqid>gi|33239452</INSDSeqid></INSDSeq_other-seqids><INSDSeq_project>419</INSDSeq_project><INSDSeq_source>Prochlorococcus marinus subsp. marinus str. CCMP1375
(Prochlorococcus marinus SS120)</INSDSeq_source><INSDSeq_organism>Prochlorococcus marinus subsp. marinus str. CCMP1375
</INSDSeq_organism><INSDSeq_taxonomy>Bacteria; Cyanobacteria; Prochlorales; Prochlorococcaceae;
Prochlorococcus</INSDSeq_taxonomy>....<INSDSeq_feature-table>....<INSDFeature><INSDFeature_key>CDS</INSDFeature_key><INSDFeature_location>1447640..1449106</INSDFeature_location>
157
<INSDFeature_intervals><INSDInterval><INSDInterval_from>1447640</INSDInterval_from><INSDInterval_to>1449106</INSDInterval_to><INSDInterval_accession>NC_005042.1</INSDInterval_accession></INSDInterval>
</INSDFeature_intervals><INSDFeature_quals><INSDQualifier><INSDQualifier_name>gene</INSDQualifier_name><INSDQualifier_value>atpD</INSDQualifier_value></INSDQualifier><INSDQualifier><INSDQualifier_name>locus_tag</INSDQualifier_name><INSDQualifier_value>Pro1591</INSDQualifier_value></INSDQualifier><INSDQualifier><INSDQualifier_name>note</INSDQualifier_name><INSDQualifier_value>Produces ATP from ADP in the presence of a proton
gradient across the membrane. The beta chain is a regulatory subunit</INSDQualifier_value>
</INSDQualifier><INSDQualifier><INSDQualifier_name>codon_start</INSDQualifier_name><INSDQualifier_value>1</INSDQualifier_value></INSDQualifier><INSDQualifier><INSDQualifier_name>transl_table</INSDQualifier_name><INSDQualifier_value>11</INSDQualifier_value></INSDQualifier><INSDQualifier><INSDQualifier_name>product</INSDQualifier_name><INSDQualifier_value>ATP synthase subunit B</INSDQualifier_value></INSDQualifier><INSDQualifier><INSDQualifier_name>protein_id</INSDQualifier_name><INSDQualifier_value>NP_875982.1</INSDQualifier_value></INSDQualifier><INSDQualifier><INSDQualifier_name>db_xref</INSDQualifier_name><INSDQualifier_value>GI:33241040</INSDQualifier_value></INSDQualifier><INSDQualifier><INSDQualifier_name>db_xref</INSDQualifier_name><INSDQualifier_value>GeneID:1462973</INSDQualifier_value></INSDQualifier><INSDQualifier><INSDQualifier_name>translation</INSDQualifier_name><INSDQualifier_value>MAAAATASTGTKGVVRQVIGPVLDVEFPAGKLPKILNALRIEGK
NPAGQDVALTAEVQQLLGDHRVRAVAMSGTDGLVRGMEAIDTGSAISVPVGEATLGRIF
158
NVLGEPVDEQGPVKTKTTSPIHREAPKLTDLETKPKVFETGIKVIDLLAPYRQGGKVGLFGGAGVGKTVLIQELINNIAKEHGGVSVFGGVGERTREGNDLYEEFKESGVINADDLTQSKVALCFGQMNEPPGARMRVGLSALTMAEHFRDVNKQDVLLFVDNIFRFVQAGSEVSALLGRMPSAVGYQPTLGTDVGELQERITSTLEGSITSIQAVYVPADDLTDPAPATTFAHLDATTVLARALAAKGIYPAVDPLDSTSTMLQPSVVGDEHYRTARAVQSTLQRYKELQDIIAILGLDELSEDDRRTVDRARKIEKFLSQPFFVAEIFTGMSGKYVKLEDTIAGFNMILSGELDDLPEQAFYLVGNITEVKEKAQKISADAKK</INSDQualifier_value>
</INSDQualifier></INSDFeature_quals>
</INSDFeature>....</INSDSeq_feature-table><INSDSeq_sequence> ....... </INSDSeq_sequence></INSDSeq>
The size of this example XML file is about 7.7M, the size of the complete genome
sequence in fasta format is about 1.7M. Actual length of this sequence is 1751080
nt.
D.2 Example of a ATP synthase subunit B sequence
Fasta format:
>gi|33241040|ref|NP_875982.1| ATP synthase subunit B[Prochlorococcus marinus subsp. marinus str. CCMP1375]
MAAAATASTGTKGVVRQVIGPVLDVEFPAGKLPKILNALRIEGKNPAGQDVALTAEVQQLLGDHRVRAVAMSGTDGLVRGMEAIDTGSAISVPVGEATLGRIFNVLGEPVDEQGPVKTKTTSPIHREAPKLTDLETKPKVFETGIKVIDLLAPYRQGGKVGLFGGAGVGKTVLIQELINNIAKEHGGVSVFGGVGERTREGNDLYEEFKESGVINADDLTQSKVALCFGQMNEPPGARMRVGLSALTMAEHFRDVNKQDVLLFVDNIFRFVQAGSEVSALLGRMPSAVGYQPTLGTDVGELQERITSTLEGSITSIQAVYVPADDLTDPAPATTFAHLDATTVLARALAAKGIYPAVDPLDSTSTMLQPSVVGDEHYRTARAVQSTLQRYKELQDIIAILGLDELSEDDRRTVDRARKIEKFLSQPFFVAEIFTGMSGKYVKLEDTIAGFNMILSGELDDLPEQAFYLVGNITEVKEKAQKISADAKK
TinySeq XML:
<?xml version="1.0"?><!DOCTYPE TSeq PUBLIC "-//NCBI//NCBI TSeq/EN""http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd">
<TSeq><TSeq_seqtype value="protein"/><TSeq_gi>33241040</TSeq_gi><TSeq_accver>NP_875982.1</TSeq_accver><TSeq_sid>gnl|REF_uproscoff|Pro1591</TSeq_sid><TSeq_taxid>167539</TSeq_taxid><TSeq_orgname>Prochlorococcus marinus subsp. marinus str. CCMP1375
159
</TSeq_orgname><TSeq_defline>ATP synthase subunit B [Prochlorococcus marinus subsp.marinus str. CCMP1375]</TSeq_defline><TSeq_length>488</TSeq_length><TSeq_sequence>MAAAATASTGTKGVVRQVIGPVLDVEFPAGKLPKILNALRIEGKNPAGQDVALTAEVQQLLGDHRVRAVAMSGTDGLVRGMEAIDTGSAISVPVGEATLGRIFNVLGEPVDEQGPVKTKTTSPIHREAPKLTDLETKPKVFETGIKVIDLLAPYRQGGKVGLFGGAGVGKTVLIQELINNIAKEHGGVSVFGGVGERTREGNDLYEEFKESGVINADDLTQSKVALCFGQMNEPPGARMRVGLSALTMAEHFRDVNKQDVLLFVDNIFRFVQAGSEVSALLGRMPSAVGYQPTLGTDVGELQERITSTLEGSITSIQAVYVPADDLTDPAPATTFAHLDATTVLARALAAKGIYPAVDPLDSTSTMLQPSVVGDEHYRTARAVQSTLQRYKELQDIIAILGLDELSEDDRRTVDRARKIEKFLSQPFFVAEIFTGMSGKYVKLEDTIAGFNMILSGELDDLPEQAFYLVGNITEVKEKAQKISADAKK</TSeq_sequence>
</TSeq>
There are total 182 complete genome sequences in the database and 878 ATP gene
sequences.
D.3 Protein name
For each whole genome sequence find all of the proteins that make up ATP
synthase (See table D.1)
D.4 Syntax of search local database
Local database is indexed using Lucene search engine. Refer to 1 for com-
plete syntax description. There are two tables to store complete genome sequence
and gene sequences respectively. D.2 list syntax and example for searching these
database. D.3 summarize the field when we create index.
D.5 Workflow of retrieve sequence
Since we use web services provided by NCBI to retrieve the sequences, there
may have failure during data collection process. Record the status of data retrieve
1http://lucene.apache.org/java/docs/queryparsersyntax.html
160
TABLE D.1
NAME OF ATP SYNTHASE
Protein name description
atpC gamma chain
atp1 protein 1
atpI chain a
atpH subunit c
atpG chain b’
atpF chain b
atpD delta chain
atpA alpha chain
atpB beta subunit
atpE epsilon subunit
N/A “ATP synthase”
ch1M Mg-protoporphyrin IX methyl transferase
ftrC ferredoxin-thioredoxin reductase, catalytic chain
TABLE D.2
SYNTAX OF SEARCHING LOCAL DATABASE
Query type Example
single words cyanobacteria
phrase “ATP synthase”
field name:ATP AND gamma AND plastid
boolean atpa NOT bacteria
grouping atpa AND (plastid or cyanobacteria)
161
TABLE D.3
INDEXING FIELD OF LOCAL DATABASE
Field Comments
gi gi number of the sequence
accver accver number of the sequence
name name of the sequence
term query defined by users and used to get this sequence from NCBI
taxonomy taxonomy of the sequence provided by NCBI
cds name of protein that make up atp synthase (only in gene table)
nucleotide gi gi number of corresponding nucleotide gi which is also the gi fromthe complete genome (only in gene table)
nucleotide name name of corresponding nucleotide sequence(only in gene table)
default the default field contains all the information described above, with-out specify the field name
in the database enables us to examine the integrity of the data. Parse the XML
file requires huge memory. Detail with the redundance of the sequence, but record
the query term. Update the database weekly or daily.
Psudo code for retrieving complete genome sequence:
get search term from ncbi_retrieve tablefor each term
get sequence in fasta formatset retrieve_gene_status as ’ready’
Psudo code for retrieving gene sequence:
get acceid from ncbi_genomes table where retrieve_gene_status is ’ready’for each acceid
update retrieve_gene_status as ’start’ in ncbi_genome tableget sequence in GB XML formatparse the XML to get particular protein sequence acceiduse acceid to get protein sequence in fasta formatcompute the correspond nucleotide sequenceget taxonomy of the sequence
162
update retrieve_gene_status as ’start’ in ncbi_genome tableupdate the taxonomy for the sequence in ncbi_genome table
D.6 ClustalW input
An example of the ClustalW input file:
<?xml version=’1.0’ encoding=’utf-8’?><inputparams><setid>142</setid><sequencetype>nucleotid</sequencetype><title>Sequence</title><topdiags></topdiags><alignment>full</alignment><window></window><gapext></gapext><outputtree></outputtree><output>aln1</output><tossgaps>true</tossgaps><ktup></ktup><kimura>true</kimura><matrix>blosum</matrix><scores>percent</scores><outorder>aligned</outorder><gapopen></gapopen><gapclose></gapclose><gapdist></gapdist><pairgap></pairgap>
</inputparams>
An example of the ClustW output file:
<?xml version=’1.0’ encoding=’utf-8’?><output><title>Sequence</title><ebiid>clustalw-20060925-04170320</ebiid><file>clustalw-20060925-04170320.txt</file><file>clustalw-20060925-04170320.aln</file><file>clustalw-20060925-04170320.dnd</file>
</output>
D.7 Blast
An example blastn input:
<?xml version=’1.0’ encoding=’utf-8’?>
163
<inputparams><expect>10</expect><wordsize>11</wordsize><matrix></matrix><opengap></opengap><extendgap></extendgap><searchSetId>130</searchSetId><searchSetType>gene</searchSetType><searchSeqType>nucleotide</searchSeqType><dbSetId>130</dbSetId><dbSetType>gene</dbSetType><dbSeqType>nucleotide</dbSeqType>
</inputparams>
D.8 PAUP
The result generated from ClustalW program is convert to NEXUS format
from the web interface (see Figure B.12). The data conversion is done with the
service provided in the system. Here is portion of the NEXUS file format for all
ATP beta unit.
#NEXUS
BEGIN DATA;DIMENSIONS NTAX=27 NCHAR=1503;FORMAT DATATYPE=DNA INTERLEAVE MISSING=-;
[Name: Saccharum1 Len: 1503 Check: 0][Name: Saccharum2 Len: 1503 Check: 0][Name: Zea_mays Len: 1503 Check: 0][Name: Triticum_a Len: 1503 Check: 0]...MATRIXSaccharum1 ATGAGAACCAATCCTACTAC TTCGCGTCCCGGGGTTTCCACAATTGAAGAAAAAA----- -GCGTAGGGCGTATTGATCA AATTATTGGACCCGTGCTGGSaccharum2 ATGAGAACCAATCCTACTAC TTCGCGTCCCGGGGTTTCCACAATTGAAGAAAAAA----- -GCGTAGGGCGTATTGATCA AATTATTGGACCCGTGCTGGZea_mays ATGAGAACCAATCCTACTAC TTCGCGTCCCGGGATTTCCACAATTGAAGAAAAAA----- -GCGTAGGGCGTATTGATCA AATTATTGGACCCGTGCTGG...Calycanthu TGAPinus_kora ---Pinus_thun ---Marchantia ---Physcomitr ---
164
Anthoceros ---Huperzia_l ---
;END;
Here is the configuration file used for PAUP to generate phylogenetic tree usingthe NEXUS file:
#NEXUSbegin paup;
set autoclose=yes warntree=no warnreset=no;log start file=thisfile.log replace;execute atpb_27.nex;Set criterion=distance;dset dist = hky85;showdist;nj;nj breakties = random;bootstrap nreps=100 brlens=yes keepall=yes search=heuristic;
savetrees from=1 to=1 savebootp=both maxdecimals=0;contree all/strict=no file=thisfilename.tre replace showtree=yes;
end;
Figure D.1 and D.2 show generated tree results.
165
Figure D.1. Phylogenetic tree generated from the PAUP
166
Figure D.2. Phylogenetic tree file generated from the PAUP can beviewed by other program
167
APPENDIX E
SUPPLEMENTARY MATERIAL FOR CHAPTER 5 AND CHAPTER 6
This is a sample output of comparing two workflows using SUBDUE. The
inexact graph match program computes the cost of transforming the larger of the
input graphs into the smaller according to the transformation costs predefined
cost. The program returns this cost and the mapping of vertices in the larger
graph to vertices in the smaller graph. The smaller match cost represents the
higher structural similarity between two workflows.
// Costs of various graph match transformations#define INSERT_VERTEX_COST 1.0 // insert vertex#define DELETE_VERTEX_COST 1.0 // delete vertex#define SUBSTITUTE_VERTEX_LABEL_COST 1.0 // substitute vertex label#define INSERT_EDGE_COST 1.0 // insert edge#define INSERT_EDGE_WITH_VERTEX_COST 1.0 // insert edge with vertex#define DELETE_EDGE_COST 1.0 // delete edge#define DELETE_EDGE_WITH_VERTEX_COST 1.0 // delete edge with vertex#define SUBSTITUTE_EDGE_LABEL_COST 1.0 // substitute edge label#define SUBSTITUTE_EDGE_DIRECTION_COST 1.0 // change directedness of edge#define REVERSE_EDGE_DIRECTION_COST 1.0 // change direction of directed edge
[xxiang1@localhost subdue-5.1.4]$ bin/gm graphs/graph1.g graphs/mytest1.gMatch Cost = 15.000000Mapping (vertices of larger graph to smaller):1 -> deleted2 -> 33 -> 14 -> 25 -> deleted6 -> deleted7 -> 48 -> deleted
[xxiang1@localhost subdue-5.1.4]$
168
An example of WSDL description for the service provided in the MoGServ (See
Figure: E.1).
Create a workflow using Taverna workbench (See Figure: E.2). XScufl
format (See Figure: E.3). Sample data annotation in rdf format displayed with
RDF Gravity 1 (See Figure: E.4). Sample service annotation in rdf format
displayed with RDF Gravity 2 (See Figure: E.5).
1http://semweb.salzburgresearch.at/apps/rdf-gravity/download.html2http://semweb.salzburgresearch.at/apps/rdf-gravity/download.html
169
Figure E.1. This is the WSDL description of QueryLocal service hostedin the MoGServ, which provides an operation to create a set in the localdatabase. This operation accepts two parameters and return the set id.
170
Figure E.2. One example of using Taverna workbench to create, test,and run workflow. This workflow accepts users input, search the localdatabase, create set, align set using ClustalW, convert the ClustalW
result to NEXUS format, which can be fed to PAUP.
171
Figure E.3. XScufl workflow format represents the workflow createdusing the Taverna workbench.
172
Figure E.4. Annotation of job and set information using ontologicalmodel defined. The sample rdf file is displayed using RDF Gravity.
173
Figure E.5. Annotation of a service using ontological model defined.The sample rdf file is displayed using RDF Gravity.
174
BIBLIOGRAPHY
1. S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic localalignment search tool. J. Mol. Biol., 215(3):403–10, 1990.
2. K. Amin, G. von Laszewski, M. Hategan, N. J. Zaluzec, S. Hampton, andA. Rossi. Gridant: A client-controllable grid workflow system. In Proceedingsof the 37th Hawaii International Conference on System Science, 2004.
3. Axis. Apache axis, apache software foundation. URL http://ws.apache.
org/axis.
4. BEANSHELL. Light weight scripts for java. URL http://www.beanshell.
org/.
5. K. A. Beiter and K. Ishii. Integration producibility and product performancetools within a web-service environment. In ASME 2003 design engineeringtechnical conferences and computers and information in engineering confer-ence, 2003.
6. B. Benatallah, M. Dumas, Q. Z. Sheng, and A. H. Ngu. Declarative compo-sition and peer-to-peer provisioning of dynamic web services. In Proceedingsof the 18th International Conference on Data Engineering (ICDE’02), 2002.
7. T. Berners-Lee, J. Hedler, and O. Lassila. The semantic web. ScientificAmerican, May 2001.
8. T. Berners-Lee, W. Hall, J. Hendler, N. Shadbolt, and D. J. Weitzner. Cre-ating a science of the web. Science, 313(5788):769–771, August 2006.
9. BIOWBI. Bioinformatic workflow builder interface (biowbi). URL http:
//www.alphaworks.ibm.com/tech/biowbi.
10. P. A. Bonatti and P. Festa. On optimal service selection. In Proceedings ofthe 14th international conference on World Wide Web, 2005.
11. BPWS4J. The ibm business process execution language for web service javarun time. URL http://www.alphaworks.ibm.com/tech/bpws4j.
175
12. D. Buttler, M. Coleman, T. Critchlow, R. Fileto, W. Han, C. Pu, D. Rocco,and L. Xiong. Querying multiple bioinformatics information sources: Cansemantic web research help? SIGMOD Record, 31(4):59–64, 2002.
13. M. Carman, L. Serafini, and P. Traverso. Web service composition as plan-ning. In ICAPS 2003 Workshop on Planning for Web Services, 2003.
14. S. Carrere and J. Gouzy. Remora: a pilot in the ocean of biomoby web-services. Bioinformatics, 22(7), 2006.
15. S. Chirstley, X. Xiang, and G. Madey. An ontology for agent-based modelingand simulation. In Agent 2004 Conference, 2004.
16. M. Clamp, J. Cuff, S. M. Searle, and G. J. Barton. The jalview java alignmenteditor. Bioinformatics, 20(3):426–7, 2004.
17. Collaxa. Collaxa BPEL server. URL http://www.collaxa.com/.
18. D. J. Cook and L. B. Holder. Graph-based data mining. IEEE IntelligentSystems, 15(2):32–41, 2000.
19. J. Day and R. Deters. Selecting the best web service. In Proceedings of the2004 conference of the Centre for Advanced Studies on Collaborative research,pages 293–307, 2004.
20. R. de Knikker, Y. Guo, J. long Li, A. K. Kwan, K. Y. Yip, D. W. Cheung,and K.-H. Cheung. A web services choreography scenario for interoperatingbioinformatics applications. BMC Bioinformatics, 5(25), 2004.
21. D. de Roure, N. R. Jennings, and N. Shadbolt. The semantic grid: Past,present and future. Proc. of the IEEE, 93(3), March 2005.
22. E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, K. Black-burn, A. Lazzarini, A. Arbree, R. Cavanaugh, and S. Koranda. Mappingabstract complex workflows onto grid environments. Journal of Grid Com-puting, 1(1), 2003.
23. E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, S. Patil, M.-H. Su1,K. Vahi, and M. Livny. Grid Computing, volume 3165/2004 of Lecture Notesin Computer Science, chapter Pegasus: Mapping Scientific Workflows ontothe Grid, pages 11–20. Springer Berlin / Heidelberg, 2004.
24. L. A. Digiampietri, C. B. Medeiros, and J. C. Setubal. A framework based onweb service orchestration for bioinformatics workflow management. Geneticsand Molecular Research, 4(3):535–542, 2005.
176
25. A. Doan, J. Madhavan, R. Dhamankar, P. Domingos, and A. Halevy. Learn-ing to match ontologies on the semantic web. The VLDB Journal (Theinternational Journal on Very large Data bases, 12, 2003.
26. A. Dogac, Y. Kabak, G. Laleci, S. Sinir, A. Yildiz, S. Kirbas, and Y. Gurcan.Semantically enriched web services for the travel industry. SIGMOD Record,33(3), 2004.
27. S. D. Dyall, M. T. Brown, and P. J. Johnson. Ancient invasions: Fromendosymbionts to organelles. SCIENCE, 304(9), April 2004.
28. I. Elgedawy, Z. Tari, and M. Winikoff. Exact functional context matchingfor web services. In ICSOC’04, 2004.
29. V. Ermolayev, N. Keberle, O. Kononenko, S. Plaksin, and V. Terziyan. To-wards a framework for agent-based semantic web service composition. Inter-national Journal of Web Service Research, 2004.
30. ETTK. Emerging technologies toolkit. URL http://www.alphaworks.ibm.
com/tech/wssem.
31. N. M. Fast, J. C. Kissinger, D. S. Roos, and P. J. Keeling. Nuclear-encoded,plastid-targeted genes suggest a single common origin for apicomplexan anddinoflagellate plastids. Mol. Biol. Evol., 18(3):418–426, 2001.
32. I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the grid: Enablingscalable virtual organizations. Lecture Notes in Computer Science, 2150,2001.
33. K. Garwood, P. Lord, H. Parkinson, N. Paton, and C. Goble. Pedro ontologyservices: A framework for rapid ontology markup. In Proc of 2nd EuropeanSemantic Web Conference, pages 578–591. Springer Verlag, 2005.
34. Y. Gil, E. Deelman, J. Blythe, C. Kesselman, and H. Tangmunarunkit. Arti-ficial intelligence and grids: Workflow planning and beyond. IEEE IntelligentSystems, special issue on E-Science, Jan/Feb 2004.
35. GO. Gene ontology consortium. URL http://www.geneontology.org/.
36. C. Goble, C. Wroe, R. Stevens, and the myGrid consortium. The mygridproject: services, architecture and demonstrator. In UK e-Science AHM,September 2003.
37. A. Goderis, U. Sattler, P. Lord, and C. Goble. Seven bottlenecks to workflowreuse and repurposing. In Fourth International Semantic Web Conference(ISWC 2005), volume 3792, pages 323–337, Galway, Ireland, 2005.
177
38. A. Goderis, P. Li, and C. Goble. Workflow discovery: the problem, a casestudy from e-science and a graph-based solution. In IEEE InternationalConference on Web Services (ICWS’06), 2006.
39. GridSphere. Gridsphere portal framework. URL http://www.gridsphere.
org/gridsphere/gridsphere?cid=2.
40. T. Gruber. What is an ontology. http://www-ksl.stanford.edu/kst/what-is-an-ontology.html.
41. A. Gmez-Prez, R. Gonzlez-Cabero, and M. Lama. A framework for designand composition of semantic web services. American Association for Artifi-cial Intelligence, 2004.
42. JLaunch. JLaunch from Duke bioinformatics shared resource. URL http:
//dbsr.duke.edu/.
43. B. Johansson and P. Krus. A web service approach for model integration incomputational design. In ASME 2003 design engineering technical confer-ences and computers and information in engineering conference, 2003.
44. E. Kawas, M. Senger, and M. D. Wilkinson. Biomoby extensions to the tav-erna workflow management and enactment software. BMC Bioinformatics,Nov. 2006.
45. M. Klein and A. Bernstein. Toward high-precision service retrieval. InternetComputing, 8(1):30–36, January/February 2004.
46. U. Kuter, E. Sirin, D. Nau, B. Parsia, and J. Hendler. Information gather-ing during planning for web service composition. In The third internatonalsemantic web conference (ISWC2004), Hiroshima, Japan, 2004.
47. L. Li and I. Horrocks. A software framework for matchmaking based onsemantic web technology. In Proceedings of the 12th international conferenceon World Wide Web, 2003.
48. Y. Liu, A. H. Ngu, and L. Zeng. Qos computation and policing in dynamicweb service selection. In WWW2004, 2004.
49. P. Lord, S. Bechhofer, M. Wilkinson, G. Schiltz, D. Gessler, D. Hull,C. Goble, and L. Stein. Applying semantic web services to bioinformat-ics: Experiences gained, lessons learnt. In Third International SemanticsWeb Conference (ISWC2004), 2004.
178
50. P. Lord, P. Alper, C. Wroe, and C. Goble. Feta: A light-weight architecturefor user oriented semantic service discovery. In In Proceedings of SecondEuropean Semantic Web Conference, ESWC 2005, pages 17–31. Springer-Verlag LNCS 3532 2005, May-June 2005.
51. Lucene. Apache lucene. URL http://lucene.apache.org/java/docs/
index.html.
52. B. Ludascher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, E. A.Lee, J. Tao, and Y. Zhao. Scientific workflow management and the keplersystem. Concurrency and Computation: Practice and Experience, 18(10):1039 – 1065, Dec 2005.
53. E. M. Maximilien and M. P. Singh. Toward autonomic web services trustand selection. In ICSOC’04, 2004.
54. S. A. McIlraith, T. C. Son, and H. Zeng. Semantic web services. IEEEIntelligent Systems, pages 46–53, March/April 2001.
55. B. Medjahed, A. Bouguettaya, and A. K. Elmagarmid. Composing webservices on the semantic web. The VLDB Journal, 2003.
56. E. Mena, V. Kashyap, A. Sheth, and A. Illarramendi. Observer: An approachfor query processing in global information systems based on interoperationacross pre-existing ontologies. In Intl. Conf. on Cooperative InformationSystems (CoopIS 96), 1996.
57. F. Meyer. Genome sequencing vs. moore’s law: Cyber challenges for the nextdecade. CTWatch Quarterly, 2(3), August 2006.
58. N. Milanovic and M. Malek. Current solutions for web service composition.IEEE Internet Computing, 8(6):51–59, November/December 2004.
59. J. A. Miller and P. A. Fishwick. Investigating ontologies for simulation mod-eling. In The 37th Annual Simulation Symposium, April 2004.
60. M. G. Nanda, S. Chandra, and V. Sarkar. Decentralizing execution of com-posite web services. In OOPSLA’04, 2004.
61. NCBI. Entrez: Making use of its power. Briefings in bioinformatics, 4(2),June 2003. URL http://www.ncbi.nih.gov/.
62. N. F. Noy and M. A. Musen. Prompt: Algorithm and tool for automated on-tology merging and alignment. In The proceedings of the National conferenceon artificial intelligence (AAAI), 2000.
179
63. OGSA. Links to open grid service architecture. URL http://www.globus.
org/ogsa/.
64. T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Greenwood, T. Carver,A. Wipat, and P. Li. Taverna, lessons in creating a workflow environmentfor the life sciences. In GGF workflow workshop, 2004.
65. T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood,T. Carver, K. Glover, M. R. Pocock, A. Wipat, and P. Li. Taverna: a tool forthe composition and enactment of bioinformatics workflows. Bioinformatics,20(17), 2004.
66. M. Ouzzani, B. Benatallah, and A. Bouguettaya. Ontological approachfor information discovery in internet databases. Distributed and ParallelDatabases, 8(3), 2000.
67. OWL. W3c OWL web ontology language overview. URL http://www.w3.
org/TR/owl-features/.
68. R. D. M. Page. Treeview: An application to display phylogenetic trees onpersonal computers. Computer Applications in the Biosciences, 12:357–358,1996.
69. J. D. Palmer. The symbiotic birth and spread of plastids: How many timesand whodunit? J. Phycol., 39, 2003.
70. M. P. Papazoglou and D. Georgakopoulos. Service-oriented computing. Com-munications of the ACM, 46(10), 2003.
71. A. Patil, S. Oundhakar, A. Sheth, and K. Verma. Meteor-s web serviceannotation framework. In Proceeding of the World Wide Web Conference,July 2004.
72. S. Pillai, V. Silventoinen, K. Kallio, M. Senger, S. Sobhany, J. Tate, S. Ve-lankar, A. Golovin, K. Henrick, P. Rice, P. Stoehr, and R. Lopez. Soap-basedservices provided by the European Bioinformatics Institute (EBI). NucleicAcids Res, 33(1):W25–W28, 2005. URL http://www.ebi.ac.uk/Tools/
webservices/WSClustalW.html.
73. U. Radetzki and A. B. Cremers. Iris: A framework for mediator-based com-position of service-oriented software. In 2004 IEEE International Conferenceon Web Services (ICWS 2004), July 2004.
74. U. Radetzki, U. Leser, S. Schulze-Rauschenbach, J. Zimmermann, J. Lussem,T. Bode, and A. Cremers. Adapters, shims, and glue–service interoperabilityfor in silico experiments. Bioinformatics, 22(9):1137–1143, 2006.
180
75. S. Ran. A model for web services discovery with QoS. ACM SIGecomExchanges, 4(1), 2003.
76. J. Rao, D. Dimitrov, P. Hofmann, and N. Sadeh. A mixed initiative approachto semantic web service discovery and composition: Sap’s guided proceduresframework. In Proceedings of the IEEE International Conference on WebServices (ICWS’06), pages 401 – 410, 2006.
77. J. A. Raven and J. F. Allen. Genomics and chloroplast evolution: what didcyanobacteria do for plants? Genome Biology, 4, 2003.
78. J. Romero-Severson. Use case: How mog web services enable scientific dis-covery. Technical report, University of Notre Dame, August 2006.
79. S. Russell and P. Norvig. Artificial Intelligence A Mordern Approach. Pren-tice Hall, 1995.
80. M. Sabou, C. Wroe, C. Goble, and H. Stuckenschmidt. Learn-ing domain ontologies for semantic web service descriptions.Journal of Web Semantics, 3(4), 2005. Accessible from:http://www.websemanticsjournal.org/ps/pub/2005-28.
81. SAWSDL. Semantic annotations for web services description language work-ing group. URL http://www.w3.org/2002/ws/sawsdl/.
82. C. Schmidt and M. Parashar. A peer-to-peer approach to web service dis-covery. World Wide Web, 7(2), 2004.
83. Y. L. Simmhan, B. Plale, and D. Gannon. A survey of data provenance ine-science. SIGMOD Record, 34(5), Sept 2005.
84. E. Sirin, J. Hendler, and B. Parsia. Semi-automatic composition of webservices using semantic descriptions. In Web Services: Modeling, Architectureand Infrastructure” workshop in conjunction with ICEIS2003, 2003.
85. K. Sivashanmugam, K. Verma, A. Sheth, and J. Miller. Adding semantics toweb services standards. In In Proceedings of the 1st International Conferenceon Web Services (ICWS’03), 2003.
86. K. Sivashanmugam, J. Miller, A. Sheth, and K. Verma. Framework forsemantic web process composition. Special Issue of the International Journalof Electronic Commerce (IJEC), 2004.
87. SoapLab. Soap-based analysis web service developed in the EuropeanBioinformatics Institute (EBI). URL http://www.ebi.ac.uk/soaplab/.
181
88. SpeedR. URL http://lsdis.cs.uga.edu/proj/meteor/mwsdi.html.
89. N. Srinivasan, M. Paolucci, and K. Sycara. Semantic web service discoveryin the owl-s ide. In Proceeding of the 39th Hawaii International Conferenceon System Sciences, 2006.
90. B. Srivastava and J. Koehler. Web service composition current solutions andopen problems. In ICAPS2003, 2003.
91. L. Stein. Creating a bioinformatics nation. Nature, 417(9), 2002.
92. L. D. Stein. Integrating biological databases. Nature Reviews genetics, 4,2003.
93. R. Stevens. Trends in cyberinfrastructure for bioinformatics and com-putational biology. CTWatch Quarterly, 2(3), August 2006. URLAvailableon-lineathttp://www.ctwatch.org/quarterly/.
94. R. Stevens, K. Glover, C. Greenhalgh, C. Jennings, S. Pearce, P. Li,M. Radenkovic, and A. Wipat. Performing in silico experiments on the grid:A users perspective. In Proc UK e-Science programme All Hands Conference,2003.
95. J. W. Stiller and D. C. Reel. A single origin of plastids revisited: Convergentevolution in organellar genome content. J. Phycol, 39, 2003.
96. I. Taylor, M. Shields, I. Wang, and A. Harrison. Visual Grid Work-flow in Triana. Journal of Grid Computing, 3(3-4):153–169, Septem-ber 2005. URL http://www.springerlink.com/openurl.asp?genre=
article&issn=1570-7873&volume=3&issue=3&spage=153.
97. The Globus Project. The globus project. URL http://www.globus.org.
98. W. van der Aalst. Don’t go with the flow: Web services composition stan-dards exposed. IEEE Intelligent Systems, Jan/Feb 2003.
99. Y. Wang and E. Stroulia. Semantic structure matching for assessing web-service similarity. In M. E. Orlowska, S. Weerawarana, M. P. Papazoglou,and J. Yang, editors, Service-Oriented Computing - ICSOC 2003, 2003.
100. R. Weber, C. Schuler, P. Neukomm, H. Schuldt, and H.-J. Schek. Webservice composition with o’grape and osiris. In Proceeding of the 29th VLDBConference, 2003.
101. M. D. Wilkinson and M. Links. Biomoby: An open source biological webservice proposal. Briefings in bioinformatics, 3(4), 2002.
182
102. WordNet. Wordnet: A large lexical database of english, developed under thedirection of george a. miller. URL http://wordnet.princeton.edu/.
103. C. Wroe, R. Stevens, C. Goble, A. Roberts, and M. Greenwood. A suite ofdaml+oil onotlogies to describe bioinformatics web services and data. Inter-national Journal of Cooperative Information Systems, 12(4):197–224, June2003.
104. C. Wroe, C. Goble, A. Goderis, P. Lord, S. Miles, J. Papay, P. Alper, andL. Moreau. Recycling workflows and services through discovery and reuse.Concurrency and Computation: Practice and Experience, 2007.
105. WS. Web services architecture. URL http://www.w3.org/TR/ws-arch/
#service_oriented_architecture. W3C Working Group Note 11 Febru-ary 2004.
106. WsBAW. Bioinformatic analysis workflow (WsBAW). URL http://www.
alphaworks.ibm.com/tech/wsbaw.
107. WSIF. Web services invocation framework (wsif), apache software founda-tion. URL http://ws.apache.org/wsif/.
108. X. Xiang and G. Madey. A semantic web services enabled web portal archi-tecture. In International Conference on Web Services (ICWS2004), 2004.
109. X. Xiang and G. Madey. Improving the reuse of scientific workflows andtheir by-products. URL http://www.nd.edu/~mog/Papers/papers.html.Working paper, 2007.
110. X. Xiang, G. Madey, and J. Romero-Severson. A service-oriented data inte-gration and analysis environment for in-silico experiments and bioinformaticsresearch. In Proceedings of the 40th Annual Hawaii International Conferenceon System Sciences (CD-ROM), January 2007.
111. J. Yang. Web service componentization. Communication of the ACM, Oc-tober 2003.
112. X. Yi and K. J. Kochut. Process composition of web services with complexconversation protocols: A colored petri nets based approach. In Design,Analysis and Simulation of Distributed System DASD2004, 2004.
113. U. Zdun, M. Voelter, and M. Kircher. Design and implementation of anasynchronous invocation framework for web services. In The InternationalConference on Web Services - Europe 2003 (ICWS-Europe’03), 2003.
This document was prepared & typeset with pdfLATEX, and formatted withnddiss2ε classfile (v1.0[2004/06/15]) provided by Sameer Vijay.
183