myGrid Users Day
EMBRACE Tutorial, Finland
15th June 2006
myGRID Users DayJune 2006 1
Contributors
All teaching material is provided by the following members of the myGrid team:
Katy Wolstencroft, Pinar Alper, Duncan Hall, Matthew Gamble, Antoon Goderis
and Paul Fischer, and Georgina Moulton, NIBHI Bioinformatics Education and
Development Fellow.
myGRID Users DayEMBRACE Tutorial - June 2006 2
Material published by The University of Manchester for this event is copyright The University of Manchester and may not be reproduced without permission. Copyright exists in all other original material published by staff of The University of Manchester and may belong to the author or to The University of Manchester depending on the circumstances of publication.
All Material © 2006, The University of Manchester
myGRID Users DayEMBRACE Tutorial - June 2006 3
Today’s course in association with:
Development Team
(http://www.mygrid.org/.uk)
University of Manchester
Northwest Institute for Bio-Health Informatics (NIBHI)
myGrid collaborators include:
Funded by:
myGRID Users DayEMBRACE Tutorial - June 2006 4
Course Outline
09.00 Introduction
09.30 Installation of workbench
09.45 ‘Hands-on’ workshop
Adding new services
Finding and invoking a service
Finding and using workflows
Building a simple workflow
Stringing services together
Saving results
Defining output formats (a break is scheduled mid-morning during this session)
12.00 Advanced Exercises
BioMart
BioMoby services
Iteration
Control flow
12.30 Lunch
13.30 Feta Semantic Discovery
14.00 SHIM Services
Introduction and ‘hands-on’ session
15.00 Break
15.30 Case study: Building workflows for Environmental Genomics
16.15 Provenance management Demonstration
16.45 Q&A session
17.00 Finish
myGRID Users DayEMBRACE Tutorial - June 2006 5
Background Information
MotivationBioinformatics data in the public domain is vast, heterogeneous and distributed throughout the world. Whilst the public availability has enabled great advances in the field of bioinformatics, the heterogeneity and distribution means using these resources together can be problematic. Traditionally, solutions have involved bespoke Perl scripting and transferring large data sets to local machines, but these approaches are not practical on a large scale. Biological workflows provide a distributed computing solution for interoperation enabling researchers to use public databases and analysis tools, with their associated computational resources, from their own desktop.
Aims
The myGrid training day will provide a foundation in building workflows in the biological domain using Taverna and associated services. Participants will gain an understanding of both the issues and practicalities of building workflows through ‘hands-on’ sessions and will gain sufficient knowledge in order to build their own workflows and use those developed by others.
At the end of today you will be able to use basic Grid features such as:
adding new services
finding and invoking a service
finding and reusing workflows
building new workflows
In addition, you will be introduced to more advanced features such as:
configuring BioMoby services
handling service failures
iterating over datasets
provenance management
service discovery using FETA
myGRID Users DayEMBRACE Tutorial - June 2006 6
Useful References
myGRID website: http://www.mygrid.org.uk
TAVERNA project website: http://taverna.sourceforge.net/This website includes News, Downloads, Documentation.
TAVERNA references
Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock MR, Wipat A, Li P. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics. 2004 Nov 22;20(17):3045-54
R. D. Stevens, H. J. Tipney, C. J. Wroe, T. M. Oinn, M. Senger, P. W. Lord, C. A. Goble, A. Brass, and M. Tassabehji Exploring Williams–Beuren syndrome using myGridBioinformatics, Aug 2004; 20: i303 - i310.
Li, P, Wipat, A., Hayward, K., Jennings, C., Owen, K., Oinn, T., Stevens, R. and Pearce, S. Association of variations in NFKBIE with Graves’ disease using classical and myGrid methodologies
Contact Information
Dr Katy Wolstencroft [email protected] Moulton [email protected]
myGRID Users DayEMBRACE Tutorial - June 2006 7
Wh
at
is m
yG
rid
?
•m
yG
rid
mid
dle
wa
re
co
mp
on
en
ts t
o
su
pp
ort
in s
ilico
experim
ents
in
bio
log
y
•T
ave
rna
–use
r
inte
rface f
or
myG
rid
.
Enable
s t
he d
esig
n
an
d e
na
ctm
en
t o
f lif
e
scie
nce w
ork
flow
s
Ta
vern
a
Fre
efluo
Grim
oire
Re
gis
try
Event
Notifica
tio
nm
IR
Pedro
An
no
tatio
n
Feta
Dis
co
very
Info
.
Model
Soapla
b
Go
wla
b
Bio
Nann
y
Media
tor
Po
rta
l
LS
IDs
KA
VE
DQ
P
myG
rid
Pa
rtn
ers
EP
SR
C funded U
K e
Scie
nce P
rogra
m P
ilot P
roje
ct
No
w a
n O
MII
-UK
no
de
Th
e M
oti
vati
on
Bio
info
rmatics is a
n o
pen C
om
munity
•O
pen a
ccess to d
ata
•O
pen a
ccess to r
esourc
es
•O
pen a
ccess to tools
•O
pen a
ccess to a
pplic
ations
Glo
balin
sili
co
bio
logic
al re
searc
h
Th
e U
ser
Co
mm
un
ity P
rob
lem
s
•E
ve
ryth
ing
is D
istr
ibu
ted
–D
ata
, R
eso
urc
es a
nd
Scie
ntists
•H
ete
rog
en
eo
us d
ata
•V
ery
few
sta
ndard
s
–I/O
form
ats
, data
repre
se
nta
tion, a
nnota
tio
n
–E
very
thin
g is a
str
ing
!
Inte
gra
tion o
f data
and inte
ropera
bili
ty o
f
resourc
es is d
ifficult
myG
RID
Use
rs D
ay
EM
BR
AC
E T
uto
ria
l -
Ju
ne
20
06
8
Tra
dit
ion
al A
pp
roach
to
Inte
gra
tio
n
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt
12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt
12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct
12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt
12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt
12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt
12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg
12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga
12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc
12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa
12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
Pip
elin
e P
rog
ram
min
g
•A
dva
nta
ge
s
–R
ep
ea
tab
le
–A
llow
s a
uto
ma
tio
n
–Q
uic
k, re
liab
le, e
ffic
ien
t
•D
isa
dva
nta
ge
s
-R
eq
uir
es p
rog
ram
min
g s
kill
s
-D
ifficult to m
odify
–R
eq
uir
es lo
ca
l to
ol a
nd
da
tab
ase
in
sta
llatio
n
–R
eq
uir
es to
ol a
nd
da
tab
ase
ma
inte
na
nce
!!!
Init
ial U
se
r R
eq
uir
em
en
ts
•N
ee
d a
n a
uto
ma
ted
, re
liab
le r
ep
lace
me
nt
•N
ee
ds t
o d
ea
l w
ith
th
e d
ata
he
tero
ge
ne
ity a
nd
op
en
wo
rld
co
mm
un
ity
•N
ee
ds t
o d
ea
l w
ith
th
e d
istr
ibu
ted
re
so
urc
es –
pro
vid
e a
cce
ss t
o t
hird
pa
rty s
erv
ice
s n
ot
ow
ne
d
or
co
ntr
olle
d b
y m
yG
rid
•N
ee
ds t
o r
un
on
ord
ina
ry P
C o
r la
pto
p
•N
ee
ds t
o b
e e
asy t
o u
se
by n
on
-pro
gra
mm
ers
myG
rid
Ap
pro
ach
-W
ork
flo
ws
Ge
ne
ral te
ch
niq
ue
fo
r d
escri
bin
g a
nd
en
actin
g a
pro
ce
ss
de
scri
be
sw
ha
tyou w
ant to
do
, not
ho
wyo
u w
an
t to
do
it
Sim
ple
la
ng
uag
e s
pe
cifie
s h
ow
bio
info
rma
tics p
roce
sse
s fit
togeth
er
–p
roce
sse
s a
re w
eb
se
rvic
es
-H
igh
le
ve
l w
ork
flo
w d
iag
ram
se
pa
rate
d fro
m a
ny lo
we
r le
ve
l co
din
g –
the
refo
re, yo
u d
on
’t h
ave
to
be
a c
od
er
to
bu
ild w
ork
flo
ws
Rep
eat
Masker
Web
serv
ice
Gen
Scan
Web
Serv
ice
Bla
st
Web
Serv
ice
Seq
ue
nce
Pre
dic
ted
Gen
es o
ut
myG
RID
Use
rs D
ay
EM
BR
AC
E T
uto
ria
l -
Ju
ne
20
06
9
Ta
ve
rna
Wo
rkfl
ow
Co
mp
on
en
ts
Scu
flS
imple
Conceptu
al U
nifie
d F
low
Language
Ta
vern
aW
ritin
g,
run
nin
g w
ork
flo
ws &
exa
min
ing
re
su
lts
SO
AP
LA
BM
ake
s a
pp
lica
tio
ns a
va
ilab
le
Fre
efl
uo
Work
flo
w
en
gin
e t
o r
un
wo
rkflo
ws
Fre
efl
uo
SO
AP
LA
B
Web
Se
rvic
eA
ny A
pp
licati
on
Web
Se
rvic
ee
.g.
DD
BJ B
LA
ST
Wil
lia
ms
-Be
ure
n S
yn
dro
me
(WB
S)
•C
ontig
uous s
pora
dic
gen
e d
ele
tion
dis
ord
er
•1/2
0,0
00 liv
e b
irth
s, caused b
y
uneq
ual cro
ssove
r (h
om
olo
gous
recom
bin
atio
n)
during m
eio
sis
•H
aplo
insuffic
iency o
f th
e r
eg
ion
results in the p
he
noty
pe
•M
ultis
yste
m p
he
noty
pe –
muscula
r,
nerv
ous, circu
lato
ry s
yste
ms
•C
hara
cte
ristic facia
l fe
atu
res
•U
niq
ue c
ogn
itiv
e p
rofile
•M
enta
l re
tard
atio
n (
IQ 4
0-1
00,
mean~
60, ‘n
orm
al’
mean ~
100 )
•O
utg
oin
g p
ers
on
alit
y, fr
iendly
natu
re,
‘charm
ing’
Wil
lia
ms
-Be
ure
n S
yn
dro
me
Mic
rod
ele
tio
n
Ch
r 7
~155 M
b
~1
.5 M
b7
q11.2
3
GTF2I
RFC2
CYLN2
GTF2IRD1
NCF1
WBSCR1/E1f4H
LIMK1
ELN
CLDN4
CLDN3
STX1A
WBSCR18
WBSCR21
TBL2
BCL7B
BAZ1B
FZD9
WBSCR5/LAB
WBSCR22
FKBP6
POM121
NOLR1
GTF2IRD2
C-cen
C-mid
A-cen
B-mid
B-cen
A-mid
B-tel
A-tel
C-tel
WBSCR14
STAG3PMS2L
Blo
ck A
FKBP6T
POM121NOLR1
Blo
ck C
GTF2IPNCF1P
GTF2IRD2P
Blo
ck B
*
*
WB
S
SV
AS
Pati
en
t d
ele
tio
ns
CT
A-3
15H
11
CT
B-5
1J22
‘Gap
’P
hysic
al
Map
Eic
her
E,
Cla
rk R
& S
he,
X
An A
ssessm
ent
of
the S
equen
ce G
aps:
Unfinis
hed B
usin
ess
in a
Fin
ished H
um
an G
eno
me.
Natu
re G
enetics R
evie
ws (
2004)
5:3
45-3
54
Hill
ier
L e
t al. T
he D
NA
Sequ
ence o
f H
um
an C
hro
moso
me
7.
Natu
re(2
003)
424:1
57-1
64
Fillin
g a
gen
om
ic g
ap
in
silic
o
•Id
en
tify
ne
w, o
ve
rla
pp
ing
sequ
en
ce
of in
tere
st
•C
ha
racte
rise
th
e n
ew
se
qu
en
ce
at n
ucle
otid
e a
nd
am
ino
acid
le
ve
l
-F
req
ue
ntly r
ep
ea
ted
–in
fo r
ap
idly
ad
de
d to
pu
blic
d
ata
ba
se
s
-T
ime
co
nsu
min
g a
nd
mu
nd
an
e
-D
on’t a
lwa
ys g
et re
sults
-H
ug
e a
mo
un
t o
f in
terr
ela
ted
da
ta is p
rod
uce
d
myG
RID
Use
rs D
ay
EM
BR
AC
E T
uto
ria
l -
Ju
ne
20
06
10
A B C
The Williams
Workflows
A: Identification of
overlapping sequence
B: Characterisation of
nucleotide sequence
C: Characterisation of
protein sequence
The Biological Results
CTA-315H11 CTB-51J22
ELN
WBSCR14
RP11-622P13 RP11-148M21 RP11-731K22
314,004bp extension
All nine known genes identified
(40/45 exons identified)
CLDN4
CLDN3
STX1A
WBSCR18
WBSCR21
WBSCR22
WBSCR24
WBSCR27
WBSCR28
Four workflow cycles totalling ~ 10 hours
The gap was correctly closed and all known features identified
Data Management
• Workflows can generate vast amount of data -
How can we manage and track it?
• Taverna workflow workbench & plugins
– Ensure automated recording
• Data AND metadata AND experiment
provenance
• Semantic Web technologies (RDF, Ontologies)
– To store knowledge provenance
Taverna Workflow Workbench OGSA-Distributed
Query Processing
Results
management
LSID
mIR
e-Science coordination e-Science
mediator
e-Science
process
patterns
e-S
cie
nce
eve
nts
Notification
service
Architectural
framework
myGrid information
model
Metadata & provenance
management using
semantics
KAVE
Legacy
integration
Publication and
Discovery using
semantics
Feta
Pedro
Ontology
Portal &
Application tools
myGRID Users DayEMBRACE Tutorial - June 2006 11
Today
Practical experience of the Taverna
workbench
• Finding services
• Building workflows
• Reusing workflows
Other myGrid components – shim
services, semantic discovery, provenance
management
AcknowledgementsCore
Matthew Addis, Nedim Alpdemir, Pinar Alper, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Matthew Gamble, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pockock Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Daniele Turi, Anil Wipat, Paul Watson, Katy Wolstencroft and Chris Wroe.
Users
Simon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical Medical Sciences, University of Newcastle, UK
Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UK
Postgraduates
Martin Szomszor, Duncan Hull, Jun Zhao, John Dickman, Keith Flanagan, Antoon Goderis, Tracy Craddock, Alastair Hampshire
Industrial
Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM)
Robin McEntire (GSK)
myGRID Users DayEMBRACE Tutorial - June 2006 12
“H
an
ds-o
n”
Basic
Exerc
ises
myG
RID
Users
Da
yE
MB
RA
CE
Tuto
rial - J
un
e 2
00
613
myGrid Users Day
15th June 2006
EMBRACE Tutorial, Finland
Dr. K. Wolstencroft and Dr. G. Moulton
Exercise 1 Installing the Workbench
• Download Taverna from http://taverna.sourceforge.net– Windows or linux
If you are using either a modern version of Windows (Win2k or WinXP, with XP preferred) or any form of linux, solaris etc. you shoulddownload the workbench zip file. For windows users, Taverna can be unzipped and used, for linux you will also need to install GraphViz (http://www.graphviz.org/ the appropriate rpm for your platform)
– Mac OSX
If you are using Mac OSX you should download the .dmg workbench file. Double-click to open the disk image and copy both components (Taverna and GraphViz) onto your hard-disk to run the application
• YOU WILL ALSO NEED a modern Java Runtime Environment (JRE) or Java Software Development Kit (SDK) from http://java.sun.com We recommend 1.4.2 or 1.5
Workbench Layout
• AME – Advanced Model Explorer
The Advanced Model Explorer (AME) is the primary
editing component within Taverna. Through it you can
load, save and edit any property of a workflow.
- enables
building
loading
editing
saving workflows
Workflow Diagram Window
Visual representation of workflow
• Shows inputs / outputs, services and control
flows
• Enables saving of workflow diagrams for
publishing and sharing
myGRID Users DayEMBRACE Tutorial - June 2006 14
Available Services Panel
Lists services available by default in Taverna
• ~ 3000 services
– Local java services
– Simple web services
– Soaplab services – legacy command-line application
– Gowlab services
– BioMart database services
– BioMoby services
Allows the user to add new services or
workflows from the web or from file systems
Exercise 2 Adding New Services
New services can be gathered from anywhere on the web
Go to http://taverna.sourceforge.net/webservices/
These services are not all included by default when taverna opens.
• Scroll down the page to DDBJ services. you will see a list of
available DDBJ services. Click on the DDBJ blast service
(http://xml.nig.ac.jp/wsdl/Blast.wsdl) and copy the web page address
Exercise 2 Adding New Services
• Go to the ‘Available services’ panel and right-click on
‘Available Processors’. For each type of service, you are given
the option to add a new service, or set of services.
• Select ‘Add new wsdl scavenger’. A window will pop-up
asking for a web address
• Enter the Blast Web service address
• Scroll down to the bottom of the ‘Available Services’
panel and look at the new DDBJ service that is now
included.
Exercise 3
Finding and invoking a Service
Go to the ‘Available Services’ Panel
• Search for Fasta in the ‘search list’ box at the top of the
panel (we will start with simple sequence retrieval)
• You will see several services highlighted in red
• Scroll down to ‘Get Protein FASTA’
This service returns a Fasta sequence from a database if
you supply it with a sequence id
myGRID Users DayEMBRACE Tutorial - June 2006 15
Exercise 3
Invoking a single service
• Right click on the ‘Get Protein FASTA’ service and select
‘Invoke service’
• In the pop-up ‘Run workflow’ window add a protein
sequence GI by right-clicking on ID and entering the
value in the box on the right
– GI is a genbank gene identifier (you don’t need the gi: just the
number, for example, the MAP kinase phosphatase sequence
‘GI:1220173’ would be entered as ‘1220173’
• Click ‘Run workflow’ and the service is invoked
Exercise 3 View Results
• Click on ‘Results’
– The fasta sequence is displayed on right when you
select click to view
• Click on ‘Process Report’
– Look at processes. This shows the experiment provenance –
where and when processes were run
• Click on ‘Status’
– Look at options As workflows run, you can monitor their
progress here.
Exercise 3 - Conclusion
The processes for running and invoking a single service
are the basics for any workflow and the tracking of
processes and generation of results are the same
however complicated a workflow becomes
In the next few exercises, we will look at some example
workflows and build some of our own from scratch
Exercise 4 –
Finding and using workflows
• Reset the workbench (top right of the AME)
• Select ‘Load’ from the top left of the AME. You will see a
selection of .xml files in an examples directory. These are workflow
definition files
• Select ‘ConvertedEMBOSSTutorial.xml’ and a pre-
defined workflow will be loaded
• View the workflow diagram - you will see services of in
different colours
myGRID Users DayEMBRACE Tutorial - June 2006 16
Exercise 4 Workflow Documentation
• Find out what the workflow does by reading the workflow
metadata
• In the AME – click on the ‘workflow model’ and then
select the ‘workflow metadata’ tab at the top of the AME. You will see a text description of the workflow, its author and its
unique LSID. When publishing workflows for others, this annotation
is useful for information and for allowing acknowledgement of IP.
Exercise 4 Workflow Features
• Run the workflow by selecting the ‘Tools and Workflow
Invocation’ tab at the top of the workbench and selecting
‘Run workflow’
• Watch the progress of the workflow in the ‘enactor
invocation’ window. As services complete, the enactor
reports the events. If a service fails, the enactor reports
this also
Loading workflows from the Web
• Go to the webpage http://www.cs.man.ac.uk/~katy/taverna
• Select ‘conditional_control.xml’ and copy the web address
• Go back to the taverna workbench and select ‘Load from web’
• Run the workflow
• You will see at lease one of the services fail. These are conditionals
– fail of false or fail if true. These are useful operators for controlling
progress of your workflow based on intermediate results
• You will see black arrows and white circles – black arrows show the
flow of the data and white circles are control links.
Exercise 5
Building a simple workflow from scratch
• Import the ‘Get Protein FASTA’ service into a new
workflow model First, you will need to reset the workflow in the
AME, then find the ‘Get Protein Fasta’ service again in the ‘Available
services’ panel.
• Right-click on ‘Get Protein Fasta’ and import it into the
workbench by selecting ‘Add to Model’
• Go to the AME and expand the [+] next to the newly
imported ‘Get Protein Fasta’ service. You will see:
– 1 input (Green arrow pointing up)
– 1 output (purple arrow pointing down)
myGRID Users DayEMBRACE Tutorial - June 2006 17
Exercise 5 Adding Input
• Define a new workflow input by right-clicking on
‘Workflow Input’ and selecting ‘create new Input’
• Supply a suitable name e.g. ‘geneIdentifier’
• Connect this new input to the ‘Get Protein Fasta’ service
by right-clicking on ‘geneIdentifier’ and selecting
‘getFasta ->id’
You always build workflows with the flow of data
Exercise 5 Adding output
• Define a new workflow output by right-clicking on
‘workflow output’ and selecting ‘create new output’
• Supply a suitable name e.g. ‘fastaSequence’
• Connect this new output to the ‘Get Protein Fasta’
service. remembering to build with the flow of data
You have now built a simple workflow from scratch!
• Run the workflow by selecting ‘run workflow’ from the
‘Tools and Workflow Invocation’ menu at the very top of
the workbench. You will again need to supply a GI – for later
exercises, please use a protein GI – e.g. 1220173
Exercise 6 Stringing Services Together
• We have used ‘Get Protein Fasta’ to retrieve a sequence
from the genbank database. What can we do with a
sequence?
• Blast it?
• Find features and annotate it?
• Find GO annotations?
Exercise 6 Blast It
• Search for ‘blast’ in the ‘Available Services’ panel. Again
you will see several services highlighted in red
• Scroll down the list until you find the DDBJ Blast service
we added earlier
• Select the ‘Search Simple’ service and add it to the
model
• In the AME expand the [+] for the ‘search simple’ service
and view the input/output parameters
myGRID Users DayEMBRACE Tutorial - June 2006 18
Exercise 6 Blast it
• This time, you will see three inputs and two outputs. Forthe workflow to run, each input must be defined. If there are multiple outputs, a workflow will usually run if at least one output is defined.
• Create an output called ‘blast_report’ in the same way we did before
• The sequence input for the Blast will be the output from the ‘Get Protein Fasta’ service. Connect the two together, from ‘Get Protein Fasta Output Text’ to ‘searchsimple query’
• Create two more inputs called ‘database’ and ‘program’and connect them to the ‘database’ and ‘program’ inputs on ‘search simple’ service
Exercise 6 Blast it
• Once more select ‘run workflow’ from the ‘Tools and
Workflow Invocation’ menu. You will see a run workflow
window asking for 3 input values
• Insert a GI (e.g. 1220173), a program (blastp for protein-
protein blast), and a database, e.g. SWISS (for
swissprot)
• Click ‘run workflow’. This time you will see a blast report
and a fasta sequence as a result
Exercise 6 Blast it
• For parameters that do not change often, you will not
wish to always type them in as input. In this example, the
database and blast program may only change
occasionally, so there is an alternative way of defining
them.
• Go back to the AME and remove the ‘database’ and
‘program’ inputs by right-clicking and selecting ‘remove
from model’
Exercise 6 String Constants
• Select ‘string constant’ from ‘Available Services’
• Right-click and select ‘add to model with name…’
• Insert ‘program’ in the pop-up window
• Select ‘string constant’ for a second time and repeat for
a string constant named ‘database’
• In the AME, right-click on ‘program’ and select ‘edit me’
• Edit the text to ‘blastp’. Repeat for ‘database’ and enter
‘SWISS’ for the swissprot database
• Run the workflow – it runs in the same way
• Save the workflow by selecting the ‘save’ icon at the top
of the AME.
myGRID Users DayEMBRACE Tutorial - June 2006 19
Exercise 7 Protein Annotation
How can we use Taverna to annotate our protein with
function descriptions?
• In the ‘available services’ panel, find the emboss soaplab
services and find the ‘protein_motifs’ section
Hint: use the simple text search at the top of the panel
• Find out which of these services enable searching of the
Prosite and Prints databases by fetching the service
descriptions. To do this right-click on ‘protein_motifs’ and
select ‘fetch descriptions’
• Import both services into the workflow model.
Exercise 7 Protein Annotation
• Connect these services up to the workflow so that you
can find prints and prosite matches in the query
sequence returned from ‘Get Protein Fasta’ – you will
see that soaplab services have many input values
Soaplab services have many input parameters, but many have
default values so may not always need to be altered. In this case,
you can run the services by simply adding the query sequence. Go
to the EMBOSS home page to find out which input(s) relate to the
query sequence.
This extra searching is impractical – the Feta Semantic Discovery
tool is designed to combat this problem (There will be a Feta talk
later in the day)
Exercise 7 Protein Annotation
• Run the workflow – now you have blast results and
protein domain/motif matches
• How else can you annotate your protein? As an
advanced exercise, you might want to search for other
ways of characterising your sequence e.g. structural
elements, GO annotation?
Saving Results
Taverna provides several options for saving data.
1. Individual data items can be saved by right-clicking on
them
2. All data can be saved to disk
3. Textual/tabular data can be saved to excel
• Save all the data from your workflow
myGRID Users DayEMBRACE Tutorial - June 2006 20
“H
an
ds
-on
”A
dv
an
ce
d E
xe
rcis
es
myG
RID
Users
Da
yE
MB
RA
CE
Tuto
rial - J
un
e 2
00
621
Advanced Exercises
The previous exercises have covered the basics of
myGrid workflows. The following demos and exercises
cover more advanced features, such as rendering
output, configuring BioMart services, dealing with service
failure and iterating over datasets. You may not reach
the end of these exercises, but they will provide a some
examples to take home
Exercise 8 Defining Output Formats
So far, most of the outputs we have seen have been
text, but in bioinformatics, we often want to view a graph,
a 3D structure, an alignment etc. Taverna is able to
display results using a specific type of renderer if the
workflow output is configured correctly.
• Reset the workbench and load
‘convertedEMBOSSTutorial’ from the ‘examples’
directory
• Look at the workflow diagram and read the workflow
metadata to find out what the workflow does
• Run the workflow
Exercise 8 Defining Output Format
• Look at the results. For ‘tmapPlot’ and ‘outputPlot’, you will see
the results are displayed graphically. This is achieved by specifying
a particular mime type in the output.
• Go back to the AME and look at the metadata for
‘tmapPlot’ and ‘outputPlot’.
• Select MIME Types. As you can see, each has the image/png
mime type associated with it. If you wish to render results in
anything other than plain text, you MUST specify the mime-type in
the workflow output
Exercise 8 Taverna MIME-Types
The following mime-types are currently used by Tavernatext/plain=Plain Text
text/xml=XML Text
text/html=HTML Text
text/rtf=Rich Text Format
text/x-graphviz=Graphviz Dot File
image/png=PNG Image
image/jpeg=JPEG Image
image/gif=GIF Image
application/zip=Zip File
chemical/x-swissprot=SWISSPROT Flat File
chemical/x-embl-dl-nucleotide=EMBL Flat File
chemical/x-ppd=PPD File
chemical/seq-aa-genpept=Genpept Protein
chemical/seq-na-genbank=Genbank Nucleotide
chemical/x-pdb=Protein Data Bank Flat File
chemical/x-mdl-molfile
myGRID Users DayEMBRACE Tutorial - June 2006 22
Exercise 8 Taverna MIME types(2)
The ‘chemical/’ mime-types are rendered using SeqVista
to view formatted sequence data
• Reset the workbench and load ‘seqVistaRendering’ from
the ‘examples’ directory for a demo
The chemical/x-pdb can be used to view rotating 3D
protein images
Advanced Features
• Spotlight on BioMart
• BioMoby Services
• Iteration
• Control Flow
• Substituting Services and fault tolerance
Spotlight on Biomart
Biomart enables the retrieval of large amounts of
genomic data e.g. from Ensembl and sanger, as well as
Uniprot and MSD datasets
• After saving any workflows you want to keep, reset the
workbench in the AME
• Load the workflow ‘BiomartAndEMBOSSAnalysis.xml’
from the ‘examples’ directory
• Run the Workflow
Spotlight on Biomart
This Workflow Starts by fetching all gene IDs from Ensembl corresponding to human genes on chromosome 22 implicated in known diseases and with homologous genes in rat and mouse.
For each of these gene IDs it fetches the 200bp after the five prime end of the genomic sequence in each organism and performs a multiple alignment of the sequences using the EMBOSS tool 'emma' (a wrapper around ClustalW). IT then returns PNG images of the multiple alignment along with three columns containing the human, rat and mouse gene IDs used in each case.
myGRID Users DayEMBRACE Tutorial - June 2006 23
Configuring Biomart
• Right-click on the service and select ‘configure bioMart
query’
• By selecting ‘filters’ – change the chromosome from 22
to 21 – now the workflow will retrieve all disease genes
from chromosome 21 with rat and mouse homologues
• Run the workflow and look at the results
• See how the ‘disease gene’ filter was configured and the
‘sequence exports’ were configured on the other Biomart
queries for mouse and rat
Adding Extra Information
Find out which diseases the known diseases are on your
chosen chromosome by adding a new Biomart query
process
• Select ‘hsapiens_gene_ensembl’ from the available
services panel and select ‘invoke with name….’ (as there
is already a service with that name!)
• Call the service ‘hsapiens_disease’
• Configure ‘hsapiens_disease’ by selecting an ‘ensembl
gene IDs’ filter under the ‘gene’ tab
• Configure the output attribute ‘disease description’ under
the ‘gene’ tab in the attributes section
Adding Extra Information
• Connect the input to the ‘hsapiens_gene_ensembl’
service via the ‘gene_stable_id’
• Create a new workflow output for the
‘disease_description’ output
• Re-run the workflow and view which diseases are
associated with your chromosome
Spotlight on BioMoby
The process of adding a BioMoby service is different
from other services. BioMoby services need to be
defined using terms from the Moby Object ontology
• Reset the workflow and load the ‘blast-biomoby.xml’
workflow from
http://www.cs.man.ac.uk/~katy/taverna/
myGRID Users DayEMBRACE Tutorial - June 2006 24
Spotlight on BioMoby
• Run the workflow and look at the results
As the workflow name suggests, a blast search is
performed on a sequence
• Look at the workflow diagram
Instead of simply giving the blast service a fasta
sequence, there is a ‘GenericSequence’ objects defined.
• Look at the inputs for ‘GenericSequence’
• Read the metadata for the ‘GenericSequence’
object in the AME window
Spotlight on BioMoby
The GenericSequence object is defined by
1. The sequence (as a plain string)
2. The length of the sequence (as an integer)
3. A unique identifier for the sequence
4. The namespace in which that identifier belongs
These extra definitions take time for the user to define,
but they have other advantages
Spotlight on BioMoby
• Right-click on the ‘GenericSequence’ object in the AME
and select ‘Moby Object Details’
• A pop-up window will show you what BioMoby services a
‘GenericSequence’ is produced by and what services it
can feed into
• Right-click on the ‘MPIBlastBetterE13’ service and select
‘Moby Object Details’. This tells you what the service
requires as inputs and what it produces as output
Spotlight on BioMoby
• The BioMoby services are annotated using terms from
the Moby ontology to enable semantic searching for
services.
• BioMoby services are specialist kinds of service from a
closed community. The object model, ontology and
annotations have been agreed by the BioMoby service
providers.
• Semantic discovery queries over other myGrid services
will also be possible in the near future using the myGrid
ontology and the Feta sematic discovery component.
This will be covered in detail in a later session.
myGRID Users DayEMBRACE Tutorial - June 2006 25
Iteration
Taverna has an implicit iteration framework. If you
connect a set of data objects (for example, a set of fasta
sequences) to a process that expects a single data item
at a time, the process will iterate over each sequence
• Reload the ‘Probeset_id_2_swissprot_id.xml’ from the
previous web directory and run it
• Watch the progress report. You will see several services
with ‘Invoking with Iteration’
Iteration
The user can also specify more complex iteration
strategies using the service metadata tag
• Reset the workflow and load the
‘IterationStrategyExample.xml’
• Read the workflow metadata to find out what the
workflow does
• Select the ‘ColourAnimals’ service and read the
metadata for that service. Under the description is the
iteration strategy
• Click on ‘dot product’. This allows you to switch to cross
product
Iteration
• Run the workflow twice – once with ‘dot product’ and
once with ‘cross product’.
• Save the first results so you can compare them – what is
the difference? What does it mean to specify dot or cross
product?
Substituting Services and Fault Tolerance
Taverna does not own many of the bioinformatics
services it provides. This means that it cannot control
their reliability. Instead, Taverna provides strategies for
dealing with services being unavailable
• Reload the ‘convertedEMBOSSTutorial.xml’ from the
‘examples’ directory.
• Look at the metadata for the ‘emma’ service. It is an
implementation of clustalw
• Find the DDBJ clustalw service
myGRID Users DayEMBRACE Tutorial - June 2006 26
Substituting Services
• Right-click on the ‘analyzeSimple’ part of DDBJ clustalw
service and select ‘add as alternate’
• In the resulting menu select ‘emma’
• The DDBJ version of the clustalw service is now added
as an alternative to emma in the AME. It will be called
‘alternate1’
• Select ‘alternate1’ and look at the inputs and outputs.
These need to be mapped to the correct inputs and
outputs in emma
Substituting Services
• Right-click on the ‘query’ input in alternate1 and map it to
‘sequence_direct_data’. In both services, these inputs
expect a set of fasta sequences.
• Right-click on the ‘result’ output and map it to ‘outseq’ in
emma in the same way.
• Now you have a workflow which will run using emma
when it is available – but will substitute it for DDBJ
clustalw if emma fails!
Fault Tolerance
Taverna also allows the user to specify the number of
times a service is retried before it is considered to have
failed. Sometimes network traffic is heavy, so a working service
needs to be retried
• Select ‘tmap’ from the same workflow. To the right of the
service name are a series of 0s and 1s. By simply typing the
numbers, the user can specify the number of retries and the time
between the retries
• Change it to 3 retries for ‘tmap’ and set the status to
‘critical’ using the final tickbox. Now it is critical, it means the
whole workflow will be aborted if ‘tmap’ fails after 3 retries. Failures
in non-critical services will not abort the workflow run.
myGRID Users DayEMBRACE Tutorial - June 2006 27
FE
TA
Sem
an
tic D
isco
very
myG
RID
Users
Da
yE
MB
RA
CE
Tuto
rial - J
un
e 2
00
628
Semantic Service Discovery
with Feta
Pinar ALPER
University of Manchester
Outline
• Problem Definition
• myGrid Domain Ontology
• Discovery Dimensions– Demo Discovery
– How does discovery work – Architectural View
• Annotation & Publication Process– Demo Annotation
– How does Ann. & Pub. work – Architectural View
• Futures
Problem Characteristics
?
• Large # of
services, 1800+
• No real
description of
capabilities
• A common
abstraction
“Processor”
• Users do the
selection
myGrid Ontology
• Aims to capture domain knowledge
– Adopts the common abstraction
– Defines a set of dimensions with which service
capabilities can be defined
– Defines domain types for objects passed around in a
coarse grained manner
• Two forms exist
– OWL used for development
– RDF(S)– read only exports used during pub. & disc.
• similar to GO in terms of expressivity- used in Feta
myGRID Users DayEMBRACE Tutorial - June 2006 29
Upper level
ontology
Task
ontology
Informatics
ontology
Molecular Biology
ontology
Publishing
ontology
Organisation
ontology
Bioinformatics
ontology
Web Service
ontology
Specialises
Contributes to
sequence
biological_sequence
protein_sequence
nucleotide_sequence
DNA_sequence
protein_structure_feature
BLASTp service
Similarity Search Service
BLAST service
InterProScan service
myGrid Ontology (Closer Look) Questions we can ask
• Dimensions for service capabilities
• Find me services that,
– Accepts input of domain type X or more general
• DNA_sequence would include biological_sequence
• E.g. BLAST service,
– Produces output of domain type Y or more specific
– Performs task T or more specific
• Alignment would include local_alignment
• E.g. BLAST, FASTA
Questions we can ask (contd.)
– Uses resource R or more specific
• Databases such as GENBANK, dbEST Database
• E.g Distinguish between BLAST over different DBs
– Uses method M or more specific
• Alignment algorithm will include Smith_Watermann algorithm
– Is part of Application A or more specific
• EMBOSS
• NCBI BLAST directly or BLAST through EMBOSS
– Uses Resources with content of
• Human Genome, Mouse Genome..
• DEMO
Feta Architecture (discovery focused)
Feta Engine
ServiceSemantic
Discovery
Taverna
Workbench
Feta
GU
I Clie
ntUser Classification
- In RDF(S) -
Obtain
descriptions
Obtain
Classification
3
3
4
Feta
Descriptions
Feta
Descriptions
Feta
Descriptions
?
myGRID Users DayEMBRACE Tutorial - June 2006 30
Generating/Publishing Feta
Descriptions• Feta Import - Not starting from scratch
– Import supported for all processors
– Comprehensive Import for• BioMoby
• WSDL
• Soaplab
• Scufl processors
• Pedro Annotate – Bringing in ontology terms to the desc.– Partly populated description is completed
– Ontology is used to mark-up
• Publish description– Publish to Web accessible locations
• DEMO
Feta Architecture (publication focused)
PedroAnnotator
AnnotateService
Provider
Feta
ImporterWSDL-
SCUFL-
SOAPLAB
Scavenge
Interface
Descriptions1
2
Publish
Description
Access
Classification
Publish
service/workflow
Interface
description
Import
Feta
Descriptions
Feta Engine
Service Classification
- in RDF(S) -
3 Publish
Notify service
2
3
3
Taverna
Workbench
1
Feta Futures
• Feedback is very valuable to us.
– Coverage of domain ontology
– Dimensions of search
– Modes of search
– Coverage of Feta Model
Access to Feta
• Included in myGrid 0.6 beta 2 release– http://sourceforge.net/projects/mygrid-uk
– cvs.mygrid.org.uk/
• Included in Taverna Workbench 1.3 release
• User Guide within Taverna User Manual
• Contact [email protected]
myGRID Users DayEMBRACE Tutorial - June 2006 31
Sh
im S
erv
ices
myG
RID
Users
Da
yE
MB
RA
CE
Tuto
rial - J
un
e 2
00
632
Taverna / Web Services
• Taverna is good way to access to lots of
globally distributed services out there on
the Web
• This saves you the considerable trouble
of installing stuff (e.g. databases, tools
etc) on your personal machine or
laboratory server
myGRID Users DayEMBRACE Tutorial - June 2006 33
Lots of services…
• Last count, around 1,800services
• Provided by thirdparties:– EMBOSS / SOAPlab - UK
– BioMOBY -Canada/Germany/Australia etc
– BioMART / EBI - UK
– NCBI and PathPort - US
– KEGG - Japan
– SeqHound / BIND - Canada
– Probably more to come
myGRID Users DayEMBRACE Tutorial - June 2006 34
But there’s a price to pay
• Lots of services don’t match: A into B doesn’t go
• Lots of services are poorly described
myGRID Users DayEMBRACE Tutorial - June 2006 35
UniP
rotre
cord
Ma
tch
Mis
ma
tch
pro
tein
_sequence
Sh
im
hasP
art
e.g
.1
myG
RID
Use
rs D
ay
EM
BR
AC
E T
uto
ria
l -
Ju
ne
20
06
36
DN
Afa
sta
Ma
tch
Mis
ma
tch
DN
Ae
mbl
Sh
im
e.g
.2
:s
yn
tax
myG
RID
Use
rs D
ay
EM
BR
AC
E T
uto
ria
l -
Ju
ne
20
06
37
e.g
.3
(to
da
y)
•S
him
serv
ice
rem
oves
redundantG
Oid
entifiers
from
the
listpro
duced
by
GetX
Chro
mF
unctions
myG
RID
Use
rs D
ay
EM
BR
AC
E T
uto
ria
l -
Ju
ne
20
06
38
Wh
at
ifth
es
erv
ice
do
es
n’t
ex
ist?
•If
you’re
lucky
som
eone
will
have
already
cre
ate
da
serv
ice
thatdoes
whatyou
wantand
will
have
described
/annota
ted
itin
aw
ay
that
lets
you
find
it
•…
ifno
serv
ice
exis
ts,one
way
tosolv
eth
epro
ble
mis
tocre
ate
alig
htw
eig
htB
eanS
hell
script,
bit
like
JavaS
criptth
atru
ns
on
your
local
machin
e(n
oton
aserv
er
like
the
serv
ices
do)
•T
hese
com
ein
two
flavours
:R
eady
made
“LocalJava
Wid
gets
”and
roll-
your-
ow
n“L
ocal
Serv
ices’’,
See
Exerc
ise…
myG
RID
Use
rs D
ay
EM
BR
AC
E T
uto
ria
l -
Ju
ne
20
06
39
Sh
im S
erv
ices
Th
is e
xe
rcis
e h
igh
ligh
ts th
e s
erv
ice
s th
at do
no
t p
erf
orm
bio
log
ica
l fu
nctio
ns, b
ut a
re v
ita
l
for
run
nin
g life
scie
nce
wo
rkflo
ws
Fin
din
g G
en
es
•L
oa
d th
e w
ork
flo
w e
ntitle
d g
en
sca
n_
sh
im_
exa
mp
le.x
ml
fro
m th
e p
ag
e h
ttp://w
ww
.cs.m
an.a
c.u
k/~
katy
/tavern
a
•L
oo
k a
t th
e w
ork
flo
w m
eta
da
ta –
wh
at d
oe
s t
he
wo
rkflo
w
do
?
•R
un the w
ork
flo
w.
•F
or
an
in
pu
t file
, lo
ad
exam
ple
_in
put.tx
tfr
om
th
e s
am
e
we
b p
ag
e
Wh
at h
ap
pe
ns?
Did
all
the
se
rvic
es r
etu
rn r
esu
lts?
Wh
y d
id s
om
e fa
il?
Fin
din
g G
en
es
•L
oa
d th
e w
ork
flo
w e
ntitle
d g
en
sca
n_
sh
im_
exa
mp
le2
.xm
l
fro
m th
e p
ag
e h
ttp://w
ww
.cs.m
an.a
c.u
k/~
katy
/tavern
a
•L
oo
k a
t th
e w
ork
flo
w m
eta
da
ta –
wh
at d
oe
s t
he
wo
rkflo
w
do
? H
ow
is it
diffe
ren
t fr
om
th
e p
revio
us o
ne
?
•R
un
th
e w
ork
flo
w (
usin
g th
e s
am
e in
pu
t) –
wh
at h
ap
pe
ns
this
tim
e?
Ge
nsca
nsp
litt
er
is a
sh
im s
erv
ice
–it p
erf
orm
s n
o
bio
log
ica
l fu
nctio
n, it s
imp
ly p
ars
es a
re
su
lts file
.
Oth
er
Sh
ims
•T
he
re a
re m
an
y m
yG
rid
sh
im s
erv
ice
s. T
he
se
are
cu
rre
ntly b
ein
g d
escrib
ed
in
a s
him
lib
rary
, b
ut fo
r n
ow
, a
sm
all
co
llectio
n a
re d
ocu
me
nte
d h
ere
htt
p:/
/ww
w.c
s.m
an
.ac.u
k/~
hu
lld/s
him
s.h
tml
Fro
m the lis
t,
•F
ind
a s
him
th
at
will
re
turn
a G
en
Ba
nk
DN
A file
fro
m a
n
id. L
oa
d th
e e
xa
mp
le w
ork
flo
w a
nd
ru
n it in
Ta
ve
rna
•F
ind
a s
him
th
at
will
tra
nsla
te D
NA
myG
RID
Use
rs D
ay
EM
BR
AC
E T
uto
ria
l -
Ju
ne
20
06
40
Oth
er
Sh
ims
•R
elo
ad the
Pro
beset_
id_2_S
wis
sport
_id
.xm
l w
ork
flo
w fro
m
http://w
ww
.cs.m
an.a
c.u
k/~
katy
/tavern
a/
This
wo
rkflo
w c
onta
ins s
evera
l shim
s. T
he ‘split
_b
y_re
gex’is
a
com
monly
usefu
l shim
. If a
serv
ice p
roduces a
sin
gle
str
ing, th
at
str
ing m
ay a
ctu
ally
conta
in s
evera
l in
div
idua
l re
sults.
•F
ind o
ut ho
w the s
trin
g is s
plit
in th
is e
xam
ple
Anoth
er
shim
exa
mple
is the b
eanshell
scri
pt.
•R
ight-
clic
k o
n the
bro
wn ‘p
ars
e_sw
iss’bea
nshell
serv
ice a
nd s
ele
ct
‘config
ure
be
anshell’
. S
ee if
yo
u c
an w
ork
out
wh
at it d
oes
Oth
er
Sh
ims
•T
he
EM
BO
SS
su
ite
of p
rog
ram
s h
ave
a s
ubd
ivis
ion
–
ed
it
•A
ll th
e e
dit s
erv
ice
s a
re s
him
s
•E
xp
erim
en
t w
ith
th
e e
dit s
erv
ice
s
•F
ind
a s
erv
ice
th
at
will
re
mo
ve
ga
ps fro
m s
eq
ue
nce
s
myG
RID
Use
rs D
ay
EM
BR
AC
E T
uto
ria
l -
Ju
ne
20
06
41
Case Study: Building Workflows for Environmental Genomics
myGRID Users DayEMBRACE Tutorial - June 2006 42
Case Study: Building workflows for Environmental Genomics
Aims
During the course so far you have used Taverna to build workflows in a detailed step-by-
step process; however when it comes to building your own workflows from scratch it can
be a daunting process. In this exercise you will build a simple biological workflow in the
environmental genomics discipline. The aim is to build on your understanding from
previous exercises and to show the importance of the reuse of (sub)workflows – after all
there is no point re-inventing the wheel!
Scenario
Chen K, Pachter L. Bioinformatics for whole-genome shotgun sequencing of microbial
communities. PLoS Comput Biol. Jul;1(2):106-12 (2005).
Baliga,N.S., Bonneau,R., Facciotti,M.T., Pan,M., Glusman,G., Deutsch,E.W.,
Shannon,P., Chiu,Y., Weng,R.S., Gan,R.R., Hung,P., Date,S.V., Marcotte,E., Hood,L.
and Ng,W.V. Genome sequence of Haloarcula marismortui: A halophilic archaeon from
the Dead Sea Genome Res. 14 (11), 2221-2234 (2004)
The genomic sequence of Haloarcula marismortui, a halophilic archaeon from the Dead
Sea, was completed in 2004. Halophilic archaeal organisms thrive in extreme
environments of ~4-5 molar salt (an environment found in the Dead Sea). This genome
was sequenced as it is of some interest how protein structures as well as metabolic
strategies and physiologic responses might be altered in this high-salt environment.
Previously sequence Halobacterium species revealed several interesting physical
adaptations and this recent sequencing project hopes to shed light on common features
shared by all halophiles.
Of some interest was the species encodes a photobiology response: opsin proteins use
light energy to maintain physiological ion concentrations, facilitate phototaxis, and
generate chemical energy in the form of a proton gradient. They wanted to see whether
myGRID Users DayEMBRACE Tutorial - June 2006 43
all opsins (Sop1, Sop2, Hop and Bop) were shared between the two species. From the
genomic sequence, you are to see whether the H. marismortui shares one of these
opsins with other Halobacterium and other species.
Using the methodologies outlined from the above papers you are to build a workflow that
annotates a piece of genomic sequence that has come from the above paper. A couple
of steps have already been carried out – I have searched the genomic sequence for
genes and one of them is to be annotated. It is usual when building a workflow to first
note the individual steps of analysis that is to be carried out in the workflow. As some of
you may not be familiar with the steps needed to annotate a DNA sequence, the steps
are detailed below.
myGRID Users DayEMBRACE Tutorial - June 2006 44
Following this rough draft of analysis it is possible for you to build a workflow using
Taverna.
At each stage of the flow diagram you will be able to find a service, which you should be
able to define the inputs and outputs of each. Remember, that the format of these is
important as the service will fail it is incorrect – this case-study relies on services already
written for us to convert formats or parse results. You may find at each step that there
are multiple services that are available – feel free to use which one you please, but I
would recommend using services that are easiest and simplest to use (e.g., from DDBJ,
EBI, SeqHound). Also always make sure that you data has somewhere to go – either
into another process or output! If you don’t do this the workflow will fail!
Exercise: reusing workflows
If you study the diagram above carefully – what do you notice? Is there anything that
overlaps with workflows built earlier today?
Even though the workflow repository is at its infancy, many workflows do exist that may
overlap with the workflow that you are attempting to build. It may be easier when
building small workflows for you to build it yourself from scratch, on the other hand, it is
worth having a look either to see whether one already exists or part of one that is
suitable for your own analysis, especially if you are hoping to annotate a sequence.
In this case, you can see that we could use the getFASTA and SearchSimple services
that we implemented earlier today. If you have kept that workflow open then you could
modify that, otherwise load the workflow from this website:
This can be done by clicking on the ‘Load from web’ button in the AME window.
Exercise: modification of workflow
In the first workflow, a GI number is used to get a protein sequence, which is then used
to search the SWISSPROT database. In this analysis we are going to start with the GI:
number that represents DNA sequence:<given on day>. For us to be able to annotate
myGRID Users DayEMBRACE Tutorial - June 2006 45
this we need to translate it into a protein sequence. Find the service that translates a
DNA sequence and add to the model – remember to look at all inputs required! For the
inputs you will have to also add the translation table as 11 and the frame as +2. You
may want to add these as String Constants rather than as a workflow input.
Exercise: adding alternative services
BLAST report simplification
A BLAST search identifies sequences that are similar to your query sequence and share
a common ancestor (homologues). The hits reported back potential function of your
query sequence, but for further analysis all these sequences need to collected together
and analysed. Reported sequences are now going to be used to construct a multiple
sequence alignment.
The first thing you need to do is to get all the GI accession numbers from the BLAST
report – as no MSA service will take a BLAST report as input! A service at Manchester
has already been written to do this. First of all check whether the following address can
be seen on the list of available services; if not, ‘add a SOAPLAP scavenger’ using this
address:
http://phoebus.cs.man.ac.uk:1977/axis/services.
Look under the seq_analysis services added from the above address and add the
service that allows you to extract information from a BLAST output. (Hint: it will require
the BLAST report as one input and ‘gi’ for the other).
Multiple Sequence Alignment
A program that is often used in bioinformatics is ClustalW – search for ClustalW in the
service list. You may have noticed that one of the services highlighted is the Biomoby
service – why cannot we use this? Are there any other services? Did you find a service
that can take multiple identifiers and aligns the sequences they correspond to?
The services reported back when searching for ClustalW are few and many are Biomoby
services, which in this instance we cannot use without conversion into many objects. A
myGRID Users DayEMBRACE Tutorial - June 2006 46
service that actually is ClustalW is called ‘emma’. Search for this and add to your model
– at this stage you will not be able to add any inputs.
It is likely that you did not find a service that will take identifiers – all services take
multiple sequence strings. If you want to align your sequences using Clustalw you need
someway of getting the sequences for the list of GI numbers that have resulted from
your BLAST search. To do this, can you a service you have already used? Or, is there
an alternative? Again, not obvious is it! A service that can help you do this is seqret.
However, you will not be able to use this service on its own – you also need to use one
that splits up your sequence identifiers into a list.
Add the all services required to align the sequences to the model. (Hint: you may need
to use ‘split string into string list by regular expression’).
Visualising a Multiple Sequence Alignment
More often or not we want to visualise results. Search for a service that displays your
aligned sequences. You may wish to use this website to find out a service that will help
you: http://bioweb.pasteur.fr/docs/EMBOSS/
Add the service to your model. You may wish to investigate the other input options other
than just the aligned sequences.
Exercise: finishing the workflow
The final steps of the workflow are to construct a phylogenetic tree. This consists of two
steps: the construction of a distance matrix and the drawing of the tree using that
distance matrix. Can you find the services that carry out these steps and add them to
your workflow?
myGRID Users DayEMBRACE Tutorial - June 2006 47
Provenance Management
myGRID Users DayEMBRACE Tutorial - June 2006 48
Pro
ve
na
nce
in
Ta
ve
rna
Pro
ve
na
nc
e
•O
verv
iew
of
Pro
ve
na
nce
.
•W
hy its
im
po
rta
nt
in t
he
co
nte
xt
of
Ta
ve
rna
& W
ork
flo
ws.
•D
em
o o
f th
e (
Under
De
velo
pm
ent)
Pro
venance B
row
ser.
•C
om
ing S
oon…
to t
he B
row
ser.
•Y
ou
r F
ee
db
ack/I
dea
s!?
•M
eta
da
ta a
bo
ut
the o
rig
in o
f a
reso
urc
e (
wo
rkflo
w ,
se
rvic
e,
da
ta ,
exp
erim
en
t h
yp
oth
esis
etc
) a
nd th
e p
rocess o
f h
ow
a
reso
urc
e w
as g
en
era
ted
.
•T
he
Wh
o? ,
What?
, W
he
n? ,
Wh
ere
? a
nd
Why? a
bo
ut
resourc
es.
Pro
venance
Record
Result
Result
Result
Result
Result
Input
Wh
at
is P
roven
an
ce D
ata
?P
rov
en
an
ce
Py
ram
id
Know
ledge
Level
Org
aniz
a-
tion
Level
Data
Level
Pro
cess
Level
Sto
red
in
Tri
ple
s:
<S
ub
ject
, P
red
ica
te ,
Ob
ject>
4 L
eve
ls/V
iew
s o
f p
rove
na
nce
Org
an
iza
tio
n:
<E
xp
erim
en
ter;
is_pa
rt_of;
pro
ject >
Da
ta:
<P
rocess ; h
as_
inpu
t ; D
ata
file
>
Pro
cess
: <
wo
rkflow
; ru
ns_p
rocess;p
roce
ss>
Kn
ow
led
ge
: th
e m
ost
inte
restin
g b
ut
als
o t
he
mo
st
difficu
lt t
o o
bta
in
fou
nd
th
rou
gh
ap
ply
ing
do
ma
in k
no
wle
dg
e t
o o
the
r p
rove
nan
ce
da
ta
myG
RID
Use
rs D
ay
EM
BR
AC
E T
uto
ria
l -
Ju
ne
20
06
49
Curr
ently:
•T
arg
eting a
t re
co
rdin
g m
achin
ery
experim
ent
pro
cess
•N
o s
upp
ort
for
da
ta inte
gra
tio
n
•N
o s
upp
ort
for
usin
gpro
vena
nce
Pro
ve
na
nc
e i
n T
av
ern
aP
rov
en
an
ce
in
Ta
ve
rna
Curr
ently:
•T
arg
eting a
t re
co
rdin
g m
achin
ery
experim
ent
pro
cess
•N
o s
upp
ort
for
da
ta inte
gra
tio
n
•N
o s
upp
ort
for
usin
gpro
vena
nce
Aim
ing t
o A
ddre
ss
with
pro
ve
na
nce
bro
wse
r
Pro
ve
na
nc
e i
n T
av
ern
a
Ad
va
nta
ges
to
Us
ing
Pro
ve
na
nc
e i
n T
ave
rna
:
•R
etr
ieve
re
su
lts f
or
a p
revio
usly
ru
n w
ork
flow
insta
ntly r
ath
er
tha
n r
e-r
un
.
–la
rge
wo
rkflo
w m
ay t
ake
ho
urs
!
•C
an
th
en
Na
vig
ate
& V
alid
ate
Re
su
lts
•P
rovid
es a
ba
cku
p m
ech
an
ism
as a
wo
rkflo
ws
ca
n b
e r
econ
str
ucte
d fro
m p
rove
na
nce
da
ta.
Pro
ve
na
nc
e i
n T
av
ern
a
Cro
ss
-Refe
ren
cin
g P
rove
na
nce
re
so
urc
es
ac
ross
exp
eri
me
nts
•W
here
th
e p
ow
er
of
pro
ve
na
nce
be
co
me
s a
pp
are
nt
•S
ha
re
, d
isco
ve
r ,
re-u
se
re
so
urc
es:
e.g
. C
om
pa
re d
iffe
ren
t ru
ns o
f th
e s
am
e w
ork
flo
w –
infe
r a
ny in
tere
stin
g
sim
ilari
tie
s/d
iffe
ren
ce
s
myG
RID
Use
rs D
ay
EM
BR
AC
E T
uto
ria
l -
Ju
ne
20
06
50
Pro
ve
na
nc
e B
row
se
r
•U
nd
er
con
str
uctio
n!
•P
rovid
es q
uic
k a
cce
ss t
o p
rove
na
nce
da
ta f
or
all
yo
ur
Work
flo
w R
un
s
•C
urr
en
tly t
he
bro
wse
r o
nly
su
pp
ort
s
Da
ta /
Pro
ce
ss /
Org
an
isa
tio
n c
en
tric
fun
ctio
nalit
y
•E
vo
lvin
g d
aily
,
so
on
to
pro
vid
e a
n
arr
ay o
f fu
nctio
na
lity!
Pro
ven
an
ce B
row
ser
Co
min
g S
oo
n…
•W
ork
flow
vie
w o
n t
he
pro
ve
na
nce
da
ta.
•A
bili
ty t
o c
om
pa
re t
he r
un
s o
f a
Work
flow
.
•F
ilte
rs :
fa
iled
wo
rkflow
ru
ns ,
wo
rkflo
w r
uns w
ith
ch
an
ge
d d
ata
•M
od
ify t
he
In
pu
ts a
nd r
e-c
all
the
wo
rkflo
w
myG
RID
Use
rs D
ay
EM
BR
AC
E T
uto
ria
l -
Ju
ne
20
06
51