Download - myGRIDFinlandJune06FinalHandout · Taverna Workflow Components Scufl Simple Conceptual Unified Flow Language Taverna Writing, running workflows & examining results SOAPLAB Makes applications

myGrid Users Day

EMBRACE Tutorial, Finland

15th June 2006

myGRID Users DayJune 2006 1

Contributors

All teaching material is provided by the following members of the myGrid team:

Katy Wolstencroft, Pinar Alper, Duncan Hall, Matthew Gamble, Antoon Goderis

and Paul Fischer, and Georgina Moulton, NIBHI Bioinformatics Education and

Development Fellow.

myGRID Users DayEMBRACE Tutorial - June 2006 2

Material published by The University of Manchester for this event is copyright The University of Manchester and may not be reproduced without permission. Copyright exists in all other original material published by staff of The University of Manchester and may belong to the author or to The University of Manchester depending on the circumstances of publication.

All Material © 2006, The University of Manchester


Today’s course in association with:

Development Team

(http://www.mygrid.org/.uk)

University of Manchester

Northwest Institute for Bio-Health Informatics (NIBHI)

myGrid collaborators include:

Funded by:


Course Outline

09.00 Introduction

09.30 Installation of workbench

09.45 ‘Hands-on’ workshop

Adding new services

Finding and invoking a service

Finding and using workflows

Building a simple workflow

Stringing services together

Saving results

Defining output formats (a break is scheduled mid-morning during this session)

12.00 Advanced Exercises

BioMart

BioMoby services

Iteration

Control flow

12.30 Lunch

13.30 Feta Semantic Discovery

14.00 SHIM Services

Introduction and ‘hands-on’ session

15.00 Break

15.30 Case study: Building workflows for Environmental Genomics

16.15 Provenance management Demonstration

16.45 Q&A session

17.00 Finish


Background Information

MotivationBioinformatics data in the public domain is vast, heterogeneous and distributed throughout the world. Whilst the public availability has enabled great advances in the field of bioinformatics, the heterogeneity and distribution means using these resources together can be problematic. Traditionally, solutions have involved bespoke Perl scripting and transferring large data sets to local machines, but these approaches are not practical on a large scale. Biological workflows provide a distributed computing solution for interoperation enabling researchers to use public databases and analysis tools, with their associated computational resources, from their own desktop.

Aims

The myGrid training day will provide a foundation in building workflows in the biological domain using Taverna and associated services. Participants will gain an understanding of both the issues and practicalities of building workflows through ‘hands-on’ sessions and will gain sufficient knowledge in order to build their own workflows and use those developed by others.

At the end of today you will be able to use basic Grid features such as:

adding new services

finding and invoking a service

finding and reusing workflows

building new workflows

In addition, you will be introduced to more advanced features such as:

configuring BioMoby services

handling service failures

iterating over datasets

provenance management

service discovery using FETA


Useful References

myGRID website: http://www.mygrid.org.uk

TAVERNA project website: http://taverna.sourceforge.net/This website includes News, Downloads, Documentation.

TAVERNA references

Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock MR, Wipat A, Li P. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics. 2004 Nov 22;20(17):3045-54

R. D. Stevens, H. J. Tipney, C. J. Wroe, T. M. Oinn, M. Senger, P. W. Lord, C. A. Goble, A. Brass, and M. Tassabehji Exploring Williams–Beuren syndrome using myGridBioinformatics, Aug 2004; 20: i303 - i310.

Li, P, Wipat, A., Hayward, K., Jennings, C., Owen, K., Oinn, T., Stevens, R. and Pearce, S. Association of variations in NFKBIE with Graves’ disease using classical and myGrid methodologies

Contact Information

Dr Katy Wolstencroft [email protected] Moulton [email protected]


Wh

at

is m

yG

rid

?

•m

yG

rid

mid

dle

wa

re

co

mp

on

en

ts t

o

su

pp

ort

in s

ilico

experim

ents

in

bio

log

y

•T

ave

rna

–use

r

inte

rface f

or

myG

rid

.

Enable

s t

he d

esig

n

an

d e

na

ctm

en

t o

f lif

e

scie

nce w

ork

flow

s

Ta

vern

a

Fre

efluo

Grim

oire

Re

gis

try

Event

Notifica

tio

nm

IR

Pedro

An

no

tatio

n

Feta

Dis

co

very

Info

.

Model

Soapla

b

Go

wla

b

Bio

Nann

y

Media

tor

Po

rta

l

LS

IDs

KA

VE

DQ

P

myG

rid

Pa

rtn

ers

EP

SR

C funded U

K e

Scie

nce P

rogra

m P

ilot P

roje

ct

No

w a

n O

MII

-UK

no

de

Th

e M

oti

vati

on

Bio

info

rmatics is a

n o

pen C

om

munity

•O

pen a

ccess to d

ata

•O

pen a

ccess to r

esourc

es

•O

pen a

ccess to tools

•O

pen a

ccess to a

pplic

ations

Glo

balin

sili

co

bio

logic

al re

searc

h

Th

e U

ser

Co

mm

un

ity P

rob

lem

s

•E

ve

ryth

ing

is D

istr

ibu

ted

–D

ata

, R

eso

urc

es a

nd

Scie

ntists

•H

ete

rog

en

eo

us d

ata

•V

ery

few

sta

ndard

s

–I/O

form

ats

, data

repre

se

nta

tion, a

nnota

tio

n

–E

very

thin

g is a

str

ing

!

Inte

gra

tion o

f data

and inte

ropera

bili

ty o

f

resourc

es is d

ifficult

myG

RID

Use

rs D

ay

EM

BR

AC

E T

uto

ria

l -

Ju

ne

20

06

8

Tra

dit

ion

al A

pp

roach

to

Inte

gra

tio

n

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt

12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt

12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct

12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt

12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt

12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt

12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg

12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga

12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc

12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa

12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa

Pip

elin

e P

rog

ram

min

g

•A

dva

nta

ge

s

–R

ep

ea

tab

le

–A

llow

s a

uto

ma

tio

n

–Q

uic

k, re

liab

le, e

ffic

ien

t

•D

isa

dva

nta

ge

s

-R

eq

uir

es p

rog

ram

min

g s

kill

s

-D

ifficult to m

odify

–R

eq

uir

es lo

ca

l to

ol a

nd

da

tab

ase

in

sta

llatio

n

–R

eq

uir

es to

ol a

nd

da

tab

ase

ma

inte

na

nce

!!!

Init

ial U

se

r R

eq

uir

em

en

ts

•N

ee

d a

n a

uto

ma

ted

, re

liab

le r

ep

lace

me

nt

•N

ee

ds t

o d

ea

l w

ith

th

e d

ata

he

tero

ge

ne

ity a

nd

op

en

wo

rld

co

mm

un

ity

•N

ee

ds t

o d

ea

l w

ith

th

e d

istr

ibu

ted

re

so

urc

es –

pro

vid

e a

cce

ss t

o t

hird

pa

rty s

erv

ice

s n

ot

ow

ne

d

or

co

ntr

olle

d b

y m

yG

rid

•N

ee

ds t

o r

un

on

ord

ina

ry P

C o

r la

pto

p

•N

ee

ds t

o b

e e

asy t

o u

se

by n

on

-pro

gra

mm

ers

myG

rid

Ap

pro

ach

-W

ork

flo

ws

Ge

ne

ral te

ch

niq

ue

fo

r d

escri

bin

g a

nd

en

actin

g a

pro

ce

ss

de

scri

be

sw

ha

tyou w

ant to

do

, not

ho

wyo

u w

an

t to

do

it

Sim

ple

la

ng

uag

e s

pe

cifie

s h

ow

bio

info

rma

tics p

roce

sse

s fit

togeth

er

–p

roce

sse

s a

re w

eb

se

rvic

es

-H

igh

le

ve

l w

ork

flo

w d

iag

ram

se

pa

rate

d fro

m a

ny lo

we

r le

ve

l co

din

g –

the

refo

re, yo

u d

on

’t h

ave

to

be

a c

od

er

to

bu

ild w

ork

flo

ws

Rep

eat

Masker

Web

serv

ice

Gen

Scan

Web

Serv

ice

Bla

st

Web

Serv

ice

Seq

ue

nce

Pre

dic

ted

Gen

es o

ut

myG

RID

Use

rs D

ay

EM

BR

AC

E T

uto

ria

l -

Ju

ne

20

06

9

Ta

ve

rna

Wo

rkfl

ow

Co

mp

on

en

ts

Scu

flS

imple

Conceptu

al U

nifie

d F

low

Language

Ta

vern

aW

ritin

g,

run

nin

g w

ork

flo

ws &

exa

min

ing

re

su

lts

SO

AP

LA

BM

ake

s a

pp

lica

tio

ns a

va

ilab

le

Fre

efl

uo

Work

flo

w

en

gin

e t

o r

un

wo

rkflo

ws

Fre

efl

uo

SO

AP

LA

B

Web

Se

rvic

eA

ny A

pp

licati

on

Web

Se

rvic

ee

.g.

DD

BJ B

LA

ST

Wil

lia

ms

-Be

ure

n S

yn

dro

me

(WB

S)

•C

ontig

uous s

pora

dic

gen

e d

ele

tion

dis

ord

er

•1/2

0,0

00 liv

e b

irth

s, caused b

y

uneq

ual cro

ssove

r (h

om

olo

gous

recom

bin

atio

n)

during m

eio

sis

•H

aplo

insuffic

iency o

f th

e r

eg

ion

results in the p

he

noty

pe

•M

ultis

yste

m p

he

noty

pe –

muscula

r,

nerv

ous, circu

lato

ry s

yste

ms

•C

hara

cte

ristic facia

l fe

atu

res

•U

niq

ue c

ogn

itiv

e p

rofile

•M

enta

l re

tard

atio

n (

IQ 4

0-1

00,

mean~

60, ‘n

orm

al’

mean ~

100 )

•O

utg

oin

g p

ers

on

alit

y, fr

iendly

natu

re,

‘charm

ing’

Wil

lia

ms

-Be

ure

n S

yn

dro

me

Mic

rod

ele

tio

n

Ch

r 7

~155 M

b

~1

.5 M

b7

q11.2

3

GTF2I

RFC2

CYLN2

GTF2IRD1

NCF1

WBSCR1/E1f4H

LIMK1

ELN

CLDN4

CLDN3

STX1A

WBSCR18

WBSCR21

TBL2

BCL7B

BAZ1B

FZD9

WBSCR5/LAB

WBSCR22

FKBP6

POM121

NOLR1

GTF2IRD2

C-cen

C-mid

A-cen

B-mid

B-cen

A-mid

B-tel

A-tel

C-tel

WBSCR14

STAG3PMS2L

Blo

ck A

FKBP6T

POM121NOLR1

Blo

ck C

GTF2IPNCF1P

GTF2IRD2P

Blo

ck B

*

*

WB

S

SV

AS

Pati

en

t d

ele

tio

ns

CT

A-3

15H

11

CT

B-5

1J22

‘Gap

’P

hysic

al

Map

Eic

her

E,

Cla

rk R

& S

he,

X

An A

ssessm

ent

of

the S

equen

ce G

aps:

Unfinis

hed B

usin

ess

in a

Fin

ished H

um

an G

eno

me.

Natu

re G

enetics R

evie

ws (

2004)

5:3

45-3

54

Hill

ier

L e

t al. T

he D

NA

Sequ

ence o

f H

um

an C

hro

moso

me

7.

Natu

re(2

003)

424:1

57-1

64

Fillin

g a

gen

om

ic g

ap

in

silic

o

•Id

en

tify

ne

w, o

ve

rla

pp

ing

sequ

en

ce

of in

tere

st

•C

ha

racte

rise

th

e n

ew

se

qu

en

ce

at n

ucle

otid

e a

nd

am

ino

acid

le

ve

l

-F

req

ue

ntly r

ep

ea

ted

–in

fo r

ap

idly

ad

de

d to

pu

blic

d

ata

ba

se

s

-T

ime

co

nsu

min

g a

nd

mu

nd

an

e

-D

on’t a

lwa

ys g

et re

sults

-H

ug

e a

mo

un

t o

f in

terr

ela

ted

da

ta is p

rod

uce

d

myG

RID

Use

rs D

ay

EM

BR

AC

E T

uto

ria

l -

Ju

ne

20

06

10

A B C

The Williams

Workflows

A: Identification of

overlapping sequence

B: Characterisation of

nucleotide sequence

C: Characterisation of

protein sequence

The Biological Results

CTA-315H11 CTB-51J22

ELN

WBSCR14

RP11-622P13 RP11-148M21 RP11-731K22

314,004bp extension

All nine known genes identified

(40/45 exons identified)

CLDN4

CLDN3

STX1A

WBSCR18

WBSCR21

WBSCR22

WBSCR24

WBSCR27

WBSCR28

Four workflow cycles totalling ~ 10 hours

The gap was correctly closed and all known features identified

Data Management

• Workflows can generate vast amount of data -

How can we manage and track it?

• Taverna workflow workbench & plugins

– Ensure automated recording

• Data AND metadata AND experiment

provenance

• Semantic Web technologies (RDF, Ontologies)

– To store knowledge provenance

Taverna Workflow Workbench OGSA-Distributed

Query Processing

Results

management

LSID

mIR

e-Science coordination e-Science

mediator

e-Science

process

patterns

e-S

cie

nce

eve

nts

Notification

service

Architectural

framework

myGrid information

model

Metadata & provenance

management using

semantics

KAVE

Legacy

integration

Publication and

Discovery using

semantics

Feta

Pedro

Ontology

Portal &

Application tools


Today

Practical experience of the Taverna

workbench

• Finding services

• Building workflows

• Reusing workflows

Other myGrid components – shim

services, semantic discovery, provenance

management

AcknowledgementsCore

Matthew Addis, Nedim Alpdemir, Pinar Alper, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Matthew Gamble, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pockock Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Daniele Turi, Anil Wipat, Paul Watson, Katy Wolstencroft and Chris Wroe.

Users

Simon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical Medical Sciences, University of Newcastle, UK

Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UK

Postgraduates

Martin Szomszor, Duncan Hull, Jun Zhao, John Dickman, Keith Flanagan, Antoon Goderis, Tracy Craddock, Alastair Hampshire

Industrial

Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM)

Robin McEntire (GSK)


“H

an

ds-o

n”

Basic

Exerc

ises

myG

RID

Users

Da

yE

MB

RA

CE

Tuto

rial - J

un

e 2

00

613

myGrid Users Day

15th June 2006

EMBRACE Tutorial, Finland

Dr. K. Wolstencroft and Dr. G. Moulton

Exercise 1 Installing the Workbench

• Download Taverna from http://taverna.sourceforge.net– Windows or linux

If you are using either a modern version of Windows (Win2k or WinXP, with XP preferred) or any form of linux, solaris etc. you shoulddownload the workbench zip file. For windows users, Taverna can be unzipped and used, for linux you will also need to install GraphViz (http://www.graphviz.org/ the appropriate rpm for your platform)

– Mac OSX

If you are using Mac OSX you should download the .dmg workbench file. Double-click to open the disk image and copy both components (Taverna and GraphViz) onto your hard-disk to run the application

• YOU WILL ALSO NEED a modern Java Runtime Environment (JRE) or Java Software Development Kit (SDK) from http://java.sun.com We recommend 1.4.2 or 1.5

Workbench Layout

• AME – Advanced Model Explorer

The Advanced Model Explorer (AME) is the primary

editing component within Taverna. Through it you can

load, save and edit any property of a workflow.

- enables

building

loading

editing

saving workflows

Workflow Diagram Window

Visual representation of workflow

• Shows inputs / outputs, services and control

flows

• Enables saving of workflow diagrams for

publishing and sharing


Available Services Panel

Lists services available by default in Taverna

• ~ 3000 services

– Local java services

– Simple web services

– Soaplab services – legacy command-line application

– Gowlab services

– BioMart database services

– BioMoby services

Allows the user to add new services or

workflows from the web or from file systems

Exercise 2 Adding New Services

New services can be gathered from anywhere on the web

Go to http://taverna.sourceforge.net/webservices/

These services are not all included by default when taverna opens.

• Scroll down the page to DDBJ services. you will see a list of

available DDBJ services. Click on the DDBJ blast service

(http://xml.nig.ac.jp/wsdl/Blast.wsdl) and copy the web page address

Exercise 2 Adding New Services

• Go to the ‘Available services’ panel and right-click on

‘Available Processors’. For each type of service, you are given

the option to add a new service, or set of services.

• Select ‘Add new wsdl scavenger’. A window will pop-up

asking for a web address

• Enter the Blast Web service address

• Scroll down to the bottom of the ‘Available Services’

panel and look at the new DDBJ service that is now

included.

Exercise 3

Finding and invoking a Service

Go to the ‘Available Services’ Panel

• Search for Fasta in the ‘search list’ box at the top of the

panel (we will start with simple sequence retrieval)

• You will see several services highlighted in red

• Scroll down to ‘Get Protein FASTA’

This service returns a Fasta sequence from a database if

you supply it with a sequence id


Exercise 3

Invoking a single service

• Right click on the ‘Get Protein FASTA’ service and select

‘Invoke service’

• In the pop-up ‘Run workflow’ window add a protein

sequence GI by right-clicking on ID and entering the

value in the box on the right

– GI is a genbank gene identifier (you don’t need the gi: just the

number, for example, the MAP kinase phosphatase sequence

‘GI:1220173’ would be entered as ‘1220173’

• Click ‘Run workflow’ and the service is invoked

Exercise 3 View Results

• Click on ‘Results’

– The fasta sequence is displayed on right when you

select click to view

• Click on ‘Process Report’

– Look at processes. This shows the experiment provenance –

where and when processes were run

• Click on ‘Status’

– Look at options As workflows run, you can monitor their

progress here.

Exercise 3 - Conclusion

The processes for running and invoking a single service

are the basics for any workflow and the tracking of

processes and generation of results are the same

however complicated a workflow becomes

In the next few exercises, we will look at some example

workflows and build some of our own from scratch

Exercise 4 –

Finding and using workflows

• Reset the workbench (top right of the AME)

• Select ‘Load’ from the top left of the AME. You will see a

selection of .xml files in an examples directory. These are workflow

definition files

• Select ‘ConvertedEMBOSSTutorial.xml’ and a pre-

defined workflow will be loaded

• View the workflow diagram - you will see services of in

different colours


Exercise 4 Workflow Documentation

• Find out what the workflow does by reading the workflow

metadata

• In the AME – click on the ‘workflow model’ and then

select the ‘workflow metadata’ tab at the top of the AME. You will see a text description of the workflow, its author and its

unique LSID. When publishing workflows for others, this annotation

is useful for information and for allowing acknowledgement of IP.

Exercise 4 Workflow Features

• Run the workflow by selecting the ‘Tools and Workflow

Invocation’ tab at the top of the workbench and selecting

‘Run workflow’

• Watch the progress of the workflow in the ‘enactor

invocation’ window. As services complete, the enactor

reports the events. If a service fails, the enactor reports

this also

Loading workflows from the Web

• Go to the webpage http://www.cs.man.ac.uk/~katy/taverna

• Select ‘conditional_control.xml’ and copy the web address

• Go back to the taverna workbench and select ‘Load from web’

• Run the workflow

• You will see at lease one of the services fail. These are conditionals

– fail of false or fail if true. These are useful operators for controlling

progress of your workflow based on intermediate results

• You will see black arrows and white circles – black arrows show the

flow of the data and white circles are control links.

Exercise 5

Building a simple workflow from scratch

• Import the ‘Get Protein FASTA’ service into a new

workflow model First, you will need to reset the workflow in the

AME, then find the ‘Get Protein Fasta’ service again in the ‘Available

services’ panel.

• Right-click on ‘Get Protein Fasta’ and import it into the

workbench by selecting ‘Add to Model’

• Go to the AME and expand the [+] next to the newly

imported ‘Get Protein Fasta’ service. You will see:

– 1 input (Green arrow pointing up)

– 1 output (purple arrow pointing down)


Exercise 5 Adding Input

• Define a new workflow input by right-clicking on

‘Workflow Input’ and selecting ‘create new Input’

• Supply a suitable name e.g. ‘geneIdentifier’

• Connect this new input to the ‘Get Protein Fasta’ service

by right-clicking on ‘geneIdentifier’ and selecting

‘getFasta ->id’

You always build workflows with the flow of data

Exercise 5 Adding output

• Define a new workflow output by right-clicking on

‘workflow output’ and selecting ‘create new output’

• Supply a suitable name e.g. ‘fastaSequence’

• Connect this new output to the ‘Get Protein Fasta’

service. remembering to build with the flow of data

You have now built a simple workflow from scratch!

• Run the workflow by selecting ‘run workflow’ from the

‘Tools and Workflow Invocation’ menu at the very top of

the workbench. You will again need to supply a GI – for later

exercises, please use a protein GI – e.g. 1220173

Exercise 6 Stringing Services Together

• We have used ‘Get Protein Fasta’ to retrieve a sequence

from the genbank database. What can we do with a

sequence?

• Blast it?

• Find features and annotate it?

• Find GO annotations?

Exercise 6 Blast It

• Search for ‘blast’ in the ‘Available Services’ panel. Again

you will see several services highlighted in red

• Scroll down the list until you find the DDBJ Blast service

we added earlier

• Select the ‘Search Simple’ service and add it to the

model

• In the AME expand the [+] for the ‘search simple’ service

and view the input/output parameters


Exercise 6 Blast it

• This time, you will see three inputs and two outputs. Forthe workflow to run, each input must be defined. If there are multiple outputs, a workflow will usually run if at least one output is defined.

• Create an output called ‘blast_report’ in the same way we did before

• The sequence input for the Blast will be the output from the ‘Get Protein Fasta’ service. Connect the two together, from ‘Get Protein Fasta Output Text’ to ‘searchsimple query’

• Create two more inputs called ‘database’ and ‘program’and connect them to the ‘database’ and ‘program’ inputs on ‘search simple’ service

Exercise 6 Blast it

• Once more select ‘run workflow’ from the ‘Tools and

Workflow Invocation’ menu. You will see a run workflow

window asking for 3 input values

• Insert a GI (e.g. 1220173), a program (blastp for protein-

protein blast), and a database, e.g. SWISS (for

swissprot)

• Click ‘run workflow’. This time you will see a blast report

and a fasta sequence as a result

Exercise 6 Blast it

• For parameters that do not change often, you will not

wish to always type them in as input. In this example, the

database and blast program may only change

occasionally, so there is an alternative way of defining

them.

• Go back to the AME and remove the ‘database’ and

‘program’ inputs by right-clicking and selecting ‘remove

from model’

Exercise 6 String Constants

• Select ‘string constant’ from ‘Available Services’

• Right-click and select ‘add to model with name…’

• Insert ‘program’ in the pop-up window

• Select ‘string constant’ for a second time and repeat for

a string constant named ‘database’

• In the AME, right-click on ‘program’ and select ‘edit me’

• Edit the text to ‘blastp’. Repeat for ‘database’ and enter

‘SWISS’ for the swissprot database

• Run the workflow – it runs in the same way

• Save the workflow by selecting the ‘save’ icon at the top

of the AME.


Exercise 7 Protein Annotation

How can we use Taverna to annotate our protein with

function descriptions?

• In the ‘available services’ panel, find the emboss soaplab

services and find the ‘protein_motifs’ section

Hint: use the simple text search at the top of the panel

• Find out which of these services enable searching of the

Prosite and Prints databases by fetching the service

descriptions. To do this right-click on ‘protein_motifs’ and

select ‘fetch descriptions’

• Import both services into the workflow model.


• Connect these services up to the workflow so that you

can find prints and prosite matches in the query

sequence returned from ‘Get Protein Fasta’ – you will

see that soaplab services have many input values

Soaplab services have many input parameters, but many have

default values so may not always need to be altered. In this case,

you can run the services by simply adding the query sequence. Go

to the EMBOSS home page to find out which input(s) relate to the

query sequence.

This extra searching is impractical – the Feta Semantic Discovery

tool is designed to combat this problem (There will be a Feta talk

later in the day)


• Run the workflow – now you have blast results and

protein domain/motif matches

• How else can you annotate your protein? As an

advanced exercise, you might want to search for other

ways of characterising your sequence e.g. structural

elements, GO annotation?

Saving Results

Taverna provides several options for saving data.

1. Individual data items can be saved by right-clicking on

them

2. All data can be saved to disk

3. Textual/tabular data can be saved to excel

• Save all the data from your workflow


“H

an

ds

-on

”A

dv

an

ce

d E

xe

rcis

es

myG

RID

Users

Da

yE

MB

RA

CE

Tuto

rial - J

un

e 2

00

621

Advanced Exercises

The previous exercises have covered the basics of

myGrid workflows. The following demos and exercises

cover more advanced features, such as rendering

output, configuring BioMart services, dealing with service

failure and iterating over datasets. You may not reach

the end of these exercises, but they will provide a some

examples to take home

Exercise 8 Defining Output Formats

So far, most of the outputs we have seen have been

text, but in bioinformatics, we often want to view a graph,

a 3D structure, an alignment etc. Taverna is able to

display results using a specific type of renderer if the

workflow output is configured correctly.

• Reset the workbench and load

‘convertedEMBOSSTutorial’ from the ‘examples’

directory

• Look at the workflow diagram and read the workflow

metadata to find out what the workflow does

• Run the workflow

Exercise 8 Defining Output Format

• Look at the results. For ‘tmapPlot’ and ‘outputPlot’, you will see

the results are displayed graphically. This is achieved by specifying

a particular mime type in the output.

• Go back to the AME and look at the metadata for

‘tmapPlot’ and ‘outputPlot’.

• Select MIME Types. As you can see, each has the image/png

mime type associated with it. If you wish to render results in

anything other than plain text, you MUST specify the mime-type in

the workflow output

Exercise 8 Taverna MIME-Types

The following mime-types are currently used by Tavernatext/plain=Plain Text

text/xml=XML Text

text/html=HTML Text

text/rtf=Rich Text Format

text/x-graphviz=Graphviz Dot File

image/png=PNG Image

image/jpeg=JPEG Image

image/gif=GIF Image

application/zip=Zip File

chemical/x-swissprot=SWISSPROT Flat File

chemical/x-embl-dl-nucleotide=EMBL Flat File

chemical/x-ppd=PPD File

chemical/seq-aa-genpept=Genpept Protein

chemical/seq-na-genbank=Genbank Nucleotide

chemical/x-pdb=Protein Data Bank Flat File

chemical/x-mdl-molfile


Exercise 8 Taverna MIME types(2)

The ‘chemical/’ mime-types are rendered using SeqVista

to view formatted sequence data

• Reset the workbench and load ‘seqVistaRendering’ from

the ‘examples’ directory for a demo

The chemical/x-pdb can be used to view rotating 3D

protein images

Advanced Features

• Spotlight on BioMart

• BioMoby Services

• Iteration

• Control Flow

• Substituting Services and fault tolerance

Spotlight on Biomart

Biomart enables the retrieval of large amounts of

genomic data e.g. from Ensembl and sanger, as well as

Uniprot and MSD datasets

• After saving any workflows you want to keep, reset the

workbench in the AME

• Load the workflow ‘BiomartAndEMBOSSAnalysis.xml’

from the ‘examples’ directory

• Run the Workflow

Spotlight on Biomart

This Workflow Starts by fetching all gene IDs from Ensembl corresponding to human genes on chromosome 22 implicated in known diseases and with homologous genes in rat and mouse.

For each of these gene IDs it fetches the 200bp after the five prime end of the genomic sequence in each organism and performs a multiple alignment of the sequences using the EMBOSS tool 'emma' (a wrapper around ClustalW). IT then returns PNG images of the multiple alignment along with three columns containing the human, rat and mouse gene IDs used in each case.


Configuring Biomart

• Right-click on the service and select ‘configure bioMart

query’

• By selecting ‘filters’ – change the chromosome from 22

to 21 – now the workflow will retrieve all disease genes

from chromosome 21 with rat and mouse homologues

• Run the workflow and look at the results

• See how the ‘disease gene’ filter was configured and the

‘sequence exports’ were configured on the other Biomart

queries for mouse and rat

Adding Extra Information

Find out which diseases the known diseases are on your

chosen chromosome by adding a new Biomart query

process

• Select ‘hsapiens_gene_ensembl’ from the available

services panel and select ‘invoke with name….’ (as there

is already a service with that name!)

• Call the service ‘hsapiens_disease’

• Configure ‘hsapiens_disease’ by selecting an ‘ensembl

gene IDs’ filter under the ‘gene’ tab

• Configure the output attribute ‘disease description’ under

the ‘gene’ tab in the attributes section

Adding Extra Information

• Connect the input to the ‘hsapiens_gene_ensembl’

service via the ‘gene_stable_id’

• Create a new workflow output for the

‘disease_description’ output

• Re-run the workflow and view which diseases are

associated with your chromosome

Spotlight on BioMoby

The process of adding a BioMoby service is different

from other services. BioMoby services need to be

defined using terms from the Moby Object ontology

• Reset the workflow and load the ‘blast-biomoby.xml’

workflow from

http://www.cs.man.ac.uk/~katy/taverna/



• Run the workflow and look at the results

As the workflow name suggests, a blast search is

performed on a sequence

• Look at the workflow diagram

Instead of simply giving the blast service a fasta

sequence, there is a ‘GenericSequence’ objects defined.

• Look at the inputs for ‘GenericSequence’

• Read the metadata for the ‘GenericSequence’

object in the AME window


The GenericSequence object is defined by

1. The sequence (as a plain string)

2. The length of the sequence (as an integer)

3. A unique identifier for the sequence

4. The namespace in which that identifier belongs

These extra definitions take time for the user to define,

but they have other advantages


• Right-click on the ‘GenericSequence’ object in the AME

and select ‘Moby Object Details’

• A pop-up window will show you what BioMoby services a

‘GenericSequence’ is produced by and what services it

can feed into

• Right-click on the ‘MPIBlastBetterE13’ service and select

‘Moby Object Details’. This tells you what the service

requires as inputs and what it produces as output


• The BioMoby services are annotated using terms from

the Moby ontology to enable semantic searching for

services.

• BioMoby services are specialist kinds of service from a

closed community. The object model, ontology and

annotations have been agreed by the BioMoby service

providers.

• Semantic discovery queries over other myGrid services

will also be possible in the near future using the myGrid

ontology and the Feta sematic discovery component.

This will be covered in detail in a later session.


Iteration

Taverna has an implicit iteration framework. If you

connect a set of data objects (for example, a set of fasta

sequences) to a process that expects a single data item

at a time, the process will iterate over each sequence

• Reload the ‘Probeset_id_2_swissprot_id.xml’ from the

previous web directory and run it

• Watch the progress report. You will see several services

with ‘Invoking with Iteration’

Iteration

The user can also specify more complex iteration

strategies using the service metadata tag

• Reset the workflow and load the

‘IterationStrategyExample.xml’

• Read the workflow metadata to find out what the

workflow does

• Select the ‘ColourAnimals’ service and read the

metadata for that service. Under the description is the

iteration strategy

• Click on ‘dot product’. This allows you to switch to cross

product

Iteration

• Run the workflow twice – once with ‘dot product’ and

once with ‘cross product’.

• Save the first results so you can compare them – what is

the difference? What does it mean to specify dot or cross

product?

Substituting Services and Fault Tolerance

Taverna does not own many of the bioinformatics

services it provides. This means that it cannot control

their reliability. Instead, Taverna provides strategies for

dealing with services being unavailable

• Reload the ‘convertedEMBOSSTutorial.xml’ from the

‘examples’ directory.

• Look at the metadata for the ‘emma’ service. It is an

implementation of clustalw

• Find the DDBJ clustalw service


Substituting Services

• Right-click on the ‘analyzeSimple’ part of DDBJ clustalw

service and select ‘add as alternate’

• In the resulting menu select ‘emma’

• The DDBJ version of the clustalw service is now added

as an alternative to emma in the AME. It will be called

‘alternate1’

• Select ‘alternate1’ and look at the inputs and outputs.

These need to be mapped to the correct inputs and

outputs in emma

Substituting Services

• Right-click on the ‘query’ input in alternate1 and map it to

‘sequence_direct_data’. In both services, these inputs

expect a set of fasta sequences.

• Right-click on the ‘result’ output and map it to ‘outseq’ in

emma in the same way.

• Now you have a workflow which will run using emma

when it is available – but will substitute it for DDBJ

clustalw if emma fails!

Fault Tolerance

Taverna also allows the user to specify the number of

times a service is retried before it is considered to have

failed. Sometimes network traffic is heavy, so a working service

needs to be retried

• Select ‘tmap’ from the same workflow. To the right of the

service name are a series of 0s and 1s. By simply typing the

numbers, the user can specify the number of retries and the time

between the retries

• Change it to 3 retries for ‘tmap’ and set the status to

‘critical’ using the final tickbox. Now it is critical, it means the

whole workflow will be aborted if ‘tmap’ fails after 3 retries. Failures

in non-critical services will not abort the workflow run.


FE

TA

Sem

an

tic D

isco

very

myG

RID

Users

Da

yE

MB

RA

CE

Tuto

rial - J

un

e 2

00

628

Semantic Service Discovery

with Feta

Pinar ALPER

University of Manchester

[email protected]

Outline

• Problem Definition

• myGrid Domain Ontology

• Discovery Dimensions– Demo Discovery

– How does discovery work – Architectural View

• Annotation & Publication Process– Demo Annotation

– How does Ann. & Pub. work – Architectural View

• Futures

Problem Characteristics

?

• Large # of

services, 1800+

• No real

description of

capabilities

• A common

abstraction

“Processor”

• Users do the

selection

myGrid Ontology

• Aims to capture domain knowledge

– Adopts the common abstraction

– Defines a set of dimensions with which service

capabilities can be defined

– Defines domain types for objects passed around in a

coarse grained manner

• Two forms exist

– OWL used for development

– RDF(S)– read only exports used during pub. & disc.

• similar to GO in terms of expressivity- used in Feta


Upper level

ontology

Task

ontology

Informatics

ontology

Molecular Biology

ontology

Publishing

ontology

Organisation

ontology

Bioinformatics

ontology

Web Service

ontology

Specialises

Contributes to

sequence

biological_sequence

protein_sequence

nucleotide_sequence

DNA_sequence

protein_structure_feature

BLASTp service

Similarity Search Service

BLAST service

InterProScan service

myGrid Ontology (Closer Look) Questions we can ask

• Dimensions for service capabilities

• Find me services that,

– Accepts input of domain type X or more general

• DNA_sequence would include biological_sequence

• E.g. BLAST service,

– Produces output of domain type Y or more specific

– Performs task T or more specific

• Alignment would include local_alignment

• E.g. BLAST, FASTA

Questions we can ask (contd.)

– Uses resource R or more specific

• Databases such as GENBANK, dbEST Database

• E.g Distinguish between BLAST over different DBs

– Uses method M or more specific

• Alignment algorithm will include Smith_Watermann algorithm

– Is part of Application A or more specific

• EMBOSS

• NCBI BLAST directly or BLAST through EMBOSS

– Uses Resources with content of

• Human Genome, Mouse Genome..

• DEMO

Feta Architecture (discovery focused)

Feta Engine

ServiceSemantic

Discovery

Taverna

Workbench

Feta

GU

I Clie

ntUser Classification

- In RDF(S) -

Obtain

descriptions

Obtain

Classification

3

3

4

Feta

Descriptions

Feta

Descriptions

Feta

Descriptions

?


Generating/Publishing Feta

Descriptions• Feta Import - Not starting from scratch

– Import supported for all processors

– Comprehensive Import for• BioMoby

• WSDL

• Soaplab

• Scufl processors

• Pedro Annotate – Bringing in ontology terms to the desc.– Partly populated description is completed

– Ontology is used to mark-up

• Publish description– Publish to Web accessible locations

• DEMO

Feta Architecture (publication focused)

PedroAnnotator

AnnotateService

Provider

Feta

ImporterWSDL-

SCUFL-

SOAPLAB

Scavenge

Interface

Descriptions1

2

Publish

Description

Access

Classification

Publish

service/workflow

Interface

description

Import

Feta

Descriptions

Feta Engine

Service Classification

- in RDF(S) -

3 Publish

Notify service

2

3

3

Taverna

Workbench

1

Feta Futures

• Feedback is very valuable to us.

– Coverage of domain ontology

– Dimensions of search

– Modes of search

– Coverage of Feta Model

Access to Feta

• Included in myGrid 0.6 beta 2 release– http://sourceforge.net/projects/mygrid-uk

– cvs.mygrid.org.uk/

• Included in Taverna Workbench 1.3 release

• User Guide within Taverna User Manual

• Contact [email protected]


Sh

im S

erv

ices

myG

RID

Users

Da

yE

MB

RA

CE

Tuto

rial - J

un

e 2

00

632

Taverna / Web Services

• Taverna is good way to access to lots of

globally distributed services out there on

the Web

• This saves you the considerable trouble

of installing stuff (e.g. databases, tools

etc) on your personal machine or

laboratory server


Lots of services…

• Last count, around 1,800services

• Provided by thirdparties:– EMBOSS / SOAPlab - UK

– BioMOBY -Canada/Germany/Australia etc

– BioMART / EBI - UK

– NCBI and PathPort - US

– KEGG - Japan

– SeqHound / BIND - Canada

– Probably more to come


But there’s a price to pay

• Lots of services don’t match: A into B doesn’t go

• Lots of services are poorly described


UniP

rotre

cord

Ma

tch

Mis

ma

tch

pro

tein

_sequence

Sh

im

hasP

art

e.g

.1

myG

RID

Use

rs D

ay

EM

BR

AC

E T

uto

ria

l -

Ju

ne

20

06

36

DN

Afa

sta

Ma

tch

Mis

ma

tch

DN

Ae

mbl

Sh

im

e.g

.2

:s

yn

tax

myG

RID

Use

rs D

ay

EM

BR

AC

E T

uto

ria

l -

Ju

ne

20

06

37

e.g

.3

(to

da

y)

•S

him

serv

ice

rem

oves

redundantG

Oid

entifiers

from

the

listpro

duced

by

GetX

Chro

mF

unctions

myG

RID

Use

rs D

ay

EM

BR

AC

E T

uto

ria

l -

Ju

ne

20

06

38

Wh

at

ifth

es

erv

ice

do

es

n’t

ex

ist?

•If

you’re

lucky

som

eone

will

have

already

cre

ate

da

serv

ice

thatdoes

whatyou

wantand

will

have

described

/annota

ted

itin

aw

ay

that

lets

you

find

it

•…

ifno

serv

ice

exis

ts,one

way

tosolv

eth

epro

ble

mis

tocre

ate

alig

htw

eig

htB

eanS

hell

script,

bit

like

JavaS

criptth

atru

ns

on

your

local

machin

e(n

oton

aserv

er

like

the

serv

ices

do)

•T

hese

com

ein

two

flavours

:R

eady

made

“LocalJava

Wid

gets

”and

roll-

your-

ow

n“L

ocal

Serv

ices’’,

See

Exerc

ise…

myG

RID

Use

rs D

ay

EM

BR

AC

E T

uto

ria

l -

Ju

ne

20

06

39

Sh

im S

erv

ices

Th

is e

xe

rcis

e h

igh

ligh

ts th

e s

erv

ice

s th

at do

no

t p

erf

orm

bio

log

ica

l fu

nctio

ns, b

ut a

re v

ita

l

for

run

nin

g life

scie

nce

wo

rkflo

ws

Fin

din

g G

en

es

•L

oa

d th

e w

ork

flo

w e

ntitle

d g

en

sca

n_

sh

im_

exa

mp

le.x

ml

fro

m th

e p

ag

e h

ttp://w

ww

.cs.m

an.a

c.u

k/~

katy

/tavern

a

•L

oo

k a

t th

e w

ork

flo

w m

eta

da

ta –

wh

at d

oe

s t

he

wo

rkflo

w

do

?

•R

un the w

ork

flo

w.

•F

or

an

in

pu

t file

, lo

ad

exam

ple

_in

put.tx

tfr

om

th

e s

am

e

we

b p

ag

e

Wh

at h

ap

pe

ns?

Did

all

the

se

rvic

es r

etu

rn r

esu

lts?

Wh

y d

id s

om

e fa

il?

Fin

din

g G

en

es

•L

oa

d th

e w

ork

flo

w e

ntitle

d g

en

sca

n_

sh

im_

exa

mp

le2

.xm

l

fro

m th

e p

ag

e h

ttp://w

ww

.cs.m

an.a

c.u

k/~

katy

/tavern

a

•L

oo

k a

t th

e w

ork

flo

w m

eta

da

ta –

wh

at d

oe

s t

he

wo

rkflo

w

do

? H

ow

is it

diffe

ren

t fr

om

th

e p

revio

us o

ne

?

•R

un

th

e w

ork

flo

w (

usin

g th

e s

am

e in

pu

t) –

wh

at h

ap

pe

ns

this

tim

e?

Ge

nsca

nsp

litt

er

is a

sh

im s

erv

ice

–it p

erf

orm

s n

o

bio

log

ica

l fu

nctio

n, it s

imp

ly p

ars

es a

re

su

lts file

.

Oth

er

Sh

ims

•T

he

re a

re m

an

y m

yG

rid

sh

im s

erv

ice

s. T

he

se

are

cu

rre

ntly b

ein

g d

escrib

ed

in

a s

him

lib

rary

, b

ut fo

r n

ow

, a

sm

all

co

llectio

n a

re d

ocu

me

nte

d h

ere

htt

p:/

/ww

w.c

s.m

an

.ac.u

k/~

hu

lld/s

him

s.h

tml

Fro

m the lis

t,

•F

ind

a s

him

th

at

will

re

turn

a G

en

Ba

nk

DN

A file

fro

m a

n

id. L

oa

d th

e e

xa

mp

le w

ork

flo

w a

nd

ru

n it in

Ta

ve

rna

•F

ind

a s

him

th

at

will

tra

nsla

te D

NA

myG

RID

Use

rs D

ay

EM

BR

AC

E T

uto

ria

l -

Ju

ne

20

06

40

Oth

er

Sh

ims

•R

elo

ad the

Pro

beset_

id_2_S

wis

sport

_id

.xm

l w

ork

flo

w fro

m

http://w

ww

.cs.m

an.a

c.u

k/~

katy

/tavern

a/

This

wo

rkflo

w c

onta

ins s

evera

l shim

s. T

he ‘split

_b

y_re

gex’is

a

com

monly

usefu

l shim

. If a

serv

ice p

roduces a

sin

gle

str

ing, th

at

str

ing m

ay a

ctu

ally

conta

in s

evera

l in

div

idua

l re

sults.

•F

ind o

ut ho

w the s

trin

g is s

plit

in th

is e

xam

ple

Anoth

er

shim

exa

mple

is the b

eanshell

scri

pt.

•R

ight-

clic

k o

n the

bro

wn ‘p

ars

e_sw

iss’bea

nshell

serv

ice a

nd s

ele

ct

‘config

ure

be

anshell’

. S

ee if

yo

u c

an w

ork

out

wh

at it d

oes

Oth

er

Sh

ims

•T

he

EM

BO

SS

su

ite

of p

rog

ram

s h

ave

a s

ubd

ivis

ion

–

ed

it

•A

ll th

e e

dit s

erv

ice

s a

re s

him

s

•E

xp

erim

en

t w

ith

th

e e

dit s

erv

ice

s

•F

ind

a s

erv

ice

th

at

will

re

mo

ve

ga

ps fro

m s

eq

ue

nce

s

myG

RID

Use

rs D

ay

EM

BR

AC

E T

uto

ria

l -

Ju

ne

20

06

41

Case Study: Building Workflows for Environmental Genomics


Case Study: Building workflows for Environmental Genomics

Aims

During the course so far you have used Taverna to build workflows in a detailed step-by-

step process; however when it comes to building your own workflows from scratch it can

be a daunting process. In this exercise you will build a simple biological workflow in the

environmental genomics discipline. The aim is to build on your understanding from

previous exercises and to show the importance of the reuse of (sub)workflows – after all

there is no point re-inventing the wheel!

Scenario

Chen K, Pachter L. Bioinformatics for whole-genome shotgun sequencing of microbial

communities. PLoS Comput Biol. Jul;1(2):106-12 (2005).

Baliga,N.S., Bonneau,R., Facciotti,M.T., Pan,M., Glusman,G., Deutsch,E.W.,

Shannon,P., Chiu,Y., Weng,R.S., Gan,R.R., Hung,P., Date,S.V., Marcotte,E., Hood,L.

and Ng,W.V. Genome sequence of Haloarcula marismortui: A halophilic archaeon from

the Dead Sea Genome Res. 14 (11), 2221-2234 (2004)

The genomic sequence of Haloarcula marismortui, a halophilic archaeon from the Dead

Sea, was completed in 2004. Halophilic archaeal organisms thrive in extreme

environments of ~4-5 molar salt (an environment found in the Dead Sea). This genome

was sequenced as it is of some interest how protein structures as well as metabolic

strategies and physiologic responses might be altered in this high-salt environment.

Previously sequence Halobacterium species revealed several interesting physical

adaptations and this recent sequencing project hopes to shed light on common features

shared by all halophiles.

Of some interest was the species encodes a photobiology response: opsin proteins use

light energy to maintain physiological ion concentrations, facilitate phototaxis, and

generate chemical energy in the form of a proton gradient. They wanted to see whether


all opsins (Sop1, Sop2, Hop and Bop) were shared between the two species. From the

genomic sequence, you are to see whether the H. marismortui shares one of these

opsins with other Halobacterium and other species.

Using the methodologies outlined from the above papers you are to build a workflow that

annotates a piece of genomic sequence that has come from the above paper. A couple

of steps have already been carried out – I have searched the genomic sequence for

genes and one of them is to be annotated. It is usual when building a workflow to first

note the individual steps of analysis that is to be carried out in the workflow. As some of

you may not be familiar with the steps needed to annotate a DNA sequence, the steps

are detailed below.


Following this rough draft of analysis it is possible for you to build a workflow using

Taverna.

At each stage of the flow diagram you will be able to find a service, which you should be

able to define the inputs and outputs of each. Remember, that the format of these is

important as the service will fail it is incorrect – this case-study relies on services already

written for us to convert formats or parse results. You may find at each step that there

are multiple services that are available – feel free to use which one you please, but I

would recommend using services that are easiest and simplest to use (e.g., from DDBJ,

EBI, SeqHound). Also always make sure that you data has somewhere to go – either

into another process or output! If you don’t do this the workflow will fail!

Exercise: reusing workflows

If you study the diagram above carefully – what do you notice? Is there anything that

overlaps with workflows built earlier today?

Even though the workflow repository is at its infancy, many workflows do exist that may

overlap with the workflow that you are attempting to build. It may be easier when

building small workflows for you to build it yourself from scratch, on the other hand, it is

worth having a look either to see whether one already exists or part of one that is

suitable for your own analysis, especially if you are hoping to annotate a sequence.

In this case, you can see that we could use the getFASTA and SearchSimple services

that we implemented earlier today. If you have kept that workflow open then you could

modify that, otherwise load the workflow from this website:

This can be done by clicking on the ‘Load from web’ button in the AME window.

Exercise: modification of workflow

In the first workflow, a GI number is used to get a protein sequence, which is then used

to search the SWISSPROT database. In this analysis we are going to start with the GI:

number that represents DNA sequence:<given on day>. For us to be able to annotate


this we need to translate it into a protein sequence. Find the service that translates a

DNA sequence and add to the model – remember to look at all inputs required! For the

inputs you will have to also add the translation table as 11 and the frame as +2. You

may want to add these as String Constants rather than as a workflow input.

Exercise: adding alternative services

BLAST report simplification

A BLAST search identifies sequences that are similar to your query sequence and share

a common ancestor (homologues). The hits reported back potential function of your

query sequence, but for further analysis all these sequences need to collected together

and analysed. Reported sequences are now going to be used to construct a multiple

sequence alignment.

The first thing you need to do is to get all the GI accession numbers from the BLAST

report – as no MSA service will take a BLAST report as input! A service at Manchester

has already been written to do this. First of all check whether the following address can

be seen on the list of available services; if not, ‘add a SOAPLAP scavenger’ using this

address:

http://phoebus.cs.man.ac.uk:1977/axis/services.

Look under the seq_analysis services added from the above address and add the

service that allows you to extract information from a BLAST output. (Hint: it will require

the BLAST report as one input and ‘gi’ for the other).

Multiple Sequence Alignment

A program that is often used in bioinformatics is ClustalW – search for ClustalW in the

service list. You may have noticed that one of the services highlighted is the Biomoby

service – why cannot we use this? Are there any other services? Did you find a service

that can take multiple identifiers and aligns the sequences they correspond to?

The services reported back when searching for ClustalW are few and many are Biomoby

services, which in this instance we cannot use without conversion into many objects. A


service that actually is ClustalW is called ‘emma’. Search for this and add to your model

– at this stage you will not be able to add any inputs.

It is likely that you did not find a service that will take identifiers – all services take

multiple sequence strings. If you want to align your sequences using Clustalw you need

someway of getting the sequences for the list of GI numbers that have resulted from

your BLAST search. To do this, can you a service you have already used? Or, is there

an alternative? Again, not obvious is it! A service that can help you do this is seqret.

However, you will not be able to use this service on its own – you also need to use one

that splits up your sequence identifiers into a list.

Add the all services required to align the sequences to the model. (Hint: you may need

to use ‘split string into string list by regular expression’).

Visualising a Multiple Sequence Alignment

More often or not we want to visualise results. Search for a service that displays your

aligned sequences. You may wish to use this website to find out a service that will help

you: http://bioweb.pasteur.fr/docs/EMBOSS/

Add the service to your model. You may wish to investigate the other input options other

than just the aligned sequences.

Exercise: finishing the workflow

The final steps of the workflow are to construct a phylogenetic tree. This consists of two

steps: the construction of a distance matrix and the drawing of the tree using that

distance matrix. Can you find the services that carry out these steps and add them to

your workflow?


Provenance Management


Pro

ve

na

nce

in

Ta

ve

rna

Pro

ve

na

nc

e

•O

verv

iew

of

Pro

ve

na

nce

.

•W

hy its

im

po

rta

nt

in t

he

co

nte

xt

of

Ta

ve

rna

& W

ork

flo

ws.

•D

em

o o

f th

e (

Under

De

velo

pm

ent)

Pro

venance B

row

ser.

•C

om

ing S

oon…

to t

he B

row

ser.

•Y

ou

r F

ee

db

ack/I

dea

s!?

•M

eta

da

ta a

bo

ut

the o

rig

in o

f a

reso

urc

e (

wo

rkflo

w ,

se

rvic

e,

da

ta ,

exp

erim

en

t h

yp

oth

esis

etc

) a

nd th

e p

rocess o

f h

ow

a

reso

urc

e w

as g

en

era

ted

.

•T

he

Wh

o? ,

What?

, W

he

n? ,

Wh

ere

? a

nd

Why? a

bo

ut

resourc

es.

Pro

venance

Record

Result

Result

Result

Result

Result

Input

Wh

at

is P

roven

an

ce D

ata

?P

rov

en

an

ce

Py

ram

id

Know

ledge

Level

Org

aniz

a-

tion

Level

Data

Level

Pro

cess

Level

Sto

red

in

Tri

ple

s:

<S

ub

ject

, P

red

ica

te ,

Ob

ject>

4 L

eve

ls/V

iew

s o

f p

rove

na

nce

Org

an

iza

tio

n:

<E

xp

erim

en

ter;

is_pa

rt_of;

pro

ject >

Da

ta:

<P

rocess ; h

as_

inpu

t ; D

ata

file

>

Pro

cess

: <

wo

rkflow

; ru

ns_p

rocess;p

roce

ss>

Kn

ow

led

ge

: th

e m

ost

inte

restin

g b

ut

als

o t

he

mo

st

difficu

lt t

o o

bta

in

fou

nd

th

rou

gh

ap

ply

ing

do

ma

in k

no

wle

dg

e t

o o

the

r p

rove

nan

ce

da

ta

myG

RID

Use

rs D

ay

EM

BR

AC

E T

uto

ria

l -

Ju

ne

20

06

49

Curr

ently:

•T

arg

eting a

t re

co

rdin

g m

achin

ery

experim

ent

pro

cess

•N

o s

upp

ort

for

da

ta inte

gra

tio

n

•N

o s

upp

ort

for

usin

gpro

vena

nce

Pro

ve

na

nc

e i

n T

av

ern

aP

rov

en

an

ce

in

Ta

ve

rna

Curr

ently:

•T

arg

eting a

t re

co

rdin

g m

achin

ery

experim

ent

pro

cess

•N

o s

upp

ort

for

da

ta inte

gra

tio

n

•N

o s

upp

ort

for

usin

gpro

vena

nce

Aim

ing t

o A

ddre

ss

with

pro

ve

na

nce

bro

wse

r

Pro

ve

na

nc

e i

n T

av

ern

a

Ad

va

nta

ges

to

Us

ing

Pro

ve

na

nc

e i

n T

ave

rna

:

•R

etr

ieve

re

su

lts f

or

a p

revio

usly

ru

n w

ork

flow

insta

ntly r

ath

er

tha

n r

e-r

un

.

–la

rge

wo

rkflo

w m

ay t

ake

ho

urs

!

•C

an

th

en

Na

vig

ate

& V

alid

ate

Re

su

lts

•P

rovid

es a

ba

cku

p m

ech

an

ism

as a

wo

rkflo

ws

ca

n b

e r

econ

str

ucte

d fro

m p

rove

na

nce

da

ta.

Pro

ve

na

nc

e i

n T

av

ern

a

Cro

ss

-Refe

ren

cin

g P

rove

na

nce

re

so

urc

es

ac

ross

exp

eri

me

nts

•W

here

th

e p

ow

er

of

pro

ve

na

nce

be

co

me

s a

pp

are

nt

•S

ha

re

, d

isco

ve

r ,

re-u

se

re

so

urc

es:

e.g

. C

om

pa

re d

iffe

ren

t ru

ns o

f th

e s

am

e w

ork

flo

w –

infe

r a

ny in

tere

stin

g

sim

ilari

tie

s/d

iffe

ren

ce

s

myG

RID

Use

rs D

ay

EM

BR

AC

E T

uto

ria

l -

Ju

ne

20

06

50

Pro

ve

na

nc

e B

row

se

r

•U

nd

er

con

str

uctio

n!

•P

rovid

es q

uic

k a

cce

ss t

o p

rove

na

nce

da

ta f

or

all

yo

ur

Work

flo

w R

un

s

•C

urr

en

tly t

he

bro

wse

r o

nly

su

pp

ort

s

Da

ta /

Pro

ce

ss /

Org

an

isa

tio

n c

en

tric

fun

ctio

nalit

y

•E

vo

lvin

g d

aily

,

so

on

to

pro

vid

e a

n

arr

ay o

f fu

nctio

na

lity!

Pro

ven

an

ce B

row

ser

Co

min

g S

oo

n…

•W

ork

flow

vie

w o

n t

he

pro

ve

na

nce

da

ta.

•A

bili

ty t

o c

om

pa

re t

he r

un

s o

f a

Work

flow

.

•F

ilte

rs :

fa

iled

wo

rkflow

ru

ns ,

wo

rkflo

w r

uns w

ith

ch

an

ge

d d

ata

•M

od

ify t

he

In

pu

ts a

nd r

e-c

all

the

wo

rkflo

w

myG

RID

Use

rs D

ay

EM

BR

AC

E T

uto

ria

l -

Ju

ne

20

06

51

Download - myGRIDFinlandJune06FinalHandout · Taverna Workflow Components Scufl Simple Conceptual Unified Flow Language Taverna Writing, running workflows & examining results SOAPLAB Makes applications

Top Related