g. paolella napoli, 21/2/ 2008 1 progetto s.co.p.e. – wp4 bioinformatica nel progetto scope g....

Post on 01-May-2015

216 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

G. Paolella Napoli, 21/2/ 2008 1

Progetto S.Co.P.E. – WP4

Bioinformatica nel progetto SCOPE

G. Paolella, M. Petrillo, L.Cozzuto, A. Boccia, C. Cantarella, L.Sepe

G. Paolella Napoli, 21/2/ 2008 2

Our role within SCOPE

Nodes Nodes Nodes Ns Nodes

GRID software

High level middleware

SCOPE web siteAstronomy Chemistry

Physics Bioinformatics

Hardware

Middleware

Application

G. Paolella Napoli, 21/2/ 2008 3

Tasks

• Provide a large number of users with general purpose bioinformatic service, which take advantage of high performance hardware, allowing:– Web access for quick operations, performed by the vast majority

of users– Unix level access in the form of an integrated problem solving

environment

• Set up an automatic annotation system to be used in specific computational or experimental projects, based on the available services two specific applications:– CST analysis by comparative genomics– Mining for regulatory RNA within completely sequenced

genomes

G. Paolella Napoli, 21/2/ 2008 4

Bioinfo portal

G. Paolella Napoli, 21/2/ 2008 5

Available services

G. Paolella Napoli, 21/2/ 2008 6

Programs

G. Paolella Napoli, 21/2/ 2008 7

Graphic interface to programs

G. Paolella Napoli, 21/2/ 2008 8

Various operations in a row:Complement ->Translation -> Isoelectric point of the resulting protein.

DNA

Complement

Translation

Isoelectric point

CAPRI workflow

G. Paolella Napoli, 21/2/ 2008 9

SRS: the database tool

G. Paolella Napoli, 21/2/ 2008 10

SRS

G. Paolella Napoli, 21/2/ 2008 11

WEB SERVER

CAPRI SRSPISE

Other Emboss Fasta Blast

UserData DB

Primary remotedatabases

ENSEMBL

Services organization

G. Paolella Napoli, 21/2/ 2008 12

Sito periferico medicina

HD attached to the system:• 112 processor cluster• Two 8-processor servers, several 2-processor servers• Storage center (SCOPE)• Campus GRID and beyond (SCOPE)

G. Paolella Napoli, 21/2/ 2008 13

Broker

virtualnode

virtualnode

DB

DB

Grid

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

Low latency scheduler

High level scheduler

500 tasks/sec

20-50 ms delay

G. Paolella Napoli, 21/2/ 2008 14

Joining the GRID

HD attached to the system:• 1 Cluster Element (CE)• 5 Worker nodes (WN) biproc (expandable up to 40)• 1 Storage Element (SE) with 50 Gb• 1 User Interface (UI)

G. Paolella Napoli, 21/2/ 2008 15

Available at:lfn:/grid/scope/bioinfo/

programs/ (executables)dbs/ (datasets)

Currently installed tools

• Blast• Randfold• Infernal package

Support databases

• RFAM• Blast (human, rat, dog, chicken and macaca genomes)

GRID bioinformatic tools

G. Paolella Napoli, 21/2/ 2008 16

• Blastz• Clustalw• Dialignt• Emboss package• FASTA package• Genscan• Hmmer package• MCL package• Pcma• Primer3• RNAz• Vienna package• Multiz-tba

Ready to be installed tools

G. Paolella Napoli, 21/2/ 2008 17

Tasks

• Provide a large number of users with general purpose bioinformatic service, which take advantage of high performance hardware, allowing:– Web access for quick operations, performed by the vast majority

of users– Unix level access in the form of an integrated problem solving

environment

• Set up an automatic annotation system to be used in specific computational or experimental projects, based on the available services two specific applications:– CST analysis by comparative genomics– Mining for regulatory RNA within completely sequenced

genomes

G. Paolella Napoli, 21/2/ 2008 18

Due esempi

Due esempi di sistemi di annotazione automatica, utilizzati per la identificazione e caratterizzazione di sequenze di DNA con possibile ruolo funzionale:

– sequenze di piccole dimensioni, conservate tra uomo ed altre specie CST;

– sequenze in grado di codificare per RNA strutturati.

G. Paolella Napoli, 21/2/ 2008 19

• Obiettivo: Sistema di annotazione automatica di sequenze

• Motivazioni: Analisi computazionale di sequenze non codificanti permette l’identificazione di nuovi elementi funzionali

• Descrizione del problema e sua risoluzione. Diversi tipi di test predittivi applicati su larga scala ad un gran numero di dati sperimentali, estratti da banche dati pubblicamente disponibili o provenienti da dati sperimentali.

• Esigenza per l’uso dell’HPC: dato l’elevato numero di test, in genere si utilizzano cluster multiprocessore. L’uso di GRID permette di estendere l’analisi a set di dati di dimensioni ancora maggiori.

• Descrizione della soluzione del problema nell’ambiente HPC

Obiettivi e modalita’

G. Paolella Napoli, 21/2/ 2008 20

Identificazione e caratterizzazione di sequenze nucleotidiche conservate tra uomo e topo (CSTs) in altre specie.

H. Sapiens

M. Musculus

CSTs

CST identificate in geni associati a malattie: 64.495.Analisi da effettuare mediante BLAST contro altri genomi (ratto, cane, scimmia, pollo, etc).

Identificazione di CST

G. Paolella Napoli, 21/2/ 2008 21

Annotation is carried out through a pipeline which goes through the various phases wit hout requiring human assistance. Tasks requiring intensive CPU usage, such as BLAST homology search, are spread on several collaborating servers using a system specifically developed for load distribution and monitoring.

CST ANNOTATIONCSTs- chromosome position- type (i.e. intergenic, intronic, exonic, etc.)- coding %- closest gene and relative distances- .......

ENSEMBL gene and gene structure data- Max L-Score- Avg L-Score- .......

UCSC Log Score dataMatches with:- EST- Other genomes- Proteins (BlastX)

BLAST- repeats type- repeats %Repeat MaskerCoding Potential ScoreCPS - Redundancy- Overlapping- ........

PHP ScriptsDBRemote Servers Remote Servers

CST annotation

G. Paolella Napoli, 21/2/ 2008 22

DG-CST

1022 genes related to

genetically transmitted

disease

G. Paolella Napoli, 21/2/ 2008 23

KinWeb

500 genes coding for

human protein

kinases

G. Paolella Napoli, 21/2/ 2008 24

(a)

(b)

(c)

(d)

(e)

KinWeb DB

G. Paolella Napoli, 21/2/ 2008 25

BLAST

• Eseguibile submitted da un repository locale di programmi • Librerie di dati genomici conservate su SE locale e registrate sull'SE

centrale scopelfc01.dsf.unina.it:/grid/scope/bioinfo

• Esempio Blast delle 65597 CST contro genomi di cane, gallo, scimmia e ratto.

• Numero jobs sottomessi 67• Gruppo di sequenze di input: 1000 sequenze• Tempo totale di esecuzione dei 67 jobs: 4 ore• Tempo medio per job: 18 minuti (2 spesi per scaricare il dataset).

• Tempo CPU• Ricerca di 1 sequenza nel genoma di topo => 5 sec. • 64.495 sequenze => 3,75 giorni• 10 genomi => 37,5 giorni• MPIBLAST (soltanto installato)

G. Paolella Napoli, 21/2/ 2008 26

Bacterial SLSs

Pae-1 (Pseudomonas aeuruginosa)Eric (Escherichia coli)

G. Paolella Napoli, 21/2/ 2008 27

Identificazione e caratterizzazione in genomi batterici di famiglie di sequenze ripetute che condividono una struttura secondaria conservata.

Analisi da effettuare mediante INFERNAL su oltre 300 genomi batterici

EsempioRicerca di una famiglia in un genoma =====> 6 ore.Ricerca di 50 famiglie in un genoma =====> 12,5 giorniRicerca di 50 famiglie in 300 genomi =====> 10 anni

Ricerca Strutture secondarie

G. Paolella Napoli, 21/2/ 2008 28

DNA

Aim: find potential regulatory sequences acting as structured RNAs.

Pilot project: Analyses carried on chromosome 21.

Protein

Structured RNA

mRNA

Folding of the human genome

G. Paolella Napoli, 21/2/ 2008 29

Chromosome length 46,944,323 bp

Transcriptome length 14,609,025 bp

Sequences potentially transcribed has been split in overlapping fragments of 150 bp length.

Fragments 290,904 sequences

Total length 43,726,912 bp

Genome plan

G. Paolella Napoli, 21/2/ 2008 30

Length 46,944,323 bps

Total genes 392

> miRNA Genes 10

> rRNA Genes 3

> snRNA Genes 7

> snoRNA Genes 8

> miscRNA 8

Found known RNAs 9

Transcriptome length 14,609,025

Sequences potentially transcribed has been split in overlapping fragments of 150 bp length.

290,904 sequences

Results

G. Paolella Napoli, 21/2/ 2008 31

Valutazione dei risultati ottenuti

• RANDFOLD• Programma randfold• Eseguibile submitted da un repository locale di programmi di

bioinformatica• Gruppo di sequenze di input: 2500 sequenze di regioni trascritte del chr

21• Numero jobs sottomessi 117

• Tempo CPU richiesto• Sequenze derivate dai geni del cromosoma 21: 291.589• Predizione su 1 sequenza => 45 sec.• 291.589 sequenze => 152 giorni.

G. Paolella Napoli, 21/2/ 2008 32

Node number n_sequences seconds Day(s)

1 1 45 0

1 291,589 13,121,505 152

117 2,500 112,500 1,3

About 3 days

How long ?

G. Paolella Napoli, 21/2/ 2008 33

0

10000

20000

30000

40000

50000

60000

0 200 400 600 800 1000 1200

proc numbers

time (sec)

single nodegrid

Performance

G. Paolella Napoli, 21/2/ 2008 34

Some extra applications

G. Paolella Napoli, 21/2/ 2008 35

Assemble

Contigs Scaffolds

geneA tRNA prom oprA oprB

geneCluster A

Annotation

High throughput sequencing

G. Paolella Napoli, 21/2/ 2008 36

• Identification of genes and other genetic elements.• Protein functional annotation.• Cellular process annotation.

• Identification of ORFs, tRNAs, rRNAs• Scanning for signals, such as promoters and microRNAs• Identification of operons and gene clusters• Comparison with known genomes/proteins• Identification of orthologs and paralogs • Characterization of protein domains• Reconstruction of complete metabolic pathways• …• …

Annotation Steps

G. Paolella Napoli, 21/2/ 2008 37

Annotation

G. Paolella Napoli, 21/2/ 2008 38

IPROC

IPROC

IPROC

The image processing system: IPROC

G. Paolella Napoli, 21/2/ 2008 39

image in

iProcStep

iProcStepImageMagick

iProcStepPHP

iProcStepPerl

commandLine program

Image MagickPackage

PHPPackage

PERLPackage

Command LinePackages

adapter adapter

image out

adapter

Image processing modules

G. Paolella Napoli, 21/2/ 2008 40

HPCon

ClusternodesG

ateway

iPage

image

area

data + images

page

iPaneiPaneiPane

proc-steps

IPROC architecture

G. Paolella Napoli, 21/2/ 2008 41

Cluster Nodes

AccessServer

AccessServer

AccessServer

CLUSTER

IPROC

Parallel processing

G. Paolella Napoli, 21/2/ 2008 42

The group

Angelo BocciaGianluca BusielloMauro PetrilloConcita Cantarella*Luca CozzutoLeandra Sepe*

Vittorio LucignanoMarisa Passaro

G. Paolella Napoli, 21/2/ 2008 43

G. Paolella Napoli, 21/2/ 2008 44

SPEED (μ /40 )m min

( )ANGLE degree R

FRONT MIDDLE FAR FRONT MIDDLE FAR FRONT MIDDLE FAR

3 3NIH T 7,27 4,92 5,25 194,95 181,82 212,620.85

( =0.49)coeff0.47

( =0.56)coeff0.09

( =0.30)coeff

NIHRas 11,57 6,88 7,57 160,74 188,16 87,60.83

( =0.51)coeff0.59

( =0.54)coeff0.34

( =0.30)coeff

NIHSrc 6,05 5.1 3,71 181,08 168,29 156,950.89

( =0.48)coeff0.74

( =0.49)coeff0.56

( =0.35)coeff

SPEED μm/40min R

average angle(degree)

NIH3t3 8.43 0.02 226.73

NIHRas 11.73 0.37 203.79

NIHSrc 4.87 0.24 251.07

middle

front

far

NIHRas, NIH3T3, NIHSrc wound NIHRas, NIH3T3, NIHSrc wound healinghealing

Three cell subpopulations: Three cell subpopulations:

front, middle, and far from front, middle, and far from

the woundthe wound

G. Paolella Napoli, 21/2/ 2008 45

Version number 1 features tab-delimited

Name filename

Depth size 16bit

wdim size 4 where files

cdim size 3 where files

pdim size n where files

tdim size n unit min scale 10 where files

ldim size n unit µm scale 0.4 where layers

Time 1 Time 2 Time n

well1 well2

well3 well4

Channel1Channel 2

Channel 3

Position 1

Position n

l1

ln

File format

Data input: text description

G. Paolella Napoli, 21/2/ 2008 46

Acquisition parameters Buttons to slide

the acquisition

Image processing menus

Info panel for each frame

hide/show control command

IPROC

Image processing

G. Paolella Napoli, 21/2/ 2008 47

Broker

virtualnode

virtualnode

DB

DB

Grid

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

no

de

Hierarchical node organization

top related