use of tblastx to find regions of homology among multiple ... · why tblastx? • sgp2 (parra et...

12
V Jornada de Usuarios de la RESUse of TBLASTX to find regions of homology among multiple large-size mammalian genomes Francisco Câmara Ferreira Bioinformatics & Genomics Unit (Roderic Guigó,CRG)

Upload: others

Post on 19-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Use of TBLASTX to find regions of homology among multiple ... · Why TBLASTX? • SGP2 (Parra et al. 2003)• ab initio Geneid + sequence similarity search algorithm (TBLASTX) SGP2

“V Jornada de Usuarios de la RES”

Use of TBLASTX to find regions of homology

among multiple large-size mammalian genomes

Francisco Câmara Ferreira

Bioinformatics & Genomics Unit (Roderic Guigó,CRG)

Page 2: Use of TBLASTX to find regions of homology among multiple ... · Why TBLASTX? • SGP2 (Parra et al. 2003)• ab initio Geneid + sequence similarity search algorithm (TBLASTX) SGP2

Why TBLASTX?

• SGP2 (Parra et al. 2003)

• ab initio Geneid + sequence similarity search algorithm (TBLASTX)

SGP2 is a comparative gene prediction tool: QUERY sequences from a genome (i.e H.sapiens ) is compared against a collection of sequences from a second TARGET (REFERENCE;i.e. M.musculus) genome (TBLASTX) and the results of the comparison generate “HSPs” are used to modify the scores of the exons produced by the underlying ab initio gene prediction tool GENEID

WHAT IS SGP2??

Page 3: Use of TBLASTX to find regions of homology among multiple ... · Why TBLASTX? • SGP2 (Parra et al. 2003)• ab initio Geneid + sequence similarity search algorithm (TBLASTX) SGP2

Geneid: • Geneid is a protein-coding gene prediction tool: can be optimized for prediction in different species. • Geneid follows a hierarchical structure: signal -> exon -> gene • Exon score: Score of exon-defining signals + protein-coding potential • Dynamic programming algorithm: maximize score of assembled exons -> assembled gene

SGP2

Page 4: Use of TBLASTX to find regions of homology among multiple ... · Why TBLASTX? • SGP2 (Parra et al. 2003)• ab initio Geneid + sequence similarity search algorithm (TBLASTX) SGP2

TBLASTX as a gene-prediction tool

Coding sequences evolve slowly

compared to surrounding DNA

“Proper” evolutionary distance?

TBLASTX CHR5_1_5000000 CHR1_mm -

hspmax=0 -gspmax=0 W=5 E=0.01

E2=0.01 -nogap -filter=xnu+seg S2=80

-matrix=blosum62 -altscore="* any -

999" -altscore="any * -999”

TBLASTX is computationally expensive

“flavour”of BLAST

6-frame translation of query/target

Page 5: Use of TBLASTX to find regions of homology among multiple ... · Why TBLASTX? • SGP2 (Parra et al. 2003)• ab initio Geneid + sequence similarity search algorithm (TBLASTX) SGP2

Why marenostrum?

• H.sapiens vs. M.musculus

•7-10 days on a 20-25 CPU grid

•12-13 hours on 256 CPUs

• Multiple genomes compared

concurrently

¡PARALLELIZATION!

LARGE SIZE OF MAMMALIAN GENOMES (i.e. Human & Cow ~3 Gbases, Mouse 2.5 Gb…)

Page 6: Use of TBLASTX to find regions of homology among multiple ... · Why TBLASTX? • SGP2 (Parra et al. 2003)• ab initio Geneid + sequence similarity search algorithm (TBLASTX) SGP2

Strategy for MN TBLASTX:

• Fragment “query” genome:

• H.sapiens genome: >650 5-Mbase fragments

• Reference genome divided into 10-

Mbase fragments (internally)

•22 chromosomes for M.musculus

Page 7: Use of TBLASTX to find regions of homology among multiple ... · Why TBLASTX? • SGP2 (Parra et al. 2003)• ab initio Geneid + sequence similarity search algorithm (TBLASTX) SGP2

TBLASTX MN PIPELINE: David García Cortés/Xavier Pastor

Page 8: Use of TBLASTX to find regions of homology among multiple ... · Why TBLASTX? • SGP2 (Parra et al. 2003)• ab initio Geneid + sequence similarity search algorithm (TBLASTX) SGP2

Significant publications (MN-derived)

Page 9: Use of TBLASTX to find regions of homology among multiple ... · Why TBLASTX? • SGP2 (Parra et al. 2003)• ab initio Geneid + sequence similarity search algorithm (TBLASTX) SGP2

SGP2 importance as an annotation tool

component of the comparative gene prediction pipelines to annotate:

• Human (MN)

• Mouse • Rat • Cow (MN)

• Chicken • Paramecium

• Also several species of insects and plants (Melon/Bean)

Page 10: Use of TBLASTX to find regions of homology among multiple ... · Why TBLASTX? • SGP2 (Parra et al. 2003)• ab initio Geneid + sequence similarity search algorithm (TBLASTX) SGP2

UCSC Genome browser: http://www.genome.ucsc.edu

Page 11: Use of TBLASTX to find regions of homology among multiple ... · Why TBLASTX? • SGP2 (Parra et al. 2003)• ab initio Geneid + sequence similarity search algorithm (TBLASTX) SGP2

GBL Web server: http://genome.crg.es/genepredictions/

Page 12: Use of TBLASTX to find regions of homology among multiple ... · Why TBLASTX? • SGP2 (Parra et al. 2003)• ab initio Geneid + sequence similarity search algorithm (TBLASTX) SGP2

Acknowledgments

• BSC-CNS/U. de Cantabria

•Xavier Pastor

•David García Cortés

• Genis Parra/Josep Abril/Roderic

Guigo (developers of SGP2)