ortholog assignment

32
Computational Prediction of Orthologs Melvin Zhang School of Computing, National University of Singapore May 4, 2011

Upload: melvin-zhang

Post on 10-May-2015

662 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Ortholog assignment

Computational Prediction of Orthologs

Melvin Zhang

School of Computing,National University of Singapore

May 4, 2011

Page 2: Ortholog assignment

A gene is a unit of heredity in a living organism

Page 3: Ortholog assignment

One gene may encode for multiple proteins

Page 4: Ortholog assignment

Two genes are homologous if they descended from

a common ancestral gene1

In practice, homology is determined using sequence alignment.

Figure: A sequence alignment of two proteins

Have you seen phrases like “high homology”, “significanthomology”, or “35% homology”?

1with respect to a specific speciation event

Page 5: Ortholog assignment

Two genes are homologous if they descended from

a common ancestral gene1

In practice, homology is determined using sequence alignment.

Figure: A sequence alignment of two proteins

Have you seen phrases like “high homology”, “significanthomology”, or “35% homology”?

1with respect to a specific speciation event

Page 6: Ortholog assignment

Two genes are homologous if they descended from

a common ancestral gene1

In practice, homology is determined using sequence alignment.

Figure: A sequence alignment of two proteins

Have you seen phrases like “high homology”, “significanthomology”, or “35% homology”?

1with respect to a specific speciation event

Page 7: Ortholog assignment

Orthologs are due to speciation, paralogs are due

to duplication

MRCA of G and H

G H

speciation

duplication

main orthologs

orthologs

g h h′

paralogs

Page 8: Ortholog assignment

Orthologs maintain their function

Annotate genes with unknownfunctions.

Infer protein-proteininteractions.

Page 9: Ortholog assignment

Orthologs maintain their function

Annotate genes with unknownfunctions.

Infer protein-proteininteractions.

Page 10: Ortholog assignment

Orthologs are not one-to-one due to lineage

specific gene duplicationsMain orthologs are orthologs that have retained their ancestralposition.2

MRCA of G and H

G H

speciation

duplication

main orthologs

orthologs

g h h′

paralogs

2Burgetz et al., Evolutionary Bioinformatics 2006

Page 11: Ortholog assignment

Problem of identifying main orthologs

Input Position and sequences of genes in 2 genomes

Output For each gene in their common ancestor, find itsdirect descendant in G and H

Complications

I gene duplication

I gene loss

I horizontal gene transfer

I gene fusion, fission

Page 12: Ortholog assignment

Problem of identifying main orthologs

Input Position and sequences of genes in 2 genomes

Output For each gene in their common ancestor, find itsdirect descendant in G and H

Complications

I gene duplication

I gene loss

I horizontal gene transfer

I gene fusion, fission

Page 13: Ortholog assignment

Three main approaches for finding orthologs

Graph based Tree based Rearrangement based

Page 14: Ortholog assignment

Bidirectional Best Hit and variants

Most popular approach. Highlevel of functional relatedness.a

Reciprocal smallest distuse evolutionary distanceestimate instead of BLASTscores

OMA stable pairsintroduce a tolerance intervaland stable matching

aAltenhoff et al., PLoS CB 2009

Page 15: Ortholog assignment

EnsemblCompara GeneTrees3

Figure: Species tree for 4 species on top gene tree for gene A

Based on reconciliation of gene trees with species tree.

1. Partition genes into families and construct gene trees

2. Reconcile each gene tree and species tree3Vilella et al., Genome Res 2009

Page 16: Ortholog assignment

MSOAR24

Figure: Rearrangement scenario between human and mouse

1. Partition genes into families and assign a unique symbol

2. Reconstruct the most parsimonious rearrangement(inversion, translocation, fusion, fission, duplication)

3. Extract the corresponding orthologs

4Fu et al., JCB 2007

Page 17: Ortholog assignment

Can conserved gene neighborhood improveortholog predictions?

Page 18: Ortholog assignment

Human-mouse synteny blocksConserved synteny blocks between human and mouse genomegenerated by the Cinteny web server5

5Sinha and Meller, BMC Bioinformatics 2007

Page 19: Ortholog assignment

Local synteny criteria6

Figure: Local synteny: more than one unique match within +/- 3genes. Homology defined as BLASTP E-value < 1e-5

94% of sampled inter-species pairs are identified as orthologsby Inparanoid (based on BBH) and local synteny criteria.

6Jin Jun et al., BMC Genomics 2009

Page 20: Ortholog assignment

Local synteny score (LC)

G

H

g

h

The local synteny score of g and h is 4 since there are 4 edgesin the maximum matching.

Page 21: Ortholog assignment

Smith-Waterman alignment score (SW)

Page 22: Ortholog assignment

BBH-LS: bidirectional best hits based on linear

combination of SW and LC

G

H

g

h

+

sim(g , h) = (1−f )×SW(g , h)+f ×LC(g , h)

Page 23: Ortholog assignment

Human-Mouse-Rat dataset

InputHuman, mouse, and rat genes downloaded from Ensembl.

BenchmarkNo “golden” benchmark for true orthology.Assume that orthologs are assigned the same gene symbol.

Page 24: Ortholog assignment

Tuning the BBH-LS methodsim(g , h) = (1 − f ) × SW(g , h) + f × LC(g , h)

Figure: Performance of BBH-LS for different ratio of spatialsimilarity to sequence similarity on the human-mouse dataset.

Page 25: Ortholog assignment

Results for various methods on Human-Mouse

Figure: TP: same gene symbols, FP: different gene symbols

More true positives and less false positives than MSOAR2.

Page 26: Ortholog assignment

Results for various methods on Human-Rat

Figure: TP: same gene symbols, FP: different gene symbols

Page 27: Ortholog assignment

Results for various methods on Mouse-Rat

Figure: TP: same gene symbols, FP: different gene symbols

Page 28: Ortholog assignment

How local synteny helps

CTSH

MSH3

CKMT2RASGRF2MSH3RASGRF1 ANKRD34C

RASGRF2ANKRD34C RASGRF1 CKMT2CTSH

sw = 5265ls = 1

sw = 2003ls = 5

sw = 2466ls = 5

Humanchr 15

Humanchr 5

Mousechr 9

Mousechr 13

Bold edges are the pairing from BBH-LS, thin edges are thepairing from BBH.BBH paired RASGRF2 (human) to RASGRF1 (mouse) due tohigh SW, corrected by BBH-LS with LC.

Page 29: Ortholog assignment

Summary: Identifying main orthologs

MRCA of G and H

G H

speciation

duplication

main orthologs

orthologs

g h h′

paralogs

For each gene in their common ancestor, find its directdescendant in G and H

Page 30: Ortholog assignment

Summary: Three approaches

Graph based Tree based Rearrangement based

Page 31: Ortholog assignment

BBH-LS: bidirectional best hits based on linear

combination of SW and LC

G

H

g

h

+

Page 32: Ortholog assignment

BBH-LS: bidirectional best hits based on linear

combination of SW and LC

G

H

g

h

+