comparative genomics haixu tang school of informatics

39
Comparative genomics Haixu Tang School of Informatics

Upload: chloe-sullivan

Post on 14-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Comparative genomics

Haixu Tang

School of Informatics

WGS of human genome

• 2001 Two assemblies of initial human genome sequences published– International Human

Genome project

– Celera Genomics: WGS approach

• 1995 Haemophilus influenzae sequenced

• 1997 E. Coli sequenced

• 1998 Complete sequence of the Caenorhabditis elegans genome

• 2000 Complete sequence of the euchromatic portion of the Drosophila melanogaster genome

Model organisms

Why model organisms?

• Testing and improvements of genome sequencing technology and strategy

• 1993 Whole genome shotgun sequencing proposed (J. C. Venter)

• 1995 Haemophilus influenzae sequenced ~1.5-2 MBps

• 1995 Automated fluorescent sequencing instruments and robotic operations (PerkinsElmer, Inc)

• 1996 Yeast sequenced

• 1996 Double barrelled sequencing

• 1997 E. Coli sequenced ~4 Mbps

• 1998 Complete sequence of the Caenorhabditis elegans genome ~ 100 Mbps

• 1998 Whole genome shotgun sequencing (Weber & Myers)

• 2000 Complete sequence of the euchromatic portion of the Drosophila melanogaster genome ~ 180 Mbps

Model organisms

Why model organisms?

• Testing and improvements of genome sequencing technology and strategy

• Model organisms have important biological implications themselves.

• 1995 Haemophilus influenzae sequenced (infectious disease)

• 1996 Yeast sequenced (industry and biology)

• 1997 E. Coli sequenced (industry and biotechnology)

• 1998 Complete sequence of the Caenorhabditis elegans genome (multi-cellular organism, development)

• 2000 Complete sequence of the euchromatic portion of the Drosophila melanogaster genome (genetics, entomology)

Model organisms

Why model organisms?

• Testing and improvements of genome sequencing technology and strategy.

• Model organisms have important biological implications themselves.

• Genome sequences provide useful information to study genome function and evolution.

• 1995 Haemophilus influenzae sequenced (Bacterial)

• 1996 Yeast sequenced (Uni-cellular)

• 1997 E. Coli sequenced (Bacterial)

• 1998 Complete sequence of the Caenorhabditis elegans genome (Multi-cellular organism, nematode)

• 2000 Complete sequence of the euchromatic portion of the Drosophila melanogaster genome (Multi-cellular organism, insect)

Model organisms

• 2001 Human genome

• 2002 Mouse genome– Initial sequencing and comparative analysis of the

mouse genome

• 2003 Rat genome

• 2004 Chicken genome (first bird)

• 2005 Chimpanzee genome

Model mammalian and vertebrate genomes

Comparative genomics

• Solving biological problems by comparing genomic sequences– Function of genes and genomes– Evolution of genes and genomes

• Data driven approaches– Computational methods are the core

Which genomes to sequence?

• Species having important biological applications• For comparative genomics studies

– Functional consideration• Evolutionary divergent genomes conserved elements, e.g.

human vs. mouse (~75% identical)• Evolutionary close genomes divergent elements, e.g.

human vs. chimpanzee (98.4% identical)

– Evolutionary consideration• Specific evolutionary puzzles whole genome duplications

in yeast

Ongoing eukaryotic genome projects

• http://igweb.integratedgenomics.com/ERGO_supplement/genomes_eukarya.html

• >20 yeast, insects (12 drosophila, 2 mosquitoes, Silkworm), Flea, Sea urchin, frog, fish (Zebrafish, Fugu), Mammals (mouse, rat, dog, cow, pig, monkey, etc.), plants (Arabidopsis, Rice(>2), Maize, etc)

Comparative genomics: case studies

• Gene function and evolution

• Gene-gene relationship

• Genome evolution

• Orthologues : any gene pairwise relation where the ancestor node is a speciation event

• Paralogues : any gene pairwise relation where the ancestor node is a duplication event

HomologueHomologue relationships of geneselationships of genes

Atime

Duplication

M 2’

Speciation

Duplication

M 2

A 1 A 2

M 1 H 1

H 2

Inparalogues

OutparaloguesOrthologues

Inparalogues

Inparalogues

Homologue RelationshipsHomologue Relationships

Functional implications

• Orthologous genes same function in different species

• Paralogous genes different functions

Yeast speciescerevisiae

paradoxus

mikatae

bayanus

glabrata

castellii

lactis

gossypii

waltii

hansenii

albicans

lipolytica

crassa

graminearum

grisea

nidulans

pombe

• 5-20 million years

• Sufficient conservation to align

• Sufficient divergence to identify conserved functional elements

~20M

~5M

Large scale genome evolution

• Most genes have a clear match

• Clear blocks of synteny

Human–chimpanzee comparisons

• POSITIVE SELECTION---A sequence change in a species that results in increased fitness is subject to positive selection. As a consequence, the change normally becomes fixed, leading to adaptive evolution of that species.

Genome vs. Genes

• The whole genome sequence can tell not only what genes exist in a genome, but also what genes do not exist (deleted) in a genome.

Phylogenetic profile analysis

• A non-homologous approach to gene function prediction

• The phylogenetic profile of a gene is a string encoding the presence or absence of the gene in every sequenced genome

• The phylogenetic profiles of genes involving in the same biological process are often “similar'‘, since they may co-evolve.

Phylogenetic profile analysis

• Phylogenetic profile (against N genomes)– For each gene X in a target genome (e.g., E coli), build a

phylogenetic profile as follows– If gene X has a homolog in genome #i, the ith bit of X’s

phylogenetic profile is “1” otherwise it is “0”

Phylogenetic profile analysis

• Example – phylogenetic profiles based on 89 genomes

orf1034:1110110110010111110100010100000000111100011111110110111010101orf1036:1011110001000001010000010010000000010111101110011011010000101orf1037:1101100110000001110010000111111001101111101011101111000010100orf1038:1110100110010010110010011100000101110101101111111111110000101orf1039:1111111111111111111111111111111111111111101111111111111111101orf104: 1000101000000000000000101000000000110000000000000100101000100orf1040:1110111111111101111101111100000111111100111111110110111111101orf1041:1111111111111111110111111111111101111111101111111111111111101orf1042:1110100101010010010110000100001001111110111110101101100010101orf1043:1110100110010000010100111100100001111110101111011101000010101orf1044:1111100111110010010111010111111001111111111111101101100010101orf1045:1111110110110011111111111111111101111111101111111111110010101orf1046:0101100000010001011000000111110000010100000001010010100000000orf1047:0000000000000001000010000001000100000000000000010000000000000orf105: 0110110110100010111101101010111001101100101111100010000010001orf1054:0100100110000001100001000100000000100100100001000100100000000

Genes with similar phylogenetic profiles have related functions or functionally linked – D Eisenberg and colleagues (1999)

Genome evolution

• Genome rearrangement

• Whole genome duplication

Turnip vs Cabbage: Look and Taste Different

• Although cabbages and turnips share a recent common ancestor, they look and taste different

Turnip vs Cabbage: Comparing Gene Sequences Yields No Evolutionary Information

Turnip vs Cabbage: Different mtDNA Gene Order

• Gene order comparison:

Before

After

Evolution is manifested as the divergence in gene order

Comparative Genomic Architecture of Human and Mouse Genomes

To locate where corresponding gene is in humans, the relative architecture of human and mouse genomes were analyzed.

Types of Rearrangements

Reversal1 2 3 4 5 6 1 2 -5 -4 -3 6

Translocation1 2 3 44 5 6

1 2 6 4 5 3

1 2 3 4 5 6

1 2 3 4 5 6

Fusion

Fission

Comparative Genomic Architectures: Mouse vs Human Genome

• Humans and mice have similar genomes, but their genes are ordered differently

• ~245 rearrangements– Reversals– Fusions– Fissions– Translocation

Hypothesis (1997): Whole Genome Duplication

cerevisiae

paradoxus

mikatae

bayanus

glabrata

castellii

lactis

gossypii

waltii

hansenii

albicans

lipolytica

crassa

graminearum

grisea

nidulans

pombe

?

~100M

Hypothetical resolution of WGD

• A 1:2 mapping where– nearly every region in species Y would correspond to

two sister regions in S. cerevisiae – the two sister regions in S. cerevisiae would contain

ordered interleaving subsequences of the genes in the corresponding region of species Y

– nearly every region of S. cerevisiae would correspond to one region of species Y, and thus be paired to a sister region in S. cerevisiae

Hypothesis (1997): Whole Genome Duplication

cerevisiae

paradoxus

mikatae

bayanus

glabrata

castellii

lactis

gossypii

waltii

hansenii

albicans

lipolytica

crassa

graminearum

grisea

nidulans

pombe

?

~100M

Aligning the S. cerevisiae and K. waltii genomes

• Most regions in K. waltii mapped to two regions in S. cerevisiae with each containing matches to only a subset of the K. waltii genes

Duplication covers the whole S. cerevisiae genome

What happens to genes post WGD?

• 12% (457) of paralogous gene pairs were retained

• 76 of the 457 gene pairs (17%) show accelerated protein evolution