comparative genomics haixu tang school of informatics
TRANSCRIPT
WGS of human genome
• 2001 Two assemblies of initial human genome sequences published– International Human
Genome project
– Celera Genomics: WGS approach
• 1995 Haemophilus influenzae sequenced
• 1997 E. Coli sequenced
• 1998 Complete sequence of the Caenorhabditis elegans genome
• 2000 Complete sequence of the euchromatic portion of the Drosophila melanogaster genome
Model organisms
• 1993 Whole genome shotgun sequencing proposed (J. C. Venter)
• 1995 Haemophilus influenzae sequenced ~1.5-2 MBps
• 1995 Automated fluorescent sequencing instruments and robotic operations (PerkinsElmer, Inc)
• 1996 Yeast sequenced
• 1996 Double barrelled sequencing
• 1997 E. Coli sequenced ~4 Mbps
• 1998 Complete sequence of the Caenorhabditis elegans genome ~ 100 Mbps
• 1998 Whole genome shotgun sequencing (Weber & Myers)
• 2000 Complete sequence of the euchromatic portion of the Drosophila melanogaster genome ~ 180 Mbps
Model organisms
Why model organisms?
• Testing and improvements of genome sequencing technology and strategy
• Model organisms have important biological implications themselves.
• 1995 Haemophilus influenzae sequenced (infectious disease)
• 1996 Yeast sequenced (industry and biology)
• 1997 E. Coli sequenced (industry and biotechnology)
• 1998 Complete sequence of the Caenorhabditis elegans genome (multi-cellular organism, development)
• 2000 Complete sequence of the euchromatic portion of the Drosophila melanogaster genome (genetics, entomology)
Model organisms
Why model organisms?
• Testing and improvements of genome sequencing technology and strategy.
• Model organisms have important biological implications themselves.
• Genome sequences provide useful information to study genome function and evolution.
• 1995 Haemophilus influenzae sequenced (Bacterial)
• 1996 Yeast sequenced (Uni-cellular)
• 1997 E. Coli sequenced (Bacterial)
• 1998 Complete sequence of the Caenorhabditis elegans genome (Multi-cellular organism, nematode)
• 2000 Complete sequence of the euchromatic portion of the Drosophila melanogaster genome (Multi-cellular organism, insect)
Model organisms
• 2001 Human genome
• 2002 Mouse genome– Initial sequencing and comparative analysis of the
mouse genome
• 2003 Rat genome
• 2004 Chicken genome (first bird)
• 2005 Chimpanzee genome
Model mammalian and vertebrate genomes
Comparative genomics
• Solving biological problems by comparing genomic sequences– Function of genes and genomes– Evolution of genes and genomes
• Data driven approaches– Computational methods are the core
Which genomes to sequence?
• Species having important biological applications• For comparative genomics studies
– Functional consideration• Evolutionary divergent genomes conserved elements, e.g.
human vs. mouse (~75% identical)• Evolutionary close genomes divergent elements, e.g.
human vs. chimpanzee (98.4% identical)
– Evolutionary consideration• Specific evolutionary puzzles whole genome duplications
in yeast
Ongoing eukaryotic genome projects
• http://igweb.integratedgenomics.com/ERGO_supplement/genomes_eukarya.html
• >20 yeast, insects (12 drosophila, 2 mosquitoes, Silkworm), Flea, Sea urchin, frog, fish (Zebrafish, Fugu), Mammals (mouse, rat, dog, cow, pig, monkey, etc.), plants (Arabidopsis, Rice(>2), Maize, etc)
Comparative genomics: case studies
• Gene function and evolution
• Gene-gene relationship
• Genome evolution
• Orthologues : any gene pairwise relation where the ancestor node is a speciation event
• Paralogues : any gene pairwise relation where the ancestor node is a duplication event
HomologueHomologue relationships of geneselationships of genes
Atime
Duplication
M 2’
Speciation
Duplication
M 2
A 1 A 2
M 1 H 1
H 2
Inparalogues
OutparaloguesOrthologues
Inparalogues
Inparalogues
Homologue RelationshipsHomologue Relationships
Functional implications
• Orthologous genes same function in different species
• Paralogous genes different functions
Yeast speciescerevisiae
paradoxus
mikatae
bayanus
glabrata
castellii
lactis
gossypii
waltii
hansenii
albicans
lipolytica
crassa
graminearum
grisea
nidulans
pombe
• 5-20 million years
• Sufficient conservation to align
• Sufficient divergence to identify conserved functional elements
~20M
~5M
Human–chimpanzee comparisons
• POSITIVE SELECTION---A sequence change in a species that results in increased fitness is subject to positive selection. As a consequence, the change normally becomes fixed, leading to adaptive evolution of that species.
Genome vs. Genes
• The whole genome sequence can tell not only what genes exist in a genome, but also what genes do not exist (deleted) in a genome.
Phylogenetic profile analysis
• A non-homologous approach to gene function prediction
• The phylogenetic profile of a gene is a string encoding the presence or absence of the gene in every sequenced genome
• The phylogenetic profiles of genes involving in the same biological process are often “similar'‘, since they may co-evolve.
Phylogenetic profile analysis
• Phylogenetic profile (against N genomes)– For each gene X in a target genome (e.g., E coli), build a
phylogenetic profile as follows– If gene X has a homolog in genome #i, the ith bit of X’s
phylogenetic profile is “1” otherwise it is “0”
Phylogenetic profile analysis
• Example – phylogenetic profiles based on 89 genomes
orf1034:1110110110010111110100010100000000111100011111110110111010101orf1036:1011110001000001010000010010000000010111101110011011010000101orf1037:1101100110000001110010000111111001101111101011101111000010100orf1038:1110100110010010110010011100000101110101101111111111110000101orf1039:1111111111111111111111111111111111111111101111111111111111101orf104: 1000101000000000000000101000000000110000000000000100101000100orf1040:1110111111111101111101111100000111111100111111110110111111101orf1041:1111111111111111110111111111111101111111101111111111111111101orf1042:1110100101010010010110000100001001111110111110101101100010101orf1043:1110100110010000010100111100100001111110101111011101000010101orf1044:1111100111110010010111010111111001111111111111101101100010101orf1045:1111110110110011111111111111111101111111101111111111110010101orf1046:0101100000010001011000000111110000010100000001010010100000000orf1047:0000000000000001000010000001000100000000000000010000000000000orf105: 0110110110100010111101101010111001101100101111100010000010001orf1054:0100100110000001100001000100000000100100100001000100100000000
Genes with similar phylogenetic profiles have related functions or functionally linked – D Eisenberg and colleagues (1999)
Turnip vs Cabbage: Look and Taste Different
• Although cabbages and turnips share a recent common ancestor, they look and taste different
Turnip vs Cabbage: Different mtDNA Gene Order
• Gene order comparison:
Before
After
Evolution is manifested as the divergence in gene order
Comparative Genomic Architecture of Human and Mouse Genomes
To locate where corresponding gene is in humans, the relative architecture of human and mouse genomes were analyzed.
Types of Rearrangements
Reversal1 2 3 4 5 6 1 2 -5 -4 -3 6
Translocation1 2 3 44 5 6
1 2 6 4 5 3
1 2 3 4 5 6
1 2 3 4 5 6
Fusion
Fission
Comparative Genomic Architectures: Mouse vs Human Genome
• Humans and mice have similar genomes, but their genes are ordered differently
• ~245 rearrangements– Reversals– Fusions– Fissions– Translocation
Hypothesis (1997): Whole Genome Duplication
cerevisiae
paradoxus
mikatae
bayanus
glabrata
castellii
lactis
gossypii
waltii
hansenii
albicans
lipolytica
crassa
graminearum
grisea
nidulans
pombe
?
~100M
Hypothetical resolution of WGD
• A 1:2 mapping where– nearly every region in species Y would correspond to
two sister regions in S. cerevisiae – the two sister regions in S. cerevisiae would contain
ordered interleaving subsequences of the genes in the corresponding region of species Y
– nearly every region of S. cerevisiae would correspond to one region of species Y, and thus be paired to a sister region in S. cerevisiae
Hypothesis (1997): Whole Genome Duplication
cerevisiae
paradoxus
mikatae
bayanus
glabrata
castellii
lactis
gossypii
waltii
hansenii
albicans
lipolytica
crassa
graminearum
grisea
nidulans
pombe
?
~100M
Aligning the S. cerevisiae and K. waltii genomes
• Most regions in K. waltii mapped to two regions in S. cerevisiae with each containing matches to only a subset of the K. waltii genes