Intro to Comp Genomics
Lecture 3: Genomic features and patterns
RNABased
Genomes
RibosomeProteins
Genetic Code
DNABased
Genomes
Membranes Diversity!
? ?
3.4 – 3.8 BYA – fossils??3.2 BYA – good fossils
3 BYA – metanogenesis2.8 BYA – photosynthesis....1.7-1.5 BYA – eukaryotes..0.55 BYA – camberian explosion 0.44 BYA – jawed vertebrates0.4 – land plants0.14 – flowering plants0.10 - mammals
Biknots
Uniknots
Eukaryotes
Eukaryotes
Uniknots – one flagela at some developmental stage
FungiAnimalsAnimal parasitesAmoebas
Biknots – ancestrally two flagellas
Green plantsRed algeaCiliates, plasmoudiumBrown algeaMore amobea
Strange biology!
A big bang phylogeny: speciations across a short time span? Ambiguity – and not much hope for really resolving it
Vertebrates
Sequenced Genomes phylogeny
Fossil based, large scale phylogeny
Ma
rmo
se
t
Ma
ca
qu
e
Ora
ng
uta
n
Ch
imp
Hu
ma
n
Bab
oo
n
Gib
bo
n
Gor
illa
0.5%0.5%
0.8%
1.5%
3%
9%
1.2%
Primates
Flies
Yeasts
Genome Size
Human: 2.8GB
Fly: 130MB
Arabidopsis: 115M
Plasmodium: 22MB
S. Cerevisae/S. Pombe: 12MB
E.Coli 4.6MB
Why larger genomes?
• Selflish DNA – – larger genomes are a result of the proliferation of selfish DNA– Proliferation stops only when it is becoming too deleterious
• Bulk DNA– Genome content is a consequence of natural selection– Larger genome is needed to allow larger cell size, larger nuclear membrane etc.
Why smaller genomes?
• Metabolic cost: maybe cells lose excess DNA for energetic efficiency– But DNA is only 2-5% of the dry mass– No genome size – replication time correlation in prokaryotes– Replication is much faster than transcription (10-20 times in E. coli)
Mutational balance
• Balance between deletions and insertions– May be different between species– Different balances may have been evolved
• In flies, yeast laboratory evolution– 4-fold more 4kb spontaneous insertions
• In mammals – More small deletions than insertions
Mutational hazard
• No loss of function for inert DNA– But is it truly not functional?
• Gain of function mutations are still possible:– Transcription– Regulation
Differences in population size may make DNA purging more effective for prokaryotes, small eukaryotes
Differences in regulatory sophistication may make DNA mutational hazard less of a problem for metazoan
Genome Structural features: centromeres/telomeres
Rat – Partly acrocentricHuman
• Centromeres are essential and universally important for proper cell division, but are highly diverging among species
• S. Cerevisae: 100bp centromere• S. Pombe: repetitive (~50kb)• Mouse: one arm is degenerate• Human: both arms contain genes• Pericentromeric regions – more repeats
• Telomeres are critical for genome maintenance• Sub telomeric regions – also repetitive, and rearranged• May be key to nuclear structure?
Substitution rates and stationary distributions
• A simple Markov chain• P(X=A;t+1) = P(X=A;t)P(A|A) +P(X=C;t)P(A|C) + P(X=G;t)P(A|G) + P(X=T;t)P(A|T)
We represent the change using the transition probability matrix
Running Start from P(C;t=0) = 1 and running for a long time, what would you expect P(A) to be?
A
C
G
T
A
C
G
T
991.0001.0001.0007.0
003.0991.0003.0003.0
003.0003.099.0003.0
007.0001.0001.0991.0
TGCcTGCc
cAPcPAcPAP,,,,
)|()()|()(Fixed point:
More later in the course (and in the second term)
Differences in substitution rates result in major changes to the stationary distribution
Nucleotide composition: human vs. mouse
TG
CpG Islands (mainly at promoters)
Low methylation(+Selection??)
Deamination andslow correction
High methylationHigh methylation
Deamination andslow correction
Normal mutation
CA
CATG
CA TG TG
Normal #CpGs Small #CpGsSmall #CpGs
CpG Islands: %(G+C)>0.5 and %CpG/(%G*%C) > 0.6, for a “long” genomic interval
K-mer distribution
Specialized proteins can bind DNA in a sequence specific fashion
Genomes can therefore control the level of affinity of each region to a large set of DNA binding proteins
DNA binding sites are typically short (<20bp)
Multiple binding sites at different affinities participate in regulation
The frequency of k-mer DNA words in the genome is called the k-spectrum of the genomeThe K-spectrum is complex, due to multiple effects
1 2 3
Distance from insert
A
C
G
T
Genomic information: Protein coding genes
Defining and detecting genes
Predictions: using probabilistic models.•HMM based, using landmark features•Different among prokeryotes/eukaryotes•Modest success and only work for “classical genes” (protein coding)
RNA based•Sequencing RNA’s from different tissues•Mapping to genome using std alignment algorithms•Compared to known protein sequcene databases•Gold standard
EST based•Sequencing expressed sequence tags (Unigene) •Clustering and defining criteria for coverage•Combine with RNA/Predictions
Comparative•Aligning gene models from one species to another based on sequence homology•Effective for uncharacterized genomes, but of limited accuracy
Databases:
Refseq: containing validated transcripts, high confidence, missing stuffUCSC: knowngene (combining multiple sources) – half conservativeEnsembl: combining multiple sources Model organisms: SGD, Flybase, Wormbase
Introns/Exons
Genomic information: the gene repertoire is evolving by duplication and loss
Strand asymmetry
Polak and Arndt
Structure meets information: HOX clusters as an example
Hox genes are important developmental regulators
Present in linear clusters, preserving order
Their expression is frequently coordinate with the gene order
4 HOX clusters are present in the human genome
Additional gene clusters: Protocadherins, Olfactory receptors, MAGE genes, Zinc fingers
Additional smaller groups of related regulators are co-located
miRNA clusters
Repeats: selfish DNA
ClassCopiesGenome Fraction
LINEs868,000(only ~100 active!!)
20.4%
SINEs1,558,000
(70% Alu)
13.1%
LTR elements443,0008.3%
Transposons294,0002.8%
Repetitive elements in the human genome
Retrotransposition via RNA
Repeats: short tandems, satellites
DNA-based transposons do not involve an RNA intermediate, and are quite rare.
Satellite DNA duplicate by Replication slippages which is enhanced for specific sequences. Abundant near telomeres and centromeres. Some of these are still a mystery.
Retrotransposition is generally sloppy and noisy – so elements die out quickly
Element proliferation appears in evolutionary bursts.
Pseudogenes
Genes that are becoming inactive due to mutations are called pseudogenes
mRNAs that jump back into the genome are called processed pseudogenes (they therefore lack introns)
Linear correlation
Figure: wikipedia
)()(
),(
YVXV
YXCOVrXY
YXrXY For normalized vectors
2
|||)|(,500
NrerfcrrPN c
))()((())()(((
)()()(2222 YEYEXEXE
YEXEXYErXY
Easy to compute in one pass
21
2
r
Nrt
Hard to say if meaningful:
Assuming binorrmality:
Spearman correlation
• Linear correlation is biased whenever you observe a non linear behavior.• Linear correlation is extremely sensitive to outliers
• A-parameteric statistics transform all values to their rank statistic (by sorting)• Ties are broken using “mid-ranking”
• Computing correlation on the rank statitsics generate the Spearman correlation• Spearman values of independent variables are distributed just like linear correlation• Computing p-value is done accordingly
Studying trend lines
• Strong correspondence between variables can be observed without any spearman or pearson correlation (can you think of an example?)
• One can try using parametric transformation to fix such a problem• In any case, looking at the data carefully and computing trend lines is essential
• Generating statistics on trend lines:
Fixed Span bins
Fixed Size bins
Sliding window
Weighted sliding window
Auto-correlation
• Computing correlation between point x and point x-d for different d’s• Providing clues for different scales of correlation in the data
• For example:– Fragment lengths on arrays
– Nucleosomes
• Related method: Fast Fourier Transform (FFT)
Testing difference between samples
Model based:
assume normality, compute p-value
assume binomal distribution
assume poisson distribution
Direct comparison:
T-test to compare means
ANOVA
Chi-square to compare contingency tables
Kolmogorov Smirnov
Multi-variate cross correlation
• Pairwise correlation matrix– Plotting – but in which order?
• Multiple testing should be controlled for– Bonferoni’s union bound
– False discovery rate (FDR)
• Control one/few parametersPros: Results are robust if done right
Cons: fewer stats
• Normalize one/few parameters
• Clustering or model based approaches (see later in the course)
Preparations:
• Download one chromosome 17 in the human genome
• Download ucsc knowngene table
• Compute the moving average G+C content for bins of 500 bps, decide on 8 G+C content bins.
Modeling:
• Build a 4-order , G+C contet dependent Markov model from all of the non-exonic sequences
Analysis:
• Compute the expected frequency of 4-mers in your genomic bins
• Compute the observed/expected ratio for all 256 4-mers on all genomic bins. Compute correlations between them.
• Develop a p-value (using any reasonable method) for the difference between observed and expected k-mers frequencies, report the most significant p-values you discovered.
Your Task
Preparations:
-Links to data in the wiki
-File format of knowngene table at the USCS site
-Reasonable G+C content bins: balance the bin span and the bin count (having very few cases in a bin will make your subsequent model non-realistic, having a large span of G+C in a bin would make your model inaccurate)
Preparations:
• Download one chromosome 17 in the human genome
• Download ucsc knowngene table
• Compute the moving average G+C content for bins of 500 bps, decide on 8 G+C content bins.
Modeling:
• Build a 3-order , G+C contet dependent Markov model from all of the non-exonic sequences
Analysis:
• Compute the expected frequency of 4-mers in your genomic bins
• Compute the observed/expected ratio for all 256 4-mers on all genomic bins. Compute correlations between them.
• Develop a p-value (using any reasonable method) for the difference between observed and expected k-mers frequencies, report the most significant p-values you discovered.
Your Task
Modeling:
Count K-mers for each G+C bin. Transform to 3 order Markov model
Use fixed size, non overlapping bins of 20kb for testing observed/expected stat.
Ignore repeat maksed sequence (lower case characters)
If you believe a different bin size would work better – go for it (think of the expected number of each of the 256 k-mers – in 20kb bin it is ~100 if there are no masked sequence).
c
GCcccc
GCcccc
n
nGCcccc
321
4321),|Pr( 3214
Preparations:
• Download one chromosome 17 in the human genome
• Download ucsc knowngene table
• Compute the moving average G+C content for bins of 500 bps, decide on 8 G+C content bins.
Modeling:
• Build a 3-order , G+C contet dependent Markov model from all of the non-exonic sequences
Analysis:
• Compute the expected frequency of 4-mers in your genomic bins using only G+C content or the 3-order markov model
• Compute the observed/expected ratio for all 256 4-mers on all genomic bins. Compute correlations between them.
• Develop a p-value (using any reasonable method) for the difference between observed and expected k-mers frequencies, report the most significant p-values you discovered.
Your Task
Analysis:
The expected number of appearances for a 4-mer given only the G+C content:
The expected number of appearances for a 4-mer given the 3 order Markov and the G+C:
The obs/exp ratio should be studied in log-scale, handling carefully low values
Computing 256*256/2 correlations of ~1000 values should not be difficult. We will use these correlations later.
P-value can be computed using a Z-score
)|Pr( 4321 GCccccN bin
GC
binGCccc GCccccN ),|Pr( 3214,321