Download - Intro to Comp Genomics Lecture 3: Genomic features and patterns

Intro to Comp Genomics

Lecture 3: Genomic features and patterns

RNABased

Genomes

RibosomeProteins

Genetic Code

DNABased

Genomes

Membranes Diversity!

? ?

3.4 – 3.8 BYA – fossils??3.2 BYA – good fossils

3 BYA – metanogenesis2.8 BYA – photosynthesis....1.7-1.5 BYA – eukaryotes..0.55 BYA – camberian explosion 0.44 BYA – jawed vertebrates0.4 – land plants0.14 – flowering plants0.10 - mammals

Biknots

Uniknots

Eukaryotes

Eukaryotes

Uniknots – one flagela at some developmental stage

FungiAnimalsAnimal parasitesAmoebas

Biknots – ancestrally two flagellas

Green plantsRed algeaCiliates, plasmoudiumBrown algeaMore amobea

Strange biology!

A big bang phylogeny: speciations across a short time span? Ambiguity – and not much hope for really resolving it

Vertebrates

Sequenced Genomes phylogeny

Fossil based, large scale phylogeny

Ma

rmo

se

t

Ma

ca

qu

e

Ora

ng

uta

n

Ch

imp

Hu

ma

n

Bab

oo

n

Gib

bo

n

Gor

illa

0.5%0.5%

0.8%

1.5%

3%

9%

1.2%

Primates

Yeasts

Genome Size

Human: 2.8GB

Fly: 130MB

Arabidopsis: 115M

Plasmodium: 22MB

S. Cerevisae/S. Pombe: 12MB

E.Coli 4.6MB

Why larger genomes?

• Selflish DNA – – larger genomes are a result of the proliferation of selfish DNA– Proliferation stops only when it is becoming too deleterious

• Bulk DNA– Genome content is a consequence of natural selection– Larger genome is needed to allow larger cell size, larger nuclear membrane etc.

Why smaller genomes?

• Metabolic cost: maybe cells lose excess DNA for energetic efficiency– But DNA is only 2-5% of the dry mass– No genome size – replication time correlation in prokaryotes– Replication is much faster than transcription (10-20 times in E. coli)

Mutational balance

• Balance between deletions and insertions– May be different between species– Different balances may have been evolved

• In flies, yeast laboratory evolution– 4-fold more 4kb spontaneous insertions

• In mammals – More small deletions than insertions

Mutational hazard

• No loss of function for inert DNA– But is it truly not functional?

• Gain of function mutations are still possible:– Transcription– Regulation

Differences in population size may make DNA purging more effective for prokaryotes, small eukaryotes

Differences in regulatory sophistication may make DNA mutational hazard less of a problem for metazoan

Genome Structural features: centromeres/telomeres

Rat – Partly acrocentricHuman

• Centromeres are essential and universally important for proper cell division, but are highly diverging among species

• S. Cerevisae: 100bp centromere• S. Pombe: repetitive (~50kb)• Mouse: one arm is degenerate• Human: both arms contain genes• Pericentromeric regions – more repeats

• Telomeres are critical for genome maintenance• Sub telomeric regions – also repetitive, and rearranged• May be key to nuclear structure?

Substitution rates and stationary distributions

• A simple Markov chain• P(X=A;t+1) = P(X=A;t)P(A|A) +P(X=C;t)P(A|C) + P(X=G;t)P(A|G) + P(X=T;t)P(A|T)

We represent the change using the transition probability matrix

Running Start from P(C;t=0) = 1 and running for a long time, what would you expect P(A) to be?

A

C

G

T

A

C

G

T

991.0001.0001.0007.0

003.0991.0003.0003.0

003.0003.099.0003.0

007.0001.0001.0991.0

TGCcTGCc

cAPcPAcPAP,,,,

)|()()|()(Fixed point:

More later in the course (and in the second term)

Differences in substitution rates result in major changes to the stationary distribution

Nucleotide composition: human vs. mouse

TG

CpG Islands (mainly at promoters)

Low methylation(+Selection??)

Deamination andslow correction

High methylationHigh methylation

Deamination andslow correction

Normal mutation

CA

CATG

CA TG TG

Normal #CpGs Small #CpGsSmall #CpGs

CpG Islands: %(G+C)>0.5 and %CpG/(%G*%C) > 0.6, for a “long” genomic interval

K-mer distribution

Specialized proteins can bind DNA in a sequence specific fashion

Genomes can therefore control the level of affinity of each region to a large set of DNA binding proteins

DNA binding sites are typically short (<20bp)

Multiple binding sites at different affinities participate in regulation

The frequency of k-mer DNA words in the genome is called the k-spectrum of the genomeThe K-spectrum is complex, due to multiple effects

1 2 3

Distance from insert

A

C

G

T

Genomic information: Protein coding genes

Defining and detecting genes

Predictions: using probabilistic models.•HMM based, using landmark features•Different among prokeryotes/eukaryotes•Modest success and only work for “classical genes” (protein coding)

RNA based•Sequencing RNA’s from different tissues•Mapping to genome using std alignment algorithms•Compared to known protein sequcene databases•Gold standard

EST based•Sequencing expressed sequence tags (Unigene) •Clustering and defining criteria for coverage•Combine with RNA/Predictions

Comparative•Aligning gene models from one species to another based on sequence homology•Effective for uncharacterized genomes, but of limited accuracy

Databases:

Refseq: containing validated transcripts, high confidence, missing stuffUCSC: knowngene (combining multiple sources) – half conservativeEnsembl: combining multiple sources Model organisms: SGD, Flybase, Wormbase

Introns/Exons

Genomic information: the gene repertoire is evolving by duplication and loss

Strand asymmetry

Polak and Arndt

Structure meets information: HOX clusters as an example

Hox genes are important developmental regulators

Present in linear clusters, preserving order

Their expression is frequently coordinate with the gene order

4 HOX clusters are present in the human genome

Additional gene clusters: Protocadherins, Olfactory receptors, MAGE genes, Zinc fingers

Additional smaller groups of related regulators are co-located

miRNA clusters

Repeats: selfish DNA

ClassCopiesGenome Fraction

LINEs868,000(only ~100 active!!)

20.4%

SINEs1,558,000

(70% Alu)

13.1%

LTR elements443,0008.3%

Transposons294,0002.8%

Repetitive elements in the human genome

Retrotransposition via RNA

Repeats: short tandems, satellites

DNA-based transposons do not involve an RNA intermediate, and are quite rare.

Satellite DNA duplicate by Replication slippages which is enhanced for specific sequences. Abundant near telomeres and centromeres. Some of these are still a mystery.

Retrotransposition is generally sloppy and noisy – so elements die out quickly

Element proliferation appears in evolutionary bursts.

Pseudogenes

Genes that are becoming inactive due to mutations are called pseudogenes

mRNAs that jump back into the genome are called processed pseudogenes (they therefore lack introns)

Linear correlation

Figure: wikipedia

)()(

),(

YVXV

YXCOVrXY

YXrXY For normalized vectors

2

|||)|(,500

NrerfcrrPN c

))()((())()(((

)()()(2222 YEYEXEXE

YEXEXYErXY

Easy to compute in one pass

21

2

r

Nrt

Hard to say if meaningful:

Assuming binorrmality:

Spearman correlation

• Linear correlation is biased whenever you observe a non linear behavior.• Linear correlation is extremely sensitive to outliers

• A-parameteric statistics transform all values to their rank statistic (by sorting)• Ties are broken using “mid-ranking”

• Computing correlation on the rank statitsics generate the Spearman correlation• Spearman values of independent variables are distributed just like linear correlation• Computing p-value is done accordingly

Studying trend lines

• Strong correspondence between variables can be observed without any spearman or pearson correlation (can you think of an example?)

• One can try using parametric transformation to fix such a problem• In any case, looking at the data carefully and computing trend lines is essential

• Generating statistics on trend lines:

Fixed Span bins

Fixed Size bins

Sliding window

Weighted sliding window

Auto-correlation

• Computing correlation between point x and point x-d for different d’s• Providing clues for different scales of correlation in the data

• For example:– Fragment lengths on arrays

– Nucleosomes

• Related method: Fast Fourier Transform (FFT)

Testing difference between samples

Model based:

assume normality, compute p-value

assume binomal distribution

assume poisson distribution

Direct comparison:

T-test to compare means

ANOVA

Chi-square to compare contingency tables

Kolmogorov Smirnov

Multi-variate cross correlation

• Pairwise correlation matrix– Plotting – but in which order?

• Multiple testing should be controlled for– Bonferoni’s union bound

– False discovery rate (FDR)

• Control one/few parametersPros: Results are robust if done right

Cons: fewer stats

• Normalize one/few parameters

• Clustering or model based approaches (see later in the course)

Preparations:

• Download one chromosome 17 in the human genome

• Download ucsc knowngene table

• Compute the moving average G+C content for bins of 500 bps, decide on 8 G+C content bins.

Modeling:

• Build a 4-order , G+C contet dependent Markov model from all of the non-exonic sequences

Analysis:

• Compute the expected frequency of 4-mers in your genomic bins

• Compute the observed/expected ratio for all 256 4-mers on all genomic bins. Compute correlations between them.

• Develop a p-value (using any reasonable method) for the difference between observed and expected k-mers frequencies, report the most significant p-values you discovered.

Your Task

Preparations:

-Links to data in the wiki

-File format of knowngene table at the USCS site

-Reasonable G+C content bins: balance the bin span and the bin count (having very few cases in a bin will make your subsequent model non-realistic, having a large span of G+C in a bin would make your model inaccurate)

Preparations:




Modeling:


Analysis:

• Compute the expected frequency of 4-mers in your genomic bins



Your Task

Modeling:

Count K-mers for each G+C bin. Transform to 3 order Markov model

Use fixed size, non overlapping bins of 20kb for testing observed/expected stat.

Ignore repeat maksed sequence (lower case characters)

If you believe a different bin size would work better – go for it (think of the expected number of each of the 256 k-mers – in 20kb bin it is ~100 if there are no masked sequence).

c

GCcccc

GCcccc

n

nGCcccc

321

4321),|Pr( 3214

Preparations:




Modeling:


Analysis:

• Compute the expected frequency of 4-mers in your genomic bins using only G+C content or the 3-order markov model



Your Task

Analysis:

The expected number of appearances for a 4-mer given only the G+C content:

The expected number of appearances for a 4-mer given the 3 order Markov and the G+C:

The obs/exp ratio should be studied in log-scale, handling carefully low values

Computing 256*256/2 correlations of ~1000 values should not be difficult. We will use these correlations later.

P-value can be computed using a Z-score

)|Pr( 4321 GCccccN bin

GC

binGCccc GCccccN ),|Pr( 3214,321

Download - Intro to Comp Genomics Lecture 3: Genomic features and patterns

Top Related