computational analysis of promoters

35
COMPUTATIONAL ANALYSIS OF PROMOTERS

Upload: gaetan

Post on 24-Feb-2016

63 views

Category:

Documents


0 download

DESCRIPTION

Computational analysis of PromoterS. Gene regulation. Genomes usually contain several thousands of different genes . Some of the gene products are required by the cell under all growth conditions and are called housekeeping genes . - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Computational  analysis of PromoterS

COMPUTATIONAL ANALYSIS OF PROMOTERS

Page 2: Computational  analysis of PromoterS

Gene regulation• Genomes usually contain several thousands of different

genes.• Some of the gene products are required by the cell under

all growth conditions and are called housekeeping genes. • genes for DNA polymerase, RNA polymerase, rRNA, tRNA, …

• Many other gene products are required under specific growth conditions. • e.g. enzymes responding to a specific environmental condition

such as DNA damage

Page 3: Computational  analysis of PromoterS

Gene regulation• Housekeeping genes must be expressed at some level all

of the time. • Frequently, as the cell grows faster, more of the housekeeping

gene products are needed.• The gene products required for specific growth conditions

are not needed all of the time. • These genes are frequently expressed at extremely low levels, or

not expressed at all when they are not needed and yet made when they are needed.

• Apparently, the gene expression must be regulated so that the genes that are being expressed meet the needs of different cell types, developmental stages, or different external conditions.

Page 4: Computational  analysis of PromoterS

Gene regulationGene regulation basically occurs at three different places:

1. transcriptional regulation• transcription of the gene is regulated• control of transcription initiation – most important control mechanism

2. translational regulation• translation of the gene is regulated• How often the mRNA is translated influences the amount of gene product

that is made.

3. post-transcriptional/post-translational regulation• regulation of gene products after they are completely synthesized, e.g.

degradation, chemical modifications (methylation, phosphorylation)

Page 5: Computational  analysis of PromoterS

Transcriptional regulation• Transcription control has two key features:

1. protein-binding regulatory DNA sequences (control elements) are associated with genes

2. specific proteins that bind to regulatory sequences determine where transcription will start, and either activate or repress its transcription

• DNA sequence specifying where RNA polymerase binds and initiates transcription of a gene is called a promoter.

• Transcription from a particular promoter is controlled by DNA-binding proteins, termed transcription factors.

• DNA control elements in binding transcription factors may be located very far from the promoter they regulate.

Page 6: Computational  analysis of PromoterS

Three different polymerases• As a result of this arrangement, transcription from a single

promoter may be regulated by binding of multiple transcription factors to alternative control elements, permitting complex control of gene expression.

• RNA polymerase I synthesizes rRNA.• RNA polymerase II synthesizes mRNA.• RNA polymerase III synthesizes small RNAs and tRNA.

Page 7: Computational  analysis of PromoterS

source: Molecular Biology of the Cell. 4th edition. Alberts B

Page 8: Computational  analysis of PromoterS

Three parts of promoter• core promoter

• responsible for actual binding of transcription apparatus• very close upstream (~35 bp), may also be downstream, see later

• proximal promoter• contains several regulatory elements• few hundreds bases upstream of transcriptional start site (TSS)

• distal promoter• contains enhancers (upstream/downstream), silencers• They are cis-acting … cis-element regulates gene on the same DNA

molecule. cis-acting sequences are bound by trans-acting (i.e. acting from a different molecule) regulatory proteins.

• However, the distinctions between proximal elements and enhancers/silencers is not very clear.

Page 9: Computational  analysis of PromoterS

Core promoter• Eukaryotic RNAPII is not itself capable of transcriptional

initiation in vitro.• It needs to be supplemented by general (basal)

transcription factors (GTFs).• Factors are identified as TFIIX, where X is a letter. e.g. TFIIA,

TFIIB, …• RNAPII + TFs form pre-initiation complex (PIC). Only

then transcription can commence.• minimal (core) promoter – DNA sequence sufficient for

assembly of pre-initiation complex.• Transcription initiated by the core promoter is called basal

transcription.

Page 10: Computational  analysis of PromoterS

Core promoter elements• Core promoter is usually located proximal to or overlapping

TSS.

• Contains several sequence motifs. TFs interact with them in sequence-specific manner.

• Combination of TF-binding motifs vary depending on the gene.

Page 11: Computational  analysis of PromoterS

Core promoter elements• TATA box … ~ 30 bp upstream, consensus TATA(A/T)A(A/T)

• Instead of a TATA box, some eukaryotic (TATA-less) genes contain initiator (Inr) … surrounds TSS, extremely degenerate consensus sequence YYAN(T/A)YYY (A – TSS, N – any nucleotide)

• Promoters with both TATA and Inr also exist.

• DPE (downstream promoter element) in TATA-less• Present in some TATA-, Inr+ promoters, 30 bp downstream.

consensus: RGWCGTG (W = A or T)

Butler JE, Kadonaga JT. The RNA polymerase II core promoter: a key component in the regulation of gene expression. Genes Dev. 2002; 16 (20):2583-92.

Page 12: Computational  analysis of PromoterS
Page 13: Computational  analysis of PromoterS

Promoter proximal elements• Found within 100 to 200 bp of the TSS.

• CAAT (CCAAT, CAT) box … consensus GGCCAATCT

• GC box … consensus G/T G/A GGCG G/T G/A G/A C/T.• It’s GC rich segment.• Promoter may contain multiple GC boxes, such promoter usually

lack TATA box.

Page 14: Computational  analysis of PromoterS

A hypothetic mammalian promoter region

TATA

+1

PromoterProximalElement

Enhancer

Enhancer

EnhancerIntron

Exon

-30-200+10~50 Kb-10~-50 Kb

Page 15: Computational  analysis of PromoterS

CpG island• Transcription of genes with TATA/Inr promoters begins at a

well-defined sites.• However, transcription of many protein-coding genes has been

shown to begin at any one of multiple possible sites over an extended region 20–200 bp long.

• As a result, such genes give rise to mRNAs with multiple alternative 5’ ends.

• These are housekeeping genes, they do not contain TATA, Inr.• Most genes of this type contain a CG-rich stretch of several

hundreds nucleotides – CpG island – within ≈100 base pairs upstream of TSS.

• CpG islands are typical for vertebrates (including human). They are not common in lower eukaryotes.

Page 16: Computational  analysis of PromoterS

CpG island

• Computational analysis is based on CG dinucleotide imbalance.

• length = 200 bp, C+G content min 50%,

M. Gardiner-Garden, M. Frommer, CpG islands in vertebrate genomes, J. Mol. Biol. 1987, 196, 261-282.

• length = 500 bp, C+G content min 55%, 5D. Takai, P. A. Jones, Comprehensive analysis of CpG islands in human chromosomes 21 and 22, PNAS 2002, 99, 3740-45.

mRNA

Multiple 5’-start sites

CpG island

~100 bp

Page 17: Computational  analysis of PromoterS

CpG island

• simple methods based on the frequency of CG perform remarkably well at correctly predicting regions containing TSSs

• EMBOSS CpGPlot/CpGReport -http://www.ebi.ac.uk/Tools/emboss/cpgplot/• CpG Island Searcher - http://cpgislands.usc.edu/ (IE only)

len=51, #C=76, #g=101, #CG=30, , , ,, CpGo/e=0.98

Page 18: Computational  analysis of PromoterS

Promoter regions in human genes

TATA 32%

Inr 85%

GC box 97%

CAAT box 64%

located in CpG 48%

TATA+Inr+ 28%

TATA+Inr- 4%

TATA-Inr+ 56%

TATA-Inr- 12%

Suzuki Y et al., Identification and characterization of the potential promoter regions of 1031 kinds of human genes. Genome Res. 2001, 11(5):677-84.

Page 19: Computational  analysis of PromoterS

Computational analysis of promoters

Page 20: Computational  analysis of PromoterS

Introduction• Regulatory regions typically contain several transcription

factor binding sites strung out over a large region.

• Which particular factor is used not only relies on the binding site, but also on what factors are available for binding in a given cell type at a given time.

• Any given gene will typically have its very own pattern of binding sites for transcriptional activators and repressors ensuring that the gene is only transcribed in the proper cell type(s) and at the proper time during the development.

Page 21: Computational  analysis of PromoterS

Introduction• Transcription factors themselves are also subject to

similar transcriptional regulation, thereby forming transcriptional cascades and feed-back control loops.

• While this all is very nice and interesting from a biologist’s point of view, it spells big trouble for promoter prediction.

Page 22: Computational  analysis of PromoterS

Computational difficulties• There thousands of transcriptional regulators, many of

which have recognition sequences that are not yet characterized.

• Any given sequence element might be recognized by different factors in different cell types.

• Core promoter regulatory elements are short and not completely conserved similar elements will be found purely by chance all over the genome.

Page 23: Computational  analysis of PromoterS

What promoter prediction methods actually predict?• 1st nucleotide copied at the 5’ end of the corresponding

mRNA – transcription start site TSS• region around TSS is often referred as the core promoter• Owing to the strong link between TSS and core promoter,

these terms are often used interchangeably.

• Three distinct types of promoter prediction1. signal features2. context features3. structure features

Page 24: Computational  analysis of PromoterS

Evaluating predictions• sensitivity (Se), recall, TPR

• proportion of correct predictions of TSSs relative to all experimental TSSs

• positive predictive value (PPV), precision• proportion of correct predictions of TSSs out of all counted positive

predictions

Page 25: Computational  analysis of PromoterS

Evaluating predictions• And how to obtain FP, FN, TP?• You have a gene sequence for which you know TSS

location. And you make your prediction. • If it falls within the region [-2000, +2000] relative to

annotated TSS, you have TP.• Prediction falling into the annotated part of gene within

[+2001, EndOfGene] are FPs.• If you predict no promoter for this gene sequence, you

have FN.

Page 26: Computational  analysis of PromoterS

Signal features• Recognize “conserved” signals such as TATA box, Inr,

DPE, BRE etc.• Such motifs are highly variable and degenerate. This

leads to high false positive rate.• Methods based on core promoter elements and other

specific TFBs (e.g. CAAT box) are far from being accurate.

• Much more reliable signal is CpG-island feature. However, only ≈50% of human genes contain CpG islands.

CpG and non-CpG promoters are predicted with different success, prediction of non-CpG is less accurate

Page 27: Computational  analysis of PromoterS

Context features• Extracted from genomic context of promoters• Represented by a set of n-mers (DNA sequence long n

bases). Their statistics are estimated from training samples.• n-mers can cover most biological signals (TFBS: TATAAA,

CCAAT; CpG: GC rich n-mers like CGGCG)• n-mer representation encodes contextual information of

promoters and has following advantages• contextual information is independent of any biological signals• distribution of n-mers may have biological significance (TFBS, CpG)• n-mers may reveal details of yet unknown promoter regions

• n-mers reduce FPR while maintaining relatively high TPR (i.e. Se)

Page 28: Computational  analysis of PromoterS

Structure features• They originate from DNA 3D structures that characterize

proximal promoters.• DNA actually encodes in its sequence at least two

independent levels of functional information• DNA sequence – encodes proteins and their regulatory elements.• Physical and structural properties of DNA itself.

• Example:• dinucleotide properties – stacking energy, propeller twist• trinucleotide – bendability, nucleocome position preference

• They have long-range interactions (up to 10 kbp), so they can exhibit properties not visible in the sequence.

Page 29: Computational  analysis of PromoterS

Model for cooperative assembly of an activated transcription-initiation complex.

This figure clearly shows, why are structural features such as flexibility important.

Molecular Cell Biology. 4th edition. Lodish H, Berk A, Zipursky SL, et al. New York: W. H. Freeman; 2000.Werner T, Fessele S, Maier H, Nelson PJ. Computer modeling of promoter organization as a tool to study transcriptional coregulation. FASEB J. 2003; 17(10):1228-37.

Page 30: Computational  analysis of PromoterS

SoftwareSignal features (two leading CpG predictors)

• FirstEF – different quadratic discriminant functions for CpG and non-CpG, slightly improves performance by concentrating to regions around first exon

• Eponine – TATA and G+C rich domain, Relevance Vector MachineContext features

• PromoterInspector – IUPAC word groups with wildcards

Structure features• McPromoter – DNA sequence, bending, DNA twist, ANN• EP3 – features from1, prediction based just on the threshold

imposed on the structural profile.1 Florquin K et al., Large-scale structural analysis of the core promoter in mammalian and plant genomes. Nucleic Acids Res. 2005;33(13):4255

Page 31: Computational  analysis of PromoterS

Integrated approaches• combine sequence, context and structural features

• ARTS – SVM, sophisticated kernels, combines n-mers to structure features (e.g. twist angle, stacking energies)• does not distinguish CpG related promoter from unrelated, not

clear how it performs on non-CpG• SCS – sequence (TATA, Inr, DPE, CpG), structure

(flexibility), and context (6-mers) features are used in different prediction models, their outcomes are combined by Decission Tree

• CoreBoost – boosting technique with stumps, integrates core promoter signals, DNA flexibility, n-mer frequency, …• CoreBoost_HM … adds experimental histone modification data

Page 32: Computational  analysis of PromoterS

Boosting, stumps• Boosting

• Belongs between ensemble methods that produce a very accurate prediction rule (strong learner) by combining rough and moderately inaccurate (i.e. just a bit better than random guessing) rules (weak learners, WL).• Iteratively learn weak classifiers and add them to a final strong classifier• When WL is added, it’s weighted based on their accuracy. • After a WL is added, the data is reweighted: misclassified examples gain

weight and correctly classified examples lose weight. • Thus, future WLs focus more on the examples that previous WLs misclassified.

• Stump• One-level decision tree (i.e. it has one root and

two terminal nodes)

source: wikipedia

Page 33: Computational  analysis of PromoterS

Databases• EPD – Eukaryotic Promoter Database

• http://epd.vital-it.ch• manually annotated non-redundant collection of eukaryotic POL II

promoters

• DBTSS• http://dbtss.hgc.jp/ • putative core promoter: e.g. -100 bp … +50 bp, -250 bp … +50 bp,

-200 … +200 bp

Page 34: Computational  analysis of PromoterS

Actual state of the promoter prediction• CpG island promoters are better to predict than non-CpG.• CpG islands usually correspond to housekeeping genes.

Promoters of housekeeping genes are easier to predict, but housekeeping genes are not regulated that strongly. So if biologist wants to up- or down-regulate the expression and you tell him he has CpG island promoter, he is usually not happy.

• non-CpG islands correspond to tissue-specific expression. And are the bottleneck in accurate promoter prediction.• Best way how to do it: use transcription data. Alignment of the 5’ of

ESTs or full cDNAs can be indicative of promoter sequence. However, cDNA does not contain 5’ UTR. This is overcome by new mRNA cap cloning techniques – DBTSS.

Page 35: Computational  analysis of PromoterS

Future directions• False positives are still the main problem.• This is because the information about chromatine structure is

missing in prediction models.• Without knowing which regions of chromatin are opened or

closed (and to what degree), researchers have to assume the whole genome is accessi-ble for binding, which is obvi-ously wrong and will lead to more FP (and FN because of the extra noise).

• Chromatin remodelling:enzyme-assisted movement of nucleosomes on DNA.

source: http://www.nida.nih.gov/NIDA_notes/NNvol21N4/gene.html