introduction to the eukaryotic promoter database (epd) and signal search analysis (ssa) giovanna...

27
Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence Motif Discovery, November 10th 2006. The Linnaeus Centre for Bioinformatics, SLU- UU, Sweden.

Upload: victoria-leonard

Post on 23-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis

(SSA)

Giovanna Ambrosini Christoph Schmid

Workshop on Regulatory Sequence Motif Discovery,November 10th 2006.The Linnaeus Centre for Bioinformatics, SLU-UU, Sweden.

Page 2: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

Wasserman 5, 276-287 (2004)

Distal transcription-factor binding sites (enhancer)

cis-regulatory modules

Components of transcriptional regulation

Page 3: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

EPDThe Eukaryotic Promoter Database

Current Release 88 (SEPT-2006)

• founded in 1986 (Bucher and Trifonov; Nucleic Acids Res, 14, 10009-10026)

• originally exclusively based on literature, carefully maintained and regularly updated

• in recent years started with consideration of mass sequencing data

• aim at high precision of mapping of transcription start site (+/- 5bp)

• promoter sequences of 139 different species, still relatively low coverage (i.e. 1871

human entries)

• format of annotation of TSS:

DR EMBL; ZZ999999.1; HS28BP; [-19, 9]. -15 -10 -5 0 5 ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ‘ a c c c g c c t g c a c c c g a t t c A T G T G A G A A

• one or several alternative transcription start sites per gene

Page 4: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

EPD formatID HS_RPS3 standard; multiple; VRT.XXAC EP74176;XXDT 10-JAN-2003 (Rel. 73, created)DT 13-SEP-2004 (Rel. 80, Last annotation update).XXDE Ribosomal protein S3.OS Homo sapiens (human).XXHG none.AP none.NP none.XXDR GENOME; NT_033927.7; NT_033927; [-5333322, 12577805]. [ ENSEMBL; UCSC HapMap ]DR CLEANEX; HS_RPS3.DR EMBL; AP000744.4; [-90138, 35862]. [ EMBL; GenBank; DDBJ ]DR SWISS-PROT; P23396; RS3_HUMAN.DR RefSeq; NM_001005 [ DBTSS ].DR MIM; 600454.

Page 5: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

10 bp

Fre

quen

cy o

f fu

ll-l

engt

h tr

ansc

ript

s

Genomic positionR84046905-84046987

R84047148-84047231

45 bp

TSS determined by modelling Gaussian distributions (MADAP)

The Eukaryotic Promoter Database EPD: the impact of in silico primer extension. Schmid, C.D., Praz, V., Delorenzi, M., Perier, R. and Bucher, P. (2004) Nucleic Acids Res, 32, D82-85.

Page 6: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

[-10;10] [-400;400]EPD 70 0.83 136

RefSeq mRNA 0.32 0.95933

Genome annot. 0.31 0.95890

DBTSSv1 (human) 0.13 0.68933

Eponine 0.12 0.46494

Page 7: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

Superior precision of in silico primer extension (ISPE)

Page 8: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

New data sources for EPDChIP-chipKim et al. (2005) Nature, 436, 876-880 GEO: GSE2672 (remapped!)

ENSEMBL chro12:6.8 – 6.94 Mb

virt

ual c

ount

s(2

** lo

g ra

tio)-

1

Page 9: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

Distribution of TSS

Genomic position

Fre

qu

en

cy

6831200 6831400 6831600 6831800 6832000

0.0

0.5

1.0

1.5

2.0

2.5

3.0

FP Hs USP5 :+R EU:NC_000012.10 1+ 6831557; 74339.

ChIP-chip data with insufficient resolution

Page 10: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

EPD webserver: http://www.epd.isb-sib.ch/

• find EPD entry(-ies) using gene symbols,...– extraction of promoter sequences in user-defined

ranges – direct transfer to Signal Sequence Analysis (SSA)

• download of complete (reference!) promoter sets http://www.epd.isb-sib.ch/seq_download.html

Page 11: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

SSA

Signal Search AnalysisGiovanna Ambrosini

ISREC Swiss Institute for Experimental Cancer Research

History: Signal Search Analysis is a method developed by P Bucher in the early eighties (Bucher, P. and Bryan B., E.N.; Nucleic Acids Res, v.12(1 Pt 1): 287–305)

Purpose: to discover and characterize sequence motifs that occur at constrained distances from physiologically defined sites in nucleic acid sequences.

Signal search analysis programs:

1. CPR: generates a “constraint profile” for the neighborhood of a functional site

2. SList: generates lists of over and under-represented motifs in particular regions relative to a

functional site

3. OProf: generates a “signal occurrence profile” for a particular motif

4. PatOP: optimizes a weight matrix description of a locally over-represented sequence motif

Recent events: Adaptation of software to new environment, SSA web server, application to promoters and translational start sites

Page 12: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

Locally Over-represented Sequence Motifs

Page 13: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

Definition of a Locally Over-represented Sequence Motif

Components of the formal motif description

1. A weight matrix or consensus sequence defining the motif

2. A cut-off value determining which subsequence constitutes a motif match

3. A preferred region of occurrence defined by 5’ and 3’ borders relative to a

functional site, e.g. a transcription initiation site

Concept

A motif which preferentially occurs at a characteristic distance (range) from a certain type of functional position

Example: the TATA-box is a locally over-represented sequence motif of the -30 region of eukaryotic POL II transcription initiation sites

Page 14: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

Locally Over-represented Sequence Motifs

Input Data Structure

Work data

Primary experimental data

(Functional Position Set)

annotated functional positions in DNA

sequences stored in a database

A DNA sequence matrix

a set of fixed-length sequence segments with an experimentally defined site at a fixed internal position

Page 15: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

The Motif Search Problem

Statement

For a given DNA sequence matrix

find locally optimal combination of

using a given quality criterion

Quantitative motif description Cut-off value Region of preferential occurrence

Page 16: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

TATA-box Signal Occurrence Profile for EPD and ENSEMBL Drosophila Promoters

Page 17: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

CCAAT-box Signal Occurrence Profile for Vertebrate and ENSEMBL Drosophila Promoters

Page 18: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

SSA webserver: http://www.isrec.isb-sib.ch/ssa

Provides access to precompiled functional position sets

Collections of transcription initiation sites (promoters) from

eukaryotic species

Collections of translation initiation sites from large variety of

prokaryotic genomes

Provides access to the four signal search analysis programs

Page 19: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

Application to a bacterial translational control signal: the Shine-Dalgarno ribosome binding-site motif

Compare the strength and location of the Shine-Dalgarno mRNA-rRNA interaction motif in E. coli and B. subtilis in a qualitative manner.

Result: the Shine-Dalgarno interaction motif is stronger in B. subtilis than in E .coli and centered about two bases further upstream in the former species. More than hundred bacterial genomes are now available to perform this type of analysis.

Page 20: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

Studying transcription regulatory processes with specialized bioinformatics resources – and example

Biological question:

Do genes that are generally up-regulated in cancer cells have different types of promoters?

Procedure:

Define cancer up- and down-regulated gene sets using CleanExExtract corresponding promoter regions from EPDAnalyse the signal content of the two promoter sequence sets using SSA

Page 21: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

Comparative analysis of cancer up- and down-regulated promoters

Signals considered:

Initiator preferred position approx. frequency

Initiator 0 25% - 50%

TATA-box -30 to -25 ~30%

GC-box -200 to 0 ~50%

CCAAT-box -200 to -50 ~20%

Page 22: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

Positional distribution of Initiator motif in cancer up- and down-regulated promoters

Page 23: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

Positional distribution of TATA-boxes in cancer up- and down-regulated promoters

Page 24: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

Positional distribution of GC-boxes in cancer up- and down-regulated promoters

Page 25: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

Positional distribution of CCAAT-boxes in cancer up- and down-regulated promoters

Page 26: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

Comparative analysis of cancer up- and down-regulated promoters: Summary of results

Signal content

Initiator Frequency in Frequency incancer-up genes cancer-down genes

Initiator no change no changeTATA-box up downGC-box no change no changeCCAAT-box up down

Next questions:

Are TATA-box and CCAAT-box binding factors disregulated in cancer cells ?Or do cancer-specific transcription factors (binding to adjacent sites) preferentially interact withTATA-box and CCAAT-box binding factors?

Page 27: Introduction to the Eukaryotic Promoter Database (EPD) and Signal Search Analysis (SSA) Giovanna Ambrosini Christoph Schmid Workshop on Regulatory Sequence

Concluding remarks

Signal search analysis has played an instrumental role in the characterization of eukaryotic promoter elements

The method has originally been developed for the analysis of eukaryotic promoters but has a much broader application potential (e.g. Shine-Dalgarno signal analysis)

Rapidly growing collection of complete genomes and high-throughput methods for genomic analysis increase the statistical power to discover new motifs, or better characterize already known control signals

Aligning sequence sets with respect to a well characterized motif might allow the detection of binding sites of cooperating transcription factors positionally correlated with the known motif

Confirm or challenge commonly accepted hypotheses originally derived from small sets