searching for transcription start sites in drosophila wilson leung08/2015

48
Searching for Transcription Start Sites in Drosophila Wilson Leung 08/2015

Upload: shauna-atkins

Post on 14-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Searching for Transcription Start Sites in Drosophila

Wilson Leung08/2015

Outline

Transcription start sites (TSS) annotation goals

Promoter architecture in D. melanogaster

Find the initial transcribed exon

Annotate putative transcription start sitesRNA PolII ChIP-SeqSearch for core promoter motifs

Additional TSS annotation resources

Structure of a typical mRNA

Pesole G. et al. Untranslated regions of mRNAs. Genome Biology. 2002: 3(3) reviews0004.1-reviews0004.10.

Muller F element, heterochromatic, and euchromatic genes show similar expression

levels

Riddle NC, et al. Enrichment of HP1a on Drosophila chromosome 4 genes creates an alternate chromatin structure critical for regulation in this heterochromatic domain. PLoS Genet. 2012 Sep;8(9):e1002954.

Annotate transcription start sites (TSS)

Research goal:Identify motifs that enable Muller F element genes to function within a heterochromatic environment

Annotation goals:Define search regions enriched in regulatory motifs

Define precise location of TSS if possible

Define search regions where TSS could be found

Document the evidence used to support the TSS annotations

Estimated evolutionary distances with respect to D. melanogaster

D. melanogasterD. simulansD. sechelliaD. yakubaD. erectaD. ficusphilaD. eugracilisD. biarmipesD. takahashiiD. elegansD. rhopaloaD. kikkawai

D. ananassaeD. bipectinata

D. pseudoobscuraD. persimilisD. willistoni

D. mojavensisD. virilisD. grimshawi

New species sequenced by modENCODE

Species Substitutions per neutral

site

D. ficusphila 0.80

D. eugracilis 0.76

D. biarmipes 0.70

D. takahashii 0.65

D. elegans 0.72

D. rhopaloa 0.66

D. kikkawai 0.89

D. bipectinata 0.99Data from table 1 of the modENCODE comparative genomics whitepaper

GEP annotation projects

TSS annotation resourcesWalkthroughs:

Annotation of Transcription Start Sites in Drosophila

Reference:TSS Annotation Workflow

GEP Annotation Report:TSS Report Form (sample report for onecut)

Additional evidence tracks:RNA PolII ChIP-Seq (D. biarmipes only)D. mel TranscriptsConservation (Multiz alignments)

TSS annotation workflow1. Identify the ortholog

2. Note the gene structure in D. melanogaster

3. Annotate the coding exons

4. Classify the type of core promoter in D. melanogaster

5. Annotate the initial transcribed exon

6. Identify any core promoter motifs

7. Define TSS positions or TSS search regions

Evidence for TSS annotations(in general order of importance)

1. Experimental dataRNA-SeqRNA Pol II ChIP-Seq (D. biarmipes only)

2. ConservationType of TSS (peaked versus broad) in D. melanogasterSequence similarity to initial exon in D. melanogasterSequence similarity to other Drosophila species (Multiz)

3. Core promoter motifsInr, TATA box, etc.

Challenges with TSS annotations

Fewer constraints on untranslated regions (UTRs)

UTRs evolve more quickly than coding regionsOpen reading frames, compatible phases of donor and acceptor sites do not apply to UTRs

Low percent identity (~50-70%) between D. biarmipes contigs and D. melanogaster UTRs

Most gene finders do not predict UTRs

Lack of experimental dataCannot use RNA-Seq data to precisely define the TSS

RNA-Seq biases introduced by library construction

cDNA fragmentationStrong bias at the 3’ end

RNA fragmentationMore uniform coverage

Miss the 5’ and 3’ ends of the transcript

Wang Z, et al. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63.

Gene Span

RN

A-S

eq

Read

Cou

nt

5’ 3’

Techniques for finding TSS

Identify the 5’ cap at the beginning of the mRNA

Cap Analysis of Gene Expression (CAGE)RNA Ligase Mediated Rapid Amplification of cDNA Ends (RLM-RACE)Cap-trapped Expressed Sequence Tags (5’ ESTs)

More information on these techniques:Takahashi H, et al. CAGE (cap analysis of gene expression): a protocol for the detection of promoter and transcriptional networks. Methods Mol Biol. 2012 786:181-200.

Sandelin A, et al. Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nat Rev Genet. 2007 Jun;8(6):424-36.

modENCODE TSS annotations

TSS predictions based on CAGE, RLM-RACE, and 5’ EST data

Results lifted from Release 5 to the Release 6 assembly

Two sets of modENCODE TSS predictionsTSS (Celniker)

Most recent dataset produced by modENCODE

TSS (Embryonic)Older dataset available from FlyBase GBrowse

Use TSS (Celniker) dataset as the primary evidence

Revised TSS analysis for Release 6 not yet available

Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 2011 Feb;21(2):182-92

FlyBase release 6 assembly

First change in the D. melanogaster assembly since 2005Substantial changes in the heterochromatic regions

modENCODE RNA-Seq data have been re-analyzed relative to the release 6 assembly

Most of the modENCODE datasets are still in release 5

Gnomon predictions for eight Drosophila species

Based on RNA-Seq data from either the same or closely-related species

D. simulans, D. yakuba, D. erecta, D. ananassae, D. pseudoobscura, D. willistoni, D. virilis, and D. mojavensis

Predictions include untranslated regions and multiple isoforms

Records not yet available through the NCBI RefSeq database

RNA Polymerase II core promoter

Juven-Gershon T and Kadonaga JT. Regulation of gene expression via the core promoter and the basal transcriptional machinery. Dev Biol. 2010 Mar 15;339(2):225-9.

Initiator motif (Inr) contains the TSS

TFIID binds to the TATA box and Inr to initiate the assembly of the pre-initiation complex (PIC)

Peaked versus broad promoters

Kadonaga JT. Perspectives on the RNA polymerase II core promoter. Wiley Interdiscip Rev Dev Biol. 2012 Jan-Feb;1(1):40-51.

Peaked promoter(Single strong TSS)

Broad promoter(Multiple weak TSS)

50-300 bp

Promoter architecture in Drosophila

Classify core promoter based on the Shape Index (SI)

Determined by distribution of CAGE and 5’ RLM-RACE reads

Shape index is a continuum

Most promoters in D. melanogaster contain multiple TSS

Median width = 162 bp

~70% of vertebrate genes have broad promoters

Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 2011 Feb;21(2):182-92.

Genes with peaked promoters show stronger spatial and tissue specificity

46% of genes with broad promoters are expressed in all stages of embryonic development

19% of genes with peaked promoters are expressed in all stages

Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 2011 Feb;21(2):182-92.

Peaked and broad promoters are enriched in different core promoter motifs

Rach EA, et al. Motif composition, conservation and condition-specificity of single and alternative transcription start sites in the Drosophila genome. Genome Biol. 2009;10(7):R73.

Using modENCODE data to classify the type of core promoter in D.

melanogaster Only a subset of the modENCODE data are available through FlyBase

Evidence tracks on the D. melanogaster GEP UCSC Genome Browser [July 2014 (BDGP R6) assembly]:

FlyBase gene annotations (release 6.06)modENCODE TSS (Celniker) annotationsDNase I hypersensitive sites (DHS)9-state and 16-state chromatin modelsRNA Pol II and CBP ChIP-SeqTranscription factor binding site (TFBS) HOT spots

9-state chromatin model

Kharchenko PV, et al. Comprehensive analysis of the chromatin landscape in Drosophila melanogaster. Nature. 2011 Mar 24;471(7339):480-5.

DNaseI Hypersensitive Sites (DHS) correspond to accessible regions

Aasland R and Stewart AF. Analysis of DNaseI hypersensitive sites in chromatin by cleavage in permeabilized cells. Methods Mol Biol. 1999;119:355-62.

Ho JW, et al. Comparative analysis of metazoan chromatin organization. Nature. 2014 Aug 28;512(7515):449-52.

Classifying the D. melanogaster core promoter for each unique TSS

PeakedSingle annotated TSS with a single DHS position

IntermediateSingle annotated TSS with multiple DHS positions within a 300bp window

BroadMultiple annotated TSS with multiple DHS positions within a 300bp window

Insufficient evidenceNo TSS annotations or DHS positions

DEMOClassifying the core promoter of D. melanogaster Rad23

Additional DHS data from different stages of embryonic development

Thomas S, et al. Dynamic reprogramming of chromatin accessibility during Drosophila embryo development. Genome Biol. 2011;12(5):R43.

DHS data produced by the BDTNP project

Evidence tracks:Detected DHS Positions (Embryos)

DHS Read Density (Embryos)

Determine the gene structure in D. melanogaster

FlyBase: GBrowse UTR CDS

Gene Record Finder: Transcript Details

Identify the initial transcribed exon using NCBI blastn

Retrieve the sequences of the initial exons from the Transcript Details tab of the Gene Record Finder

Use placement of the flanking exons to reduce the size of the search region if possible

Increase sensitivity of nucleotide searchesChange Program Selection to blastnChange Word size to 7Change Match/Mismatch Scores to +1, -1Change Gap Costs to Existence: 2, Extension: 1

Optimize alignment parameters based on expected levels of conservation

Derive alignment scores using information theory

Relative entropy of target and background frequencies

Match +2, Mismatch -3 optimized for 90% identity

Match +1, Mismatch -1 optimized for 75% identity

States DJ, et al. Improved Sensitivity of Nucleic Acid Database Searches Using Application-Specific Scoring Matrices. Methods 3:66-70.

DEMOblastn search of the initial transcribed exon of Rad23 against D. biarmipes contig19

RNA PolII ChIP-Seq tracks (D. biarmipes only)

Show regions that are enriched in RNA Polymerase II compared to input DNA

Gene Models

RNA PolII Peaks

RNA PolII Enrichment

RNA-Seq

Use core promoter motifs to support TSS annotationsSome sequence motifs are enriched in the region (~300 bp) surrounding the TSS

Some motifs (e.g., Inr, TATA) are well-characterized

Other motifs are identified based on computational analysis

Presence of core promoter motifs can be used to support the TSS annotations

Inr motif (TCAKTY) overlaps with the TSS (-2 to +4)

Absence of core promoter motifs is a negative result

Most D. melanogaster TSS do not contain the Inr motif

Use UCSC Genome Browser Short Match to find Drosophila core promoter motifs

Ohler U, et al. Computational analysis of core promoters in the Drosophila genome. Genome Biol. 2002; 3(12):RESEARCH0087.

TATA box

Initiator (Inr)

Available under “Projects” “Annotation Resources” “Core Promoter Motifs” on the GEP web site: http://gander.wustl.edu/~wilson/core_promoter_motifs.html

Core Promoter Motifs tracks

Show core promoter motif matches for each contig

Separated by strand

Visualize matches to different core promoter motifs

Use UCSC Table Browser (or other means) to export the list of motif matches within the search region

DEMOUse the Inr motif to determine the TSS position of Rad23

Using RNA PolII ChIP-Seq and RNA-Seq data to define the TSS search region

TSS search region

D. mel Transcripts

RNA PolII Peaks

RNA PolII Enrichment

RNA-Seq

TSS annotation for Rad23

TSS position: 28,936Conservation with D. melanogaster

blastn of initial exon and the D. mel Transcripts track

Location of the Inr motif

TSS search region: 28,716-28,936Enrichment of RNA PolII upstream of the TSS positionRNA-Seq read coverage upstream of the TSS positionSearch region defined by the extent of the RNA PolII peak

TSS annotation section in the GEP Annotation Report Form

TSS report formClassify the type of core promoterEvidence that supports or refutes the TSS annotationDistribution of core promoter motifs in the project and in D. melanogaster

Sample TSS report for onecut on the GEP web site

Under the “Beyond Annotation” section

D. biarmipes Muller F element TSS annotation projects

Reconcile coding region annotations submitted by GEP students

Reconciled annotations might be incorrectBased on older FlyBase releasePossible misannotations

See the “Revised gene models report form” section of the Transcription Start Sites Project Report

Additional TSS annotation resources

The D. melanogaster gene annotations are the primary source of evidence

Resources that might be useful if the D. melanogaster evidence is ambiguous

Whole genome alignments of 14 Drosophila speciesPhastCons and PhyloP conservation scores

Genome browsers for nine Drosophila speciesRNA Pol II ChIP-Seq (D. biarmipes only)

RNA-Seq coverage, TopHat junctions, assembled transcripts

Augustus and N-SCAN gene predictions

Conservation tracks on the D. melanogaster GEP UCSC Genome Browser

Whole genome alignments of 14 Drosophila speciesDrosophila Chain/Net composite track

Generate multiple sequence (Multiz) alignments from these pairwise alignments

Identify conserved regions from Multiz alignmentsPhastCons: identify conserved elementsPhyloP: measure level of selection at each nucleotide

Multiz alignment of 27 insect species available on the official UCSC Genome Browser

Aug. 2014 (BDGP Release 6 + ISO1 MT/dm6) assembly

Use the conservation tracks to identify regions under selection

PhyloP scores:

Under negative selection

Fast-evolving

Examine the Multiz alignments to identify the orthologous TSS regions

Use RNA PolII tracks on the D. biarmipes genome browser to identify putative TSS

April 2013 (BCM-HGSC/Dbia_2.0) assembly

Search for orthologous regions in D. elegans

Use RNA-Seq data to predict untranslated regions and putative TSS

TSS predictions available for 9 Drosophila species

N-SCAN+PASA-EST, Augustus, TransDecoder

D. mel Proteins

N-SCANAugustus

TransDecoder

RNA-Seq

DEMOGenome browsers for other Drosophila species

TSS annotation summary

Most of the D. melanogaster core promoters have multiple TSS

Classify the type of promoter (peaked/intermediate/broad) based on the transcriptome evidence from D. melanogasterDefine search regions that might contain TSS

Use multiple lines of evidence to infer the TSS region

Identify initial exonRNA-Seq coverage

blastn (change search parameters)

RNA PolII ChIP-Seq peaksDistribution of core promoter motifs (e.g., Inr)Maintain conservation compared to D. melanogaster

Questions?