bits - comparative genomics: the contra tool
DESCRIPTION
This is the last presentation of the BITS training on 'Comparative genomics'. It reviews tthe Contra tool for detecting common transcription factor binding sites in sequences.Thanks to Stefan Broos of the DMBR department of VIBTRANSCRIPT
ConTra v2: a tool to identify transcription factor binding
sites across species, update 2011
Stefan Broos
Prediction of functional regulatory units in noncoding regions
● Look for consensus sequence in certain genomic regions
● Example TATABox consensus sequence TATA(T/A)A(A/T)(A/G)
● >chr1:2375696723757090_hg18_1000_+TTAGTACTTAATGGAGACGGGTGTCATCATATACACAAGTGTTTAAAAATCGTTTATTATGCAAAATGTTAACTTTTATAAAAAGTTTAATATACATCGCATTGTTACAGAAAGTCAC
● Problem: does not take into account the nucleotide frequencies
Prediction of functional regulatory units in noncoding regions
● More advanced way to represent binding sites (and most popular way) is the positional weight matrix (PWM)
● 4xL matrix with L being the length of the binding site
● Each element of the matrix represents the frequency of a certain nucleotide (the 4 rows) at a given position of the binding site
Prediction of functional regulatory units in noncoding regions
● Example of the positional weight matrix of the TATABox:
A [ 61 16 352 3 354 268 360 222 155 56 83 82 82 68 77 ]C [145 46 0 10 0 0 3 2 44 135 147 127 118 107 101 ]G [152 18 2 2 5 0 20 44 157 150 128 128 128 139 140 ]T [ 31 309 35 374 30 121 6 121 33 48 31 52 61 75 71 ]
Prediction of functional regulatory units in noncoding regions
Prediction of functional regulatory units in noncoding regions
● PWMs provide a more natural way to represent and search for binding sites
● Problem: motifs tend to be short and degenerative. No positional dependencies are taken into account...
● Although this is the most popular method, most of the predicted sites are false positive predictions with no known real in vivo functionality (~ Futility theorem)
Prediction of functional regulatory units in noncoding regions
● Solutions:– Use information of flanking sequences
– Use more complex models (biophysical models)
– Use sequence conservation across species (if a site is conserved across species, there is a higher probability the site is functional)
– ...
Prediction of functional regulatory units in noncoding regions
● Solutions:– Use information of flanking sequences
– Use more complex models (HMMs and biophysical models)
– Use sequence conservation across species (if a site is conserved across species, there is a higher probability the site is functional)
– ...
What is ConTra?● A tool to visualize predicted and conserved
transcription factor binding sites in a region of interest
● A tool to explore the regulatory potential of a set of binding sites in a region of interest
● Focus on ease of use● Free access to the latests and most uptodate
versions of the TRANSFAC and JASPAR PWM libraries
What is ConTra?
First version of ConTra
● Published in 2008 by Hooghe, Hulpiau et al.
● Popular tool, cited 23 times
● Had some limitations
ConTra update
What is new?● Update of PWM libraries● More reference species were added
What is new?● Users are no longer restricted to the promoter
region. One can search for binding sites in 5'UTR, 3'UTR, promoter and intron regions
● Users can upload their own matrices (it is as simple as uploading a multifasta file!)
● Users can upload a custom alignment● Noncoding genes are no longer excluded from
the analysis
PWM libraries● TRANSFAC version 2010.04● Jaspar update 2010● Phylophacts 2010● All protein binding microarrays from Berger et
al. Cell, 2008● These PWM libraries are used in combination
with the match scan tool
Alignments in ConTra● Alignments generated using MULTIZ● Downloaded from UCSC genome browser
How does it work?● The analysis consists of a four step process
Step 1– Select type of analysis: visualization or
exploration
– Select species
– Select gene of interest using the gene name or symbol, Ensembl gene ID (ENSG), entrez gene ID, RefSeq (NM_|NR_) or Ensembl transcript ID (ENST)
How does it work?● The analysis consists of a four step process
Step 2– All possible matches with your search term are
listed. Search term is highlighted
– Select 1 transcript of your gene of interest
How does it work?● The analysis consists of a four step process
Step 3– Select a genomic region of interest (promoter, 5'
UTR, 3'UTR, intronic regions)
How does it work?● The analysis consists of a four step process
Step 4– Select up to 20 PWMs from the TRANSFAC
library, JASPAR library, phylophacts or PBM
– Select a cutoff (to minimize false positive predictions or to minimize false negative predictions)
– Run ConTra ...
Who should use it and where to find it?
● You!● To get an indication how your gene is regulated● To create publication ready graphics● To get a quick and easy visualization of some
transcription factor binding sites● http://bioit.dmbr.ugent.be/contrav2/index.php
Questions & Examples● Analyse gene of interest● Explore gene of interest● Download and upload own alignment● Make your own PWM● Make beautiful publication graphics using
ConTra and Jalview