computational analysis of tissue specificity: decoding promoters chris stoeckert, ph.d. center for...

47
Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania Nov. 17, 2004 Department of Physiology Seminar Series University of Kentucky

Upload: imogen-jennings

Post on 17-Dec-2015

217 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

Computational Analysis of Tissue Specificity: Decoding

PromotersChris Stoeckert, Ph.D.

Center for Bioinformatics & Dept. of Genetics

University of Pennsylvania

Nov. 17, 2004

Department of Physiology Seminar Series

University of Kentucky

Page 2: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

What is the code for determining where (and when) a gene is expressed?

http://molbio.info.nih.gov/molbio/gcode.html

Expression

TFBS1 TFBS4TFBS3

TFBS3

TFBS4

TFBS2

TFBS2

TFBS1

TFBS = transcription factor binding site

Page 3: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

Goal is to Identify Combinations of TFBS (cis-Regulatory Modules or

CRMs) that Specify Tissue Expression

From Wasserman & Sandelin, NRG 2004

Page 4: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

A Genomics Unified Schema approach to understanding

gene expression

Dave Barkan, Jonathan Crabtree, Shailesh Date, Steve Fischer, Bindu Gajria, Thomas Gan, Greg Grant, Hongxian

He, John Iodice, Li Li, Junmin Liu, Matt Mailman, Elisabetta Manduchi, Joan Mazzarelli, Debbie Pinney,

Angel Pizarro, Mike Saffitz, Jonathan Schug, Chris Stoeckert, Trish Whetzel

Computational Biology and Informatics Laboratory (CBIL), Penn Center for Bioinformatics

Page 5: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

Stem Cell Gene Anatomy Project

Beta Cell Biology Consortium

Plasmodium Genome Resource

Allgenes (human and mouse DoTS)

GUS

Page 6: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

GUS

CoreSRESTESSRADDoTS

Oracle RDBMS

Object Layer for Data Loading

Java Servlets

GUS is an open source projectSanger Institute

U. Georgia

Flora Centromere

Database

U. Chicago

U. Penn

U. Toronto

Phytophthora sojae

genomeVirginia BioinformiaticsInsitiute

Page 7: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

GUS (Genomics Unified Schema)http://www.gusdb.org

MIAME/MAGE-OMGene ExpressionRAD

EST clusters and gene models

Sequence and annotation

DoTS

DocumentationData ProvenanceCore

OntologiesShared

ResourcesSres

TFBS organization

Gene RegulationTESS

FeaturesDomainNamespace

Page 8: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

RAD EST clustering and assembly

DoTS

Genomic alignmentand comparativesequence analysis

Identify sharedTF binding sites

TESS

BioMaterial annotation SRES

Page 9: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

DoTS integrates sequence annotation including where expressed

Page 10: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

DoTS integrates sequence annotation including where expressed

kidney, mammary gland, brain, liver, colon, lung, retina, spinal cord, rhabdomyosarcoma cell line

brain, liver, kidney, lung, melanocyte

embryo, fetus, kidney, limb, retina, salivary gland

brain, rhabdomyosarcoma cell line, kidney

Sorbs1: sorbin and SH3 domain containing 1 - GO molecular function - actin binding and protein kinase binding- GO cellular component – actin cytoskeletal stress fibers

Page 11: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

RAD Contains Detailed Expression Experiments Including Tissue Surveys

Page 12: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

TESS Allows You to Find Potential TFBS

But there are too many potential sites!

Page 13: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

Promoters Features Related to Tissue-Specificity as Measured by Shannon

Entropy

Jonathan Schug1, Winfried-Paul Schuller2, Claudia Kappen2, J. Michael Salbaum2, Maja Bucan3, Christian

J. Stoeckert Jr.1

1. Center for Bioinformatics, University of Pennsylvania, Philadelphia, Pennsylvania, 19104, USA

2. Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, Nebraska, 68198, USA

3. Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania, 19104, USA

Page 14: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

What is a Liver-Specific Gene?

*http://expression.gnf.org/

Page 15: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

Assessing Tissue Specificity of Genes Using Shannon Entropy

Shannon entropy is a measure of the uniformity of a discrete probability distribution. Given a set of T tissues, H ranges from 0 for a gene expressed in a single tissue to lg T for a gene expressed uniformly in all T tissues. It works well as a measure of overall tissue-specificity.

To measure specificity to a particular tissue, we combine entropy H and the relative expression level in that tissue to get Q. Q = 0 for a tissue when the gene is expressed only in that tissue and Q = 2T for a typical tissue in uniform expression.

(a) Very specific liver expression: H=1.6 and Qliver = 2.2, 98612_at cytochrome p450

(b) Near uniform expression : H=4.3 and Qliver=10.2, 104391_s_at Clcn7 chloride channel 7

Page 16: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

Agreement between Microarrays and ESTs on Tissue Specificity

Page 17: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

Specificity Characteristics of Tissues

TissueProbe SetID

H Q RefSeq Description

96055_at 3.2 5.8 NM_031161 cholecystokinin

93178_at 2.7 5.8 NM_019867neuronal guanine nucleotideexchange factor

93273_at 3.7 5.8 NM_009221 synuclein, alpha

92943_at 3.5 6.0 NM_008165glutamate receptor, ionotropic,AMPA1 (alpha 1)

Amygdala

95436_at 3.3 6.1 NM_009215 somatostatin

98406_at 2.7 4.0 NM_013653chemokine (C-C motif) ligand5

98063_at 1.6 4.1 -glycosylation dependent celladhesion molecule 1

99446_at 2.5 4.1 NM_007641membrane-spanning 4-domains, subfamily A, member1

92741_g_at 3.3 4.5 -immunoglobulin heavy chain 4(serum IgG1)

Lymph Node

102940_at 2.8 4.6 NM_008518 lymphotoxin B94777_at 1.3 2.1 - albumin 1101287_s_at 1.6 2.2 NM_010005 cytochrome P450, 2d1099269_g_at 1.5 2.2 NM_019911 tryptophan 2,3-dioxygenase100329_at 1.4 2.3 NM_009246 serine protease inhibitor 1-4

Liver

94318_at 1.6 2.3 NM_013475 apolipoprotein H

Page 18: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

CpG Islands are Associated with the Start Sites of Genes with Wide-Spread

Expression

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 1 2 3 4 5 6

Entropy

Fraction of Promoters w/ CpG Island

HumanMouse

CpG island = minimum 200 bp, C+G > 0.6, obs./expect. >=0.5

Page 19: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

Tissue-Specific and Non-Specific Promoters Have Distinct Base Compositions

CpG+ CpG-

Multi-TissueH >= 4.4

TissueSpecificH <= 3.5

Page 20: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

TATA Boxes are Associated with Tissue-Specific Genes

p h = 0.13; p m = 0.15

p h = 0.00007; p m = 0.00087

p h = 0.00005; p m = 0.00001

0

10

20

30

40

50

60

70

80

90

0-2 2-4 4-6 6-8 8-10 >10

Q-Value

% with TATAA Box

human

mouse

(7/9)

(8/9)

(4/8)

(8/28)

(16/80)

(3/8) (10/28)

(16/80)

genes with

TATAA Box

human 18.8%

mouse: 22.9%

(4/31)

(2/27)

(9/35)

(3/27)

Page 21: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

CellularComponent

BiologicalProcess

Human Only Mouse Only

extracellular,extracellular space

microsome, vesicular fraction intermediate filament(cytoskeleton)

CGI-/TATA+

response tostimulus

organismal physiological processinflammatory responseinnate immune responsecell motilitydefense responseresponse to pest/pathogen/parasiteresponse to woundingresponse to biotic stimuluscell-cell signalingmorphogenesisdigestionmuscle contraction

chemotaxis,taxis,response to chemicalsubstance,response to abioticstimulus,muscle development

cell, cytoplasm,intracellular,mitochondrion

nucleus, ribonucleoproteincomplex

CGI+/TATA-

nucleobase, nucleoside, nucleotideand nucleic acid metabolismintracellular transportmetabolismprotein transportintracellular protein transportRNA processingRNA metabolismcell cyclemitotic cell cycle

(integral to)(plasma)membrane

extracellular,extracellular space

CGI-/TATA-

organismalphysiologicalprocess, defenseresponse, immuneresponse, responseto biotic stimulus,response tostimulus

response to pest/ pathogen/parasite, cell communication,response to wounding, cellulardefense response, signaltransduction

complement activation,complement activation(classical pathway),humoral defensemechanism (sensuVertebrata),humoral immuneresponse

Functional relationships of promoter classes based on over-represented GO terms (EASE)

Page 22: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

First Clues: TATA Box indicates Tissue Specific;

CpG indicates Wide Spread Expression

Additional clues: CpG-/TATA+ indicates high expression, secreted proteins while CpG+/TATA- indicates cellular and mitchondrial proteins.

Page 23: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

Pattern Analysis of Pancreas Gene Promoters

Guang (Gary) Chen, Jonathan Schug

Page 24: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

Shannon Entropy

Shannon Entropy

GNF Gene Expression Atlas

GNF Gene Expression Atlas

Gene Lists withTissue Specificity

Gene Lists withTissue SpecificityDBTSS

DBTSS

Sequences around Transcription Start Sites

Sequences around Transcription Start Sites

TeiresiasTeiresias

Pattern Clusters(PWM)

Pattern Clusters(PWM)

Represent Seqs with

PWMs

Represent Seqs with

PWMs

Gene ClustersGene ClustersGene Ontology

(GO)

Gene Ontology (GO)

GO Category Analysis

GO Category AnalysisPatternsPatterns

Pattern Clustering

Pattern Clustering

Comparative Genome Analysis

Comparative Genome Analysis

Identifying TFBMs – Method Pipeline

Starting with a gene expression tissue survey, pancreas-specific genes with common TFBS and biological processes are identified

Tissue Specific Regulatory Modules

Associated with GO Biological Process

Tissue Specific Regulatory Modules

Associated with GO Biological Process

Page 25: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

– DBTSS: Database of Transcriptional Start Sites • Based on 400,225 and 580,209 human and mouse full length cDNA sequences,

DBTSS contains the genomic positions of the transcriptional start sites and the adjacent promoters for 8,793 and 6,875 human and mouse genes, respectively. http://dbtss.hgc.jp/

Yutaka Suzuki, Riu Yamashita, Kenta Nakai and Sumio Sugano (2002). DBTSS: DataBase ofhuman Transcriptional Start Sites and full-length cDNAs. Nucleic Acids Res. 30: 328-331.

– Pancreas genes are chosen based on efforts to understand pancreatic development and function (EPConDB)

• 500bp upstream for preliminary study• 159 human (mouse) pancrea specific genes (Qislet <7, positive(p)) & 159 human (mouse) ubiquitous genes (Qislet >10, negative (n))

– This approach can be applied to any tissue to study the tissue specificity of transcription factor binding motifs (TFBMs) & Modules

Methods & Resources (Cont.)

Page 26: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

• A Teiresias Pattern P is a <L,W> pattern (with L ≤ W) if P containing at least L residues such that every subpattern of P containing L residues is at most W symbols in length.

Pattern ACTGGC A. C. GT

<L,W>

<L=?, W=6><L=?, W=4><L=6, W=6>

<L=?, W=6><L=?, W=4><L=4, W=6>

Method- Pattern Discovery - Teiresias

Teiresias Patterns

*Rigoutsos, I. and A. Floratos, Combinatorial Pattern Discovery in Biological Sequences: the TEIRESIAS Algorithm. Bioinformatics, 14(1), January 1998.

Page 27: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

Identifying TFBMs - Pattern Distribution

With 117 human pancreas specific genes (Qpancreas <6.5, positive(p)) and 117 human ubiquitous genes (Qpancreas >10, negative (n)), roughly 90,000 patterns were discovered in the 1kb+/200bp- promoter region. Patterns with ∆p-n >20 (in blue box) are more likely to be pancreas specific

Each point represents a pattern with occurrence in positive data set (y-axis) and negative data set (x-axis)

For each pattern (x-axis), the occurrence difference ∆p-n (y-axis) between positive (Q<6.5) and negative (Q>10) data set

Page 28: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

Method - Pattern Clustering

Pattern Clustering

PatternsPatterns

Smith-Waterma

n

Smith-Waterma

n

Distance of pattern pair

Distance of pattern pair

Hierarchical

Hierarchical

K-Median

K-Median

Pattern Clusters (PWM)

Pattern Clusters (PWM)

Num of ClusterNum of Cluster

Pattern Clustering

Page 29: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

Results - Pattern Clustering

Clustering Results (human, ∆p-n>20, 72 patterns)

Page 30: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

Identifying TFBMs

72 patterns (Human, ∆p-n >20) were clustered to 18 pattern clusters and 6 of them were identified as known ones by searching TRANSFAC.

Identified known binding sites associated with human pancreas genes

AP2ALPHAMEF2

SRY

NKX62 CAP_01 HOXA3

Page 31: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

AP2ALPHA MEF2NKX62

CAP_01

Identifying TFBM

By conducting comparative genomic analysis, some discovered TFBMs are conserved between Human & Mouse pancreas Orthologs

HOXA3

Page 32: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

Gene Clustering - Based on TFBMs

pancreas specific genes can be clustered according to presence or absence ofconserved promoter motifs

Upstream sequences can be characterized by pattern occurrences, which can then be used to calculate pairwise similarities between sequences. For simplicity, we just used a boolean model by considering 7 conserved pattern appearance. Centered pearson correlation was used to calculated similarity, and 117 pancreas specific (Q<6.5) were clustered into 10 clusters with hierarchical clustering.

Cluster 6AP2ALPHA MEF2 NKX62 HOXA3 CCTGTT CTGCTC CAP Refseq Locus Name Description

NM_001504 2833 CXCR3 chemokine (C-X-C motif) receptor 3NM_000380 7507 XPA xeroderma pigmentosum, complementation group ANM_000278 5076 PAX2 paired box gene 2NM_003987 5076NM_003988 5076NM_003989 5076NM_003990 5076NM_002728 5553 PRG2 proteoglycan 2 (natural killer cell activator)NM_013230 934 CD24 CD24 antigen (small cell lung carcinoma cluster 4 antigen)

Page 33: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

Gene Clustering – GO Category

Assign Gene Clusters to GO Category To interpret clustering results, we used EASE to find the significant biological features of a gene cluster of interest of a gene cluster through the GO Biological Process.

cluster GO Biological Process Gene Name Descriptions

c2 Digestion AMY1A amylase, alpha 1A; salivary

CEL carboxyl ester lipase, bile salt-dependent lipase, cholesterol esterase; fetoacinar pancreatic protein

CLPS colipase, pancreatic

CTRB1 chymotrypsinogen B1

CTRL chymotrypsin-like

c4 catabolism CPA2 carboxypeptidase A2 (pancreatic)

DHPS deoxyhypusine synthase

MEPA1 meprin A, alpha (PABA peptide hydrolase)

CPB1 arboxypeptidase B1

ELA3A elastase 3A, pancreatic (protease E)

ELA2A pancreatic elastase IIA

c6 response to stimulus CD24 CD24 antigen (small cell lung carcinoma cluster 4 antigen)

CXCR3 chemokine (C-X-C motif) receptor 3

PAX2 paired box gene 2

PRG2 proteoglycan 2, bone marrow (natural killer cell activator, eosinophil granule major basic protein)

XPA xeroderma pigmentosum, complementation group A

c8 phosphorus metabolism PDGFRA platelet-derived growth factor receptor, alpha polypeptide

PRDX4 thioredoxin peroxidase; thioredoxin peroxidase

PTP4A3 protein tyrosine phosphatase type IVA, member 3

c9 Transport SLC12A3 solute carrier family 12

CACNA1E calcium channel, voltage-dependent, alpha 1E subunit

TCOF1 Treacher Collins-Franceschetti syndrome 1

SLC35A3 solute carrier family 35 (UDP-N-acetylglucosamine (UDP-GlcNAc)

Page 34: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

More Clues: Known and novel TFBS found associated

with genes expressed in the pancreas

See conservation of sites between human and mouseAssociated with digestion, catabolism, and response to stimulus GO

biological processes

Page 35: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

Discovering regulatory modules by creating profiles for Gene

Ontology Biological Processes based on tissue-specificity scores

Elisabetta Manduchi, Jonathan Schug

Page 36: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

If we focus on biological processes that are predominantly taking place in a given tissue, can we identify regulatory modules common to genes involved in these processes?

TissueBiological Process

Genes

Page 37: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

For a given tissue survey, we attach “tissue-specificity” profiles to gene sets defined by GO BPs, based on the ranked lists of genes in each tissue according to increasing Q.

• To this end, we use an Enrichment Score (ES) in the spirit of that described in Mootha et al. (2003), as a measure of tissue-specificity for that gene set.

• The ES turns out to be equivalent (i.e. equal up to a multiplicative constant) to a Kolmogorov-Smirnov statistic.

Page 38: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

• The following results refer to the application of the methods described above to the GeneNote tissue survey: – 12 tissues in duplicate on the HGU95

Affymetrix chip set (Av2, B-E).

• We looked at the 2316 GO BPs that we could map to probe sets (using version 1.5.1 of the Bioconductor GO and hgu95XXX metadata R packages).

Application to a Human Tissue Survey

Page 39: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

GO BPs having significantly specific profiles for each tissue can be identified

significant in liver significant in heartand skeletal muscle

Page 40: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

Excerpt of cluster ofGO BPs based on theirtissue-specificity profiles(up in spinal cord/brain)

Page 41: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

Focusing on steroid metabolismA. After mapping probe sets to RefSeqs and retrieving

from DBTSS their upstream sequences, we assembled a set of 63 promoter sequences, which was our positive set.

B. We generated 5 negative sets, each consisting of 315 sequences, by randomly scrambling each of the positive set sequences.

C. We ranked each of 666 Transcription Factor Binding Sites (TFBSs) from TRANSFAC -represented by position matrices - in terms of their ability (measured by average ROC area) in discriminating between the positive set and the negative sets.

Page 42: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

D. We then selected high ranking TFBSs from (C) and high ranking TFBSs from an independent study focusing on liver specificity and formed all possible pairs between these two sets.

E. These pairs were ranked according to their discriminative ability and on the basis of the distance between their components in the positive hits. Optimal parameters (distance and individual TFBS match scores) were selected for each pair scoring at the top.

F. By assessing the performance over a test set composed of mouse promoter sequences, we found 2 candidate CRMs (involving 3 and, respectively, 4 TFBSs) with an over-representation of steroid metabolism genes.

Focusing on steroid metabolism

Page 43: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

Example of production hits to steroid metabolism

mouse promoter sequences

No. mouse promoter sequences: 6875. Of these 50 belong to genes mapping to steroid metabolism.

No. production hits: 257. Of these 8 belong to genesmapping to steroid metabolism.

TSS

ProductionTFBSs: {FOXD3_01, GKLF_01, HFH1_01, MADSA_Q2}Parameters:

max distance=130 FOXD3_01 min score=9.934705GKLF_01 min score=10.815614HFH1_01 min score=9.442617MADSA_Q2 min score=8.246301

green=forward strandred=reverse strandshading indicates strength

Page 44: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

More Clues: We can identify candidate CRMs from top-ranking GO Biological Processes for tissues

Identified a candidate CRM for steroid metabolism.

Page 45: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

Summary• GUS is a functional genomics database system used

by a growing number of sites for genome and expression projects.

• Using expression data in GUS and entropy-based metrics, we can rank genes according to their tissue-specificity and learn promoter properties and associate functional roles

• In addition to general properties of tissue-specific promoters, we are beginning to identify combinations of motifs (i.e., regulatory modules) associated with expression in specific tissues.

Page 46: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

Future Directions

• Refine analysis from genes to transcripts

• Refine analysis from organs to cells

• Apply approach to splicing

• Apply approach to developmental stage and differentiation state

Our goal is to make inferences of the form: "The gene set G shows specificity for tissue T and is regulated by module M in this context".

Page 47: Computational Analysis of Tissue Specificity: Decoding Promoters Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania

http://www.cbil.upenn.edu