data representation in the trrd - a database of transcription

10
Proceedingsof the 28th Annual Hawaii International Conference on System Sciences - 1995 Data representation in the TRRD - a database of transcription regulatory regions of the eukaryotic genomes O.V.Kel, A.G.Romachenko,A.E.Kel, A.N.Naumochkin,N.A.Kolchanov institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, pr. Lavrentyeva 10, Novosibirsk, 630090, Russia Abstract. The TRRD a database of transcription regulatop regions of the eukatyotic genome, is described The TRRL) database contains information on the structural- junctional organization of the transcriptional regulatoq regions, protein factors involved in transcription regulation, and transcription signals. A model is suggested for representing data in the TRRD. The module structure of transcription regulator) regions are taken into account in the model. This representation mode allows to integrate extensive and diverse information on transcription regulatory regions in the eukatyotic genome. The mode can be used to establish links between known databases, such as EMBL, EPD, TFD, TRANSFAC. ,4n entry of the TRRD is given as a descriptive example of the structure qf the transcription regulatory regions of the alpha-subunit of the human glycoprotein hormone gene. Sofmare allowing to fill the TRRD and search for any needed information is described. 1. Introduction Analysis of the regulatory regions of the eukaryotic genome involved in transcription control requires special databases containing exhaustive information about the structural-functional organization of these regions and relevant transcription factors. The EPD database of the eukaryotic promoters [l], is special because containing information about the eukaryotic promoters of RNA polymerase II. The EPD contains only the promoters whose transcription start site has been established experimentally. Two databases have been developed mainly to describe transcription factors, TFD [2,3] and TRANSFAC [4]. They contain the amino acid sequences of transcription factors and their DNA- binding domains and also information on binding sites in promoters (including nucleotide sequences of these sites and their location). The TFD contains also information on consensuses of binding sites for various transcription factors. The TRANSFAC rather than the TFD describes the relative disposition of binding sites in the eukaryotic promoters. Development of a special model of data representation is a major problem in setting up a database for the transcription regulatory regions. This model should make possible a formal description, presentation, and integration of vast information concerning the structural- functional organization of transcription regulatory regions in the eukaryotic genes. We used this approach to set up the TRRD (Transcription regulatory region database). A model of data presentation underlying the TRRD is described. In developing the model the following was taken into account: the module structure of regulatory regions, hierarchy of regulatory modules and also their relationship resulting from interaction of transcription factors. A description of data presentation format is given. Integration of the TRRD database with other databases, mainly TRANSFAC and TFD, is discussed. 2. Basic principles of the structural- functional organization of transcriptional regulatory regions in the eukaryotic genomes Gene-specific transcription regulation is based on complex interaction between the trans-acting protein factors and sets of cis-elements located within the transcription regulatory regions, on cross-coupling between different protein factors and their ability to modulate the activity of the basal transcriptional complex. 2.1 Transcription factors Based on regulation specificity, all the transcription factors can be classified as 1) ubiquitous or tissue- specific; 2) development-stage or cell cycle specific; 3) factors that mediate induction by hormones, growth factors and other molecular signals. According to the structural features of DNA-binding domains, all the transcription factors can be divided into a small number of superfamilies: 1) b/ZIP, 2) bHLH [5]; 3) homeodomain proteins [6]; 4) ETS [7]; 5) REL [81; 6) 42 lOGO-3425/95$4.00@1995 IEEE Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE

Upload: dinhdang

Post on 02-Jan-2017

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data representation in the TRRD - a database of transcription

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

Data representation in the TRRD - a database of transcription regulatory regions of the eukaryotic genomes

O.V.Kel, A.G.Romachenko,A.E.Kel, A.N.Naumochkin,N.A.Kolchanov

institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, pr. Lavrentyeva 10, Novosibirsk, 630090, Russia

Abstract. The TRRD a database of transcription regulatop

regions of the eukatyotic genome, is described The TRRL) database contains information on the structural-

junctional organization of the transcriptional regulatoq regions, protein factors involved in transcription regulation, and transcription signals.

A model is suggested for representing data in the TRRD. The module structure of transcription regulator) regions are taken into account in the model. This representation mode allows to integrate extensive and diverse information on transcription regulatory regions in the eukatyotic genome. The mode can be used to establish links between known databases, such as EMBL, EPD, TFD, TRANSFAC. ,4n entry of the TRRD is given as a descriptive example of the structure qf the transcription regulatory regions of the alpha-subunit of the human glycoprotein hormone gene. Sofmare allowing to fill the TRRD and search for any needed information is described.

1. Introduction

Analysis of the regulatory regions of the eukaryotic genome involved in transcription control requires special databases containing exhaustive information about the structural-functional organization of these regions and relevant transcription factors. The EPD database of the eukaryotic promoters [l], is

special because containing information about the eukaryotic promoters of RNA polymerase II. The EPD contains only the promoters whose transcription start site has been established experimentally.

Two databases have been developed mainly to describe transcription factors, TFD [2,3] and TRANSFAC [4]. They contain the amino acid sequences of transcription factors and their DNA- binding domains and also information on binding sites in promoters (including nucleotide sequences of these sites and their location).

The TFD contains also information on consensuses of binding sites for various transcription factors. The

TRANSFAC rather than the TFD describes the relative disposition of binding sites in the eukaryotic promoters.

Development of a special model of data representation is a major problem in setting up a database for the transcription regulatory regions. This model should make possible a formal description, presentation, and integration of vast information concerning the structural- functional organization of transcription regulatory regions in the eukaryotic genes. We used this approach to set up the TRRD (Transcription regulatory region database).

A model of data presentation underlying the TRRD is described. In developing the model the following was taken into account: the module structure of regulatory regions, hierarchy of regulatory modules and also their relationship resulting from interaction of transcription factors. A description of data presentation format is given. Integration of the TRRD database with other databases, mainly TRANSFAC and TFD, is discussed.

2. Basic principles of the structural- functional organization of transcriptional regulatory regions in the eukaryotic genomes

Gene-specific transcription regulation is based on complex interaction between the trans-acting protein factors and sets of cis-elements located within the transcription regulatory regions, on cross-coupling between different protein factors and their ability to modulate the activity of the basal transcriptional complex.

2.1 Transcription factors

Based on regulation specificity, all the transcription factors can be classified as 1) ubiquitous or tissue- specific; 2) development-stage or cell cycle specific; 3) factors that mediate induction by hormones, growth factors and other molecular signals.

According to the structural features of DNA-binding domains, all the transcription factors can be divided into a small number of superfamilies: 1) b/ZIP, 2) bHLH [5]; 3) homeodomain proteins [6]; 4) ETS [7]; 5) REL [81; 6)

42

lOGO-3425/95$4.00@1995 IEEE

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE

Page 2: Data representation in the TRRD - a database of transcription

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 199.S

family of nuclear proteins with different zinc finger types: C2H2, C6 [9], C4 [ 1 O]; 8) HMG-family site-specific- and non-specific types [l 11; 9) WH (Winged Helix) [12]. There is also number of transcription factor families with DNA-binding domains different from the above types, such as SRF (serum responsive factor) CTF/NF-1 and others [ 131.

Each of the above superfamilies can be subdivided into several families. Proteins, members of different families, recognize the cis-elements with distinct structure. For example, AP-1, CREB/ATF, C/EBP are distinguished within the superfamily bZIP families [5]. The EGR and Sp-1 families, members of the C2H2 zinc finger superfamily, bind different cis-elements [14, 151. Diversity of transcription factors is necessary to provide complex regulation of differential gene expression in each cell type.

2.2 Hierarchy levels of a transcriptional regulatory region

A fundamental principle of the organization of the regulatory regions of genes is their modular structure [ 161 and recognized levels of hierarchy. Let us consider the hierarchical levels in ascending order.

Transcription factor binding site (cis-regulatory element). The lowest level in the hierarchy of the structural organization of regulatory regions corresponds to cis-regulatory element, which binds a definite transcription factor. Binding sites are of 5-25 bp. The elementary function of a site is recognition of a specific transcriptional factor and its binding with certain affinity. It was shown experimentally that a definite site can interact with several transcriptional factors of a certain family of regulatory proteins and of different protein families in some cases.[l7].

Composite regulatory element contains 2-3 closely lying binding sites for different transcription factors. Composite response element acts as an entire unit due to specific protein-protein interactions between the respective transcription factors [la]. As a rule, a composite ele’ment is formed of adjacent or partially overlapping sites. Composite elements correspond to the second hierarchical level of a transcription regulatory region.

Composite regulatory element is an autonomic functional unit. Support for its autonomy is provided by finding of similar elements in different genes. For example, TORUS (TPA and oncogene-responsive units), encompassing the PEA3 and AP-I binding sites, are found in promoters of the collagenase gene [19], and of the oncogene c-ets-1 [ZO], in polyoma virus enhancer

[21], in enhancers of the human and murine uPA genes [22, 231.

A good example is composite element TCF-2 - SRF from the enhancer of the human gene c-fos enhancer [24]. At the first step, SRF recognizes the specific DNA site and binds to it. Then, the physical interactions between TCF-2 and SRF provide direct binding of TCF-2 to DNA. Hence, direct protein-protein interactions are important for the synergistic activation by SRF and TCF- 2 proteins.

Protein-protein interactions between transcriptional factors were revealed experimentally for many composite elements. For example, NF-IL6 and glucocorticoid receptor [25], NF-IL6 and NF-kB [26], and also for glucocorticoid receptor and AP-I [27], NF-LB p65 and AP-I [28].

Cross-coupling of two different classes of transcription factors at composite element exhibits a new pattern of transcription regulation, for example, tissue- specificity of hormonal induction, tissue-specificity of immune or acute-phase response [29].

There are examples of combinatorial action of transcription factors at a key of other composite elements that serve to provide the narrow-specific features of transcription regulation of the eukaryotic gene [30].

Composite element seems to provide the appropriate stereospecificity for interaction between transcription factors ( which are bound to dist.al elements) and the proteins of the basal transcription complex [31].

To summarize, the second level of the hierarchy of the structural organization of regulatory region corresponds to the composite regulatory element. Its main functions are: 1) stabilization of the DNA-protein complex due to direct protein-protein interaction between the transcription factors; 2) emergence of the strictly- specific features of transcription regulation by cross- coupling of the proteins of different families; 3) modulation of the activity of the basal transcription complex.

Promoters and enhancers may be assigned to the following higher hierarchy level in the structural organization of transcription regulatory region. They consist of several composite elements and/or individual binding sites for transcription factors. Function of promoters and enhancers and their position relative to the transcription start are different.

The regulatory part of the gene responsible for the basal level of transcription is regarded as promoter. Promoter includes the region of transcription start and the adjacent upstream region of about several tens of nucleotides [32]. Cis-regulatory elements of promoter provide the following functions. 1) Formation of a

43

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE

Page 3: Data representation in the TRRD - a database of transcription

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

minimal transcription initiating complex that includes multisubunit RNA-polymerase II and transcription factors TFIID, TFIIB, TFIIF [33]. 2) The subsequent formation of complete transcription complex by addition of the factors TFIIE and TFIIH that enforce its elongation ability [33].

There are promoters with TATA-box or certain contextual elements having the same function of TFIID binding in TATA-less promoters [32]; Inr-dependent promoters without or in conjunction with TATA elements also occur [34]. Promoter initiating complex is modulated by protein factors binding to upstream regulatory elements [3 1,351.

Enhancer also consists of composite elements and single binding sites. It enhances (modulates) the level of transcription depending on the type of tissue, developmental and cell cycle stage, induction by hormones or other molecular signals. Position of enhancer can vary significantly within the gene. Enhancers located just before the promoter, those located at a distance of several thousands bp. in the 5’-flanking region, within introns and the 3’-flanking regions were revealed.

The main features of enhancers are: 1) distant action, and 2) orientation-independent action.

Transcription regulatory region consists of several elements of the preceding levels (separate binding sites of transcription factors, composite elements, enhancers, promoters). It spans lengthy regions of DNA. Obligatory regulatory transcription regions are located at the S- flanking regions of the genes and, in some cases, they can extend to the noncoding parts of the first exon. Other regulatory transcription regions can be located within introns and the 3’-flanking gene regions. The regulatory regions of some genes are located at a distance of more than 5- 15 kb upstream of the transcription start site.

Integrity of contextual elements of regulatory region also provides: I) selective accessibility of long DNA stretches to binding to transcription factors; 2) specific architectural features important for transcription regulation of certain DNA regions (for example, spatial proximity of distant cis-elements).

It should be emphasized that the individual nucleotide sequence of regulatory region with a given set of cis- elements can provide several modes of transcription machinery functioning. The choice of an appropriate mode depends on a definite set of tissue-specific nuclear proteins interacting with the region. Expression of corresponding nuclear proteins depends on tissue and cell type, developmental stage and cell cycle, induction by molecular signals. As a result, there is a specific set of proteins that function as architectural components of chromatin and trans-acting factors. Different sets of proteins expressed in a definite cell situation can form

various gene-specific variants of transcription complex at the same regulatory region [36; 371. This allows to alter the mode regulatory region functions, which, in turn, can modify gene expression.

The system of integral regulation of gene transcription. Finally, we reach the highest level of hierarchy, i.e., integrity of all the regulatory regions of the gene.

The main functions of the system of integral regulation of gene transcription are: 1) complete control of all the stages of the transcription process, including initiation, elongation, and termination. 2) Involvement in formation of compact DNA structure. 3) Provision of interaction with nuclear matrix proteins. 4) Provision of coordination between genes.

Thus, the wide variability of gene expression can be ensured by combinatorial sets of regulatory region.

3. Description of the data format

As indicated above, one of the advantages of information presentation in the TRRD is that one entry in the data base corresponds to a description of one gene. Information in every entry of the data base is supplied and changed in the process of revision of these entries. This revision can be done at the addition of new experimental information about the structure of the regulatory regions of a gene. It is important that information can be added to every bank entry without major changes in the earlier filled fields. Any field of the base can contain references to papers, to other databases or sources.

Let us examine the description of a gene in the TRRD, the human gene of the alpha-subunit of glycoprotein hormone (Fig. 1).

ID - identificater of an entry in the data base. It contains brief species name, brief name of the gene or the eukaryotic virus, the date of last editing (day/month/year) (Fig. 1).

OS - complete species name. NC - complete name of the gene or the eukaryotic

virus. CG - classification of each gene promoter in

compliance with the classification used in the eukaryotic promoter database EPD [ 11. All promoters for each taxon are subdivided into 7 main classes: small nuclear RNA; structural protein; storage and transport proteins; enzymes; hormones and regulatory proteins; stress proteins and immune response proteins; unclassified promoters. Each class is divided into subclasses. The described gene is referred to hormones and regulatory protein class.

44

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE

Page 4: Data representation in the TRRD - a database of transcription

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

RE - specificity of gene regulation. Four main situations are chosen which determine the specificity of gene regulation in eukaryotes.

1. Gene expression depends on the cell cycle stage. 2. Gene expression depends on the stage of organism

development. For example: embryo, adult. 3. Tissue-specific regulation (the tissue or cell type in

which this gene is expressed are listed). The tissues where repression of expression occurs are listed after symbol (- >.

4. Gene expression depends on the action of external and internal factors, such as heat shock, starvation, chemicals, hormones, interleukins, growth factors, vitamins, etc.

The alpha-subunit of glycoprotein hormone is assigned to tissue-specific genes (see Fig.]). Its expression is suppressed by steroid and thyroid hormones and enhanced by CAMP.

KW - key words. Additional gene features allowing to unite several genes into a certain group not described in the field CG are listed. For example, biochemical pathway in which the products of this group of genes take part. close location on the chromosome and other features.

RG - the gene or the genome region in which the regulatory region of the gene is located (the S-flanking region, the 3’-flanking region, exon, intron, long terminal repeat for eukaryotic viruses).

The part of this entry from this field to the next field RG composes the given regulatory region.

PR - description of promoter, enhancer and silencer. In this field, the following information is registered: the specific name of the regulatory unit, if mentioned in the paper; name of the point in the sequence relative to all the positions are designated (ST - transcription start , SI - start of intron, SS - start of sequence, AltST - alternative start of transcription relative to the start, which is set out as the main for promoters with several transcription start sites); this regulatory unit location; the list of all sites and composite elements of this promoter, enhancer or silencer.

The following example will give an idea of field displays. The gene for the alpha-subunit of human glycoprotein hormone is expressed at high level in placenta. This high level of expression is provided by tissue-specific enhancer and proteins interacting with it: placenta-specific protein and ubiquitous factor CREB [38]. Nevertheless, it is known that steroid hormones suppress this gene via regulatory region which is substantially overlapped by the tissue-specific enhancer [39]. Furthermore, a binding site for the T3 receptor exists between the TATA box and the transcription initiation site in the alpha-GH gene promoter, providing a

potentially powerful mechanism for transcription inhibition [40]. This information is written in the form of three separate fields PR (Fig. 1).

CE - field for description of composite elements. The specific name of composite element, locations, sites making up the composite element.

NM - short name of site. If several sites of this type in this gene are revealed, the site number is indicated in brackets. Then, the complete name of the site is written. If the site is located on the complementary chain, letter (C) is used (Fig. 1).

TF - brief description of protein or protein complex, if protein interacting with the site was identified. If different proteins referred to one family interact with one site similarly affect gene expression, these proteins are listed here with corresponding reference. This is exemplified by steroid responsive element to which both androgen receptor and glucocorticoid receptor can bind (Fig. 1).

AT - in this field the influence of trans-acting factor on gene expression is registered, i.e., decrease or increase in gene expression after binding of transcription factor.

SQ - nucleotide sequence corresponding to the consensus of the site.

PQ - site locations relative to the point established in the field PR.

SF - nucleotide sequence of the region of interaction with protein as demonstrated by footprinting.

PF - positions of a region interacting with protein. For site CRE(1) these fields are described as shown in

fig. 1. The sequence of a region homologous to consensus is

written in field SQ and its precise positions in field PQ. Fields SF and PF are absent because footprinting has not been done.

AG - types of experiments, such as gel mobility shift assays and cross-competition experiments, footprinting, plasmid constructions with various delletions and mutations.

HM -in this field, the genes and sites homologous to this site are listed. For site CRE(1) this field is described as shown in fig. 1.

EF - comparative affinity of this site for transcription factors. Also, other information on relative activity of different sites of this type is included. For site CRE( 1) in this field some examples of comparative affinity are described. The affinity of the bovine a-GH CRE-like element for CREB is at least 200 times lower than that of the human a-GH CRE [38].

After description of all the signals of this regulatory region field, RG can be repeated for description of other regulatory regions. After description of all the known

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE

Page 5: Data representation in the TRRD - a database of transcription

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

regulatory regions of the gene, the descriptive fields follow.

SC - comments on structure of sites of the gene, the degree of matching to the consensus, the presence of palindromes, repeats, conserved positions.

CC - comments on the whole gene, data on synergism between distantly located sites, functional interaction of regulatory regions and data on protein-protein contacts. Comments may include other information at reviewer’s discretion.

In the first version of bank each entry contains the complete nucleotide sequences of regulatory regions extracted from EMBL:

BI name of bank, identificater; names and number of points in the sequence with reference to which all corresponding regulatory units are positioned; position of the point in the entry of the nucleotide bank is written.

BM name of sequence in the nucleotide bank. BS nucleotide sequence. If regulatory regions of the gene are described in two

or more sequences, fields BI, BM, BS, BF are repeated as many times as necessary.

At the end of every entry the full list of references cited in the entry is included. The TRRD can function in conjuction with established database, such as Curent Context. Joining of the TRRD with MEDLlNE database is to be implemented in the future.

4. Technical description of TRRD software

To work with the database (DB) - edit, view, search for the data - STRRD software was designed. The software is implemented using CA-clipper v.5.01 compiler. It may be run on the IBM PC AT type computers and those compatible with them under DOS 3.0 and the later versions. Program requires not less than 500 kb RAM, EGA/VGA monitor.

The S-TRRD has a friendly interface as a system of windows and inlaid menus, help system for every operating mode of the program. The software permits to create empty shells for importing data from the text files, to export individual documents from the database or the entire database to the text files. Before data exporting, one can select and set the informational fields to be output to ASCII file. This allows to design working versions of databases, various subbases, the selection criteria being set by user. For data storage we use the common dBASE format.

There are keyword databases for some of the fields (ID, OS, KW, etc.), databases AG.dbf with experiment classification by types, and CG.dbf with the Bucher classification, and the general RF.dbf with literature references. All these DB are used for editing and

searching data through the database. Content of the keyword DB can also be exported to the ASCII tiles.

Standard operation with the DB includes viewing documents, search by condition, editing and input of new data. Available with this software are different modes of DB viewing and moving throughout the DB. For example, the alphabet sorting permits to view all the fields with the current code (e.g. all OS), to rapidly select the field with needed information, and, with the sorting mode turned off, to look through the entire document containing the selected informational line. There is a possibility for calling the list of all the fields and, the required field being selected, the user can move to it in the current entry.

To search for data of interest in the DB, in addition to the available sorting modes, there is a possibility for the user to define the complex searching condition itself. The user introduces field name to be searched through a special window along with the searching context and reference to be viewed. All this can be done manually and by calling the reference DB. For example, if the entry or field name is not set (blank is introduced), then the search will be performed throughout all the fields in all banks. All the found entries (complete or their individual fields) may be output in ASCII tile. Editing may be performed for every individual field, and each structural block RG, PR, CE, and NM.

This software permits to operate with nucleotide sequences of regulatory regions extracted from EMBL database by the corresponding identificator given in the field Bl, which includes their loading to the DB in the required format and further sequence import to the individual file.

A principal option of the considered software is setting up of samples of definite-type cis-element nucleotide sequences. It is based on the information from the DB on precise location of such cis-elements within the nucleotide sequence of the given regulatory region. Constructed samples permit to investigate the contextual features of functional site structure, with the sequences flanking such regulatory elements involved in analysis.

The described software proved to be an advantageous means for accumulating information in the TRRD and treatment of data.

5. Discussion

The developed database TRRD is designed primarily for exhaustive description of the contextual features of genomic DNA determining the gene specific transcription regulation of the gene. The main features of the database TRRD distinguishing it from other databases is its focus on the features of DNA primary structure providing

46

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE

Page 6: Data representation in the TRRD - a database of transcription

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

molecular-genetic mechanisms of gene-specific transcription regulation. What is noteworthy here is that the gene is the subject of the highest hierarchical level considered within the database TRRD. This permits to use the suggested format for detailed description of all the contextual DNA features providing the transcription machinery of the considered gene.

However, it should be emphasized that to describe in detail all the transcription machinery for a definite gene, one should consider two types of objects equally involved in the process of transcription regulation. These are the DNA fragment corresponding to the regulatory transciption region comprising an appropriate set of cis- acting sites, on the one hand, and a set of trans-acting factors, on the other hand.

In this regard, there are two possible ways for informative system development that can exhaustively describe the gene-specific regulation of transcription. Both rely on the integration of the designed database TRRD containing a detailed description of transcription regulatory sites, with the databases TFD and TRANSFAC, which contain extensive data on transcription factors.

The first possibility is introduction of changes in the database TRRD. The first step here is expansion of the model of data presentation accepted for use in the TRRD and modification of the database format. This modification is effected by introduction of a large number of additional informative fields for detailed description of trans-acting factors interacting with cis-elements of the considered regulatory region. Once the respective exchange interfaces are developed, the needed information for these fields of the TRRD database will be extracted from the databases TFD and TRANSFAC.

However, the second way appears more promising. It is based on mutual integration of three databases:, TRANSFAC, TFD, and TRRD, resulting in a dynamically integrative informational system. In this case, each database will have programs for extraction of common information from the other two databases and for inclusion, in this way, of the obtained data in its own format. The designed programs will provide facilities for all available information extraction from different databases, data comparison, and checking and accumulation of these data in each individual database.

Special software should be developed for extraction of specific information from each database. TFD will be the source of information concerning trans-factors: amino acid sequence, domains, activity and so on. TRANSFAC will be the souwce of new data on the relative location of transcription elements in gene regulatory regions and data about transcription factors features (relation between factors). The TRRD will be the source of data on location

of binding sites, composite elements, influence of mutations on the activity of the regulatory regions, quantitative data on relative activity, description of specificity of regulation of the region.

In integration of information from the TFD, TRANSFAC, TRRD and EMBL, a special software for seach and resolution of inconsistencies of data derived from the bases will be developed.

The developed software will provide databases enriched with new information. New important fields will be developed for accumulation of new available information in each of the three databases. At the next step, all the three databases should be combined in a comprehensive database for regulatory genomic regions.

Pointers to the unique entry types of the individual databases (e. g., binding matrices in TRANSFAC, the clones of TFD, contextual features in TRRD) will1 be provided.

Within the framework of this dynamic integrative informational system the following provisions should be made for linkage to other related databases and software:

- sequence databases (EMBL, SwissProt, PIR); - databases for biological functions (EPD, PROSITE); - sequence diagnostic software (e. g. Signalscan [41],

Consindex, SITEVIDEO [42]); The final stage of the TRRD development should be

its conversion into a comprehensive expert system which includes:

- development of an object-oriented system for navigation in the databases;

- development of software for analysis of different types of data in three linked databases and automatic revealment of related data. This software will reveal, for example, relationships between patterns of location of cis-elements and some features of regulation of promoter activity.

- development of a knowledge base containing revealed patterns of the data;

- development of a user-friendly interface to provide access to the databases and knowledge base.

We intend to develop a relational structural base for TRRD data and software allowing to visualize information from the base.

Links between TFD, TRANSFAC, and TRRD can provide development of each database, and will permit to clarify the mechanisms of gene activity regulation in the eukaryotic genome.

Acknowledgments

We are grateful to Dr. E. Wingender (Gesellschafi fur Biotechnologische) for fruitful discussion on the problem of linkage between data bases TRRD, TRANSFAC and TFD. This research was supported by the Russian

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE

Page 7: Data representation in the TRRD - a database of transcription

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

National Program “Human Genome”, Russian Ministry of Sciences, Education and Technique Politics, Office of Health and Environmental Research of the Department of Energy, USA (grant #OR00033-93CIS002) and Fundamental Research Foundation, Russia (grants No 13241-a and 12757-a).

References

I. Bucher ,I989 EPD: Eukaryotic promoter database. current release. 2 GhoshD., 1993.Nucl.AcidsRes.,v.21.N.13,p.3117-3118. 3.Ghosh D., 1990. Nucl. Acids Res., v.18, p 1749-1756 4. Wingender E. 1988. Nucleic Acids Res. v. 16, n.5, p.1879. 5. Lamb P. and M&night S.L., 1991. Trends Biochem. Sci., v. 16, 4 17-422. 6. Verri.jzer C.P.. Alkema M.J., van Weperen W. W., van Leeuwen H.C., Strating M.J.J., van der Vliet P.C. 1992. The EMBO Journal I1,4993-5003. 7. Janknecht R., Nordheim A. 1993. Biochimica et Biophysics Acta 1155.346-356. 8. Blank V.. Kourilsky P.. and Israel A., 1992. Sci., v. 17, p. l35- 140. 9. Vallee B.L., Coleman J.E. and Auld D.S., 1991. PNAS v.88, pp999- I003 10. Scearle L.M.. Laz T.M.. Hazel T.G., Lau L.F., and Taub R,l993. J.Biol.Chem.,v.268. p. 8855-8861. 1 I Bustin M., Lehn D.A., Landsman D., 1990. Biochimica et Biophysics Acta 1049,23 l-243. 12. Lai E.. Clark K.I.., Burley S.K.. and Darnell J.E., 1993. PNAS. v.90, p.1042 I-10423. 13. Sharrochs A.D., Gille H., Shaw P.E., 1993. Mol.Cell.Biol., v 13,Nl, p.l23-132 14. Patwardham S., Gashler A., Siegel M.G., Chang L.C., Joseph L.J., Shows ‘T.B., Le Beau M.M.. Sukhatme V.P.. 1991. Oncogene, 6, p.917-928. I 5. Faisst S. and Meyer S.. 1992. Nucl. Acids Res., v.20. n. I, p 3-26. 16. Dynan W.S., 1989. Cell. v.58, p.l-4. 17. Wingender E , 1993. Gene Regulation in E:ukaryotes. VCH, Germany, 430~. 18. Miner J.N., Y amamoto K.R. 1992. Genes & Dev. 6,2491- 2501. 19. Gutman A. and Wasylyk B., 1990. EMBO, v. 9, n. 7, p. 224 l-2246. 20. Majerus M.-A.. Bibbollet-Ruche F., Telliez J.-B., Wasylyk B.. Baileul B., 1092. Nucl. Acids Res., 20, I I, 2699-2703. 21 .Wasylyk B., Wasylyk C., Flores P.. Begue A., Leprince D., and Stehelin D.. 1990. Nature, 1990, v.346, p.l91-193. 22. Rorth P., Neriov C., Blasi F.. and Johnsen M., 1990. Nucl. Acids Res.. v. 18, p. 5009-5017. 23. Nerlov C., De Cesare I>., Pergola F., Caracciolo A., Blasi F.. Johnsen M., and Verde P., 1992. EMBO J.. v.11, p.4573-4582. 24. Shaw P.E. 1992. The EMBO J. 1 I, 301 I. 25. Nishio Y., lsshiki H.,Kishimoto T. Akira S. 1993. Mol.Cell Biol. 13,1854-1862

26. Matsusaka T., Fujikawa K., Nishio Y., Mukaida N., Matsushima K., Kishimoto T., Akora S. 1993. Proc. Nat]. Acad. Sci. USA. 90, 10193-10197. 27. Miner J.N. and Yamamoto K.R., 1991. Trends Biochem. Sci., v. 16, p.423-426. 28. Stein B.. Baldwin A.% Ballard D.W., Greene W.C.. Angel P., and Herrlich P., 1993. EMBO J., v.12, 3879-3891. 29. Roberts L.R., Nichols L.A., Holland L.J., 1993. Biochemistry, v.32, 11627-I 1637. 30. Menoud P.-A., Matthies R., and others, 1993. Nucl. Aids Res., v.21, 1845-1852. 31. Lee J.-S., Galvin K.M., and Shi Y., 1993. PNAS, 90, 6145- 6149. 32. Roeder R.G., 199 I Trends Bioch.Sci., 16, 402-407. 33. Buratowski S.. 1994. Cell, 77, l-3. 34. Smale and Baltimore, I989 Cell, 57, 103-l 13. 35. Greene J.M. and Kingston R.E., 1990. Mol.Cell.Biol., IO. 1319-1328. 36. Stapleton G., Somma M.P., Lavia P., 1993. Nucl. Acids Res., v.21, 2465-2471. 37. Cardot P., Chambaz J., Kardassis D., Cladaras C., and Zannis V.I., 1993. Biochemistry, v.32, 9080-9093. 38. Bokar J.A., Keri R.A., Farmerie T.A., Fenstermaker R.A., Andersen B., Hamernik D.L.. Yun J., Wagner T., and Nilson J., l989., Mol.Cell.Biol., v.9,No.ll, 5113-5122. 39. Akerblom I.E., Slater E.P.. Beato M., Baxter J.D., and Mellon P.L., 1988. Science, v.241, 350-353. 40. Chatterjee V.K.K., Lee J.-K.. Rentoumis A., and Jameson J.L., 1989. PNAS, v.86,9114-9118. 41, Prestidge, D.S., and Stormo, G., 1993, C’omp. Appl. Biosci., v.9. p. 113-l 15. 42. Kel A.E., Ponomarenko M.P., Likhachev E., Orlov Yu.L. , lschenko I.V., Milanesi L., Kolchanov N.A., 1993. CABIOS, Vol 9, No 6, pp 617-627.

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE

Page 8: Data representation in the TRRD - a database of transcription

Proceedings of the 28th Annual Hawaii Infernational Conference on System Sciences - 1995

ID OS NO NG CG CG xx RE RE RR RE xx RG xx PR PR xx PR PR xx PR PR xx NM TF AT SQ PQ AG xx NM NM TF AT SQ PQ AG AG I-m HM HM HM HM EF EF EF EF EF EF EF xx NM HM TF AT SQ PQ AG AG SC xx NM TF AT SQ PQ

HUM; a-GH; 01/06/94 human glycoprotein hormone alpha-subunit gene (chorionic gonadotropin alpha-subunit gene) 6.1.5.3.1 Vertebrate promoters, Chromosomal genes, Hormones, Pituitary and placental hormones, Glycoprotein hormones.

3(+): placenta, gonadotropes of the anterior pitutary [Bokar J.A. et al., 19891; 4(-1: steroid hormones [Akerblom et al. 1988, Clay et al., 19931; thyroid hormone [Chatterjee et al. 19891; 4(+): CAMP [Bokar J.A. et al., 19891

5'region

placenta-specific enhancer; STl; -169 to -100; URE, CRE(l), CRE(21, JRE Bokar J.A. et al., 1989, Andersen et al, 19901

steroid-responsive region; STl; -168 to +45; GRE(l), SRE, GRE(2) [Akerblom et al, 19881

proximal promoter; STl; -90 to +l; CCAAT box, TATA box, TRE [Bokar et al., 1989, Chatterjee et al. 19891

URE; upstream regulatory element [Bokar J.A. et al., 19891 not identified placenta-spesific protein [Bokar J.A. et al., 19891 increase AAGGGTTGkAACAAGA -169 to -154 3.4, 6.1, 6.3 [Bokar J.A. et al., 19891

CRE(l);cAMP response element[Bokar J.A. et a1.,1989, Deutsch P.J. et al., 19881 CREB; CAMP response element binding protein increase TGACGTCA -142 to -135 3.4, 6.1, 6.3 [Bokar J.A. et al., 19891 3.2.1, 3.3, 6.2+, 6.4+, 6.5+, 7.3 [Deutsch P.J. et al., 19881 Bovine a-GH, CRE, -139 to -132, TGATGTCA. Mouse a-GH, CRE, -139 to -132, TGATGTCA. Rat a-GH, CRE, -136 to -129, TGATGTCA. [Bokar J.A. et al., 19891 CREs of rat glucagon, bovine PTH, human enkephalin TG-CGTCA, human VIP, rat somatostatin [Deutsch P.J. et al., 19881. The affinity of the bovine a-GH CRE element for binding with CREB is at least 200 times lower than that of the human a-GH CRE [Bokar J.A. et al., 19891.

The affinity of the a-GH CRE for binding with CREB is indistinguishablefrom the affinity of the rat som. CRE and mush more than that of the bov. PTH, rat glucagon and mutant a-GH CRE, containing central C to G transversion [Deutsch P.J. et al., 19881.

CRE(2); CAMP response element [Bokar J.A. et al., 1989, Deutsch P.J. et al., 19881 CREB increase TGACGTCA -124 to -117 3.4, 6.1, 6.3 [Bokar J.A. et al., 19891 3.2.1, 3.3, 6.2+, 6.4+, 6.5+, 7.3 [Deutsch P.J. et al., 19881 consensus sequence of CRE is a palindrom

JRE; j,unctional regulatory element [Andersen et al, 19901 JRF; junctional regulatory factor [Andersen et al, 19901 increase GTAATTAC -114 to -107

49

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE

Page 9: Data representation in the TRRD - a database of transcription

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

AG HM HM SC cc xx NM TF AT SQ PQ AG xx NM TF TF AT SQ PQ AG AG xx NM TF AT SQ PQ AG xx NM TF AT SQ PQ AG EF EF cc cc cc cc cc xx NM SQ PQ AG xx NM TF AT SQ PQ AG HM xx AL AL SC SC cc cc cc cc cc xx

1.1.1, 1.1.2, 3.4, 4.2, 6.2 [Andersen et al, 19901 JRB is conserved between human, rat, mouse, cattle [Andersen et al, 19901 sequence of the JRE is a palindrom [Andersen et al, 19901 JRF is placenta-specific 50 kD factor

GRE(1); glucocorticoid responsive element; (C) [Akerblom et al, 19881 GR; glucocorticoid receptor decrease ACGTCAATTTGATCT -137 to -151 1.1.5. 4.2, 6.1.1+. 7.1 [Akerblom et al, 19881

SRE; steroid-responsive element; (Cl AR; androgen receptor [Clay C.M. et al., 19931, GR; glucocorticoid receptor [Akerblom et al, 19881 decrease ATTGAAGGGTACTTGGTGTAAT [Clay C.M. et al., 19931 -90 to -111 3.2.1, 3.2.2, 3.3, 3.4, 3.5, 7.1 [Clay C.M. et al., 19931 1.1.5, 4.2, 6.1.1+, 7.1 [Akerblom et al, 19881

GRE(2); glucocorticoid responsive element [Akerblom et al, 19881 GR; glucocorticoid receptor decrease ATTTCCTGTTGATCC -79 to -64 1.1.5, 4.2, 6.1-l+, 7.1 [Akerblom et al, 19881

CCAAT box; (C) [Kennedy et al, 19901 a-CBF; alpha subunit CCAAT-binding factor [Kennedy et al, 19901 increase CCAAT -83 to -87 3.3, 3.4, 3.5, 4.2, 6.2+, [Kennedy et al, 19901 a-CBF binds to the human alpha-gh CCAAT box with at least a loo-fold greater affinity than to other CCAAT boxes [Kennedy et al, 19901 Human a-CBF has an apparent molecular mass of 53 kD and distinct from CTF/NFl, C/EBP, CPl, and NF-Y, aCBF is present in a variety of cell types. It is likely that the binding of a-CBF to the CCAAT element serves as a bridge between placenta-specific enhancer and TATA-binding proteins [Kennedy et al, 19901

TATA box [Bokar J.A. et al., 19891 TATAAAA -29 to -23 7.1

TRE; thyroid responsive element [Chatterjee et al. 19891 erbA-beta TR; erbA-beta form of thyroid hormone receptor decrease GCAGGTGAGGACTTCA -22 to -7 3.2.1, 3.5, 6.1.1+, 6.1.2+, 7.2 [Chatterjee et al. 19891 TRE sites of rGH [Chatterjee et al. 19891

Bovine a-GH, sequence identity of 85%, -313 to -1 [Bokar et al., 19891. Each part of 18 bp direct repeat (-146 to -129 and -128 to -111) contains CRE consensus. [Bokar et al., 19891 Two different cis-acting sequences, the URE and CRE give the the strong cell-specificity to the enhancer. The CRE mediates placenta-spesific expression only when linked to the URE. It is posible that URE-binding protein and CREB directly interact with one another. [Bokar J.A. et al., 19891

50

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE

Page 10: Data representation in the TRRD - a database of transcription

cc cc cc cc xx cc cc cc cc xx BI BM BS BS BS BS BS BS xx RA RN RN RF xx RA RN RN RF xx RA RA RN RN RN RN RF xx RA RN RN RF xx RA RN RN RF xx RA RN RN RN RF xx RA Rh RN RN RN RF

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

The mutation at the 5'-boundary of CRE, T --z A, deminished both basal and CAMP stimulated activity by 95 %, activity in the CRE 3' A to T mutant was below the level of detection. [Deutsch P.J. et al., 19881

Glucocrticoids may repress Human alpha-GH gene by binding CRE and CCAAT box through receptor complex [Akerblom et al, 19881 The binding of AR may block binding of JRE transcription factor [Clay C.M. et al., 19931

GENBANK; HDMGPHAA; ST1:314 HDMAN GLYCOPROTEIN HORMONE ALPHA-SUBUNIT GENE, 5' FLANK TCTTGTTTAT AGGAAAGTGT CAGCTTTCAG GATGTTATGT GTATGGCTCA ATAAUTTAC GTACAAAGTG ACAGCGTACT CTCTTTTCAT GGGCTGACCT TGTCGTCACC ATCACCTGAA MTGGCTCCA AA CAAAAATG ACCTAAGGGT TGAAACAAGA TAAGATCAAA TTGACGTCAT GGTAAAAATT GACGTCATGG TMTTACACC AAGTACCCTT CAATCATTGG ATOGAATTTC CTGTTGATCC CAGGGCTTAG ATGCAGGTGG AAACACTCTG CTGGTATAAA AGCAGGTGAG GACTTCATTA ACT

Deutsch P.J., Hoeffler J.P., Jameson J-L., Lin J.C., Habener J.F. Structural determinants for transcriptional activation by CAMP responsive DNA elements. J. Biol. Chem., 1988, v. 263, n. 34, 18466-18472

Akerblom I.E., Slater E.P., Beato M., Baxter J.D., and Mellon P.L. Negative regulation by glucocorticoids through interference with a CAMP responsive enhancer. Science, 1988, 241, 350-353.

Bokar J.A., Keri R-A., Farmerie T.A., Fenstermaker R.A., Andersen B., Hamernik D.L., Yun J., Wagner T., and Nilson J. Expression of the glycoprotein Hormone alpha-subunit gene in the placenta requires a functional cyclic AMP response element, whereas a different cis-acting element mediates pituitary-specific expression. Mol.Cell.Biol., 1989, v-9, No.11, 5113-5122.

Chatterjee V.K.K., Lee J.-K., Rentowis A., and Jameson J.L. Negative regulation of the thyroid-stimulating hormone alpha gene by thyroid hormone: receptor interaction adjacent to the TATA box. PNAS, 1.989, 86, 9114-9118.

Kennedy G.C., Andersen B., and Nilson J.H. The human alpha subunit glycoprotein hormone gene utilies a unique CCAAT binding factor. J.Biol.Chem., 1990, v.265, nll, p.6279-6285.

Andersen B., Kennedy G.C., and Nilson J.H. A cis-acting element located between the CAMP response elements and CCAAT box augments cell-specific expression of the glycoprotein hormone alpha subunit gene. J.Biol.Chem., 1990, v.265, n35, p.21874-21880.

Clay M.C., Keri R.A., Finicle A.B., Heckert L.L, Hamernik D.L., Marschke K.M., Wilson E.M., French F.S., and Nilson J.H. Transcriptional repression of the glicoprotein hormone alpha subunit gene by androgen may involve direct binding of androgen receptor to the proximal promoter. J. Biol. Chem., 1993, v. 268, n. 18, 13556-13564.

Figure 1. The example from the TRRD. The description of regulatory region of the glicoprotein hormone alphs-subunit gene .

51

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE