mouse i1 collagen gene - journal of biological chemistry · mouse type i1 collagen gene ... 1991)...

8
THE JOURNAL OF BIOLOGICAL CHEMISTRY 8 1991 by The American Society for Biochemistry and Molecular Biology, Inc. Vol. 266, No. 25, Issue of September 5, pp. 16862-16539,1991 Printed in U. S.A. Mouse Type I1 Collagen Gene COMPLETE NUCLEOTIDE SEQUENCE, EXON STRUCTURE, AND ALTERNATIVE SPLICING* (Received for publication, March 5, 1991) Marjo Metsaranta, David Toman, Benoitde Crombrugghe, and Eero VuorioS From the Department of Molecular Genetics, The University of Texas M. D. Anderson Cancer Center, Houston, Texas 77030 and the Department of Medical Biochemistry, Uniuersity of Turku, 20520 Turku, Finland Several overlapping clones covering the entire mouse type I1 collagen gene including 10 kilobases (kb) of 5’- and 15 kb of 3’-flanking sequences were isolated from a cosmid library. The overall gene structure was determined by restriction mapping and sequencing. The gene spans 28.9 kb from the start of transcription to the polyadenylation site and contains 54 exons. It codes for a major mRNA species of 4910 bases which translates into a polypeptide of 1419 amino acids. A less abundant RNA species of 5 110 bases contains ad- ditional sequences corresponding to an alternatively spliced exon 2. Except for the amino-terminal propep- tide (N-propeptide) domain the exon-intron organiza- tion of the mouse proal(I1) collagen gene is remarkably similar to genes for other fibrillar collagen types. The overall identity of the coding sequences of the mouse and human type I1 collagen genes is 89% at the nucleo- tide level, but only 37 amino acid changes occur within the mature al(I1) collagen chains between mouse and man. Intron sizes are also conserved between the mouse and human genes but not with the chick al(I1) gene. The promoter of the mouse type I1 collagen gene is similar to those of the rat and human genes containing a TATA box and several G + C-rich elements but no CCAATbox. The 3”untranslated sequence contains two regions of high homology between chick, mouse, bovine, and human genes preceeding the major poly- adenylation site. Additional size variation in the mRNA arises from the use of a minor polyadenylation signal. Information on conserved noncoding sequences will help in studies on the regulation of the proal(I1) collagen gene. Detailed knowledge of the gene is also necessary for site-directed mutagenesis and work with transgenic mice. The collagens form a family of at least 25 genes in humans and higher eukaryotes coding for a minimum of 13 different collagen types with importantstructural functions inthe connective tissues. Similar collagens are also found in Dro- sophila, sea urchins, and nematodes (1, 2). In vertebrates the * This study was financially supported by National Institutes of Health Grant AR 40335, the Medical Research Council of the Finnish Academy and by the Turku University Foundation (to M. M.). The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked “aduertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. The nucleotide sequence(s) reported in thispaper has been submitted to the GenBankTM/EMBL Data Bank with accession number(s) M65161. $ To whom correspondence should be addressed Dept. of Medical Biochemistry, University of Turku, Kiinamyllynkatu 10, SF-20520 Turku, Finland. Tel.: 358-21-633-7349;Fax: 358-21-33-1126. collagens are particularly abundant in skeletal tissues, such as bone, cartilage, ligaments, and tendons. Type I1 collagen, a homotrimer of al(I1) chains, is the predominant protein in cartilage matrix where it forms fibrils conferring tensile strength and providing a scaffolding network for proteogly- cans (3, 4). In an adult organism type I1 collagen is mainly found in the hyaline cartilage of articular surfaces and the nucleus pulposus of the intervertebral discs, tissues known to suffer from degenerative disease processes (3, 5). Earlier in life the growth of long bones, which largely determines the skeletal size, occurs at growth plates by production of type I1 collagen and other cartilage matrix components by prolifer- ating chondrocytes (3). In addition to these structural roles the cartilaginous matrix serves an informative function during embryonic development; most bones are formed through en- dochondral ossification, i.e. via a cartilage intermediate. The expression of type I1 collagen is not, however, completely restricted to cartilage, it also occurs in the eye (primary corneal epithelium, lens, retina,and scleral cartilage) (6), transiently during embryogenesis at various mesenchymal- epithelial interphases (7,8), and in the prechondrogenic mes- enchyme of limb buds prior to chondrogenesis (9). The func- tions of type I1 collagen in tissues other than hyaline cartilage and eye remain, however, obscure. The phenotypes of diseases where linkage to or mutations in type I1 collagen genes (COL2A1)have been identified are clearly related to disturbed long bone growth (spondylo-epiphyseal dysplasia, achondro- genesis) (10, 11), degeneration of articular surfaces (osteoar- throsis, Stickler syndrome) (12-14), and retinal detachment (Stickler syndrome) (15). Transgenic mice harboring type I1 collagen genes with specific mutations would clearly help in defining the role of type I1 collagen in cartilage and other tissues. For this purpose we decided to isolate and characterize the gene for mouse type I1 collagen. Although no information was available for the mouse prootl(I1) collagen gene, the chick (16), rat (17, 18), bovine (19), and human genes (20-27) have been characterized to various degrees. The exon-intron structure has been deter- mined for approximately 90% of the chick gene and recently for almost the entire human gene. The complete cDNA se- quence for the human type I1 collagen mRNA has also been determined (28, 29). In addition approximately 65% of the bovine al(I1) chain amino acid sequence has been determined (30). Comparison of the overall structure of the mouse gene with the genes for the otherfibrillar collagens (types I, 111, V, and XI) revealed marked similarities thoughout the triple- helical domain and the C-propeptide’ domain, whereas the N- The abbreviations used are: C-propeptide, carboxyl-terminal pro- peptide; C-telopeptide, carboxyl-terminal telopeptide; N-propeptide, amino-terminal propeptide; N-telopeptide, amino-terminal telopep- tide; PCR, polymerase chain reaction; SDS, sodium dodecyl sulfate; bp, base pair(s); kb, kilobase pair(s). 16862

Upload: truongquynh

Post on 06-Jul-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

THE JOURNAL OF BIOLOGICAL CHEMISTRY 8 1991 by The American Society for Biochemistry and Molecular Biology, Inc.

Vol. 266, No. 25, Issue of September 5, pp. 16862-16539,1991 Printed in U. S.A.

Mouse Type I1 Collagen Gene COMPLETE NUCLEOTIDE SEQUENCE, EXON STRUCTURE, AND ALTERNATIVE SPLICING*

(Received for publication, March 5, 1991)

Marjo Metsaranta, David Toman, Benoit de Crombrugghe, and Eero VuorioS From the Department of Molecular Genetics, The University of Texas M. D. Anderson Cancer Center, Houston, Texas 77030 and the Department of Medical Biochemistry, Uniuersity of Turku, 20520 Turku, Finland

Several overlapping clones covering the entire mouse type I1 collagen gene including 10 kilobases (kb) of 5’- and 15 kb of 3’-flanking sequences were isolated from a cosmid library. The overall gene structure was determined by restriction mapping and sequencing. The gene spans 28.9 kb from the start of transcription to the polyadenylation site and contains 54 exons. It codes for a major mRNA species of 4910 bases which translates into a polypeptide of 1419 amino acids. A less abundant RNA species of 5 110 bases contains ad- ditional sequences corresponding to an alternatively spliced exon 2. Except for the amino-terminal propep- tide (N-propeptide) domain the exon-intron organiza- tion of the mouse proal(I1) collagen gene is remarkably similar to genes for other fibrillar collagen types. The overall identity of the coding sequences of the mouse and human type I1 collagen genes is 89% at the nucleo- tide level, but only 37 amino acid changes occur within the mature al(I1) collagen chains between mouse and man. Intron sizes are also conserved between the mouse and human genes but not with the chick al(I1) gene. The promoter of the mouse type I1 collagen gene is similar to those of the rat and human genes containing a TATA box and several G + C-rich elements but no CCAAT box. The 3”untranslated sequence contains two regions of high homology between chick, mouse, bovine, and human genes preceeding the major poly- adenylation site. Additional size variation in the mRNA arises from the use of a minor polyadenylation signal. Information on conserved noncoding sequences will help in studies on the regulation of the proal(I1) collagen gene. Detailed knowledge of the gene is also necessary for site-directed mutagenesis and work with transgenic mice.

The collagens form a family of at least 25 genes in humans and higher eukaryotes coding for a minimum of 13 different collagen types with important structural functions in the connective tissues. Similar collagens are also found in Dro- sophila, sea urchins, and nematodes (1, 2). In vertebrates the

* This study was financially supported by National Institutes of Health Grant AR 40335, the Medical Research Council of the Finnish Academy and by the Turku University Foundation (to M. M.). The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked “aduertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

The nucleotide sequence(s) reported in thispaper has been submitted to the GenBankTM/EMBL Data Bank with accession number(s) M65161.

$ To whom correspondence should be addressed Dept. of Medical Biochemistry, University of Turku, Kiinamyllynkatu 10, SF-20520 Turku, Finland. Tel.: 358-21-633-7349; Fax: 358-21-33-1126.

collagens are particularly abundant in skeletal tissues, such as bone, cartilage, ligaments, and tendons. Type I1 collagen, a homotrimer of al(I1) chains, is the predominant protein in cartilage matrix where it forms fibrils conferring tensile strength and providing a scaffolding network for proteogly- cans (3, 4). In an adult organism type I1 collagen is mainly found in the hyaline cartilage of articular surfaces and the nucleus pulposus of the intervertebral discs, tissues known to suffer from degenerative disease processes (3, 5). Earlier in life the growth of long bones, which largely determines the skeletal size, occurs at growth plates by production of type I1 collagen and other cartilage matrix components by prolifer- ating chondrocytes (3). In addition to these structural roles the cartilaginous matrix serves an informative function during embryonic development; most bones are formed through en- dochondral ossification, i.e. via a cartilage intermediate. The expression of type I1 collagen is not, however, completely restricted to cartilage, it also occurs in the eye (primary corneal epithelium, lens, retina, and scleral cartilage) (6), transiently during embryogenesis at various mesenchymal- epithelial interphases (7,8), and in the prechondrogenic mes- enchyme of limb buds prior to chondrogenesis (9). The func- tions of type I1 collagen in tissues other than hyaline cartilage and eye remain, however, obscure. The phenotypes of diseases where linkage to or mutations in type I1 collagen genes (COL2A1) have been identified are clearly related to disturbed long bone growth (spondylo-epiphyseal dysplasia, achondro- genesis) (10, 11), degeneration of articular surfaces (osteoar- throsis, Stickler syndrome) (12-14), and retinal detachment (Stickler syndrome) (15). Transgenic mice harboring type I1 collagen genes with specific mutations would clearly help in defining the role of type I1 collagen in cartilage and other tissues. For this purpose we decided to isolate and characterize the gene for mouse type I1 collagen.

Although no information was available for the mouse prootl(I1) collagen gene, the chick (16), rat (17, 18), bovine (19), and human genes (20-27) have been characterized to various degrees. The exon-intron structure has been deter- mined for approximately 90% of the chick gene and recently for almost the entire human gene. The complete cDNA se- quence for the human type I1 collagen mRNA has also been determined (28, 29). In addition approximately 65% of the bovine al(I1) chain amino acid sequence has been determined (30). Comparison of the overall structure of the mouse gene with the genes for the other fibrillar collagens (types I, 111, V, and XI) revealed marked similarities thoughout the triple- helical domain and the C-propeptide’ domain, whereas the N-

The abbreviations used are: C-propeptide, carboxyl-terminal pro- peptide; C-telopeptide, carboxyl-terminal telopeptide; N-propeptide, amino-terminal propeptide; N-telopeptide, amino-terminal telopep- tide; PCR, polymerase chain reaction; SDS, sodium dodecyl sulfate; bp, base pair(s); kb, kilobase pair(s).

16862

Mouse Type 11 Collagen Gene 16863

propeptide domains were more divergent. This paper describes a detailed structural analysis of the entire mouse type I1 collagen gene and its comparison with other type I1 collagen genes.

EXPERIMENTAL PROCEDURES

Cosmid Library-A mouse genomic library in cosmid pWE15 (5 X loh clones) was screened with a 32P-labeled 1340-bp DraI-Hind111 fragment of human al(I1) collagen cDNA pHCAR3 (28) both under low and high stringency. The hybridization was performed in 5 X SSC (1 X SSC is 0.15 M NaCl, 0.015 M trisodium citrate, pH 7.0), 10 X Denhardt's solution, 0.5% sodium dodecyl sulfate (SDS), and 200 pg/ml denatured herring sperm DNA at 65 "C overnight. After the low stringency washes (2 X SSC, 0.5% SDS at room temperature, and 0.2 X SSC, 0.5% SDS at 63 "C for 1 h with three changes), 57 positive clones were identified. Thereafter a high stringency wash was per- formed for the same filters in 0.2 X SSC, 0.5% SDS at 66 "C for 3 h. After this wash 10 colonies still exhibited a strong positive signal and were selected for further characterization. Purified genomic clones were characterized and aligned by restriction mapping and Southern hybridization.

DNA Sequencing-For sequencing approximately 35 kb of the gene was first subcloned as EcoRI and XbaI fragments in Bluescript'" SK- vector (Stratagene) and then further as 112 smaller subclones. In addition to oligonucleotide primers corresponding to the T3 and T7 recognition sites of the vector 25 synthetic oligonucleotides were used as sequencing primers. Sequencing was performed on the double- stranded DNA using the Sequenase'" reagent kit and [32P]dATP. The sequences were stored and analyzed using the GCG software (31).

RNA Extraction, cDNA Synthesis, and Amplification by PCR- Total RNAs were isolated from rib and epiphyseal cartilages of newborn mice and from mouse primary chondrocyte cultures using the guanidinium isothiocyanate method (32). Total RNA (10 pg) was used as the template for cDNA synthesis by Moloney murine leukemia virus reverse transcriptase under conditions suggested by the supplier (New England Biolabs). Both oligo(dT), random hexamers and spe- cific oligonucleotides were used as primers. Aliquots of cDNA were used for amplification by the polymerase chain reaction (PCR) (Gene Amp'", Perkin-Elmer Cetus Instruments) using specific oligonucleo- tide primers (locations shown in Fig. 4). The reactions were cycled by denaturing at 94 "C for 1 min, annealing at 57 "C for 2 min, and extension at 72 "C for 3 min. After 30 amplification cycles, aliquots of the reactions were fractionated by electrophoresis on 1.5% agarose gels, and the specific fragments were purified and cloned by blunt- end ligation into the EcoRV site of the Bluescript vector.

Primer Extension-The start site of transcripion of the mouse type I1 collagen gene was analyzed by primer extension. Total RNA (5 pg) from rib cartilages of 3-week-old mice and from cultured mouse primary chondrocytes served as templates for reverse transcriptase and a specific oligonucleotide (CCGGGTCTCTACCGCTCCCT- CATGCAGGAG) labeled at the 5' end as the primer for cDNA synthesis as described earlier (33).

S1 Nuclease Mapping-The polyadenylation sites of the proal(I1) collagen transcripts were determined by S1 nuclease mapping as described (33). Total RNA (30 pg) from newborn mouse rib cartilages was hybridized at 53 "C with a 1140-bp NcoI-BamHI fragment cor- responding to the 3"untranslated sequence of the gene. Prior to digestion the fragment was labeled at its 3' end by filling in at the NcoI site with [32P]dATP and ["PIdCTP using the Klenow enzyme (33). After digestion with different concentrations of S1 nuclease, the reactions were fractionated on a denaturing 6% polyacrylamide gel and the bands detected in dried gels by autoradiography.

RESULTS AND DISCUSSION

Screening of the mouse genomic cosmid library with the cDNA probe for human al(I1) collagen mRNA was performed both under low and high stringency. 57 positive clones were identified in the initial screening at low stringency. After the high stringency wash, 10 colonies still exhibited a strong positive signal and were selected for further characterization. Restriction mapping of these 10 clones showed considerable overlap, indicating that they coded for the same gene (Fig. I), which was consequently identified as the proal(I1) collagen gene. The gene boundaries within the DNA were determined

5' Mouse type 11 collagen gene 3'

ECORI~ I In I II BamHI I I I I

Xhol SnaBl

Xbal

n 5 10 1.5 20 2.5 30 35

kilobase pairs

FIG. 1. Restriction map of the mouse type I1 collagen gene. Top, location of the coding region is shown by the thick line. Middle, restriction sites for eight enzymes with less than 10 recognition sites within the sequence. Bottom, the scale in kilobases.

by Southern hybridization with probes specific for the 3'- untranslated sequence (pHCAR3) and for the 5' end (exon 1) of the human gene. Some of the other 47 positive clones were subsequently identified as clones for mouse proal(1) and proal(II1) collagen genes.

The gene for mouse type I1 collagen spans 28.9 kb from transcription start site to the polyadenylation site and con- tains 54 exons (Figs. 2 and 3). The complete nucleotide sequence of 30.7 kb of the gene was determined. Exons were identified by flanking consensus splice signals and by com- parison with the corresponding human gene and cDNA se- quences (20-29). The intron-exon boundaries and all exon sequences were confirmed on both strands by sequencing of cDNA clones covering the entire coding sequence of the mRNA using fragments amplified by PCR (Fig. 4).

Exon Structure-The overall organization of the gene (Figs. 2 and 3) shows remarkable conservation of exon structure typical for all genes coding for fibrillar collagens (1, 2). This is particularly evident in the triple-helical and C-propeptide domains. Most divergence in the exon organization between genes for different collagen types is seen in the N-propeptide domain. In the mouse (Fig. 2) and human (25, 26) type I1 collagen genes, the coding sequence is dispersed into eight exons, whereas only five or six exons code for this domain in the other fibrillar collagen genes (2). Shown in Fig. 2 is the comparison of the exon sequences, as well as the promoter and 3"untranslated sequences, with the corresponding hu- man sequences (26, 28, 29). The introns in the 5' half of the gene are considerably larger than in the 3' half, making the gene more compact toward the 3' end; the 5' half of the coding region spans 19.9 kb and the 3' half only 8.7 kb of the gene (Fig. 3).

The sequences at the intron-exon boundaries conform well with general splice consensus sequences with the exception of the 3' boundary of the alternatively spliced exon 2. The consensus sequence for the 3' ends of the introns was as follows.

T57 T41 "44 T43 T43 T57 Css N Aloo Gloo/spl i c e

Czs C35 C35 C35 C35 C35 Tzs

The subscript numbers denote the frequency of the most common nucleotides in percent. The consensus sequence for the 5' end of introns was as follows.

A52 A40

S p l ice/Gloo Tloo A06 Goa TSo T4s G40

Only three of the 54 exons begin with a split codon; these are the alternatively spliced exon 2, and exons 3 and 50.

The overall sequence identity between the mouse and hu-

Mouse Type 11 Collagen Gene exon intron “:o 6 sire

tcaqqc-cac tggqcacatt qqggqcqqqa aqctqqqctc ac-qaaaqqq qcqactqgcc ttqqcaqgtg tqgqctctqq tccgqcctqq gcggqctccq ggqgcggg-g ...... q . . . .a..gq--.q . a . . . . . . . . . . . q t . a . . . c.a..q .... .q.t.c...t .g ........ c....a.... cag.. . . c a . .................. C .

tctcaggtta caqccccgcg gqqgqctaqq qqgcqgcccq cqgtttqgqc cqgtttqcca gcctttqqaq cgaccqgqag caL3Iaactg qaqcctctqa aqggqgaaga gt . . . . . . . . . . . . . . a,.. . .................... ga...c . . . . . . . . cqaa.. 9 . 9 . , . . . C . ........-. .C...g.g.c g...a....A

1 231 bp CGCAGAGCGC CGCTGGGCTG CCGGGTCTCC TGCCTCCTCC TGCTCCT--- AGGGCCTCCT GCATGAGGGA GCGGTAGAGA CCCGGACCCG CTCCGTGCTC -TGCCGCCTC .......... T . . . . . . . . . . . . . . . . . . . C..T.... .. .C..G..CCA . . . . . . . . . . . . . . . . . . . C .................... .G ........ C ..... TT.. GCTGCGCTTC GCCCGGGCCA GGCTCTGCCA GGCCTCGCGG TGAGCCAEA TCCGCCTCGG GGCTCCCCAG TCGCTGGTGC TGCTGACGCT GCTCATCGCC GCGGTCCTAC ........ C. ......... C ..... A,. . . . . . . C... . . . . . . . . . . . . .T .......................................... G..... .. T . . . . . T. GGTGTCAGGG CCAGGATGCC Cgtaaqtcgc cc~ccqcccc tqCCtaCttC . . . . . . . . . . . . . . . . . . T. .

size

3799 bp

2 204 bp qccgcccctc CCaCCCcact tqqtgcaqAG GAGGCTGGCA GCTGTCTGCA GAATGGGCAG AGGTATAAAG ATAAGGATGT ATGGAAGCCC TCATCTTGCC GCATCTGTGT . . . . . . . . . . . . . . . . . G.... .G . . . . . . . . . . . . . . . . T. .......... G ........ G GAGC.C.... .G ........

GTGTGACACT GGGAATGTCC TCTGCGATGA CATTATCTGT GAAGAC---C CAGACTGCCT CAACCCCGAG ATCCCCTTCG GAGAGTGCTG TCCCATCTGC CCAGCTGACC C . . . . . . . . . . . . . C..... . . . . . . . C.. ... A,..... . . . . . . GTGA A,.. . . . . . . . . G...T ....................... C . . . . . . . . . . . . A,..... TCGCCACTGC CAGTGqtcqt aatttattta tttattattc aacataaata 1110 bp . . . . . . . . . . . . . . .

3 17 bp acttctcctt Ctctgtcttc ccttqcaqGA AAATTAGGGC CAAAGqtaaq gcaccccatt tttaatttad tttaattaat 387 bp . . . . . .

4A 33 bp CttCaCCtgt cgatqttttq tqatgcagGG GCAGAAAGGA GAACCTGGAG ATATCAGAGA Tgtaagtaca aattatcccc aCCgtgaCCC .G C..CC...A

120 bp .. A,. ................. .C....AG.. .

4 8 3 3 bp aatqqqctca tqgctctctc tgacacaqAT CATAGGACCC AGAGGACCTC CTGGCCCTCA Gqtaaqaaaq gqaqaqatct Ctttttccat 100 bp

5A 54 bp gtcaacacac C ~ ~ ~ C C C C ~ C accttcagGG ACCTGCAGGT GAACAAGGAC CCAGAGGTGA TCGTGGTGAC AAGGGAGAAA AGgtgagcag acagcaacaa atactgatgt 141 bp . . TG . . . . . . . . .A . . . . . . . . . . . . G..... . . . . . . . . . . . . G . . . . . . . . . . . . . . . . . G.. . . . . . . . . . . . . A..T.... .A

5B 105 bp qataactttt ttttctcttc ctqtcaaqGG TGCGCCTGGA CCCCGTGGCA GRGATGGAGA ACCTGGTACC CCTGGAAATC CTGGCCCCGC TGGCCCTCCA GGTCCCCCTG

6

1

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

21

. . . . . C..... . . . T ....................... G... .......... GTCCCCCTGG CCTTAGTGCA GGAgtaaqtg cccttdqttc tcctttctcc .......... T...G...G . ---

78 bp CccctaaCaC qgatctqttq ctttqcagAA CTTCGCGGCT CAGATGGCTG GAGGGTATGA CGAGAAGGCT GGTGGTGCCC . . . . . T..T..C . . . . . . . . . . . . . . A.T . . . T..A.. . . . . . . . . . C. ...

qattccaqtt ctqtgctqac cqgacggqag 4 5 bp gcqtctcqtt ttttctgttc caatgcagGG CCCCATGGGA CCCCGTGGAC CCCCAGGCCC TGCCGGTGCC CCCgtaaqtq

. . . . . . . . . . . . . . T..A.... .T . . . . . . . . . . . A,. ... T ..T 54 bp aacatqgcgt CCttatttCC ccttttaqGG CCCTCAAGGA TTTCAAGGCA ATCCTGGCGA ACCTGGCGAG CCTGGTGTCT

.. G..... ..................... T.. . . . . . . T..A .......... 54 bp aaccaggtct tgaattctct atttqtagGG TCCCATGGGT CCCCGAGGTC CTCCTGGCCC TGCTGGAAAA CCTGGTGACG

. . . . . . . . . . . . . . . . . T.... . . . . . . . T.. CC . . . . . . . G . . . . . . . . T. 54 bp aaaatqaqat gcctctcttt ctgttcaqGG TGAAGCTGGG AAGCCCGGAA AGTCTGGGGA AAGAGGCCTC CCTGGCCCTC

. . . . . . . . . . . A . . A..T.... .AG....T.. . . . G..T.CG . . . . . T.... 54 bp ctctqaqtac CtCCtCttqq tattqcagGG TGCTCGTGGA TTCCCAGGAA CCCCGGGTCT CCCCGGTGTC AAGGGTCACA

. . . . . . . . . . . T . . . . . . . . . . . . . . A..C.. T..T...... .. A,... . . . 54 bp qtqacqqaac CCtCgtCCtq tttCCCagGG TTACCCAGGC CTCGACGGTG CTAAGGGGGA AGCTGGTGCT CCGGGTGTGA

. . . . . T...... . . G . . . . . . . . . . . . . . A.. G..G...... .. T . . . . . . .

. . . . . . . . C. . . . T.....C .. C. . . . . . . aggctqaccc

AGATGGGAGT CATGCAAGGG CCCATGqtag .. T.... . . . A ........ A ..A,.. ttcagtcttt tcctcttggq gqcqctqtag

CTqtqaqtac cacagqctac cctctcccaq

ACqtgaqtag acccaaqaag ccccagaccc

AGqtaaagct tcatcttcct cgctctcaca .T

GAqtaaqtat catqqqataq gatqtttgqg

AGqtaaaqqq qccacaaqac acacagqgqq

909 bp

1 0 5 bp

619 bp

3 3 6 bp

687 bp

290 bp

94 bp

291 bp

4 5 bp Cactqdccct atatqtctct ttCtCCaqGG TGAGAGTGGT TCCCCTGGTG AGAACGGATC CCCGGGCCCA ATGqtgaqta tqaqagtcac ccctgqgqaa qccaccccac 413 bp . . . . . . . . . . . . . . . . . G.... . . . . . . . . . . T.. . . . . . . . . . .

5 4 bp acttaaaccg cqctggtqtq tqttqcaqGG TCCCCGTGGC CTGCCTGGCG AGAGAGGACG GACTGGCCCT GCTGGTGCTG CTgtgaqtaa CCCCCadaqC CCgqtgCCaC 2613 bp . . . . . T... . . . . . . . . . . . T. .A. ...................... C... . .G

45 bp qqctcctttq tqctattqct qttcacaqGG TGCTCGGGGT AACGATGGCC AGCCAGGCCC CGCTGGACCT CCGqtaagtt qCtgtCCttC ttaaqcaqqt qacagtagct 379 bp . . . . . C..A..C . . . . . . . . T. . . . . . . . . . . . . . A..T... ...

54 bp CaCCCtctqt acccttdctt CtCCCCaqGG TCCTGTGGGT CCCGCAGGTG GTCCTGGCTT CCCTGGTGCT CCTGGTGCCA AGgtgaqtga tctgCtgqtc aagctgagaa 1495 bp . . . . . . . . C.. . . . T..T ............................. A,... . .

99 bp cacttcaqcc cccccactct CttCCCaqGG CGAAGCTGGT CCCACTGGTG CTCGCGGTCC TGAAGGTGCT CAAGGTTCTC GTGGCGAGCC .. T ..... C.. C .......... .C..T..... . . . . . . . . . . . . . . . . C... .C..T..A..

TGGCAATCCT GGGTCCCCTG GGCCTGCAGG TGCTTCTgta aqttcdtctc tttqqcctqq aagqcatgqc atgtqtqqcc 291 bp ... T.C.... . . . . . . . . . . . . . . . . . T.. . . . C..C

4 5 bp ttttatcgct ataccatctt cttgtcaqGG TAACCCAGGG ACTGATGGTA TTCCTGGAGC CAAAGGATCC GCTgtaaqta ttgtacctqq gCtqtqttCC caaggtggct 87 bp . . . . . . . . T..A ..A . . . . . A. . . . . . . . . . . . . . . . . . . . T ...

99 bp cgtqcctctc ttctqttcat qctqacaqGG TGCTCCTGGA ATTGCTGGTG CCCCTGGCTT CCCTGGGCCC CGTGGCCCTC CCGGTCCTCA . . . . . . . . . . . C .......... .T . . . . . . . . . . . . . . . . . A . . G..T.... .T..C.....

AGGTGCAACT GGTCCCCTTG GCCCCAAAGG TCAGGCGqta aqagcccaaa aagattqgca ggtCCtaCCt acaqgctcct 1 4 5 bp . . . . . . . . . . . . . . . T..G. .... G.. . . . . . . . A,.

54 bp tqcqagcata tttctttctt tcacgtagGG TGAACCTGGC ATTGCTGGCT TTAAAGGTGA TCAAGGCCCC AAGGGAGAGA CTgtqagtat CtCCCtCaag acttgttttc 371 bp . . . . . . . . . . . T . . . . . . . . . . .C ........ A,. . . . . . . . . . . . . . . . AC ..

I08 bp atcccacccc tcttqtgctt gtcttcaqGG ACCTGCTGGG CCCCAAGGAG CCCCTGGCCC CGCTGGTGAA GAAGGCAAAC GAGGTGCTCG .. C. ....... C ..... G. . . . . . . . . . . A.. . . . . . . . . . . . . . . . . . . GA ....... C..

AGGAGAGCCG GGTGGTGCTG GACCAATCGG ACCCCCTGGA GAGAGAqtaa gtaqgcaagq aqctccagct ~ a c a g q c c ~ g 360 bp T ........ T . . . . . C.T.. .G..C..... T.. . . . . . . . . . A...

54 bp CatCccCtqC CtCgtCcCtt CtCCCtagGG TGCTCCTGGC AACCGTGGAT TCCCAGGTCA AGATGGTCTG GCAGGTCCCA AGqtgagtqq aggaqqagaq gcctqgtccc 84 bp . . . . . . . . C..A ..... C..T. ................................

99 bp catqqacttc CtqtCtCCtc tgqtatdqGG TGCCCCTGGA GAGCGAGGGC CCAGTGGCTT GGCTGGTCCC AAGGGAGCCA ACGGTGACCC .. A. ......................... TC. T . . . . . C... ....................

GGGTCGTCCT GGAGAACCTG GTCTTCCTGG AGCCAGGqta aggtggatac tacacagacc cccacaccct tCCCaCCtqC 130 bP T..C..... . . . . . . . . . . . .C . . . . . . . . . . . . C..

5 4 bp cttqgtctct tctctctccc taaccaaqGG TCTTACCGGT CGCCCTGGTG ACGCTGGTCC TCAAGGCAAA GTTGGTCCTT CTgtaaqtct attagcctga gtgag9ttCC 465 bp . . . . . C..T..C .......... .T ..................

99 bp tqtcctqaga atqaatccfq tctttcagGG AGCCCCTGGT GAAGACGGTC GCCCTGGACC TCCTGGTCCT . . . . . . . . . . . . . . . . . T.... .T . . . . . . . . . . . A,.....

TGGCGTCATG GGTTTCCCTG GCCCCAAAGG TGCCAACgta ... ...........................

54 bp Ctcdtggtct tctcgtccct CCCttCagGG CGAGCCTGGC AAAGCTGGTG AGAAGGGTCT GGCTGGCGCT .. T..... ..................... A,. .C....T...

54 bp ttacaactcc tCCtCCtCtg ccctgtagGG TCTTCCTGGA AAAGACGGTG AGACGGGAGC CGCAGGACCC . . . . . . . . . . . C . . . . . T... . . . . . A..T.. T . . . . . . . . .

T......

. . . . . . . . . . . . CAGGGAGCTC GTGGGCAGCC ..... .......... aqtgacagtt tqctctctca tctccttcqt ctttccctat 379 bp

CCTGGTCTGA GAqtaaqtqt CCtCCCCaCt acctatgqct 392 bp .......... .G CCCGGCCCCA GTgtqagtac ctqctgctaa aacqagggac 246 bp .. T. .... TG C.

G....

FIG. 2 . The nucleotide sequence of the mouse type I1 collagen gene. The sequence shown covers 200 bp of the promoter, all exons (capital letters) with 30 bp of flanking intron sequences, and the 3’-untranslated region. Top line, the nucleotide sequence of the mouse gene. Bottom line, the nucleotide sequence for the promoter, exons, and 3”untranslated domain of the corresponding human gene (from Refs. 26-29). The only bases which differ from the mouse sequence are shown. Gaps (marked by a dash) have been added for maximum alignment. The exon numbers (from Ref. 2 ) and sizes are shown on the left and the intron sizes on the right. Underlining highlights the TATAA box (in the promoter), the translation start site (in exon I), the translation stop codon, and the two polyadenylation signals (in exon 52). All exon sequences have been confirmed by sequencing of the corresponding cDNA on both strands.

Mouse Type II Collagen Gene 16865

32

33

34

35

36

3 1

38

39

40

41

42

43

4 4

45

46

4 7

48

49

50

51

52

228 bp

233 bp

186 bp

251 bp

338 bp

2 5 5 bp

361 bp

281 bp

283 bp

471 bp

435 bp

621 bp

169 bp

194 bp

2 1 7 bp

135 bp

190 bp

112 bp

234 bp

432 bD

285 bp

352 bp

297 bp

456 bp

man exons is 89.1%. The differences occur most frequently in C-telopeptide/C-propeptide domain (90.2%), but the percent- third positions and the overall amino acid similarity between ages for the triple-helical domain (88.8%) and the N-propep- mouse and human procul(I1) collagen is 95.2% (Fig. 5). The tide domain including the 5’-untranslated sequence (88.7%) conservation of the nucleotide sequence is the highest in the are also high. Interestingly the 3‘-untranslated sequence from

16866 Mouse Type II Collagen Gene 1 - U

2 3 4A4BSA 5B 6 7 8 9 10 I I 1 . I I .. I I . ." - " I

I 1 12 13 14 1.5 16 17 m m I 4:: : I -

18 1920 21 22 23 24 25 26 27 28 29 30 31 32 33 34 3.5 u: : L : : : L : :?. I : 2 L b

36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

SI 52 A T A A A

l o o 0 bp

FIG. 3. A schematic presentation of the exon-intron orga- nization of the mouse type I1 collagen gene. The figure is based on the sequence presented in Fig. 2. The exons are numbered from the 5' end with the amino-terminal joining the exon numbered as exon 6 for alignment with the other fibrillar collagen genes. Bottom, the scale in base pairs.

Exon n v m h r

I I O 2u 311 ?I1 511 UAA AUAUUAA

I lull 1 1 ~ 1 1 I 1 I 1 I 1 111111 I 1 1 1 1 1 111 I I I 1 I I I 1 I I I I I

0s AS SI6 S I 1 81 S3 " -

BY 810 811 DX 07 87 A 5 - 81 U 2

IUX1 bp -

FIG. 4. The strategy for cDNA amplification by the polym- erase chain reaction. Top, the cDNA for mouse proal(I1) collagen mRNA with the exon boundaries (vertical lines), stop codon, and the major polyadenylation signal is shown. Total RNA from primary mouse chondrocyte cultures was used as template and random hex- amers as primers for cDNA synthesis by reverse transcriptase. Mid- dle, 15 oligonucleotides (18- to 21-mers) were synthesized based on the gene sequence and used for primers in the amplification of cDNA by PCR. The amplified fragments are shown as lines with the thick labeled bars at each end denoting the primers used. For sequencing each fragment was cloned into Bluescript by blunt-end ligation. Bottom, the scale in base pairs.

stop codon to polyadenylation signal also exhibits an 86.6% identity between mouse and man. The 220 bp upstream from the transcription start site exhibits 78% similarity between the mouse and human genes. The sizes of exons are conserved between mouse and man, except for exons 2 and 5B which both differ by three nucleotides. The introns vary in size between 3799 and 84 bp. Interestingly a high degree of con- servation is also seen in the sizes of the 53 introns between mouse (Fig. 2) and human (20-27) genes. On the other hand intron sizes are not conserved between the chick (16) and mammalian type I1 collagen genes.

Promoter and Transcription Start Site-Primer extension analysis was performed to identify the transcription start site of the gene (Fig. 6). The location is exactly the same as in the rat gene (17) but differs by one nucleotide from the location shown for the human gene (23, 25, 26). Within the first 220 bp of the mouse upstream sequence (Fig. 2), the G + C content exceeds 70%, analogous to the human and rat promoters. The TATAA sequence is located between -28 and -23 from the transcription start site. The promoter contains no obvious

SIGNAL PEPTIDE N-PROPEPTIDE IN MIRLGAPQSLVLLTLLIAAVLRCQG QDA(R/Ql

. . . . . . . . . . . . . . . . v . . . . . . . . ..v . / , ALTERNATIVELY SPLICED SEQUENCE EAGSCLQNGQRYKDKDWIKPSSCRICVCDTGNVLCDDIICEDP-DCLNPEIPFGECCPICPADLATASG . . . . . V.D .... N . . . . . . . E? . . . . . . . . . T..........VK...S.............T.......

30N KLGPKGQKGEPGDIRDIIGPRGPPGPQGPAGEQGPRGDRGDKGEKGAPGPRG~GEPGTPGNPGPAGPEGPPGPP QP ............ K . . ..........................................................

N-TELOPEPTIDE 105N GLSAGNFAA QMAGGYDEKAGGAQMGVUQ . . G- . . . . . . . . . . F . . . . . . . . L....

1 GPMGPMGPRGPPGPAGAPGPQGFQGNPGEPGEPGVSGPMGPRGPPGPAGKEGDDGEAGKPGKSGERGLPGPQGAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P . . . . . . . . . . . . . . A....P.......

76 GFPGTPGLPGVKGHRGYPGLDGAKGEAGAPGVKGESGSPGENGSPGPMGPRGLPGERGRTGPAG~GARGNDGQP

TRIPLE HELIX

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 GPAGPPGPVGPAGGPGFPGAPGAKGEAGPTGARGPEGAQGSRGEPGNPGSPGPAGASGNPGTDGIPGAKGSAGAP

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P . . . . . ............................. 226 GIAGAPGEPGPRGP?GPQGATGPLGPKGQAGEEGIAGFKGDQGPKGETGPAGPQGAPGP~GEEGKRGARGEPGGA

............................. T . . . . . . . . . . .................................. V 301 GPIGPPGERGAPGNRGfPGQDGLAGPKGAPGERGPSGLAGPKGANGDPGRPGE?GLPGARGLTGREGDAGPQGKV

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 GPSGAPGEDGRPGPPGPQGARGQPGVUGFPG?KGANGEPGKAGEKGLAGAPGLRGLPGKDGETGAAGPPGPSGPA

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P ....................... A . . . 451 GERGEQGAPGPSGFQGLPGPPGPPGEGGKQGDQGIPGEAGAPGLVGPRGERGFPGERGS?GAQGLQGPRGLPGTP

............................. P . . . . v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526 GTDGPKGAAGPDGPPGAQGPPGLQGM?GERGAAGIAGPKGDRGDVGEKGPEGAPGKDGGRGLTG?IGPPGPAGAN

. . . . . . . . 5 .. A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601 GEKGEVGPPGPSGSTGARGAPGEPGETGPPGPAGFAGPPG~GQPGAKGDQGEAGQKGDAGAPGPQGPSGAPGPQ

. . . . . . . . . . . A..A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E . . . . . . . . . . . . . . . . . . . . . . . . . 676 GPTGVTGPKGARGAQGPPGATGFPG~GRVGPPGANGNPGPAG?PGPAGKDGPKGVRGDSGPPG~GDPGLQGEA

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S . . . . . . P.....S.......A...........E....... 751 GAPGEKGE?GDDGPSGLDGPPGPQGLAGQRGIVGLPGQRGERGFPGL?GPSGEPGKQGAPGASGDRGPPGPVGPP

.P . . . . . . . . . . . . . . RE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 6 GLTGPAGEPGREGSPGRDGPPGRDGAAGVKGDRGETGALGAPGAPGPPGSEGPAGPTGKQG~RGEAGAQGPMG?~

901 GPAGARGIAGPQGPRGDKGESGEQGERGLKGHRGFTGLQGLPGPPGPSGDQGASG?AGPSGPRGPPGPVGPSGKD . . . . . . . . Q . . . . . . . . . . . .......................................................

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

976 GSNGIPGPIGPPGPRGRSGETGPVGPPGSPGPPG?PGP? .A ..................... A....N..........

C-TELOPEPTIDE 1C GPGIDMSAFAGLGQREKGPD?MQYMRA DERDSTLRQHDVEVDATLKSLNNQIESIRSPDGSRKNPARTCQDLKL

. . . . . . . . . . . . . P . . . . . . . L..... .Q.AGG . . . . . .................................... 75C CHPEWKSGDYWIDPNQGCTLD~KVFCNMETGETCVYPNPATVPRKNWWSSKSKEKKHIWFGETMNGGFHFSYGD

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N .. ............................... 150C GNLAPNTANVQMTFLRLLSTEGSQNITYHCKNSIAYLDEAAGNLKKALLIQGSNDVEM~EGNSRFTYTALKDGC

D I.................

C-PROPEPTIDE

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225C TKHTGKWGKTVIEYRSQKTSRLPIIDIAPMDIGGAEQEFGVDIGPVCFL

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P . . . . . . . . . . . . . . FIG. 5. Deduced amino acid sequence of the mouse pre-

proal(I1) collagen. Top line, the amino acid sequence of the differ- ent domains of the mouse protein. Bottom line, the corresponding human amino acid sequence with differences from the mouse sequence shown (from Refs. 26-29). The gaps mark the two putative signal peptidase cleavage sites and the N-proteinase and C-proteinase cleav- age sites. Amino acid N29 varies depending on the alternative splicing of exon 2.

CCAAT element. Four copies of the hexanucleotide GGGCGG (SpI binding site) are located within the 200-bp promoter. An additional CCGCCC sequence is located in the promoter at -225. The 220 bp of mouse promoter share sequence identities of 78 and 93% with the human and rat promoters, respectively (18,23,26). Sequence identity of over 80% between the mouse and rat promoters extends at least 1 kb upstream, whereas only short segments of >70% identity between the mouse and human genes were detected beyond 400 bp upstream. The homologous regions should help identify sequences important for the regulation of this gene. Tissue-specific regulation of the human proal(I1) collagen gene in chimeric mice suggests that the regulatory mechanisms have not diverged much between these species (34).

N-propeptide Domain-The 5"untranslated sequence spans 152 nucleotides and exhibits 88-90% identity with the human and rat sequence. The first AUG codon is located 60 bases from the transcription start site; a coding sequence for a tetrapeptide Met-Glu-Glu-Arg is followed by an in-frame stop codon. The second AUG begins the open reading frame coding for the proal(I1) collagen chain. The polypeptide chain begins with a presumptive signal peptide of 23 or 25 amino acids, followed by an N-propeptide of 88 or 90 amino acids (plus the 69 amino acids coded for by the alternatively spliced exon 2). The triple-helical domain of the N-propeptide is interrupted after four Gly-X-Y triplets by four amino acids, followed by another 20 triplets. This makes the triple-helical

Mouse Type 11 Collagen Gene 16867

exon I exon 2 I

mouse \ \

m

\ \ \ \ \ ' \ \ \ \ \ \ \ \ \ \ ' \ \ \ \ \

I human

n

222 401 426 X ? 7.5

I I

x0 251 146 2'95 hp 76 74 62 %

INK) bp

FIG. I. Sequence conservation in the first intron of the type I1 collagen gene between mouse and man. Top, the homology regions in the intron 1 sequences of the mouse and human genes. Middle, the sizes of the homologous regions (in base pairs) and the sequence identity (in percent). The human nucleotide sequence is from Ref. 26. Bottom, the scale in base pairs.

c

1 5

FIG. 6. Determination of the transcription start site. For primer extension of procul(I1) collagen mRNA, a 30-mer oligonucle- otide complementary to nucleotides 53-82 of exon 1 was synthesized and used to prime cDNA synthesis by reverse transcriptase. The reaction products were resolved on a denaturing sequencing gel (lanes I and 5, corresponding to 1 and 5 pl of sample) with the corresponding sequencing reaction run on the left (lanes G, A , T, and C).

domain considerably longer than in type I and I11 procolla- gens, which have 13-14 uninterrupted Gly-X-Y repeats (2). The proa2(V) chain has a triple-helical domain similar in length to proal(I1) collagen but with two interruptions (35). The sizes of the N-propeptide domains differ slightly between mouse, rat, and man; the nontriple-helical domain is one amino acid shorter in man, and the alternatively spliced sequence codes for 1 residue less in mouse. The rat sequence for the alternatively spliced exon is not known.

The gene structure of the procrl(I1) collagen chain differs considerably from the other fibrillar collagens in the N- propeptide domain. These differences include a greater num- ber of exons coding for the domain, the presence of a "com- plete" triple-helical 54-bp exon coding for six Gly-X-Y tri- plets, and an alternatively spliced exon (exon 2). For easier alignment of the type I1 collagen gene with the triple-helical and C-propeptide domains of other fibrillar collagen genes, we have designated the joining exon coding for the N-telopep- tide as exon 6 (Figs. 2 and 3). Several of the exons in this domain are short; the shortest one (17 bp) being exon 3.

The first intron of the mouse type I1 collagen gene spans 3.8 kb. Work on the rat proal(I1) gene has located a tissue- specific enhancer in the first intron, but its sequence has not been published (36). Comparison of the first intron of the mouse gene with the human gene (26) revealed several regions of high homology (Fig. 7). The first one shares an 80% identity, spans 212 bp between nucleotides 764 and 976 of the

FIG. 8. Alternativesplicingof exon 2 sequences.A, schematic presentation of the approach used to demonstrate alternative splicing of exon 2 coding for a Cys-rich globular domain of the N-propeptide in mouse chondrocytes. Oligonucleotide S11 (Fig. 4) was used to prime cDNA synthesis by reverse transcriptase using total RNA from mouse primary chondrocyte cultures as template. Primers B5, B8, B7, and A5 were used for amplification of the specific fragments by PCR. The sizes of the expected fragments are shown. Since the amount of the longer (exon 2-containing) transcripts was low, oligo- nucleotides B8 and B7 corresponding to sequences within exon 2 were also used for specific amplification of the longer transcripts. B, fractionation of DNA fragments generated by PCR lane 1, using primers B5 and B8; lane 2, using primers B5 and A5 (the middle band corresponds to heteroduplex molecules); and lane 3, using primers B7 and A5. Shown on the left are the sizes of the bands based on molecular weight standards (123-bp ladder on the right).

225 bo

Jxo hp protected fragments 9 4x0 hp

225 hp

FIG. 9. Determination of the polyadenylation site. A , a dia- gram demonstrating the DNA fragment used in the S1 nuclease mapping. A 1140-bp NcoI-BamHI fragment corresponding to the 3'- untranslated sequence of the gene was labeled at its 3' (NcoI) end with Klenow enzyme. The fragment was hybridized at 53 "C with 30 pg of total RNA from newborn mouse rib cartilages and followed by digestion with S1 nuclease. The locations of the two potential poly- adenylation sites and the sizes of the expected fragments are shown. B, autoradigram of the reaction products from S1 nuclease digestion fractionated on a denaturing 6% polyacrylamide gel. Lane I , labeled NcoI-BamHI fragment without digestion; lane 2, the same fragment hybridized with cartilage RNA and digested with S1 nuclease. The sizes of the fragments are shown on the right.

intron, and contains in the middle an identical viral core enhancer motif (CTGGAAAGT). Two 400-bp sequences be- ginning at nucleotides 1648 and 2074 of the mouse intron exhibit identities of 82 and 75% with the human sequence.

16868 Mouse Type II Collagen Gene

This homology coincides with the location of the tissue- specific enhancer described in the rat gene (36, 37). Three additional homology regions of 251, 146, and 295 bp with sequence identities of 62-76% were found at the 3' end of intron 1. The first intron of the mouse gene contains seven GGGCGG or CCGCCC hexanucleotides; only two of these are located within the conserved sequences.

During sequencing of the mouse type I1 collagen gene we identified an additional exon in intron 1 homologous to an alternatively spliced exon in the human type I1 collagen gene (26). Previous cDNA cloning of proal(I1) collagen mRNA from rat chondrosarcoma (17) and human chondrocytes (29) suggested that type I1 collagen lacks the Cys-rich globular domain in the N-propeptide which is present in the proal(1) and proal(II1) collagens. Amplification of cDNA fragments for the 5' end of mouse type I1 collagen mRNA by PCR also demonstrated the presence of two different transcripts. To confirm this observation, PCR was performed with two oli- gonucleotide primers within the exon and two in the adjoining exons (Fig. 8). The results show that exon 2 is also alterna- tively spliced in mouse proal(I1) collagen mRNAs prepared from chondrocytes (Fig. 8). The majority of the mRNA in newborn mouse chondrocytes did not contain exon 2 se- quences. Studies on human chondrocytes have shown varying ratios of the two transcripts resulting from the alternative splicing (38). Exon 2 terminates in three possible in-frame splice sequences (Fig. 2). We determined by cDNA cloning and sequencing that the middle one (/GTCGTAA), which differs considerably from the consensus splice signal (/GT(A/ T)AGT(A/G)), is used by the splicing machinery. Deviation from the consensus splice sequence has been shown to result in exon skipping in other genes including type I collagen genes (39, 40). It remains to be seen if the exclusion of exon 2 sequences from most proal(I1) collagen transcripts results from an analogous phenomenon. Alternatively a specific un- known mechanism may regulate the ratio of the two mRNAs. This could be related to the role of the Cys-rich globular domain domain, which remains currently unknown. The a1 chains of type I, 11, and I11 collagens and the a2(V) chain all contain a homologous domain which does not undergo alter- native splicing, whereas the a2(I) chain does not contain this domain. The N-propeptide of type I collagen has been sug- gested to play a role in feed-back regulation of type I and type I1 collagen synthesis in cultured cells (see Ref. 41). The biological significance of this control remains unknown.

Triple-helical Domain-The polypeptide chains in mature type I1 collagen consist of an N-telopeptide of 19 amino acids, an uninterrupted triple-helical domain of 1014 amino acids, and a C-telopeptide of 27 amino acids. Comparison of the mouse amino acid sequence deduced from genomic and cDNA clones with the corresponding human sequences reveals 37 amino acid differences in the mature type I1 collagens, 33 within the triple helix, two in the N-telopeptide, and two in the C-telopeptide (Fig. 5). This represents an overall identity of 96.4% at the amino acid level versus the 90.4% identity at the nucleotide level. The lysine residues that participate in covalent intra- and intermolecular cross-linking of type I1 collagen are all conserved in position 122N of the N-telopep- tide, positions 87 and 930 of the triple helix, and position 17C of the C-telopeptide (Refs. 42 and 43; Fig. 2). The lysine residue in the N-telopeptide also participates in intermolec- ular cross-links between type I1 collagen and the a 2 chain of type IX collagen (43,44).

The gene structure for the triple helix shows a remarkable conservation of exon sizes. The triple helix is coded for by 44 exons, two of which (the joining exons) also code for the N-

and C-telopeptides and parts of the corresponding propep- tides. A majority of exons coding for the triple helix have sizes of 54 bp or multiples thereof: 23 are 54 bp, 8 are 108 bp, and 1 is 162 bp long, coding for 6, 12, and 18 complete Gly- X-Y triplets, respectively. Five exons are 45 bp long, and another 5 are 99 bp, coding for 5 and 11 Gly-X-Y triplets. This gene structure is conserved in all the known type I1 collagen genes (16, 19, 21, 27) and with minor variations in all the genes coding for vertebrate fibrillar collagens (2).

C-propeptide Domain-This domain coding for 246 amino acids is well conserved from chick to mouse and man. The amino acid identity between mouse and human is 95.2%, between mouse and bovine it is 93.9%, and between mouse and chicken 86.6%. Homologous regions also exist between the type I1 collagen C-propeptide and other fibrillar collagens particularly around the carbohydrate attachment site at amino acid 174C. The C-propeptides have an important role in the assembly of a procollagen molecule since the process starts at the C-propeptide.

The high degree of amino acid identity in the C-propeptide is also seen as sequence conservation at the nucleotide level between mouse and human (89.8% identity), bovine (90.6%), and chicken (83.8%) type I1 collagen genes. The exon sizes are identical between these species in this domain.

3'-Untranslated domain S1 nuclease protection experi- ments were performed to determine the polyadenylation site(s) of the mouse type I1 collagen transcripts (Fig. 9). The results indicate that the first AATAAA sequence in the 3'- untranslated sequence is rarely used in the mouse. In the human sequence no transcripts terminating at this signal have been detected (28). The predominant polyadenylation signal in the mouse is apparently the ATTAAA sequence 404 nucleotides from the translation stop signal (Figs. 2 and 9). The same variant of the consensus sequence is also used in the bovine and human transcripts, whereas in the chick the sequence is AATAAA. This polyadenylation signal is located at the end of a conserved sequence of 12 nucleotides. Situated 114 bp upstream is another sequence of 79 nucleotides which is identical between mouse and man and also highly conserved in the bovine (96%) and chick (76%) genes. These homologous regions in the 3'-untranslated domain represent the longest stretches of identical sequence between the mouse and human type I1 collagen genes. The overall sequence similarity be- tween these two species in the 3"untranslated domain is 86.6%. The distances from translation stop signal to the predominant polyadenylation signal are also similar: 513 nu- cleotides in the chick (45), 434 nucleotides in bovine (19), 439 nucleotides in human (28), and 404 nucleotides in the mouse transcripts (Figs. 2 and 9).

In Northern analysis a single but somewhat diffuse proal(I1) collagen mRNA band of approximately 5 kb is consistently observed (data not shown). In addition to the length variation in the poly(A) tail two different mechanisms probably contribute to the diffuseness of the Northern hy- bridization signal; the distance between the two alternative polyadenylation signals is 217 nucleotides, and the alternative splicing of exon 2 sequences introduces a variability of 204 nucleotides. Thus the sizes of the mRNAs (assuming the poly(A) tail to be 200 bases long) can be expected to vary between 4690 and 5110 bases. The major species should have a length of 4910 bases. The a3 chain of type XI collagen is an overmodified product of the type I1 collagen gene (46). It is tempting to speculate that the difference could be related either to the alternative splicing of exon 2 or to the use of the alternative polyadenylation signal.

Acknowledgments-We are grateful to Dr. W. B. Upholt for the

Mouse Type 11 Collagen Gene 16869

human genomic clones for type I1 collagen, to Dr. Silvio Garofalo for 22. Sangiorgi, F. O., Benson-Chanda, V., de Wet, W. J., Sobel, M. helpfull discussions, and to Drs. Linda Sandell and Leena Ala-Kokko E., Tsipouras, P., and Ramirez, F. (1985) Nucleic Acids Res. for communicating information on the human gene structure prior to 13,2207-2225 publication. 23. Nunez, A. M., Kohno, K., Martin, G. R., and Yamada, Y. (1986)

1.

2.

3.

4.

5. 6.

7.

8. 9.

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

REFERENCES

Mayne, R., and Burgeson, R. E. (1987) Structure and Function of

Vuorio, E., and de Crombrugghe, B. (1990) Annu. Reu. Biochem.

Stockwell, R. A. (1979) Biology of Cartilage Cells, Cambridge

Upholt, W. B. (1989) in Collagen (Olsen, B. R., and Ninmi, M.

Hamerman, D. (1989) N . Engl. J. Med. 320, 1321-1330 von der Mark, K., von der Mark, H., Timpl, R., and Trelstad, R.

Thorogood, P., Bee, J., and von der Mark, K. (1986) Deu. Biol.

Kosher, R. A., and Solursh, M. (1989) Deu. Biol. 131, 558-566 Nah, H.-D., Rodgers, B. J., Kulyk, W. M., Kream, B. E., Kosher,

R. A., and Upholt, W. B. (1988) Collagen Relat. Res. 8, 277- 294

Lee, B., Vissing, H., Ramirez, F., Rogers, D., and Rimoin, D. (1989) Science 244, 978-980

Vissing, H., D'Alessio, M., Lee, B., Ramirez, F., Godfrey, M., and Hollister, D. W. (1989) J. Biol. Chem. 264, 18265-18267

Palotie, A., Vaisanen, P., Ott, J., Ryhanen, L., Elima, K., Vikkula, M., Cheah, K., Vuorio, E., and Peltonen, L. (1989) Lancet 1,

Knowlton, R. G., Katzenstein, P. L., Moskowitz, R. W., Waever, E. J., Malemud, C. J., Pathria, M. N., Jimenez, S. A., and Prockop, D. J. (1990) N . Engl. J. Med. 322, 526-530

Ala-Kokko, L., Baldwin, C. T., Moskowitz, R. W., and Prockop, D. J . (1990) Proc. Natl. Acad. Sci. U. S. A . 87, 6565-6568

Francomano, C. A,, Liberfarb, R. M., Hirose, T., Maumenee, I. H., Streeten, E. A,, Meyers, D. A., and Pyeritz, R. E. (1987) Genomics 1,293-296

Upholt, W. B., and Sandell, L. J. (1986) Proc. Natl. Acad. Sci. U. S. A. 83,2325-2329

Kohno, K., Martin, G. R., and Yamada, Y. (1984) J. Biol. Chem.

Kohno, K., Sullivan, M., and Yamada, Y. (1985) J. Biol. Chem.

Sangiorgi, F. O., Benson-Chanda, V., de Wet, W. J., Sobel, M. E., and Ramirez, F. (1985) Nucleic Acids Res. 13, 2815-2826

Strom, C. M., and Upholt, W. B. (1984) Nucleic Acids Res. 12, 1025-1038

Cheah, K. S. E., Stoker, N. G., Griffin, J. R., Grosveld, F. G., and Solomon, E. (1985) Proc. Natl. Acad. Sci. U. S. A . 82, 2555- 2559

Collagen Types, Academic Press, New York

59, 837-872

University Press, Cambridge

E., eds) Vol. IV, pp. 31-49, CRC Press, Boca Raton, FL

L. (1977) Deu. Biol. 59, 75-85

116,497-509

924-927

259,13668-13673

260,4441-4447

Gene (Amst . ) 44, l l -16 24. Vikkula, M., and Peltonen, L. (1989) FEES Lett. 250,171-174 25. Su, M.-W., Benson-Chanda, V., Vissing, H., and Ramirez, F.

26. Ryan, M. C., Sieraski, M., and Sandell, L. J. (1990) Genomics 8,

27. Ala-Kokko, L., and Prockop, D. J. (1990) Genornics 8,454-460 28. Elima, K., Vuorio, T., and Vuorio, E. (1987) Nucleic Acids Res.

15,9499-9504 29. Baldwin, C. T., Reginato, A. M., Smith, C., Jimenez, S. A., and

Prockop, D. J . (1989) Biochem. J. 262,521-528 30. Seyer, J. M., Hasty, K. A., and Kang, A. H. (1989) Eur. J .

Biochem. 181, 159-173 31. Devereux, J., Haeberli, P., and Smithies, 0. (1984) Nucleic Acids

Res. 12,387-395 32. Chirgwin, J . M., Przybyla, A. E., MacDonald, R. J., and Rutter,

W. J . (1979) Biochemistry 18, 5294-5299 33. Sambrook, J., Fritsch, E. F., and Maniatis, T. (1989) Molecular

Cloning: A Laboratory Manual, 2nd Ed, pp. 7.58-7.83, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY

34. Lovell-Badge, R. H., Bygrave, A,, Bradley, A., Robertson, E., Tilly, R., and Cheah, K. S. E. (1987) Proc. Natl. Acad. Sci.

35. Woodbury, D., Benson-Chanda, V., and Ramirez, F. (1989) J. Biol. Chem. 264,2735-2738

36. Horton, W., Miyashita, T., Kohno, K., Hassell, J. R., and Ya- mada, Y. (1987) Proc. Natl. Acad. Sci. U. S. A . 84,8864-8868

37. Savagner, P., Miyashita, T., and Yamada, Y. (1990) J. Biol. Chem.

38. Ryan, M. C., and Sandell,>L. J . (1990) J. Biol. Chem. 265,10334- 10339

39. Weil, D., Bernard, M., Combates, N., Wirtz, M. K., Hollister, D. W., Steinmann, B., and Ramirez, F. (1988) J. Biol. Chem. 263, 8561-8564

40. Tromp, G., and Prockop, D. J. (1988) Proc. Natl. Acad. Sci.

41. Bornstein, P., and Sage, H. (1989) Prog: Nucleic Acid Res. Mol.

42. E v e . D. R.. Paz, M. A,. and Galloa. P. M. (1984) Annu. Reu.

(1989) Genomics 4,438-441

41-48

U. S. A . 84,2803-2807

265,6669-6674

U. S. A . 85,5254-5258

Biol. 37, 67-106 " . - I

Biochem. 53, 717-748' 43. Evre. D. R.. ADon. S.. Wu. J.-J.. Ericsson. L. H.. and Walsh. K.

A . '(1987)'FI!kS'Lett. 220, 337-341 '

44. van der Rest, M., and Mayne, R. (1987) J. Biol. Chern. 263, 1615-1618

45. Sandell, L. J., Prentice, H. L., Kravis, D., and Upholt, W. B. (1984) J. Bid. Chem. 259, 7826-7834

46. Eyre, D., and Wu, J. J . (1987) in Structure and Function of Collagen Types (Mayne, R., and Burgeson, R. E., eds) pp. 261- 281, Academic Press, New York