simplifying complexity

2
NATURE METHODS | VOL.7 NO.10 | OCTOBER 2010 | 793 NEWS AND VIEWS together full-length transcripts from a mass of short sequence reads, typically only 50 to 100 nucleotides long. Currently, analysis of RNA-seq data falls into three broad categories: reference genome–free transcriptome reconstruction (Fig. 1a), reference genome–assisted recon- struction (Fig. 1b) and gene model–based profiling ( Fig. 1c ). Direct assembly of overlapping short reads without alignment to a reference genome 5,6 is usually used specifically for detecting transcriptional novelty (fusion transcripts, new splice vari- ants or unannotated genes). This approach, however, is often confounded by the redun- dancy of expressed repetitive elements and can require excessively high coverage of transcripts to overcome sequencing error rates, effectively limiting reconstruction to the most highly abundant transcripts. In contrast, genome-assisted reconstruction uses genomic alignments of short reads to infer the repertoire of possible transcripts and mathematically models the combina- tion and abundance of transcripts that best explain the observed RNA-seq profile 7,8 . In the current report 1 , Marra and col- leagues neatly side-step the challeng- ing issue of transcript reconstruction by using gene model–based profiling to systematically catalog expression on the basis of genomic and transcriptomic fea- tures associated with each gene (such as known exons and introns, exon-intron boundaries, intergenic regions, exon-exon junctions and more). The focus on tran- scriptomic features in a genome-wide con- text allows experiment-specific noise to be measured, a factor not widely incorporated into RNA-seq analyses. The integration of statistically robust tests from the publicly available Bioconductor suite of programs 9 allows for a stepwise feature-based analy- sis of gene expression 1 and can detect multiple types of transcriptomic activi- ties, including expression amounts at a particular locus, alternative splicing and intron-retention events. The approach is equally accessible for both single-read and paired-end-read RNA-seq experi- ments of almost any length tags, and is adaptable to a variety of gene models 1 . of transcriptome complexity in specific biological states 3,4 , it is not without its chal- lenges. A complete understanding of tran- scriptional complexity using RNA-seq will likely require the daunting task of piecing Massive-scale short-tag RNA sequencing (RNA-seq) is rapidly becoming the tool of choice for surveying gene expression and transcriptome content in complex organ- isms. RNA-seq offers many advantages over microarray-based surveys, but the relative infancy of the field is reflected in the paucity of comprehensive analysis tools available. In this issue, Griffith et al. describe alternative expression analysis by sequencing (ALEXA- seq) 1 ; a software package that identifies and quantifies the genomic features of gene expression, providing a substantial step for- ward for genome-wide surveys of alterna- tive splicing. Over the last decade, it has become increasingly clear that the complexity of an organism is not reflected in the number of genes encoded in its genome but rather in the number of transcripts present in its tran- scriptome. Exhaustive full-length cDNA and rapid amplification of cDNA ends (RACE) analyses have shown that, on average, mam- malian genes are capable of expressing six to seven different transcripts, derived predom- inantly from alternative exon splicing and alternative promoter usage 2 . This increased ‘transcriptional complexity’ expands not only the proteomic output from each locus but the opportunities for gene regulation and RNA-mediated control as well. The rapid rise in the popularity of RNA- seq has been driven by its ability to overcome limitations of array based profilingnamely cross-hybridization, limits of sensitiv- ity and dynamic range, and a reliance on suitable and comprehensive probe design. Although massive-scale shotgun sequenc- ing of RNA has provided the first estimates Simplifying complexity Nicole Cloonan & Sean M Grimmond A software tool based on gene model profiling improves analysis of alternative splice events in RNA sequencing (RNA-seq) data. Nicole Cloonan and Sean M. Grimmond are at Queensland Centre for Medical Genomics, Institute for Molecular Bioscience, The University of Queensland, Brisbane, Australia. e-mail: [email protected] Figure 1 | Three models of transcriptome reconstruction. (ac) Thick bars represent tags (blue) or exons (red); thin lines represent introns. Reference genome-free transcript reconstruction a b c Assemble transcripts from overlapping tags Optional: align to genome to get exon structure Reference genome-assisted transcript reconstruction Infer possible transcripts and abundance Gene model-based profiling Reference Reference Known gene models Short tags Use known and/or predicted gene models to examine individual features Short tags Reference © 2010 Nature America, Inc. All rights reserved.

Upload: sean-m

Post on 29-Jul-2016

217 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Simplifying complexity

nature methods | VOL.7 NO.10 | OCTOBER 2010 | 793

news and views

together full-length transcripts from a mass of short sequence reads, typically only 50 to 100 nucleotides long.

Currently, analysis of RNA-seq data falls into three broad categories: reference genome–free transcriptome reconstruction (Fig. 1a), reference genome–assisted recon-struction (Fig. 1b) and gene model–based profiling (Fig. 1c). Direct assembly of overlapping short reads without alignment to a reference genome5,6 is usually used specifically for detecting transcriptional novelty (fusion transcripts, new splice vari-ants or unannotated genes). This approach, however, is often confounded by the redun-dancy of expressed repetitive elements and can require excessively high coverage of transcripts to overcome sequencing error rates, effectively limiting reconstruction to the most highly abundant transcripts. In contrast, genome-assisted reconstruction uses genomic alignments of short reads to infer the repertoire of possible transcripts and mathematically models the combina-tion and abundance of transcripts that best explain the observed RNA-seq profile7,8.

In the current report1, Marra and col-leagues neatly side-step the challeng-ing issue of transcript reconstruction by using gene model–based profiling to systematically catalog expression on the basis of genomic and transcriptomic fea-tures associated with each gene (such as known exons and introns, exon-intron boundaries, intergenic regions, exon-exon junctions and more). The focus on tran-scriptomic features in a genome-wide con-text allows experiment-specific noise to be measured, a factor not widely incorporated into RNA-seq analyses. The integration of statistically robust tests from the publicly available Bioconductor suite of programs9 allows for a stepwise feature-based analy-sis of gene expression1 and can detect multiple types of transcriptomic activi-ties, including expression amounts at a particular locus, alternative splicing and intron-retention events. The approach is equally accessible for both single-read and paired-end-read RNA-seq experi-ments of almost any length tags, and is adaptable to a variety of gene models1.

of transcriptome complexity in specific biological states3,4, it is not without its chal-lenges. A complete understanding of tran-scriptional complexity using RNA-seq will likely require the daunting task of piecing

Massive-scale short-tag RNA sequencing (RNA-seq) is rapidly becoming the tool of choice for surveying gene expression and transcriptome content in complex organ-isms. RNA-seq offers many advantages over microarray-based surveys, but the relative infancy of the field is reflected in the paucity of comprehensive analysis tools available. In this issue, Griffith et al. describe alternative expression analysis by sequencing (ALEXA-seq)1; a software package that identifies and quantifies the genomic features of gene expression, providing a substantial step for-ward for genome-wide surveys of alterna-tive splicing.

Over the last decade, it has become increasingly clear that the complexity of an organism is not reflected in the number of genes encoded in its genome but rather in the number of transcripts present in its tran-scriptome. Exhaustive full-length cDNA and rapid amplification of cDNA ends (RACE) analyses have shown that, on average, mam-malian genes are capable of expressing six to seven different transcripts, derived predom-inantly from alternative exon splicing and alternative promoter usage2. This increased ‘transcriptional complexity’ expands not only the proteomic output from each locus but the opportunities for gene regulation and RNA-mediated control as well.

The rapid rise in the popularity of RNA-seq has been driven by its ability to overcome limitations of array based profilingnamely cross-hybridization, limits of sensitiv-ity and dynamic range, and a reliance on suitable and comprehensive probe design. Although massive-scale shotgun sequenc-ing of RNA has provided the first estimates

simplifying complexityNicole Cloonan & Sean M Grimmond

A software tool based on gene model profiling improves analysis of alternative splice events in RNA sequencing (RNA-seq) data.

Nicole Cloonan and Sean M. Grimmond are at Queensland Centre for Medical Genomics, Institute for Molecular Bioscience, The University of Queensland, Brisbane, Australia. e-mail: [email protected]

Figure 1 | Three models of transcriptome reconstruction. (a–c) Thick bars represent tags (blue) or exons (red); thin lines represent introns.

Reference genome−free transcript reconstructiona

b

c

Assemble transcripts from overlapping tags

Optional: align to genome to get exon structure

Reference genome−assisted transcript reconstruction

Infer possible transcripts and abundance

Gene model−based profiling

Reference

Reference

Known genemodels

Short tags

Use known and/or predicted gene models toexamine individual features

Short tags

Reference

© 2

010

Nat

ure

Am

eric

a, In

c. A

ll ri

gh

ts r

eser

ved

.

Page 2: Simplifying complexity

nature methods | VOL.7 NO.7 | OCTOBER 2010 | 795

news and views

nature methods | VOL.7 NO.10 | OCTOBER 2010 | 795

4. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Nat. Methods 5, 621–628 (2008).

5. Birol, I. et al. Bioinformatics 25, 2872–2877 (2009).

6. Zerbino, D.R. & Birney, E. Genome Res. 18, 821–829 (2008).

7. Guttman, M. et al. Nat. Biotechnol. 28, 503–510 (2010).

8. Trapnell, C. et al. Nat. Biotechnol. 28, 511–515 (2010).

9. Reimers, M. & Carey, V.J. Methods Enzymol. 411, 119–134 (2006).

of cancer. Despite this, ALEXA-seq takes RNA-seq one step closer to readily replac-ing array-based profiling.

ComPetinG FinanCiaL interests The authors declare no competing financial interests.

1. Griffith, M. et al. Nat. Methods 7, 843–847 (2010).

2. Carninci, P. et al. Science 309, 1559–1563 (2005).

3. Cloonan, N. et al. Nat. Methods 5, 613–619 (2008).

Although ALEXA-seq1 does not attempt to reconstruct full-length transcripts, the feature-based approach should give greater statistical power to determine differential expression of individual splicing events as all reads describing a feature are pooled rather than assigned and partitioned to different transcripts. Additionally, because the authors1 are directly comparing the expression of the same feature in differ-ent samples, they neatly avoid the issue of library construction and sequencing bias.

The complex nature of transcriptional output from mammalian loci can make interpreting RNA-seq data particularly chal-lenging, and good visualization is of primary importance. ALEXA-seq1 opens RNA-seq analysis to nonspecialist researchers, by providing an intuitive graphical interface to browse the results and seamless integra-tion with the University of California Santa Cruz (UCSC) genome browser to examine the results alongside other genome-wide resources. Notably, this pipeline is complete from raw data to visualization, including the crucial aspects of library and sequenc-ing quality control (such as tag redundancy, bias in transcript positions and tag quality visualization), producing several automated reports to accurately assess the quality of the RNA-seq library before making biological conclusions.

As with any gene-model based profil-ing, there are still limitations on what can be assayed with ALEXA-seq1. The use of diagnostic features (such as unique exons or splice events that occur in only one tran-script from that locus) assumes that our biological knowledge is complete and does not readily allow the investigation or inte-gration of novel expression into or along-side existing gene models. Additionally, although the ability to detect individual transcriptomic events is important, so is understanding their context within a full-length transcript; and deconvolution of overlapping transcripts remains an issue. Finally, one important issue not addressed or incorporated by the ALEXA-seq pipeline (or in many other programs) is the major advantage of RNA-seqthat is, know-ing the precise sequence of the expressed genes. Understanding whether the pre-dicted amino acid sequence of an mRNA is mutated or whether a microRNA binding site in the 3′ untranslated region is intact dramatically expands the value of RNA-seq data and are issues of particular impor-tance, for example, to the large-scale studies

biological pathways, and alterations to some of these interactions can have phenotypic consequences, including disease. As miRNAs are an important element in gene-regulatory pathways, the identification of their targets is of considerable importance: which mRNAs are targeted, and to what degree, largely determines the biological role of an miRNA.

Our understanding of miRNA target sites owes much to comparative genomics studies and transcriptome-wide profiling of miRNA targeting, which together revealed the evo-lutionary patterns of site usage and the relative efficacy of different sites2. Because the eventual result of miRNA repression is altered protein levels, mass spectrometry has also become an established tool in the study of miRNAs. Mass spectrometry has revealed that miRNA targets repressed exclusively by translational inhibition are rare and confirmed that the rules discov-ered by earlier transcriptome analysis, though imperfect, reflect the impact of miRNAs on the proteome3,4. In contrast to previous proteomic studies of miRNA targeting, Jovanovic et al.1 use SRM mass spectrometry to focus on only a predefined subset of peptides in a sample, at the cost of reduced coverage across the proteome. The advantages of SRM over previous proteomic assays include the enhanced sensitivity of measurements and potentially the relative

Over the past several years, extensive efforts to develop and improve upon both experimental and computational approaches have advanced the identi-fication and prediction of microRNA (miRNA) targets, yet defining the biologi-cally relevant targets of an miRNA remains a challenge. In this issue of Nature Methods, Jovanovic et al. use a targeted mass spec-trometry technique called selected reaction monitoring (SRM) to perform a proteomic analysis of miRNA-target interactions and pilot this approach by investigating the targets of two different miRNAs in Caenorhabditis elegans1.

miRNAs are small (~22 nucleotide) regu-latory RNAs that repress target mRNAs. miRNA repression results in both mRNA destabilization and inhibition of translation, both of which ultimately reduce the protein levels of targets. Animals such as humans or the nematode worm C. elegans have more than a hundred miRNAs, and many of these miRNAs are each predicted to target ~100 different mRNAs; thus, there is an enormous number of predicted inter actions between miRNAs and mRNAs2. We are only in the earliest stages of understanding the bio-logical consequences of such interactions. Nevertheless, it is already clear that a subset of miRNA-mRNA targeting interactions constitute important nodes in a variety of

a targeted approach to mirna target identificationAndrew Grimson

A targeted mass spectrometry method, selected reaction monitoring, is applied to validate predicted microRNA targets in Caenorhabditis elegans.

Andrew Grimson is in the Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA. e-mail: [email protected]

© 2

010

Nat

ure

Am

eric

a, In

c. A

ll ri

gh

ts r

eser

ved

.