transcriptome profile of the fathead minnow (pimephales ......a kyoto encyclopedia of genes and...

1
Transcriptome profile of the fathead minnow (Pimephales promelas) Wentworth S. A. 1 , Thede K. 2 , Aravindabose V. 2 , Monroe I. 1 , Thompson A. W. 1 , Garvin J. 2 , Packer R. 1 1. Department of Biological Sciences, Columbian College of Arts and Sciences, The George Washington University, Washington, DC 2. Department of Physiology and Biophysics, School of Medicine, Case Western Reserve University, Cleveland, Ohio Introduction Modern RNASeq technology provides a practical and accurate method for exploring the transcriptome and expressed proteins of non-model organisms. One such organism, the fathead minnow ( Pimephales promelas ), is a widely used toxicology model, however the lack of knowledge about its genetics has hindered further study into the interactions between this organism and its environment. In this study, we present the fully assembled and annotated gill transcriptome of the rosy red strain of P. promelas to establish it as a foundation for further genetic and physiological research. Acknowledgements The authors would like to thank Adam Wong and Brian Haas for their technical support, Colonial One for providing the high-performance computing power required form this work, the Case Western Reserve University Genomics Core for the RNA sequencing, Dr. Tara Scully, and the Harlan Undergraduate Research Program for funding this project. *References available upon request Most Represented KEGG Pathways Number of Pathway Hits 0 25 50 75 100 125 150 175 200 PI3K-Akt signaling MAPK signaling Ras signaling Endocytosis Neuroactive interaction Cytoskeletal regulation Rap Cytokine interaction Focal adhesion Ribosome RNA transport Protein processing in ER Purine metabolism Spliceosome Conclusion and Discussion Overall, the statistics of the assembled transcriptome denote that it is a high quality assembly. The coverage of the reads back to the transcriptome is ~270x, a coverage substantially higher than what is generally accepted (~30x) 11 . Furthermore, the N50 statistic—the contig length at which all longer contigs will comprise at least 50% of the total assembled bases of the transcriptome—at 2,271 bp is also higher than is normally found with other transcriptome profiles (more around ~1,500 bp). It may be noted that the BLAST similarity distribution shows a majority of sequences having less than 50% identity with known gene sequences, however this is not surprising and can be explained. First, the transcriptome was only BLASTed against the UniProt/SwissProt database which, while very extensive and complete, is manually reviewed and annotated unlike the TrEMBL or nr database. This database is less likely to include sequences which are not extensively reviewed and may not provide hits for more recently sequenced and closely related organisms. Additionally, because P. promelas was BLASTed against a database containing a majority of mammalian genes, and the closest relative of the fathead minnow which has been extensively annotated is Danio rerio and, even though it is in the same family of cyprinidae, many regions may simply be evolutionarily homologous, yet not sufficiently identical to provide high similarity scores. Finally, while these similarity scores are not of the best quality, the E- values, even those removed from the above graph, show remarkable significance and quality. 3.7% 5.9% 16.8% 41.3% 30.1% 2.2% <25% 25% to 35% 35% to 50% 50% to 70% 70% to 85% 85% to 100% BLAST Similarity Distribution Future Work In continuation of our findings we would like to explore changes in the transcriptome for fathead minnows acclimatized to the extremes of seasonal water conditions (~5ºC and ~25ºC) to determine what changes in the expression of various genes with relation to changing environmental conditions. Additional explorations into a variety of environmental factors such as salinity or acidity could add to the larger picture, and could even lead to explanations of how the fathead minnow maintains homeostasis as environmental conditions change. 8.6% 5.2% 7.7% 12.6% 32.7% 33.2% 1E-3 to 1 1E-3 to 1E-20 1E-20 to 1E-50 1E-50 to 1E-100 1E-100 to 0 0 BLAST E-Value Distribution Note: ~28,000 hits with an E-value greater than 1 were removed for graph clarity Materials and Methods RNA Extraction and Sequencing: 1. Fathead minnows were acclimatized to 25ºC before tissue extraction. 2. Gill samples were perfused to clear blood prior to RNA extraction. 3. Stranded cDNA libraries were constructed from each RNA sample and sequenced by an Illumina HiSeq 2500. Transcriptome Assembly: 1. Quality control was performed with the fasts toolkit to trim bases that have been shown to be biased in Illumina sequencing 1 . 2. Cleaned Reads were assembled de novo into transcripts using the Broad Institute’s Trinity software package 2,3 using the default parameters. Transcriptome Annotation: 1. The assembled transcripts were annotated using the Broad Institute’s Trinotate workflow 2,3 . 2. The transcripts were BLASTed 4 against the Uniprot/SwissProt 5 database. 3. Further annotation was done using software such as Transdecoder 2,3 , SignalP 6 , tmhmm 7 , and RNAMMER 9 , which annotate other gene and transcript features. 4. Additional annotation using a WEGO level 2 gene ontology classification 9 and a Kyoto Encyclopedia of Genes and Genomes pathway classification 10 determined the categories of gene function and protein pathways represented in the transcriptome. Count 0 10000 20000 30000 40000 50000 200-300 300-500 500-1000 1000-2000 2000-3000 3000-50005000-10000 >10000 183 2,844 8,253 11,145 21,364 25,564 33,780 49,985 Length Distribution of Transcripts in Base Pairs bp Summary of Transcriptome Data Total raw reads 470,813,940 Total cleaned reads 470,812,874 Length of reads(bp) 101 Total length of reads 47.5 Gbp Total length of assembly 152 Mbp Number of Transcripts 153,118 Transcript N50 2,271 Mean length of transcript 996 Median length of transcript 435 Transcripts annotated 150,773 Unique genes represented 72,334 Cellular Component Number of Genes Molecular Function Biological Process 1 10 100 1000 10000 cell cell part envelope extracellular region extracellular region part macromolecular complex membrane-enclosed lumen organelle organelle part synapse synapse part viron viron part symplast antioxidant auxilliary transport protein binding catalytic chemoattractant electron carrier enzyme regulator metallochaperone molecular transducer protein tag structural molecule transcription regulator translation regulator nutrient resevoir chemorepellant transporter anatomical structure formation biological adhesion biological regulation cell killing cellular component biogenesis cellular component organization cellular process death developmental establishment of localization growth immune system process localization metabolic process multiorganism process multicellular organismal process pigmentation reproduction reproductive process response to stimulus rhythmic process viral reproduction locomotion Gene Ontology Classifications Results

Upload: others

Post on 12-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Transcriptome profile of the fathead minnow (Pimephales ......a Kyoto Encyclopedia of Genes and Genomes pathway classification 10 determined the categories of gene function and protein

Transcriptome profile of the fathead minnow (Pimephales promelas) Wentworth S. A.1, Thede K.2, Aravindabose V.2, Monroe I.1, Thompson A. W.1, Garvin J.2, Packer R.1

1. Department of Biological Sciences, Columbian College of Arts and Sciences, The George Washington University, Washington, DC 2. Department of Physiology and Biophysics, School of Medicine, Case Western Reserve University, Cleveland, Ohio

Introduction Modern RNASeq technology provides a practical and accurate method for exploring the transcriptome and expressed proteins of non-model organisms. One such organism, the fathead minnow (Pimephales promelas), is a widely used toxicology model, however the lack of knowledge about its genetics has hindered further study into the interactions between this organism and its environment. In this study, we present the fully assembled and annotated gill transcriptome of the rosy red strain of P. promelas to establish it as a foundation for further genetic and physiological research.

Acknowledgements The authors would like to thank Adam Wong and Brian Haas for their technical support, Colonial One for providing the high-performance computing power required form this work, the Case Western Reserve University Genomics Core for the RNA sequencing, Dr. Tara Scully, and the Harlan Undergraduate Research Program for funding this project. *References available upon request

Most Represented KEGG PathwaysN

umbe

r of P

athw

ay H

its

0

25

50

75

100

125

150

175

200

PI3K

-Akt

sign

alin

gM

APK

signa

ling

Ras s

igna

ling

Endo

cyto

sisN

euro

activ

e in

tera

ctio

nCy

tosk

elet

al re

gula

tion

Rap

Cyto

kine

inte

ract

ion

Foca

l adh

esio

n

Ribo

som

eRN

A tra

nspo

rtPr

otei

n pr

oces

sing

in E

RPu

rine

met

abol

ism

Splic

eoso

me

Conclusion and Discussion Overall, the statistics of the assembled transcriptome denote that it is a high quality assembly. The coverage of the reads back to the transcriptome is ~270x, a coverage substantially higher than what is generally accepted (~30x)11. Furthermore, the N50 statistic—the contig length at which all longer contigs will comprise at least 50% of the total assembled bases of the transcriptome—at 2,271 bp is also higher than is normally found with other transcriptome profiles (more around ~1,500 bp). It may be noted that the BLAST similarity distribution shows a majority of sequences having less than 50% identity with known gene sequences, however this is not surprising and can be explained. First, the transcriptome was only BLASTed against the UniProt/SwissProt database which, while very extensive and complete, is manually reviewed and annotated unlike the TrEMBL or nr database. This database is less likely to include sequences which are not extensively reviewed and may not provide hits for more recently sequenced and closely related organisms. Additionally, because P. promelas was BLASTed against a database containing a majority of mammalian genes, and the closest relative of the fathead minnow which has been extensively annotated is Danio rerio and, even though it is in the same family of cyprinidae, many regions may simply be evolutionarily homologous, yet not sufficiently identical to provide high similarity scores. Finally, while these similarity scores are not of the best quality, the E-values, even those removed from the above graph, show remarkable significance and quality.

3.7%5.9%

16.8%

41.3%

30.1%

2.2%<25%25% to 35%35% to 50%50% to 70%70% to 85%85% to 100%

BLAST Similarity Distribution

Future Work In continuation of our findings we would like to explore changes in the transcriptome for fathead minnows acclimatized to the extremes of seasonal water conditions (~5ºC and ~25ºC) to determine what changes in the expression of various genes with relation to changing environmental conditions. Additional explorations into a variety of environmental factors such as salinity or acidity could add to the larger picture, and could even lead to explanations of how the fathead minnow maintains homeostasis as environmental conditions change.

8.6%

5.2%

7.7%

12.6%

32.7%

33.2%

1E-3 to 11E-3 to 1E-201E-20 to 1E-501E-50 to 1E-1001E-100 to 00

BLAST E-Value Distribution

Note: ~28,000 hits with an E-value greater than 1 were removed for graph clarity

Materials and Methods RNA Extraction and Sequencing:

1. Fathead minnows were acclimatized to 25ºC before tissue extraction.

2. Gill samples were perfused to clear blood prior to RNA extraction.

3. Stranded cDNA libraries were constructed from each RNA sample and

sequenced by an Illumina HiSeq 2500.

Transcriptome Assembly: 1. Quality control was performed with the fasts toolkit to trim bases that have

been shown to be biased in Illumina sequencing1.

2. Cleaned Reads were assembled de novo into transcripts using the Broad

Institute’s Trinity software package2,3 using the default parameters.

Transcriptome Annotation: 1. The assembled transcripts were annotated using the Broad Institute’s Trinotate

workflow2,3.

2. The transcripts were BLASTed4 against the Uniprot/SwissProt5 database.

3. Further annotation was done using software such as Transdecoder2,3, SignalP6,

tmhmm7, and RNAMMER9, which annotate other gene and transcript features.

4. Additional annotation using a WEGO level 2 gene ontology classification9 and

a Kyoto Encyclopedia of Genes and Genomes pathway classification10

determined the categories of gene function and protein pathways

represented in the transcriptome.

Coun

t

0

10000

20000

30000

40000

50000

200-300 300-500 500-1000 1000-2000 2000-3000 3000-50005000-10000 >10000

1832,844

8,25311,145

21,364

25,564

33,780

49,985 Length Distribution of Transcripts in Base Pairs

bp

Summary of Transcriptome DataTotal raw reads 470,813,940Total cleaned reads 470,812,874Length of reads(bp) 101Total length of reads 47.5 GbpTotal length of assembly 152 MbpNumber of Transcripts 153,118Transcript N50 2,271Mean length of transcript 996Median length of transcript 435Transcripts annotated 150,773Unique genes represented 72,334

Cellular Component

Num

ber o

f Gen

es

Molecular Function Biological Process

1

10

100

1000

10000

cell

cell

part

enve

lope

extra

cellu

lar r

egio

nex

trace

llula

r reg

ion

part

mac

rom

olec

ular

com

plex

mem

bran

e-en

clos

ed lu

men

orga

nelle

orga

nelle

par

tsy

naps

esy

naps

e pa

rtvi

ron

viro

n pa

rtsy

mpl

ast

antio

xida

ntau

xilli

ary

trans

port

prot

ein

bind

ing

cata

lytic

chem

oattr

acta

ntel

ectro

n ca

rrier

enzy

me

regu

lato

rm

etal

loch

aper

one

mol

ecul

ar tr

ansd

ucer

prot

ein

tag

stru

ctur

al m

olec

ule

trans

crip

tion

regu

lato

rtra

nsla

tion

regu

lato

rnu

trien

t res

evoi

rch

emor

epel

lant

trans

porte

r

anat

omic

al st

ruct

ure

form

atio

nbi

olog

ical

adh

esio

nbi

olog

ical

regu

latio

nce

ll ki

lling

cellu

lar c

ompo

nent

bio

gene

sis

cellu

lar c

ompo

nent

org

aniza

tion

cellu

lar p

roce

ssde

ath

deve

lopm

enta

l

esta

blish

men

t of l

ocal

izatio

ngr

owth

imm

une

syst

em p

roce

sslo

caliz

atio

nm

etab

olic

pro

cess

mul

tiorg

anism

pro

cess

mul

ticel

lula

r org

anism

al p

roce

sspi

gmen

tatio

nre

prod

uctio

nre

prod

uctiv

e pr

oces

sre

spon

se to

stim

ulus

rhyt

hmic

pro

cess

vira

l rep

rodu

ctio

nlo

com

otio

n

Gene Ontology Classifications

Results