2013 pag-poultry-workshop

C. Titus BrownAsst Prof, CSE and

Microbiology;BEACON NSF STC

Michigan State [email protected]

Evaluating and improving the chick genome & transcriptome

AcknowledgementsThis is joint work with Hans Cheng (USDA

ADOL), Jerry Dodgson (MSU).

Likit Preeyanon (MSU) and Alexis Black Pyrkosz (ADOL) did the work.

All of the software discussed in this talk is available.

This work was primarily supported by the USDA NIFA through a grant to me.

Simulations show that incomplete gene reference => inaccurate differential expression from mRNAseq

Alexis Black Pyrkosz

Existing chick gene models lack exons, isoforms

*This gene contains at least 4 isoforms.

Our data

Models

Likit Preeyanon

(Exon detection is pretty good.)

Likit Preeyanon

Different approaches to gene set prediction yield distinct splice junction predictions

> 95% of thee assembly-based splice junctions are supported by 4 or more

independent reads. Likit Preeyanon

mRNAseq analysis with a combined de novo and genome-based approach.

Likit Preeyanon

We can produce combined gene models.

Cufflinks (ref based) + de novo assembly + known mRNA

Gene Model Summary(note: spleen mRNAseq)

Method Gene Transcript

Global Assembly 14,832 32,311

Local Assembly 15,297 23,028

Global + Local Assembly

15,934 46,797

*Number of genes and transcripts might be overdue to incomplete assemblyand spurious splice junctions.

Cross-validation with technical replicates

Dataset Single-end Paired-end

Mapped Unmapped Mapped Unmapped

Line 6 uninfected

18,375,966 (77.93%)

5,203,586 (22.07%)

21,598,218 (64.16%)

12,065,659

(35.84%)

Line 6 infected

17,160,695(73.18%)

6,288,286 (26.82%)

15,274,638 (63.89%)

8633855 (36.11%)

Line 7 uninfected

18,130,072 (75.77%)

5,795,737 (24.22%)

20,961,033 (63.67%)

11,960,299

(36.33%)

Line 7 infected

19,912,046(78.51%)

5,450,521 (21.49%)

22,485,833 (65.22%)

11,992,002

(34.78%)

Single-ended reads were used to generate gene models; paired-end data was used as technical

replicate cross-validation.

Gene Modeler Pipeline (“gimme”) Merge transcripts together based on transcript mapping

to genome; can include existing gene predictions, & iteratively combine predictions.

Construct gene models Remove redundant sequences Predict strands and ORFs

Likit Preeyanon

Next problem: chick reference!We like using the reference genome to scaffold RNAseq

contigs; purely de novo RNAseq assembly is messy.Genomes are also useful for other things, we hear.Problems:Poor sensitivity: the chick genome is missing a

substantial number of genes from microchromosomes:723 genes from HSA19q missing from chicken galGal4.ESTs and RNAseq transcripts for many or most.

Gaps9900 gaps on ordered chromosomes21k gaps on chr-aligned but low-confidence/unaligned

Over-collapsed tandem dups and under-collapsed het

Sensitivity – where is the problem?Are microchromosomes hard to sequence

or is microchromosomal sequence hard to assemble?

Sequences that simply don’t show up in the data are hard to include in the assembly…Unclonable (Sanger)Strong GC or AT bias

Sequences with biased (generally low) coverage are often discarded by assemblers.

Can we “even out” coverage?(Digital normalization)

If you have two loci, or two mRNA species,

with uneven coverage, can you remove the extra

coverage?

Coverage before digital normalization:

(MD amplified)

Coverage after digital normalization:

Normalizes coverage

Discards redundancy

Eliminates majority oferrors

Scales assembly dramatically.

Assembly is 98% identical.

Prelim results from digital normalizationReassembled chick genome contigs from 70x

Illumina -> normalized reads in ~24 hours.Obtained 40 Mbp of assembled contigs that

were not present in galGal4.Contig assembly contained partial or

complete matches to 70% of previously unmappable transcripts assembled from chick spleen mRNAseq.

Bioinformatics remedies may help but are probably not sufficient.

Likit Preeyanon

Can we improve the assembly?

Longer reads!

Repeat copy 1

Repeat copy 2

Long reads can span repeats

Polymorphic contig 2Polymorphic contig 2

Polymorphic contig 3Polymorphic contig 3

Contig 4Contig 1

and heterozygous regions

slides from http://slideshare.net/flxlex/ ; Lex Nederbragt

PacBio: first results (cod/salmon)

Raw reads


Cod: PacBio resultsMapping to the published genome

11.4 kbp subread

10.6 kbp subread

10.9 kbp subread


Need to combine Illumina + PacBio still.

+

+

2.7x

23x

24 cpus4.5 days 100 Gb RAM

Alignments of at least 1kb to cod published assembly

Raw reads

Err

or-

corr

ect

ed r

eads

P_errorCorrection pipeline from

93% of reads recovered


Concluding thoughts/commentsGene models and reference genome both need

work.

This is going to be a continuing process…

Together with Wes Warren (WUSTL), Hans Cheng (USDA ADOL), Jerry Dodgson (MSU) proposing to apply PacBio sequencing and digital normalization to improve chick genome and regularly integrate community improvements; should be generalizable approach.Questions? Contact me at: [email protected]

2013 pag-poultry-workshop

Documents

novo assembly

incomplete assembly

assembly assessment

novo rnaseq assembly

combined gene models

incomplete gene reference

existing chick gene

construct gene models