Kerstin Howe, Mario Caccamo, Ian Sealy
The Zebrafish Genome Sequencing Project
Bioinformatics resources
Bioinformatics resources
outline
• clone mapping, sequencing and manual annotation in
• genome assemblies and automated annotation in
• integrated ZF-Models data and tools
Clone mapping and sequencing
mapping
• 2 BAC Tuebingen libraries
• 1 BAC and 1 cosmid library from single Tuebingen double-haploid fish
• end sequencing, RH mapping, fingerprinting
• pieced together according to fingerprints, marker mapping, sequence alignment
• currently ~ 2500 ctgs
Clone mapping and sequencing
sequencing pipeline
• select clones based on position in fpc contig
• subcloning
• sequencing
• automatical assembly/pre-finishing (back to sequencing if necessary)
• finishing
• QC
• automated analysis pipeline
• manual annotation
• submission to EMBL
+
+
=
• RepeatMasker
• CpG island prediction
• Genscan
• FGenesh
• halfwise (Pfam)
• EPCR
• Blast (ESTs, cDNAs, proteins)
• gene structures
• remarks (gene names, function, similarities)
• other features
EMBL
• mysql database in 'ensembl style'
• acedb or apollo front end
• open to users from the 'outside'
unfinished sequence
finished sequence
automated analysis pipeline
manual annotation
otter
Manual annotation
Manual annotation
annotation policy
• follows guidelines for human annotation (havana team, Sanger Institute)
• no "guesses", annotations solely based on supporting evidence
• annotation of: CDSs and UTRs / transcriptssplice variants
pseudogenespoly A features
transposons repeats
• approved nomenclature (SI:clone.number)
• collaboration with ZFIN
existing ZFIN records are reported
ZFIN provides new records for newly found genes
DNA
repeats
CpG island
Genscan FGenesH
proteins ESTs
mRNAs
Manual annotation
vega.sanger.ac.uk
Vega
contigview
Vega
geneview
www.sanger.ac.uk/Projects/D_rerio
www.sanger.ac.uk/Projects/D_rerio
when to use what
go to vega.sanger.ac.uk if you need
• highly reliable sequence
• highly reliable annotation (with your input)
• ‘your gene’ stable over time (TILLING)
go to www.ensembl.org if you need
• the whole genome
• comparative data
• ZF-Models microarray or insertional mutagenesis data
• complicated searches (BioMart)
Zebrafish Genome Project
assembly release (Zv5)
clone libraries
map
(un)finished clones
whole genome shotgun sequencing clone mapping and sequencing
WGS reads
WGS assembly
integration
markers (T51)
supercontigcontig
tile path
BACs
fpc ctg
sequencing
~ 8,000 finished clones (~1 Gb)
clones+ctgs
contigs
finish clone 1.63 Gb
automatic annotationmanual annotation
WGS assembly
reads
group reads
supercontig
Phusion assembler - High Performance Assembly Group (Zemin Ning et al.)
contigcontig contig contig contig
supercontigsupercontig supercontig
A B C phrap
read-pair tracker
A CB
BA C
gap
NNNNNNNN
Read grouping
continuous base hash - k=12 continuous base hash - k=12
ATGGCGTGCAGTCCATGTTCGGATCAATGGCGTGCAGTCCATGTTCGGATCA
ATGGCGTGCAGTATGGCGTGCAGT
TGGCGTGCAGTCTGGCGTGCAGTC
GGCGTGCAGTCCGGCGTGCAGTCC
GCGTGCAGTCCAGCGTGCAGTCCA
gap hash k=12 (4x3) - dealing with variationgap hash k=12 (4x3) - dealing with variation
ATGGATGGCGTCGTGCAGGCAGTCCTCCATGTATGTTCGTCGGATCGATCAA
ATGGATGGCGTCGTGCAGGCAGTCCTCCATGTATGT
TGGCTGGCGTGGTGCAGTCAGTCCACCATGTTTGTT
GGCGGGCGTGCTGCAGTCAGTCCATCATGTTCGTTC
GCGTGCGTGCAGCAGTCCGTCCATGATGTTCGTTCG
• k-mer word hashing
~7
repeats
seq.
errors
• word distribution
k-mer occurrence
frequ
enc
y
Zebrafish Genome Project
assembly release (Zv5)
clone libraries
map
(un)finished clones
whole genome shotgun sequencing clone mapping and sequencing
WGS reads
WGS assembly
integration
markers (T51)
sequencing
~ 7,000 finished clones (~1 Gb)
automatic annotationmanual annotation
Integration
Zv5 scaffoldn
BX005049.6BX005057.8BX005153 BX005123.6
BX005153 BX005057.8BX005049.6 BX005123.6
fpc contig
WGS supercontig
marker
cDNA
bacends
BACs
Zv5 scaffoldn.3 Zv5 scaffoldn.5 Zv5 scaffoldn.7Zv5 scaffoldn.1
Assemblies
Zv5 Zv4 Zv3 Zv2
release date assembly 27.05.05 12.07.04 27.11.03 03.04.03
total length [bp] 1,630,306,866 1,592,025,686 1,459,115,486 1,452,210,772
scaffolds 16,214 21,333 58,339 83,470
finished clones 4,519 (699 Mb) 2.828 (443 Mb) 1,502 (263Mb) -
scaffolds in chr 1-25 1,749 1,892 1,490 -
scaffolds in fpc contigs 265 (chrU) 694 (chrU) 1,842 5,677
NA scaffolds 14,676 18,747 54,798 77,793
sum(length) chr 1-25 [bp]
1,200,129,620 (73%) 1,097,507,810 (69%) 718,270,423 (49%) -
sum(length) ctgs 183,993,739 (11%) 176,222,396 (11%) 365,271,659 (25%) 1,143,459,008
sum(length) NAs 246,183,507 (16%) 318,295,480 (20%) 335,615,307 (23%) 308,751,764
Automatic Annotation
Zebrafish Proteins
Genewisegenes
Other Proteins
AlignedcDNAs
Zebrafish cDNAs
Genewise geneswith UTRs
GenebuilderSupported ab initio
(optional)
Final set
AlignedESTs
Zebrafish ESTs
EnsemblEST genes
Exonerate Exonerate
ClusterMerge
Genewise
Ensembl
Contigview
Geneview
Searching Ensembl
Biomart
start
filter
output
Do’s and Dont’s
go elsewhere (Ensembl) if you
want to know about the whole genome
need comparative data
need ZF-Models microarray or insertional mut data
need to do complicated searches
go to Vega if you
need highly reliable sequence
need highly reliable annotation
need ‘your gene’ stable over time (TILLING)
DAS
reference sequence
genome browser
local storage
remote storage
DAS server
remote storage
DAS server
remote storage
DAS server
XML
DAS client
SNPs and Indels
Zv5 Zv4 Zv3 Zv2 Human Fugu Tetraodon
genes 22,877 23,526 22,409 20,062 24,194 22.339 28,005
transcripts 32,143 32,071 30,783 26,587 35,845 22,102 28,005
Ensembl releases