large-scale genome projects libraries sequencing release assembly annotation closure strategy...
TRANSCRIPT
Large-scale genome projects
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Sequencing DNA molecules in the Mb size range
• All strategies employ the same underlying principles:
Random Shotgun sequencing
Complete sequence
Shotgun reads
Contigs
Genomic DNA
Shearing/Sonication
Subclone and Sequence
Assembly
Finishing
Finishing read
Nucleotide Database Growth
EMBL breakdown by organism
EMBL Release 65
Progress on Large Sequencing Projects
Strategies for sequencing
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• How big can you go??
• Large-insert clones
• cosmids 30-40 kb
• BACs/PACs 50 - 100 kb
• Whole chromosomes
• Whole genomes
Genome size and sequencing strategies
Genome size (log Mb)
D.melanogaster (170 Mb)C.elegans (100Mb)
H.sapiens (3000 Mb)
S.cerevisiae (14 Mb)E.coli (4 Mb)
P.falciparum (30 Mb)
0 1 2 3 4
Whole genome shotgun (WGS)
Whole Chromosome Shotgun (WCS)
Clone-by-clone
Whole Genome Shotgun (WGS)with Clone ‘skims’
Complete sequence
Shotgun reads
Contigs
Genomic DNA
Shearing/Sonication
Subclone and Sequence
Assembly
Finishing
Finishing read
Strategies for sequencing
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
• Size and GC composition of genome
• Volume of data
• Ease of cloning
• Ease of sequencing
• Genome complexity
• dispersed repetitive sequence
• telomeres & centromeres
• Politics/Funding
Strategies: Clone by Clone
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Simple (0.5 - 2 K reads)
• Few problems with repeats
• Relatively simple informatics
• Scalability
• Quality of physical map
• Fingerprint / STS maps
• End sequencing
Strategies: Whole Chromosome shotgun (WCS)
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Requires chromosome isolation
• Moderate complexity (10’s K reads)
• Problems with repeats
• Complex informatics
• Inefficient in isolation
• Quality of physical map
• Skims of mapped clones
Strategies: Whole Genome shotgun (WGS)
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Moderate to High complexity (10-100’s K reads)
• Problems with repeats
• Complex informatics
• Quality of physical map
• Fingerprint map
• STS markers
• End-sequences
• Skims of mapped clones
Sequencing my genome
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
Annotation
Finishing
Production
Politics
TIME MONEY
What do you get?
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
• Sequence
• incomplete v complete
• First-pass annotation
• Gene discovery
• Full annotation
• A starting point for research
DATA!!, DATA !!, and more DATA!!
Genome annotation is central to functional genomics
Gene Knockout
Expression Microarray
RNAi phenotypes
ORFeome based functional genomics
Sequencing
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Library construction
• Colony picking
• DNA preparation
• Sequencing reactions
• Electrophoresis
• Tracking/Base calling
Libraries
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Essentially Sub-cloning
• Generation of small insert libraries in a well characterised vector.
• Ease of propagation
• Ease of DNA purification
• e.g. puc18, M13
Libraries - testing
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Simple concepts
• Insert/Vector ratio
• Real data
• Insert size
• Sequence ….
• Simple analysis
Sequence generation
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Pick colonies
• Template preparation
• Sequence reactions
• Standard terminator chemistry
• pUC libraries sequenced with forward and reverse primers
Sequence generation
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Electrophoresis of products
• Old style - slab gels, 32 > 64 > 96 lanes
• New style - capillary gels, 96 lanes
• Transfer of gel image to UNIX
• Sequencing machines use a slave Mac/PC
• Move data to centralised storage area for processing
Gel image processing
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Light-to-Dye estimation
• Lane tracking
• Lane editing
• Trace extraction
• Trace standardisation
• Mobility correction
• Background substitution
Pre-processing
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Base calling using Phred
• modifies SCF file
• Quality clipping
• Vector clipping
• Sequencing vector
• Cloning vector
• Screen for contaminants
• Feature mark up (repeats/transposons)
Finishing
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Assembly: Process of taking raw single-pass reads into contiguous consensus sequence
• Closure: Process of ordering and merging consensus sequences into a single contiguous sequence
• Finished is defined as sequenced on both strands using multiple clones. In the absence of multiple clones the clone must be sequenced with multiple chemistries. The overall error rate is estimated at less than 1 error per 10 kb
Genome Assembly
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
• Pre-assembly
• Assembly
• Automated appraisal
• Manual review
Pre-Assembly
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
• Convert to CAF format
• flatfile text format
• choice of assembler
• choice of post-assembly modules
• choice of assembly editor
www.sanger.ac.uk/Software/CAF
Assembly
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
• Assemble using Phrap
• Read fasta & quality scores from CAF file
• Merge existing Phrap .ace file as necessary
• Adjust clipping
Assembly appraisal
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
• auto-edit
• removes 70% of read discrepancies
• Remove cloning vector
• Mark up sequence features
• finish
• Identify low-quality regions
• Cover using ‘re-runs’ and ‘long-runs’
• Compare with current databases
• plate contamination
Manual Assembly appraisal
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
• Use a sequence editor (GAP/consed)
• Tools to identify Internal joins
• Tools to identify and import data from an overlapping projects
• Tools to check failed or mis-assembled reads for inclusion in project
Manual editing
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Sanger uses 100% edit strategy
• Where additional data is required:
• Check clipping
• Additional sequencing
• Template / Primer / Chemistry
• Assemble new data into project
• GAP4 Auto-assemble
• Repeat whole process
Manual Quality Checks
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
• Force annotation tag consistency
• All unedited data is re-assembled using Phrap
• All high-quality discrepancies are reviewed
• Confirm restriction digest (clones)
• Check for inverted repeats
• Manually check:
• Areas of high-density edits
• Areas with no supporting unedited data
• Areas of low read coverage
Gap closure
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Read pairs
• PCR reactions (long-range / combinatorial)
• Small-insert libraries
• Transposon-insertion libraries
Gap closure - contig ordering
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Read pair consistency
• STS mapping
• Physical mapping
• Genetic mapping
• Optical mapping
• Large-insert clone
• skims
• end-sequencing
Annotation
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• DNA features (repeats/similarities)
• Gene finding
• Peptide features
• Initial role assignment
• Others- regulatory regions
Annotation of eukaryotic genomes
transcription
RNA processing
translation
AAAAAAA
Genomic DNA
Unprocessed RNA
Mature mRNA
Nascent polypeptide
folding
Reactant A Product BFunction
Active enzyme
ab initio gene prediction
Comparative gene prediction
Functional identification
Gm3
Genome analysis overview: C.elegans
DNA features
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
• Similarity features
• mapping repeats
• simple tandem and inverted
• repeat families
• mapping DNA similarities
• EST/mRNAs in eukaryotes
• Duplications,
• RNAs
• mapping peptide similarities
• protein similarities
Gene finding
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• ORF finding (simple but messy)
• ab initio prediction
• Measures of codon bias
• Simple statistical frequencies
• Comparative prediction
• Using similarity data
• Using cross-species similarities
Peptide features
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Peptide features
• low-complexity regions
• trans-membrane regions
• structural information (coiled-coil)
• Similarities and alignments
• Protein families (InterPro/COGS)
Initial role assignment
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
• Simple attempt to describe the functional identity of a peptide
• Uses data from:
• peptide similarities
• protein families
• Vital for data mining
• Large number of predicted genes remain hypothetical or unknown
Other regulatory features
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy• Ribosomal binding sites
• Promoter regions
Data Release
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
• DNA release
• Unfinished
• Finished
• Nucleotide databases
• GENBANK/EMBL/DDBJ
• Peptide databases
• SWISSPROT/TREMBL/GENPEPT
• Others