dr justin schonfeld - bioinformatics applications

Post on 25-May-2015

1.197 Views

Category:

Education

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Analysis of typical informatics workflow, extracting data, aligning data, identifying problems and uploading data to BOLD

TRANSCRIPT

Using BOLD Data in Bioinformatics Workflows

Dr. Justin SchonfeldBiodiversity Institute of Ontario

DNA Barcodes

166 Full Eukaryotic genomes2,471 Metazoan Mitochondrial Genomes1,444,076 Barcodes - ~118,000 species

DNA Barcodes represent an enormous resource for researchers of all types.

Applications

• Species Identification• Taxonomy• Building the Reference Library• Ecology• Proteomics• Comparative Genomics• Teaching• Music

High level data flow

Museums Private collections

Regulatory Agencies Researchers

CCDB

BOLD

Genbank Mirrors Educators ResearchersRegulatory Agencies

Australian Museum

Typical Informatics Workflow

Filtered Data

Aligned Data

Cleaned Data

BOLD

Align Data

Identify Problematic Sequences

Analyze Data

Extract Data

Local Copy Filter Data

Extracting Data: BOLD Public

• Easy to use• Flexible search

tool– Search by

taxonomic name, geographic region, collector, etc.

– Example Searches: “Hymenoptera”, “Lepidoptera Canada”

Extracting Data: BOLD Public

• Provides data in .tsv, fasta, and xml formats.

• Can select sequence data, trace files, specimen data, combined data.

Extracting Data: web services

• Provides data in tsv (tab separated value) and xml formats

• Sequence data or full records

• Can be used to provide a complete dump of all public BOLD data http://services.boldsystems.org/

Extracting Data: web services

• Working with the raw data allows for custom queries

• Not all fields are available as search terms in BOLD Public

• Requires scripting knowledge, or a lot of patience with excel

• Example: All plants above 2000 ft, etc.

Filter Data

• The Barcode data is collected from a wide variety of independent investigations

• High degree of taxonomic bias• Tentative Names• Variable sequence quality

Impact of Alignment

Alignment

Build Phylogenetic

Trees

Nearest Neighbor Analysis

Clustering Distance Matrices

Impact of Alignment

Pairwise Sequence Alignment

Muscle Multiple Sequence Alignment

Aligning Animal Barcode Data

CO1 Barcode

Short CO1

3’ CO1’

Full CO1 sequence

Barcode

Even a gene as straightforward as CO1 can provide alignment challenges.

5’ 3’

Aligning Barcode Data

• Multiple Sequence Alignment– Accurate– Slow (a thousand sequences can take hours)– Trouble with variable sequences

• Pairwise Sequence Alignment– Fast (Thousands of sequences in minutes)– Inconsistent placement of indels– Highly dependent on choosing the right reference

• Parameters– Amino Acid vs Nucleotide– Gap Penalty

Uploading your alignment to BOLD

• Upload in fasta format• Edit sequence permission on the records

Identifying Problems

• Stop codons – Automatically annotated for coding regions– Even stop codons can be tricky

• Frame shifts • Ambiguous characters• Chimeric sequences

Identifying Problems: Frame Shifts

• Frame-shifts in the middle of the sequence are disruptive and easy to spot

• Frame-shifts at the ends of the sequence are more challenging

Identifying Problems: Chimeric Sequences

• Identify change points• Split the sequence at the point of

discontinuity• Blast each part

Hymenoptera

Hymenoptera Lepidoptera Chimera

Lepidoptera

Cleaning Data: Updating BOLD

• BOLD is curated by the community– Re-upload sequences– Delete sequences– Annotate sequences– Flag sequences

BOLD

Genbank Mirrors Educators ResearchersRegulatory Agencies

Example Workflow: Occurrence of Indels

Download public BOLD

Hymenoptera ecords using webservices

Select sequences with full taxonomy

Align sequences using MAAFT,

Muscle, Transalign

Select one representative

per species

Remove problematic Sequences

Tree

Map sequences onto phylogeny

Example Workflow: Code shifts

Download public BOLD

Hymenoptera ecords using webservices

80,000 sequences –

Align pairwise

Scan sequences for code shifts

Remove problematic sequences

Analyze results

Acknowledgements

• Paul Hebert• Sujeeven Ratnasingham• The BOLD Team

top related