eccmid 2015 meet-the-expert: bioinformatics tools

29
What bioinformatic tools should I use for analysis of high-throughput sequencing data for molecular diagnostics? Nick Loman

Upload: nick-loman

Post on 16-Jul-2015

1.618 views

Category:

Science


8 download

TRANSCRIPT

What bioinformatic tools should I use for analysis of high-throughput sequencing data

for molecular diagnostics?

Nick Loman

Reference-based approach

Alignment

Variant calling

SNP extraction & filter

Recombination filtering

Tree building

MLST/Antibiogram

Read QC

Adaptor/quality trimming

Species ID

Sample QC

FastQC, Qualimap

Trimmomatic

BLAST, Metaphlan, MOCAT

Blobology, Kraken, BLAST

BWA

Samtools/VarScanGATK

Custom script, snippy, SnpEff, BRESEQ

Gubbins, ClonalFrameML

FastTree, RaXML

SRST2

De novo approach

Assembly

MLST/Antibiogram

Annotation

Tree building

Population genomics

Pan-genome

VelvetSPADES

Prokka

Harvest

BigsDBPhyloviz

LS-BSR

mlst, Abricate

FastQC

• What: Analyse read-level sequence quality.

• Why: Determine serious errors in read quality that might affect downstream analysis.

• Where: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

FastQC

Qualimap

• What: Analyse insert size distribution

• Why: Determine whether sequencing has been effective, particularly for de novo assembly, need for adaptor trimming

• Where: http://qualimap.bioinfo.cipf.es/

Trimmomatic

• What: One of several million read trimmers

• Why: To remove sequence adaptors which may influence the results of de novo assembly

• Where: http://www.usadellab.org/cms/?page=trimmomatic

Species ID: BLAST

• What: Only the most famous bioinformatics algorithm ever made

• Why: A few random BLAST searches will reveal much important information about your data before you start on a pipeline analysis

• Where: http://ncbi.nlm.nih.gov/BLAST

Species ID: Metaphlan

• What: Designed for metagenomics, this algorithm will find “taxon-defining” genes to identify what species are in a sample

• Why: Check for extent of sample contamination, give an accurate species ID for unknown samples

• Where: https://bitbucket.org/biobakery/metaphlan2

Species ID: Kraken

• What: Similar to Metaphlan but even faster and with a more complete database

• Why: Check for extent of sample contamination, give an accurate species ID for unknown samples

• Where: https://ccb.jhu.edu/software/kraken/

Species ID: MOCAT

• What: Uses a phylogenetic approach to identify novel or divergent species by relying on distances in conserved marker genes

• Why: Sometimes you sequence something completely novel and want to know more about its relationships

• Where: http://vm-lux.embl.de/~kultima/MOCAT/

• Alternatives: Phylosift, rMLST

Sample QC: Blobology

• What: A simple method of plotting de novo assembly contigs by GC, coverage and taxon

• Why: Characterise contamination, plasmids, lytic phage in a sample

• Where: https://github.com/blaxterlab/blobology

Reference approach

Alignment: BWA

• What: The standard method for aligning Illumina sequences to a reference, use in BWA-MEM mode which works well with most read lengths

• Why: Finds the likely location of each sequence read in a reference genome

• Where: https://github.com/lh3/bwa

• Alternatives: SMALT, Bowtie2 (beware standard insert size parameters)

Variant calling: samtools&VarScan

• What: A way of calling SNPs against a reference in one or more samples

• Why: VarScan permits easy filtering of SNPs by allele frequency and strand, useful for getting a precise dataset

• Where: http://www.htslib.org/

• http://varscan.sourceforge.net/

• Alternatives: GATK, snippy, Nesoni

Recombination filtering: Gubbins

• What: Detect regions which have undergone recombination which will confound phylogenetic reconstructions assuming clonality

• Why: Important when attempting phylogenetic reconstructions from recombining organisms

• Where: http://sanger-pathogens.github.io/gubbins/

• Alternatives: ClonalFrameML, BRATNextGen

Tree building: FastTree

• What: Phylogenetic reconstructions from SNP data

• Why: Tree reconstructions are an effective way of examining evolutionary relationships in isolates and testing if they are from an outbreak, FastTree

• Note: Ensure you don’t hit the double-precision bug! (http://darlinglab.org/blog/2015/03/23/not-so-fast-fasttree.html)

• Where: http://meta.microbesonline.org/fasttree/Alternatives: RAxML (more thorough, slower), REALPHY http://realphy.unibas.ch/fcgi/realphy

MLST & Antibiogram: SRST2

• What: Aligns reads against MLST and antibiotic resistance databases

• Why: Permits MLST typing with genome data and a rough prediction of antibiotic resistance

• Where: http://katholt.github.io/srst2/

De novo approach

De novo assembly: SPADES

• What: A reliable de novo assembler which works well with multiple data types

• Why: Has in-built error corrector so no need for read trimming, can use multiple values of k so less need for experimentation, consistently performs well in comparisons

• Where: http://bioinf.spbau.ru/spades

De novo assembly: Velvet

• What: The original short-read assembly

• Why: Extremely fast for draft assemblies, particularly if just want to do MLST or antibiograms

• Where: https://www.ebi.ac.uk/~zerbino/velvet/

• Alternatives: MEGAHIT – even faster!

Annotation: Prokka

• What: Takes de novo assembly contig files and annotates them with coding sequences and non-coding features such as RNAs

• Why: A very sensible set of tools and reference databases in a single package, produces usable output for other software and database submission

• Where: http://www.vicbioinformatics.com/software.prokka.shtml

• Alternatives: xBASE annotation interface

Tree building: Harvest

• What: Takes de novo assembly contigs, performs whole-genome alignment and permits reconstruction of core genome phylogenies

• Why: Scaleable to hundreds of genomes on a laptop and with an excellent viewer

• Where: http://harvest.readthedocs.org/en/latest/index.html

• Alternatives: Mauve

Population genomics: BIGSDB

• What: Takes de novo assembly contigs and applies MLST-like schemes working on hundreds or thousands of core genes

• Why: Scaleable to >1000s of genomes for rapid population-level clustering

• Where: http://pubmlst.org/software/database/bigsdb/

• Alternatives: Bionumerics

Pan/accessory genomes: LS-BSR

• What: Takes de novo assembly contigs or annotations and compares gene content

• Why: To determine differences in gene content between 1 to 1000s of strains

• Where: https://github.com/jasonsahl/LS-BSR

• Alternatives: OrthoMCL

MLST/Antibiogram: mlst and Abricate

• What: Works on de novo assembly to give mlst prediction and antibiotic resistance perdiction

• Why: A very fast method

• Where: https://github.com/tseemann/mlst

• https://github.com/tseemann/abricate

• Alternatives: SRST2

CLoud Infrastructure for Microbial Bioinformatics (CLIMB)

• MRC funded project to develop Cloud Infrastructure for microbial bioinformatics

• £4M of hardware, capable of supporting >1000 individual virtual servers

• Amazon/Google cloud for Academics

Acknowledgements

• Twitter comments:

– Tom Connor, Alan McNally, Torsten Seemann, C. Titus Brown, Heng Li, Christoffer Flensburg, Matt MacManes, Rachel Glover, Willem van Schaik, Bill Hanage, Jennifer Gardy, Mick Watson, Alan McNally, Esther Robinson, Nicola Fawcett, Aziz Aboobaker, Ruth Massey