galaxytrakr: development of an accessible cloud …...discovery • ngs assembly – plasmidspades,...

24
1 GalaxyTrakr: Development of an Accessible Cloud-based Bioinformatics Platform James Pettengill Geneticist, Biostatistics and Bioinformatics Staff Center for Food Safety and Applied Nutrition US Food and Drug Administration Food Safety & High-Throughput Sequencing (HTS) Institute for Food Safety and Health May 31, 2018

Upload: others

Post on 03-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

1

GalaxyTrakr: Development of an

Accessible Cloud-based

Bioinformatics Platform

James Pettengill

Geneticist, Biostatistics and Bioinformatics Staff

Center for Food Safety and Applied Nutrition

US Food and Drug Administration

Food Safety & High-Throughput Sequencing (HTS)

Institute for Food Safety and Health

May 31, 2018

Page 2: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

2

Outline:

1. Galaxy: a user-friendly interface for bioinformatics

• Introduction to Galaxy

• GalaxyTrakr Overview

• GalaxyTrakr Tools

Page 3: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

3

Outline:

1. Galaxy: a user-friendly interface for bioinformatics

• Introduction to Galaxy

• GalaxyTrakr Overview

• GalaxyTrakr Tools

2. CFSAN cgMLST: rapid screen and clustering of

isolates

• Internal rapid identification of SNP clusters for outbreak

analyses

• Resource for others/industry

Page 4: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

4

What is Galaxy?

Page 5: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

5

Why Galaxy?

- Has a graphical user interface (GUI) so does not require

command line experience

- Active community of developers/users sharing the tools they

have developed or ported to Galaxy*

- Access programs through the Galaxy Tool Shed

Page 6: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

6

Summary of Galaxy on AWS

• Galaxy has an Academic Free License.

– https://galaxyproject.org/

• Installed on a cloud formation cluster master node.

• Submits jobs to compute cluster via Grid Engine.

• Compute clusters are elastic, based on demand.

• Storage is elastic and accessible from multiple master

nodes.

• Two options for installation on AWS:

– https://aws.amazon.com/hpc/cfncluster/ **

– https://galaxyproject.org/cloudman/getting-started/

Page 7: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

7

GalaxyTrakr Tools

• NGS QC and Manipulation– Trimmomatic, FastQC

• NGS Mapping– Bowtie2, Short Read Sequencer Typer (v2), BWA and BWA-MEM, Neptune Signature

Discovery

• NGS Assembly– Plasmidspades, SPAdes, Quast

• NGS Screening and Prediction– Seqsero v1 and v2, Seqsero Batch Paired-End Reads, Sistr cmd, BTyper, MLST, ABRicate

• Data Input– Direct from NCBI in Pileup, BAM or FASTA/Q format

– Upload from local computer via secure FTP or via GalaxyTrakr web interface

• Data Output– Download from GalaxyTrakr web interface

– Download via FTP

• Reference based variant detection– CFSAN SNP Pipeline

Page 8: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

8

GalaxyTrakr Stats

• Currently 139 active users across 42 different locations

worldwide, adding about 15 users per week

Page 9: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

9

GalaxyTrakr Stats

• Currently 139 active users across 42 different locations

worldwide, adding about 15 users per week

• Over 38,000 jobs processed to date, top users using

over 3500 hours and 11,000 CPU slots

Page 10: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

10

GalaxyTrakr Stats

• Currently 139 active users across 42 different locations

worldwide, adding about 15 users per week

• Over 38,000 jobs processed to date, top users using

over 3500 hours and 11,000 CPU slots

• Cost with current load is approximately $6500 a month,

initial target was $10000 a month

Page 11: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

11

GalaxyTrakr Stats

• Currently 139 active users across 42 different locations

worldwide, adding about 15 users per week

• Over 38,000 jobs processed to date, top users using

over 3500 hours and 11,000 CPU slots

• Cost with current load is approximately $6500 a month,

initial target was $10000 a month

• Custom software for automated monitoring and

management, less than 1 full-time staff member

managing IT services - detailed Custom Dashboard:

http://dash.galaxytrakr.org/

Page 12: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

12

An example with SeqSero:

• Uses whole genome sequence data to predict serotype.

• Useful tool for QA/QC

• Maps reads to database of antigen alleles using BWA in multiple steps.

• Chooses alleles to which more reads mapped.

• Uses BLAST to clear up ambiguities.

Page 13: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

13

Galaxy homepage

Page 14: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

14

Upload your data

Page 15: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

15

Choose your data

Page 16: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

16

Run it

Page 17: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

17

Wait…

Page 18: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

18

View and Download the Results

Page 19: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

19

CFSAN SNP Pipeline

Page 20: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

20

cgMLST:core genome multi-locus sequence type

Annotated Reference

Collection of reference genomes

Annotated Reference

Annotated Reference

All against all

comparison of genes Identify single copy core genes

cgMLST database

Genome of interest

Genome of interest

Genome of interest

cgMLST creation

Annotate with

PROKKA Annotation

Annotation

Annotation

Isolate and compare cgMLST loci to determine

closest isolates

cgMLST in practice

Page 21: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

21

21

SNP detection/calling

(reference)

Open-reading frame

annotation (de-novo);

Presence/absence and

extended MLST

Whole-chromosome

alignment (de-novo)

cgMLST has the potential to incorporate many

approaches…

…providing extreme sensitivity and

valuable genotypic and phenotypic

information.

Page 22: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

22

cgMLST: rapid screening of GenomeTrakr/Pathogen

database to identify similar isolates

Page 23: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

23

Summary:

1. GalaxyTrakr: a user-friendly interface for bioinformatics

• Open source

• Lots of tools

• Activity community and support available

• CFSAN’s GalaxyTrakr may not be suitable for industry as it’s intended for public data – but industry could stand up own Galaxy instance inhouse.

2. CFSAN cgMLST: rapid screen and clustering of isolates

• Internal rapid identification of SNP clusters for outbreak analyses

• Resource for others/industry – requires some bioinformatic expertise

Page 24: GalaxyTrakr: Development of an Accessible Cloud …...Discovery • NGS Assembly – Plasmidspades, SPAdes, Quast • NGS Screening and Prediction – Seqsero v1 and v2, Seqsero Batch

24

Acknowledgements

GalaxyTrakr

• James Sanders**

• Charles Strittmatter

• Justin Payne

• Errol Strain

• Hugh Rand

cgMLST

• Arthur Pightling