2015 06-12-beiko-irida-big data

Post on 29-Jul-2015

489 Views

Category:

Health & Medicine

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

2

“All of your answers are approximate, you might as well live with it…”

Andrew Rau-Chaplin, 1½ hours ago

Integrated Rapid Infectious Disease Analysiswww.irida.ca

Rob BeikoFaculty of Computer ScienceDalhousie UniversityJune 12

Microbial genomics for rapid investigation of infectious disease

Image © Kenneth Todar

4

2009 and Influenza A

5

6

7

Influenza ARNA genome (14,000 nucleotides)Eight segments(Image: Tao and Zheng, Science 2012)

S. Typhi CT18DNA genome (~5,100,000 nucleotides)One chromosome + two plasmidsScience (2001)

VIRUS BACTERIUM

8

Outbreak investigation

Similarities: place, time, genetics

fda.gov

2014

2010-2013

Inns et al. (2015)

9

Outbreak investigation in Canada

NATIONAL MICROBIOLOGY LABORATORY

PROVINCIAL PUBLIC HEALTH LABORATORIES

CLINICAL ISOLATES

SENTINEL SURVEILLANCE(FoodNet Canada)

CLINICAL, FOOD, ENVIRONMENTAL

CANADIAN FOOD INSPECTION AGENCY

(Regulatory)

FOOD ISOLATES

LISTERIA - E. COLI O157:H7 - SALMONELLA - SHIGELLA

PFGE/MLVA

PUBLIC HEALTH ACTION

10

Pulsed Field Gel ElectrophoresisSerratia - NICU

Hospita

l cas

es

Handwash

es

Environmental

(doors, etc)

Control

(elsewhere in

hospita

l)

Jang et al., J Hosp Infect (2001)

11

15 gigabases per run$1000 - $1500 / run, 1 day

Tinier pieces (150 – 400 bases)

< 1 kilobase per run$2 / run, 1-3 hours (96 in parallel)

Tiny pieces (600 – 1000 bases)

2011: Illumina MiSeq1977: Sanger sequencing ( )

DNA Sequencing

10/10/2013 VanBUG 12

13

MiSeq projects at Dalhousie• Bedford Basin microbial monitoring• Pediatric Crohn’s disease samples• Global microbial air sampling• Mink genomes• Sequencing Lactobacillus genomes from the poop of

old mice• Wastewater diversity and function in the Arctic• Verifying ingredients in dog food ( )• Exercise and the Microbiome

14

Integrated Rapid Infectious Disease Analysiswww.irida.ca

1.56M, 3-year Genome Canada Large-Scale Applied Platform Grant

SFU / BCCDC / PHAC-NML / Dalhousie DNA sequencing and downstream applications

• data management / federation• analysis workflows• ontologies• APIs• 3rd-party applications

Implementation in provincial public health labs Training

15

Five Pillars of IRIDA

16

Ontologies and data standards NCBI, MiXS, vegetables

Metadata Data provenance Data quality Environmental information

17

Data sharing!

• BIG challenges – different jurisdictions, “ownership” of epi data. Privacy!• Health service providers – concerns

about privacy and data breach• Technology outstrips policy• What digital records could we get TODAY?

• Canada lagging in data sharing

18

Calling isolates based on genetic variation

Traditional: Pulsed-field Multi-locus (standards! mlst.net)

Whole genomes: Lots of information! Too much information! Lots of filtering and quality

control required

19

Workflow management

REST-like API (3rd – party applications)

Security: authentication / authorization

Data models & implementation

Local Storage

Remote APIs

IRIDA’s Federated Design

List Samples

20

21

Each pipeline is implemented as a Galaxy workflow

Internal analysis pipelines Assembly and annotation Phylogenetics “Line list” management

3rd-party applications

22

Sampled genomes Quality control Tree generation /visualization

Single-Nucleotide Variant Phylogenetic Pipeline

(SNVPhyl)

23

GenGIS

Data from Haiti cholera outbreak, 2010http://kiwi.cs.dal.ca/GenGIS

24

IslandViewer

http://www.pathogenomics.sfu.ca/islandviewer/browse

25

Interfaces / environment

Personas Researchers Epidemiologists Clinical microbiologists / lab technicians

Workflow design and execution

Full Privileges

Cluster Line List ID

Patient Name

Prov. Health

No.Age Sex Location Sample

IDCollection

DateCulture Result

A 1John Smith 4513253244 26 M Vancouver F14231 14/03/21 Salmonella

sp.

A 2Sally Smith 4519567458 24 F Vancouver F14235 14/03/21 Salmonella

sp.

B 3Tom Jones 4517543216 35 M Vancouver M6542 14/03/24 Salmonella

sp.

B 4Helen Jones 9856321124 35 F Vancouver S1245 14/03/22 Salmonella

sp.

C 5Jennifer Lee 4516853122 29 F Vancouver S5642 14/03/22 Salmonella

sp.

C 6Michael Brown 9456534561 45 M Victoria T68954 14/03/25 Salmonella

sp.

Phylogenetic Tree

Genetic Distance

Limited Privileges

Cluster Line List ID

Patient Name

Prov. Health

No.Age Sex Location Sample

IDCollection

DateCulture Result

A 1John Smith 4513253244 26 M Vancouver F14231 14/03/21 Salmonella

sp.

A 2Sally Smith 4519567458 24 F Vancouver F14235 14/03/21 Salmonella

sp.

B 3Tom Jones 4517543216 35 M Vancouver M6542 14/03/24 Salmonella

sp.

B 4Helen Jones 9856321124 35 F Vancouver S1245 14/03/22 Salmonella

sp.

C 5Jennifer Lee 4516853122 29 F Vancouver S5642 14/03/22 Salmonella

sp.

C 6Michael Brown 9456534561 45 M Victoria T68954 14/03/25 Salmonella

sp.

Phylogenetic Tree

Genetic Distance

28

Large-scale sequencing initiatives

en.wikipedia.org

29

FDA GenomeTrakr

http://www.fda.gov/Food/FoodScienceResearch/WholeGenomeSequencingProgramWGS/ucm363134.htm

30

Public Health England project (>10,000 Salmonella so far)

• As of 2015, sequencing every sampled Salmonella isolate collected in England• Over 10,000 sequenced to date• 8000 already available for download in the public

databases

31Gary van Domselaar, NML

The Global Microbial Identifier

32

What’s next?

??? per run$900 / run, 6 hours

Huge pieces (max so far – 200-300 kilobases)Can stop / restart using same disposable flowcell

2015: Oxford Nanopore MinION

15 cm (-ish)

thehightechsociety.com

33Quick et al. (2015)

“Using a novel streaming phylogenetic placement method samples can be assigned to a serotype in 40 minutes and determined to be part of the outbreak in less than 2 h.”

34

Ebola monitoring

blogs.biomedcentral.comJoshua Quick, Nick Loman

35

Example workflow

6 hrs

Changeflowcell

Samples evaluated against reference in real time

Positive ID / placement

Load DNA

confi

denc

e

36

Challenges

• Sample extraction: getting DNA from stuff• Clinical-grade evaluation• Training• Equipment reliability• Sequencing errors• Quality of reference data / attribution algorithms

• Database updates in real time• Ethics / privacy (Genomes Sequenced While U Wait)

37

The Point

Comprehensive monitoringAccurate typingRapid identification

Real-time decision making

Acknowledgements PIs

Fiona Brinkman – SFUWill Hsiao – PHMRLGary Van Domselaar – NMLMorag Graham - NMLRob Beiko – Dalhousie

University of LisbonJoᾶo Carriҫo

National Microbiology Laboratory (NML)Franklin BristowAaron PetkauThomas MatthewsJosh AdamAdam OlsenTara LynchShaun TylerPhilip MabonPhilip AuCeline NadonMatthew Stuart-EdwardsChrystal BerryLorelee Tschetter

Laboratory for Foodborne Zoonoses (LFZ)Eduardo TaboadaPeter KruczkiewiczChad LaingVic GannonMatthew WhitesideRoss DuncanSteven Mutschall

Simon Fraser University (SFU)Melanie CourtotEmma GriffithsGeoff WinsorJulie ShayMatthew LairdBhav DhillonRaymond Lo

BC Public Health Microbiology & Reference Laboratory (PHMRL) and BC Centre for Disease Control (BCCDC)Judy Isaac-RentonPatrick TangNatalie PrystajeckyJennifer GardyDamion DooleyLinda HoangKim MacDonaldYin ChangEleni GalanisMarsha TaylorCletus D’SouzaAna Paccagnella

University of MarylandLynn Schriml

Canadian Food Inspection Agency (CFIA)Burton BlaisCatherine CarrilloDominic Lambert

Dalhousie UniversityAlex Keddy 38

McMaster UniversityAndrew McArthurDaim Sardar

European Nucleotide ArchiveGuy CochranePetra ten HoopenClara Amid

European Food Safety AgencyLeibana Criado ErnestoVernazza FrancescoRizzi Valentina

39

Seminar from the Will Hsiao,BC Centres for Disease Control

40

Materials to be available onhttp://bioinformatics.ca/

June 24-26, 2015

41

The Bioinformatics Exam of the Future

tagc.com.aucommons.wikimedia.org/wiki/File:DNA_ahelatest_moodustunud_niit_katsuti_korgil..JPGhttp://omicfrontiers.com/2014/06/11/diaryofaminion_part2/

42

2009 was a long time ago

J. Craig Venter Institute

43Photo credit: Emma Allen-VercoeSome slides courtesy of Gary Van Domselaar, NML

FIN

top related