bacterial pathogen genomics at ncbi

47
Bacterial Pathogen Genomics at NCBI

Upload: nist-spin

Post on 16-Jul-2015

112 views

Category:

Science


4 download

TRANSCRIPT

Page 1: Bacterial Pathogen Genomics at NCBI

Bacterial Pathogen Genomics at NCBI

Page 2: Bacterial Pathogen Genomics at NCBI
Page 3: Bacterial Pathogen Genomics at NCBI

FDA, USDA, CDC State, Local andForeign Public Health Agencies

Industry/Academia Additional DATA ANALYSIS

DATA ASSEMBLY AND STORAGE and Analysis

DATA ACQUISITION

NCBI, EMBL DDBJ (INDIS)(Public Access Database)

Our Current Model – Publicly available data

National Network of SequencersIntrenational Network of Sequencers

Page 4: Bacterial Pathogen Genomics at NCBI

Automated Bacterial Assembly

SRA Reads sample 1

Trim reads (Ns, adaptor)

Reference Distance tree

Find closest reference genome(s)

ArgoCA (Combined Assembly)

De novo assembly panelArgo (Reference

assisted assembly)

SOAP denovoGS-assembler

(newbler)MaSuRCA

Celera Assembler

Reads remapped to combined assembly

Contig fastaRead placements (bam)Quality profile

SPAdes

Page 5: Bacterial Pathogen Genomics at NCBI

WGS & Epidemiologically Relevant Distance (ERD)

• WGS allows high resolution genotypic comparison of pathogen isolates

• What is the epidemiological relevance of genotypic distance?

• Many methods to compute – we need some common principles…

Page 6: Bacterial Pathogen Genomics at NCBI

Since all approaches start with sequence reads, we must retain for independent confirmation

0

0.2

0.4

0.6

0.8

1

0 500 1000 1500

Millions

FDA-CFSAN: microbial foodborne pathogen research

SRA format bytes per sequenced base versus number of bases in MiSeq runs

With Quality Without Qualities

0

0.2

0.4

0.6

0.8

0 200 400 600 800 1000 1200

Millions

OXFORD University: Population Genomics of Mycobacterium tuberculosis

SRA format bytes per sequenced base versus number of bases in MiSeq and HiSeq runs

With Quality Without Quality

Storage is manageable…

Page 7: Bacterial Pathogen Genomics at NCBI

Reliable, transparent, high throughput, high

resolution ERDs?

Major challenge is to distinguish independent

events (SNPs) from single events that generate

multiple nucleotide differences

i.e. collapsed repeats and other artifacts,

alignment errors (reference-based alignments),

sequence quality, & recombination

Page 8: Bacterial Pathogen Genomics at NCBI

Fairly uniform distribution

of differences along the

two genomes…?

Cumulative count of differences

Page 9: Bacterial Pathogen Genomics at NCBI

Iterative density filtering

(Richa Agarwala

modification of

Science. 2011 Jan

28;331(6016):430-4.

Page 10: Bacterial Pathogen Genomics at NCBI
Page 11: Bacterial Pathogen Genomics at NCBI

Table: Samples currently processed (as of Sept 5, 2014) in NCBI Pathogen PipelineOrganisms

Center Listeria Salmonella E. coli TotalCDC 903 903FDA + State Partners* 858 6129 307 7294100K 565 34 599FERA 14 14Total 1775 6694 341 8810

Processing Status

Page 12: Bacterial Pathogen Genomics at NCBI

How to measure the system?

need the raw data (sequence reads) in unprocessed form

any read trimming/filtering along with the assembly can be regenerated

Page 13: Bacterial Pathogen Genomics at NCBI

Assembly metrics

map the reads back to the assembly and generate a profile of each position (coverage, alleles, qualities)

compare the assembly against other assemblies of the same organism (genus, species) and check the expected genome size, or similarity to related genomes

annotation metrics such as frameshifted proteins

Page 14: Bacterial Pathogen Genomics at NCBI

What is the actual measurement for sequence similarity?

the number of pairwise SNPs between two genomes

What is the threshold?

a pairwise distance (an observationally determined cutoff below which a cluster of 2 or more isolates are considered significantly close enough to warrant further investigation)

Page 15: Bacterial Pathogen Genomics at NCBI

Sensitivity vs. Specificity

sequence clustering

sensitivity – measure of isolates which belong to the cluster within epidemiologicallyrelevant distance (true positives) / true positives + false negatives (not correctly identified)

specificity – measure of isolates which are excluded from a cluster within epidemiologically relevant distance(true negatives) / true negatives + false positives

Page 16: Bacterial Pathogen Genomics at NCBI

OrganismTotal Samples

Not expected species1

Mixed organisms

Less than 5X coverage Duplicates PacBio

Poor 2nd read

Failed assembly stage

Listeria 1775 20 2 (?) 1 5 1

Salmonella 6694 35 5 9 12

E. coli 341 8 1

1. not L. monocytogenes, S. enterica, or E. coli

Processing Problems

Page 17: Bacterial Pathogen Genomics at NCBI

PROBLEMS!

Page 18: Bacterial Pathogen Genomics at NCBI

Reference Materials

Page 19: Bacterial Pathogen Genomics at NCBI
Page 20: Bacterial Pathogen Genomics at NCBI

Streptococcus massiliensis 4401825 - CANO - GCA_000341525.1

Streptococcus massiliensis DSM 18628 - ARCE - GCA_000380065.1

Streptococcus intermedius BA1 - ANFT - GCA_000313655.1

Streptococcus intermedius B196 - - GCA_000463355.1

Streptococcus intermedius C270 - - GCA_000463385.1

Streptococcus intermedius F0413 - AFXO - GCA_000234035.1

Streptococcus intermedius SK54 - AJKN - GCA_000258445.1

Streptococcus intermedius JTH08 - - GCA_000306805.1

Streptococcus intermedius ATCC 27335 - ATFK - GCA_000413475.1

Streptococcus intermedius F0395 - AFXN - GCA_000234015.1

Streptococcus sp. AS20 - JANS - GCA_000524255.1

Streptococcus constellatus subsp. constellatus SK53 - AICQ - GCA_000257785.1

Streptococcus constellatus subsp. constellatus SK53 - BASU - GCA_000474075.1

Streptococcus constellatus subsp. pharyngis C1050 - - GCA_000463425.1

Streptococcus constellatus subsp. pharyngis SK1060 = CCUG 46377 - AFUP - GCA_000223295.2

Streptococcus constellatus subsp. pharyngis SK1060 = CCUG 46377 - BASX - GCA_000474135.1

Streptococcus constellatus subsp. pharyngis C232 - - GCA_000463395.1

Streptococcus constellatus subsp. pharyngis C818 - - GCA_000463445.1

Streptococcus anginosus SK1138 - ALJO - GCA_000287595.1

Streptococcus sp. CM7 - JATP - GCA_000526035.1

Streptococcus sp. OBRC6 - JACR - GCA_000517685.1

Streptococcus anginosus F0211 - AECT - GCA_000184365.2

Streptococcus anginosus 1505 - BASW - GCA_000474115.1

Streptococcus sp. ACC21 - JAQU - GCA_000524375.1

Streptococcus sp. AC15 - JDFJ - GCA_000565055.1

Streptococcus anginosus subsp. whileyi MAS624 - - GCA_000478925.1

Streptococcus anginosus subsp. whileyi CCUG 39159 - AICP - GCA_000257765.1

Streptococcus anginosus C238 - - GCA_000463505.1

Streptococcus anginosus DORA_7 - AZMF - GCA_000508545.1

Streptococcus anginosus 1_2_62CV - ADME - GCA_000186545.1

Streptococcus anginosus C1051 - - GCA_000463465.1

Streptococcus anginosus T5 - BASY - GCA_000474155.1

Streptococcus anginosus SK52 = DSM 20563 - AFIM - GCA_000214555.2

Streptococcus anginosus SK52 = DSM 20563 - AREF - GCA_000373605.1

Streptococcus anginosus SK52 = DSM 20563 - BAST - GCA_000474055.1

Streptococcus intermedius SK54 - BASV - GCA_000474095.10.05

Page 21: Bacterial Pathogen Genomics at NCBI
Page 22: Bacterial Pathogen Genomics at NCBI

Escherichia coli KTE179 - ANYQ - GCA_000326485.1Escherichia coli KTE229 - ANXK - GCA_000353165.1

Escherichia coli H252 - AEFI - GCA_000190895.1Escherichia coli HVH 180 (4-3051617) - AVYH - GCA_000458685.1

Escherichia coli HVH 73 (4-2393174) - AVUX - GCA_000457025.1Escherichia coli HVH 104 (4-6977960) - AVVT - GCA_000457455.1

Escherichia coli HVH 19 (4-7154984) - AVTL - GCA_000456265.1Escherichia coli 908675 - AXTY - GCA_000488755.1

Escherichia coli HVH 127 (4-7303629) - AVWO - GCA_000457855.1Escherichia coli HVH 12 (4-7653042) - AVTG - GCA_000494955.1

Escherichia coli KOEGE 32 (66a) - AWAD - GCA_000459635.1Escherichia coli UMEA 3041-1 - AWAW - GCA_000460015.1

Escherichia coli HVH 148 (4-3192490) - AVXH - GCA_000495015.1Escherichia coli HVH 59 (4-1119338) - AVUQ - GCA_000456885.1

Escherichia coli HVH 222 (4-2977443) - AVZU - GCA_000459455.1Escherichia coli UMEA 3140-1 - AWBK - GCA_000460295.1

Escherichia coli HVH 178 (4-3189163) - AVYG - GCA_000495055.1Escherichia coli KTE4 - ANSO - GCA_000350645.1Escherichia coli KTE3 - ASTO - GCA_000407685.1

Escherichia coli KTE240 - ASUS - GCA_000408305.1Escherichia coli BIDMC 49b - JAPT - GCA_000522365.1

Escherichia coli BIDMC 49a - JAPU - GCA_000522385.1Escherichia coli APEC O1 - - GCA_000014845.1

Escherichia coli DSM 30083 = JCM 1649 = ATCC 11775 - BAIM - GCA_000613265.1Escherichia coli JCM 20135 - BAKV - GCA_000614505.1

Escherichia coli DSM 30083 = JCM 1649 = ATCC 11775 - AGSE - GCA_000690815.1Escherichia coli DSM 30083 = JCM 1649 = ATCC 11775 - JMST - GCA_000734955.1

Escherichia coli HVH 214 (4-3062198) - AZJN - GCA_000507665.1Escherichia coli UMEA 3162-1 - AWBU - GCA_000460475.1

Escherichia coli HVH 191 (3-9341900) - AVYR - GCA_000458875.1Escherichia coli HVH 170 (4-3026949) - AVYA - GCA_000458555.1

Escherichia coli S88 - - GCA_000026285.1Escherichia coli UMEA 3893-1 - AWEI - GCA_000461775.1

Escherichia coli HVH 217 (4-1022806) - AVZQ - GCA_000459375.1Escherichia coli KTE5 - ANSP - GCA_000350665.1

Escherichia coli KTE7 - ASTP - GCA_000407705.1Escherichia coli HVH 32 (4-3773988) - AVTX - GCA_000456505.1

Escherichia coli UMEA 3206-1 - AWCK - GCA_000460795.1Escherichia coli UMEA 3203-1 - AWCJ - GCA_000460775.1

Escherichia coli KTE62 - ANUK - GCA_000351605.1Escherichia coli KTE27 - ASTY - GCA_000407885.1

Escherichia coli cloneA_i1 - AEYT - GCA_000233675.2Escherichia coli 597 - AYQU - GCA_000503475.1

Escherichia coli HVH 203 (4-3126218) - AVZD - GCA_000459115.1Escherichia coli UMEA 3702-1 - AWDZ - GCA_000461595.1

Escherichia coli UMEA 3662-1 - AWDU - GCA_000461495.1Escherichia coli HVH 5 (4-7148410) - AVTB - GCA_000456085.1Escherichia coli HVH 102 (4-6906788) - AVVR - GCA_000465155.1

Escherichia coli HVH 201 (4-4459431) - AVZB - GCA_000459075.1Escherichia coli HM605 - AJWU - GCA_000264175.1

Escherichia coli HM605 - CADZ - GCA_000285375.10.01

Page 23: Bacterial Pathogen Genomics at NCBI
Page 24: Bacterial Pathogen Genomics at NCBI
Page 25: Bacterial Pathogen Genomics at NCBI

http://www.ncbi.nlm.nih.gov/assembly/?term=%22anomalous%22[Properties]

Page 26: Bacterial Pathogen Genomics at NCBI
Page 27: Bacterial Pathogen Genomics at NCBI

Contamination (multiple organisms)

Page 28: Bacterial Pathogen Genomics at NCBI
Page 29: Bacterial Pathogen Genomics at NCBI

Assembly for sample SAMN02727350

TypeNumber of contigs

Sum of contig lengths

Full assembly 667 5251272

contigs with Listeria hits 37 3031650contigs with Staphylococcus hits 630 2203573

Page 30: Bacterial Pathogen Genomics at NCBI

Contamination (carryover contamination)

Page 31: Bacterial Pathogen Genomics at NCBI
Page 32: Bacterial Pathogen Genomics at NCBI

Contamination (multiple strains)

Page 33: Bacterial Pathogen Genomics at NCBI
Page 34: Bacterial Pathogen Genomics at NCBI

Table: Assembly stats for SAMN02693748measurement resultnum_input_reads 4212706aligned_reads 4040070assembly_num_bases 3180478assembly_num_contigs 50assembly_N50 2817733poor_quality_support_bases 132321

Page 35: Bacterial Pathogen Genomics at NCBI
Page 36: Bacterial Pathogen Genomics at NCBI
Page 37: Bacterial Pathogen Genomics at NCBI
Page 38: Bacterial Pathogen Genomics at NCBI
Page 39: Bacterial Pathogen Genomics at NCBI
Page 40: Bacterial Pathogen Genomics at NCBI

Organism Biosample SRA Run Similarity to:

Listeria monocytogenes IEH-NGS-LIS-00100 SAMN02567873 SRR1207486 Listeria SLCC7179

SRR1220750 Listeria J0161

Salmonella enterica Enteritidis MDH-2014-00798 SAMN02741943 SRR1553852

Schwarzengrund str. CVM19633

SRR1272871 Enteritidis str. P125109

Salmonella enterica Fluntern MDH-2013-00153 SAMN02378158 SRR1067624

Javiana and Schwarzengrund

SRR1395304 Cubana and Agona

Page 41: Bacterial Pathogen Genomics at NCBI
Page 42: Bacterial Pathogen Genomics at NCBI

Proficiency Testing

• Replicate results (phylogeny, SNPs) from published studies• Resequencing

same isolate on multiple platforms same isolate in multiple libraries same isolate in multiple labs

• Blinded submissions already-characterized isolates mixed sample isolates metagenomic isolates

• Corner cases Extreme coverage Duplicates Sample mixups

Page 43: Bacterial Pathogen Genomics at NCBI
Page 44: Bacterial Pathogen Genomics at NCBI
Page 45: Bacterial Pathogen Genomics at NCBI
Page 46: Bacterial Pathogen Genomics at NCBI
Page 47: Bacterial Pathogen Genomics at NCBI

Acknowledgements

National Center for Biotechnology Information – National Library of Medicine – Bethesda MD 20892 USA

Richa AgarwalaAzat BadretdinSlava BroverJoshua CherryVyacheslav ChetverninRobert CohenMichael DiCuccioMike FeldgardenDan HaftWilliam KlimkeArjun PrasadEdward RiceKirill RotmistrovskyyStephen SherrySergey ShiryevMartin ShumwayTatiana TatusovaIgor TolstoyChunlin XiaoLeonid ZaslavskyAlexander ZasypkinAlejandro A. SchafferLukas WagnerAleksandr Morgulis

David LipmanJames Ostell

NCBI

• This research was supported by the Intramural

Research Program of the NIH, National Library of

Medicine. http://www.ncbi.nlm.nih.gov

CDC

FDA/CFSAN

NIHGRI

UC-Davis

USDA

Vendors: PacBio, Illumina, Roche