bacterial pathogen genomics at ncbi
TRANSCRIPT
FDA, USDA, CDC State, Local and Foreign Public Health Agencies
Industry/Academia Addi$onal DATA ANALYSIS
DATA ASSEMBLY AND STORAGE and Analysis
DATA ACQUISITION
NCBI, EMBL DDBJ (INDIS) (Public Access Database)
Our Current Model – Publicly available data
NaFonal Network of Sequencers IntrenaFonal Network of Sequencers
Automated Bacterial Assembly
SRA Reads sample 1
Trim reads (Ns, adaptor)
Reference Distance tree
Find closest reference genome(s)
ArgoCA (Combined Assembly)
De novo assembly panel Argo (Reference
assisted assembly)
SOAP denovo GS-‐assembler (newbler) MaSuRCA Celera
Assembler
Reads remapped to combined assembly
ConFg fasta Read placements (bam) Quality profile
SPAdes
WGS & Epidemiologically Relevant Distance (ERD)
• WGS allows high resoluFon genotypic comparison of pathogen isolates
• What is the epidemiological relevance of genotypic distance?
• Many methods to compute – we need some common principles…
Since all approaches start with sequence reads, we must retain for independent confirmaHon
0
0.5
1
0 500 1000 1500
Millions
FDA-‐CFSAN: microbial foodborne pathogen research
SRA format bytes per sequenced base versus number of bases in MiSeq runs
With Quality Without QualiFes
0 0.2 0.4 0.6 0.8
0 200 400 600 800 1000 1200
Millions
OXFORD University: PopulaFon Genomics of Mycobacterium tuberculosis
SRA format bytes per sequenced base versus number of bases in MiSeq and HiSeq runs
With Quality Without Quality
Storage is manageable…
Reliable, transparent, high throughput, high resolution ERDs?
Major challenge is to distinguish independent events (SNPs) from single events that generate multiple nucleotide differences i.e. collapsed repeats and other artifacts, alignment errors (reference-based alignments), sequence quality, & recombination
Table: Samples currently processed (as of Sept 5, 2014) in NCBI Pathogen Pipeline Organisms
Center Listeria Salmonella E. coli Total CDC 903 903 FDA + State Partners* 858 6129 307 7294 100K 565 34 599 FERA 14 14 Total 1775 6694 341 8810
Processing Status
How to measure the system? need the raw data (sequence reads) in unprocessed form any read trimming/filtering along with the assembly can be regenerated
Assembly metrics map the reads back to the assembly and generate a profile of each posiFon (coverage, alleles, qualiFes)
compare the assembly against other assemblies of the same organism (genus, species) and check the expected genome size, or similarity to related genomes annotaFon metrics such as frameshiied proteins
What is the actual measurement for sequence similarity? the number of pairwise SNPs between two genomes
What is the threshold? a pairwise distance (an observaFonally determined cutoff below which a cluster of 2 or more isolates are considered significantly close enough to warrant further invesFgaFon)
Sensi>vity vs. Specificity sequence clustering sensiFvity – measure of isolates which belong to the cluster within epidemiologically relevant distance (true posiFves) / true posiFves + false negaFves (not correctly idenFfied) specificity – measure of isolates which are excluded from a cluster within epidemiologically relevant distance (true negaFves) / true negaFves + false posiFves
Organism Total Samples
Not expected species1
Mixed organisms
Less than 5X coverage Duplicates PacBio
Poor 2nd read
Failed assembly stage
Listeria 1775 20 2 (?) 1 5 1 Salmonella 6694 35 5 9 12 E. coli 341 8 1 1. not L. monocytogenes, S. enterica, or E. coli
Processing Problems
Streptococcus massiliensis 4401825 - CANO - GCA_000341525.1 Streptococcus massiliensis DSM 18628 - ARCE - GCA_000380065.1 Streptococcus intermedius BA1 - ANFT - GCA_000313655.1
Streptococcus intermedius B196 - - GCA_000463355.1 Streptococcus intermedius C270 - - GCA_000463385.1 Streptococcus intermedius F0413 - AFXO - GCA_000234035.1 Streptococcus intermedius SK54 - AJKN - GCA_000258445.1 Streptococcus intermedius JTH08 - - GCA_000306805.1 Streptococcus intermedius ATCC 27335 - ATFK - GCA_000413475.1
Streptococcus intermedius F0395 - AFXN - GCA_000234015.1 Streptococcus sp. AS20 - JANS - GCA_000524255.1
Streptococcus constellatus subsp. constellatus SK53 - AICQ - GCA_000257785.1 Streptococcus constellatus subsp. constellatus SK53 - BASU - GCA_000474075.1 Streptococcus constellatus subsp. pharyngis C1050 - - GCA_000463425.1
Streptococcus constellatus subsp. pharyngis SK1060 = CCUG 46377 - AFUP - GCA_000223295.2 Streptococcus constellatus subsp. pharyngis SK1060 = CCUG 46377 - BASX - GCA_000474135.1
Streptococcus constellatus subsp. pharyngis C232 - - GCA_000463395.1 Streptococcus constellatus subsp. pharyngis C818 - - GCA_000463445.1
Streptococcus anginosus SK1138 - ALJO - GCA_000287595.1 Streptococcus sp. CM7 - JATP - GCA_000526035.1
Streptococcus sp. OBRC6 - JACR - GCA_000517685.1 Streptococcus anginosus F0211 - AECT - GCA_000184365.2
Streptococcus anginosus 1505 - BASW - GCA_000474115.1 Streptococcus sp. ACC21 - JAQU - GCA_000524375.1 Streptococcus sp. AC15 - JDFJ - GCA_000565055.1
Streptococcus anginosus subsp. whileyi MAS624 - - GCA_000478925.1 Streptococcus anginosus subsp. whileyi CCUG 39159 - AICP - GCA_000257765.1
Streptococcus anginosus C238 - - GCA_000463505.1 Streptococcus anginosus DORA_7 - AZMF - GCA_000508545.1
Streptococcus anginosus 1_2_62CV - ADME - GCA_000186545.1 Streptococcus anginosus C1051 - - GCA_000463465.1
Streptococcus anginosus T5 - BASY - GCA_000474155.1 Streptococcus anginosus SK52 = DSM 20563 - AFIM - GCA_000214555.2 Streptococcus anginosus SK52 = DSM 20563 - AREF - GCA_000373605.1 Streptococcus anginosus SK52 = DSM 20563 - BAST - GCA_000474055.1 Streptococcus intermedius SK54 - BASV - GCA_000474095.1
0.05
Escherichia coli KTE179 - ANYQ - GCA_000326485.1Escherichia coli KTE229 - ANXK - GCA_000353165.1
Escherichia coli H252 - AEFI - GCA_000190895.1Escherichia coli HVH 180 (4-3051617) - AVYH - GCA_000458685.1
Escherichia coli HVH 73 (4-2393174) - AVUX - GCA_000457025.1Escherichia coli HVH 104 (4-6977960) - AVVT - GCA_000457455.1
Escherichia coli HVH 19 (4-7154984) - AVTL - GCA_000456265.1Escherichia coli 908675 - AXTY - GCA_000488755.1
Escherichia coli HVH 127 (4-7303629) - AVWO - GCA_000457855.1Escherichia coli HVH 12 (4-7653042) - AVTG - GCA_000494955.1
Escherichia coli KOEGE 32 (66a) - AWAD - GCA_000459635.1Escherichia coli UMEA 3041-1 - AWAW - GCA_000460015.1
Escherichia coli HVH 148 (4-3192490) - AVXH - GCA_000495015.1Escherichia coli HVH 59 (4-1119338) - AVUQ - GCA_000456885.1
Escherichia coli HVH 222 (4-2977443) - AVZU - GCA_000459455.1Escherichia coli UMEA 3140-1 - AWBK - GCA_000460295.1
Escherichia coli HVH 178 (4-3189163) - AVYG - GCA_000495055.1Escherichia coli KTE4 - ANSO - GCA_000350645.1Escherichia coli KTE3 - ASTO - GCA_000407685.1
Escherichia coli KTE240 - ASUS - GCA_000408305.1Escherichia coli BIDMC 49b - JAPT - GCA_000522365.1
Escherichia coli BIDMC 49a - JAPU - GCA_000522385.1Escherichia coli APEC O1 - - GCA_000014845.1
Escherichia coli DSM 30083 = JCM 1649 = ATCC 11775 - BAIM - GCA_000613265.1Escherichia coli JCM 20135 - BAKV - GCA_000614505.1
Escherichia coli DSM 30083 = JCM 1649 = ATCC 11775 - AGSE - GCA_000690815.1Escherichia coli DSM 30083 = JCM 1649 = ATCC 11775 - JMST - GCA_000734955.1
Escherichia coli HVH 214 (4-3062198) - AZJN - GCA_000507665.1Escherichia coli UMEA 3162-1 - AWBU - GCA_000460475.1
Escherichia coli HVH 191 (3-9341900) - AVYR - GCA_000458875.1Escherichia coli HVH 170 (4-3026949) - AVYA - GCA_000458555.1
Escherichia coli S88 - - GCA_000026285.1Escherichia coli UMEA 3893-1 - AWEI - GCA_000461775.1
Escherichia coli HVH 217 (4-1022806) - AVZQ - GCA_000459375.1Escherichia coli KTE5 - ANSP - GCA_000350665.1
Escherichia coli KTE7 - ASTP - GCA_000407705.1Escherichia coli HVH 32 (4-3773988) - AVTX - GCA_000456505.1
Escherichia coli UMEA 3206-1 - AWCK - GCA_000460795.1Escherichia coli UMEA 3203-1 - AWCJ - GCA_000460775.1
Escherichia coli KTE62 - ANUK - GCA_000351605.1Escherichia coli KTE27 - ASTY - GCA_000407885.1
Escherichia coli cloneA_i1 - AEYT - GCA_000233675.2Escherichia coli 597 - AYQU - GCA_000503475.1
Escherichia coli HVH 203 (4-3126218) - AVZD - GCA_000459115.1Escherichia coli UMEA 3702-1 - AWDZ - GCA_000461595.1
Escherichia coli UMEA 3662-1 - AWDU - GCA_000461495.1Escherichia coli HVH 5 (4-7148410) - AVTB - GCA_000456085.1Escherichia coli HVH 102 (4-6906788) - AVVR - GCA_000465155.1
Escherichia coli HVH 201 (4-4459431) - AVZB - GCA_000459075.1Escherichia coli HM605 - AJWU - GCA_000264175.1
Escherichia coli HM605 - CADZ - GCA_000285375.10.01
Assembly for sample SAMN02727350
Type Number of conFgs
Sum of conFg lengths
Full assembly 667 5251272
conFgs with Listeria hits 37 3031650 conFgs with Staphylococcus hits 630 2203573
Table: Assembly stats for SAMN02693748 measurement result num_input_reads 4212706 aligned_reads 4040070 assembly_num_bases 3180478 assembly_num_conFgs 50 assembly_N50 2817733 poor_quality_support_bases 132321
Organism Biosample SRA Run Similarity to:
Listeria monocytogenes IEH-‐NGS-‐LIS-‐00100 SAMN02567873 SRR1207486 Listeria SLCC7179
SRR1220750 Listeria J0161 Salmonella enterica EnteriFdis MDH-‐2014-‐00798 SAMN02741943 SRR1553852
Schwarzengrund str. CVM19633
SRR1272871 EnteriFdis str. P125109 Salmonella enterica Fluntern MDH-‐2013-‐00153 SAMN02378158 SRR1067624
Javiana and Schwarzengrund
SRR1395304 Cubana and Agona
Proficiency TesFng • Replicate results (phylogeny, SNPs) from published studies • Resequencing
ü same isolate on mulFple plasorms ü same isolate in mulFple libraries ü same isolate in mulFple labs
• Blinded submissions ü already-‐characterized isolates ü mixed sample isolates ü metagenomic isolates
• Corner cases ü Extreme coverage ü Duplicates ü Sample mixups
Acknowledgements
National Center for Biotechnology Information – National Library of Medicine – Bethesda MD 20892 USA
Richa Agarwala Azat Badretdin Slava Brover Joshua Cherry Vyacheslav Chetvernin Robert Cohen Michael DiCuccio Mike Feldgarden Dan Hai William Klimke Arjun Prasad Edward Rice Kirill Rotmistrovskyy Stephen Sherry Sergey Shiryev MarFn Shumway TaFana Tatusova Igor Tolstoy Chunlin Xiao Leonid Zaslavsky Alexander Zasypkin Alejandro A. Schaffer Lukas Wagner Aleksandr Morgulis David Lipman James Ostell
NCBI
• This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. http://www.ncbi.nlm.nih.gov
CDC FDA/CFSAN NIHGRI UC-Davis USDA Vendors: PacBio, Illumina, Roche