bacterial pathogen genomics at ncbi

47
Bacterial Pathogen Genomics at NCBI

Upload: nathan-olson

Post on 17-Jul-2015

65 views

Category:

Science


2 download

TRANSCRIPT

Bacterial  Pathogen  Genomics  at  NCBI

FDA,  USDA,  CDC   State,  Local  and    Foreign  Public  Health  Agencies  

Industry/Academia   Addi$onal    DATA  ANALYSIS  

DATA  ASSEMBLY  AND  STORAGE  and  Analysis  

DATA  ACQUISITION    

NCBI,  EMBL    DDBJ      (INDIS)  (Public  Access  Database)  

Our Current Model – Publicly available data

NaFonal  Network  of  Sequencers  IntrenaFonal  Network  of  Sequencers  

Automated  Bacterial  Assembly  

SRA  Reads  sample  1  

Trim  reads    (Ns,  adaptor)  

Reference      Distance  tree  

Find  closest  reference  genome(s)  

ArgoCA  (Combined  Assembly)  

De  novo  assembly  panel  Argo  (Reference  

assisted  assembly)  

SOAP  denovo  GS-­‐assembler  (newbler)  MaSuRCA     Celera  

Assembler    

Reads  remapped  to  combined  assembly  

ConFg  fasta  Read  placements  (bam)  Quality  profile  

SPAdes  

WGS  &  Epidemiologically  Relevant  Distance  (ERD)

• WGS  allows  high  resoluFon  genotypic  comparison  of  pathogen  isolates  

• What  is  the  epidemiological  relevance  of  genotypic  distance?  

• Many  methods  to  compute  –  we  need  some  common  principles…  

Since  all  approaches  start  with  sequence  reads,  we  must  retain  for  independent  confirmaHon  

0  

0.5  

1  

0   500   1000   1500  

Millions  

FDA-­‐CFSAN:  microbial    foodborne  pathogen  research    

SRA  format  bytes  per  sequenced  base  versus    number  of  bases  in  MiSeq  runs  

With  Quality   Without  QualiFes  

0  0.2  0.4  0.6  0.8  

0   200   400   600   800   1000   1200  

Millions  

OXFORD  University:  PopulaFon  Genomics  of  Mycobacterium  tuberculosis    

SRA  format  bytes  per  sequenced  base  versus    number  of  bases  in  MiSeq  and  HiSeq  runs  

With  Quality   Without  Quality  

Storage  is  manageable…  

Reliable, transparent, high throughput, high resolution ERDs?

Major challenge is to distinguish independent events (SNPs) from single events that generate multiple nucleotide differences i.e. collapsed repeats and other artifacts, alignment errors (reference-based alignments), sequence quality, & recombination

Fairly uniform distribution of differences along the two genomes…?

Cumulative count of differences

Iterative density filtering (Richa Agarwala modification of Science. 2011 Jan 28;331(6016):430-4.

Table:  Samples  currently  processed  (as  of  Sept  5,  2014)    in  NCBI  Pathogen  Pipeline  Organisms  

Center   Listeria   Salmonella   E.  coli   Total  CDC   903   903  FDA  +  State  Partners*   858   6129   307   7294  100K   565   34   599  FERA   14   14  Total   1775   6694   341   8810  

Processing  Status  

How  to  measure  the  system?      need  the  raw  data  (sequence  reads)  in  unprocessed  form        any  read  trimming/filtering  along  with  the  assembly  can  be  regenerated  

             

Assembly  metrics    map  the  reads  back  to  the  assembly  and  generate  a  profile  of  each  posiFon  (coverage,  alleles,  qualiFes)  

compare  the  assembly  against  other  assemblies  of  the  same  organism  (genus,  species)  and  check  the  expected  genome  size,  or  similarity  to  related  genomes    annotaFon  metrics  such  as  frameshiied  proteins    

What  is  the  actual  measurement  for  sequence    similarity?    the  number  of  pairwise  SNPs  between  two  genomes  

What  is  the  threshold?    a  pairwise  distance  (an  observaFonally  determined  cutoff  below  which  a  cluster  of  2    or  more  isolates  are  considered  significantly  close  enough  to  warrant  further  invesFgaFon)      

Sensi>vity  vs.  Specificity    sequence  clustering    sensiFvity  –  measure  of  isolates  which  belong  to  the  cluster  within  epidemiologically  relevant  distance    (true  posiFves)  /  true  posiFves  +  false  negaFves  (not  correctly  idenFfied)    specificity  –  measure  of  isolates  which  are  excluded  from  a  cluster  within    epidemiologically  relevant  distance  (true  negaFves)  /  true  negaFves  +  false  posiFves  

Organism  Total  Samples  

Not  expected  species1  

Mixed  organisms  

Less  than  5X  coverage  Duplicates   PacBio  

Poor  2nd  read  

Failed  assembly  stage  

Listeria   1775   20   2  (?)   1   5   1  Salmonella   6694   35   5   9   12  E.  coli   341   8   1  1.  not  L.  monocytogenes,  S.  enterica,  or  E.  coli  

Processing  Problems  

PROBLEMS!  

Reference  Materials  

Streptococcus massiliensis 4401825 - CANO - GCA_000341525.1 Streptococcus massiliensis DSM 18628 - ARCE - GCA_000380065.1 Streptococcus intermedius BA1 - ANFT - GCA_000313655.1

Streptococcus intermedius B196 - - GCA_000463355.1 Streptococcus intermedius C270 - - GCA_000463385.1 Streptococcus intermedius F0413 - AFXO - GCA_000234035.1 Streptococcus intermedius SK54 - AJKN - GCA_000258445.1 Streptococcus intermedius JTH08 - - GCA_000306805.1 Streptococcus intermedius ATCC 27335 - ATFK - GCA_000413475.1

Streptococcus intermedius F0395 - AFXN - GCA_000234015.1 Streptococcus sp. AS20 - JANS - GCA_000524255.1

Streptococcus constellatus subsp. constellatus SK53 - AICQ - GCA_000257785.1 Streptococcus constellatus subsp. constellatus SK53 - BASU - GCA_000474075.1 Streptococcus constellatus subsp. pharyngis C1050 - - GCA_000463425.1

Streptococcus constellatus subsp. pharyngis SK1060 = CCUG 46377 - AFUP - GCA_000223295.2 Streptococcus constellatus subsp. pharyngis SK1060 = CCUG 46377 - BASX - GCA_000474135.1

Streptococcus constellatus subsp. pharyngis C232 - - GCA_000463395.1 Streptococcus constellatus subsp. pharyngis C818 - - GCA_000463445.1

Streptococcus anginosus SK1138 - ALJO - GCA_000287595.1 Streptococcus sp. CM7 - JATP - GCA_000526035.1

Streptococcus sp. OBRC6 - JACR - GCA_000517685.1 Streptococcus anginosus F0211 - AECT - GCA_000184365.2

Streptococcus anginosus 1505 - BASW - GCA_000474115.1 Streptococcus sp. ACC21 - JAQU - GCA_000524375.1 Streptococcus sp. AC15 - JDFJ - GCA_000565055.1

Streptococcus anginosus subsp. whileyi MAS624 - - GCA_000478925.1 Streptococcus anginosus subsp. whileyi CCUG 39159 - AICP - GCA_000257765.1

Streptococcus anginosus C238 - - GCA_000463505.1 Streptococcus anginosus DORA_7 - AZMF - GCA_000508545.1

Streptococcus anginosus 1_2_62CV - ADME - GCA_000186545.1 Streptococcus anginosus C1051 - - GCA_000463465.1

Streptococcus anginosus T5 - BASY - GCA_000474155.1 Streptococcus anginosus SK52 = DSM 20563 - AFIM - GCA_000214555.2 Streptococcus anginosus SK52 = DSM 20563 - AREF - GCA_000373605.1 Streptococcus anginosus SK52 = DSM 20563 - BAST - GCA_000474055.1 Streptococcus intermedius SK54 - BASV - GCA_000474095.1

0.05

Escherichia coli KTE179 - ANYQ - GCA_000326485.1Escherichia coli KTE229 - ANXK - GCA_000353165.1

Escherichia coli H252 - AEFI - GCA_000190895.1Escherichia coli HVH 180 (4-3051617) - AVYH - GCA_000458685.1

Escherichia coli HVH 73 (4-2393174) - AVUX - GCA_000457025.1Escherichia coli HVH 104 (4-6977960) - AVVT - GCA_000457455.1

Escherichia coli HVH 19 (4-7154984) - AVTL - GCA_000456265.1Escherichia coli 908675 - AXTY - GCA_000488755.1

Escherichia coli HVH 127 (4-7303629) - AVWO - GCA_000457855.1Escherichia coli HVH 12 (4-7653042) - AVTG - GCA_000494955.1

Escherichia coli KOEGE 32 (66a) - AWAD - GCA_000459635.1Escherichia coli UMEA 3041-1 - AWAW - GCA_000460015.1

Escherichia coli HVH 148 (4-3192490) - AVXH - GCA_000495015.1Escherichia coli HVH 59 (4-1119338) - AVUQ - GCA_000456885.1

Escherichia coli HVH 222 (4-2977443) - AVZU - GCA_000459455.1Escherichia coli UMEA 3140-1 - AWBK - GCA_000460295.1

Escherichia coli HVH 178 (4-3189163) - AVYG - GCA_000495055.1Escherichia coli KTE4 - ANSO - GCA_000350645.1Escherichia coli KTE3 - ASTO - GCA_000407685.1

Escherichia coli KTE240 - ASUS - GCA_000408305.1Escherichia coli BIDMC 49b - JAPT - GCA_000522365.1

Escherichia coli BIDMC 49a - JAPU - GCA_000522385.1Escherichia coli APEC O1 - - GCA_000014845.1

Escherichia coli DSM 30083 = JCM 1649 = ATCC 11775 - BAIM - GCA_000613265.1Escherichia coli JCM 20135 - BAKV - GCA_000614505.1

Escherichia coli DSM 30083 = JCM 1649 = ATCC 11775 - AGSE - GCA_000690815.1Escherichia coli DSM 30083 = JCM 1649 = ATCC 11775 - JMST - GCA_000734955.1

Escherichia coli HVH 214 (4-3062198) - AZJN - GCA_000507665.1Escherichia coli UMEA 3162-1 - AWBU - GCA_000460475.1

Escherichia coli HVH 191 (3-9341900) - AVYR - GCA_000458875.1Escherichia coli HVH 170 (4-3026949) - AVYA - GCA_000458555.1

Escherichia coli S88 - - GCA_000026285.1Escherichia coli UMEA 3893-1 - AWEI - GCA_000461775.1

Escherichia coli HVH 217 (4-1022806) - AVZQ - GCA_000459375.1Escherichia coli KTE5 - ANSP - GCA_000350665.1

Escherichia coli KTE7 - ASTP - GCA_000407705.1Escherichia coli HVH 32 (4-3773988) - AVTX - GCA_000456505.1

Escherichia coli UMEA 3206-1 - AWCK - GCA_000460795.1Escherichia coli UMEA 3203-1 - AWCJ - GCA_000460775.1

Escherichia coli KTE62 - ANUK - GCA_000351605.1Escherichia coli KTE27 - ASTY - GCA_000407885.1

Escherichia coli cloneA_i1 - AEYT - GCA_000233675.2Escherichia coli 597 - AYQU - GCA_000503475.1

Escherichia coli HVH 203 (4-3126218) - AVZD - GCA_000459115.1Escherichia coli UMEA 3702-1 - AWDZ - GCA_000461595.1

Escherichia coli UMEA 3662-1 - AWDU - GCA_000461495.1Escherichia coli HVH 5 (4-7148410) - AVTB - GCA_000456085.1Escherichia coli HVH 102 (4-6906788) - AVVR - GCA_000465155.1

Escherichia coli HVH 201 (4-4459431) - AVZB - GCA_000459075.1Escherichia coli HM605 - AJWU - GCA_000264175.1

Escherichia coli HM605 - CADZ - GCA_000285375.10.01

hlp://www.ncbi.nlm.nih.gov/assembly/?term=%22anomalous%22[ProperFes]  

Contamina>on  (mul>ple  organisms)  

Assembly  for  sample  SAMN02727350  

Type  Number  of  conFgs  

Sum  of  conFg  lengths  

Full  assembly   667   5251272  

conFgs  with  Listeria  hits   37   3031650  conFgs  with  Staphylococcus  hits   630   2203573  

Contamina>on  (carryover  contamina>on)  

Contamina>on  (mul>ple  strains)  

Table:  Assembly  stats  for  SAMN02693748  measurement   result  num_input_reads   4212706  aligned_reads   4040070  assembly_num_bases   3180478  assembly_num_conFgs   50  assembly_N50   2817733  poor_quality_support_bases   132321  

Organism   Biosample   SRA  Run   Similarity  to:  

Listeria  monocytogenes  IEH-­‐NGS-­‐LIS-­‐00100    SAMN02567873   SRR1207486   Listeria  SLCC7179  

        SRR1220750   Listeria  J0161  Salmonella  enterica  EnteriFdis  MDH-­‐2014-­‐00798   SAMN02741943   SRR1553852  

Schwarzengrund  str.  CVM19633  

        SRR1272871   EnteriFdis  str.  P125109  Salmonella  enterica  Fluntern  MDH-­‐2013-­‐00153   SAMN02378158   SRR1067624  

Javiana  and  Schwarzengrund  

        SRR1395304   Cubana  and  Agona  

Proficiency  TesFng  •  Replicate  results  (phylogeny,  SNPs)  from  published  studies  •  Resequencing    

ü  same  isolate  on  mulFple  plasorms  ü  same  isolate  in  mulFple  libraries  ü  same  isolate  in  mulFple  labs  

•  Blinded  submissions  ü  already-­‐characterized  isolates  ü mixed  sample  isolates  ü metagenomic  isolates  

•  Corner  cases  ü  Extreme  coverage  ü  Duplicates  ü  Sample  mixups  

 

Acknowledgements  

National Center for Biotechnology Information – National Library of Medicine – Bethesda MD 20892 USA

Richa  Agarwala  Azat  Badretdin  Slava  Brover  Joshua  Cherry  Vyacheslav  Chetvernin  Robert  Cohen  Michael  DiCuccio  Mike  Feldgarden  Dan  Hai  William  Klimke  Arjun  Prasad  Edward  Rice  Kirill  Rotmistrovskyy  Stephen  Sherry  Sergey  Shiryev  MarFn  Shumway  TaFana  Tatusova  Igor  Tolstoy  Chunlin  Xiao  Leonid  Zaslavsky  Alexander  Zasypkin  Alejandro  A.  Schaffer  Lukas  Wagner  Aleksandr  Morgulis    David  Lipman  James  Ostell    

NCBI

•  This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. http://www.ncbi.nlm.nih.gov

CDC FDA/CFSAN NIHGRI UC-Davis USDA Vendors: PacBio, Illumina, Roche