a metagenomic tool for cheese...
TRANSCRIPT
Sept 9, 2018
A metagenomic tool for cheese
ecosystems
Anne-Laure Abraham, Quentin Cavaillé, Thibaut Guirimand,Sandra Dérozier, Charlie Pauvert, Mahendra Mariadassou,Bedis Dridi, Valentin Loux, Pierre Renault
Jouy en Josas – France
.02JOUR / MOIS / ANNEE
RCAM 2018 2
Cheesemaking
Evolution of the ecosystem during cheese making
Inoculated micro organisms
House microbiota
Starters
Micro organismsfrom : animal milk,
Waterflows,airflows
Micro organismsfrom salt
Ripening cultures
Micro organismsfrom shelves, cellar
Micro organisms: bacteria, yeasts, fungi, phages
.03JOUR / MOIS / ANNEE
RCAM 2018 3
Properties of cheese micro organims
Starters Micro organismsfrom : animal milk,
Waterflows,airflows
Micro organismsfrom salt
Ripening cultures
Micro organismsfrom shelves, cellar
Organoleptic propertiesAcid flavor
Fruity flavorFormation of bubbles
Production of lactic acid,
carbon dioxide, alcohol,
aldehydes ketones …
Coat textureCoat color
.04JOUR / MOIS / ANNEE
RCAM 2018 4
Knowledge of cheese micro organims
Starters Micro organims from : animal milk,
Waterflows,airflows
Micro organismsfrom salt
Ripening cultures
Micro organismsfrom shelves, cellar
Defined starter cultures
Undefined complex starters Not completely known
more vulnerable to bacteriophage attack
Known
“domesticated cultures”
Inoculated micro organisms
House microbiota Not completely known
.05JOUR / MOIS / ANNEE
RCAM 2018 5
Why study cheese ecosystem?
Protect functional
properties of strains
Identify origin of organoleptic
properties of strains
Quality control
Follow ecosystem during
cheese manufacturing
Compare production linesStudy strain diversity
Major reduction in the diversity of micro-organisms due
to sanitary pressure & intensification of production
.06JOUR / MOIS / ANNEE
RCAM 2018 6
Food microbiomes project
• Project with academic & dairy industries
• Use metagenomics to achieve a better understanding of cheese ecosystems
Develop a user-friendly tool to analyze cheeses samples
• Characteristics of cheese ecosystems
• Few species (a few dozens)
• More than 4000 sequenced dairy genomes ≥ 1 genome / most species
• Needs
• Precise taxonomic assignation (strain level)
• Low abundant species identification
• Identification of genes (and their functions)
• A user-friendly interface for non bioinformaticians
• A database with dairy genomes
• Results easy to understand
• Public / private genomes & metagenomes
.07JOUR / MOIS / ANNEE
RCAM 2018 7
Metagenomic shotgun taxonomic assignation
Methods based on Kmeror Burrows–Wheeler
transform
Krachen (Wood, 2014)
CLARK (Ounit, 2015)
Kaiju (Menzel, 2015)
Centrifuge (Kim, 2016)
Methods based on genomes/contigs mapping
Sigma (Ahn, Bioinformatics, 2015)
MicrobeGPS (Lindner, PLoSOne, 2015)
DESMAN (Quince, Genomebiol, 2017)
MetaSNV (Costea, Plos one, 2017)
Constrains (Luo, Nat Biotech, 2015)metaMLST (Zolfo, NAR, 2017)
StrainPhlAn (Truong,
Genome Research, 2017)
Methods based on marker genes
Limited taxonomicassignation precision
Precise taxonomic assignation
Fast, large database Slow, limited database
Identification of strain-level variation
.08JOUR / MOIS / ANNEE
RCAM 2018 8
Metagenomic alignment
Ecosystem
Alignment
sequencing
mismatches
Unaligned reads
Reference genomes
Sequencing errors&
Absence of good reference genome
Choice of alignment parameters
.09JOUR / MOIS / ANNEE
RCAM 2018 9
Metagenomic alignment
Ecosystem
Alignment
sequencing Reference genomes
Regions with high reads coverage
Repeated regions
Heterogenous sequencing depth
Transposable elementsConserved regions
Low abundance High abundance
Choice of alignment results cleaning
.010JOUR / MOIS / ANNEE
RCAM 2018 10
Coverage of genomes
Close strain – intermediate abundance
Absent strain
Very close strain – high abundance
.011JOUR / MOIS / ANNEE
RCAM 2018 11
Characteristics of alignment
Software Bowtie (Langmead, Genome Biology 2009)
• 3 mismatches allowed (-v)• If several best hits, choose one randomly (-a --best --strata -M 1)
CDS CDS FilteredFilteredCDS
Select reads that align on CDS
Filter some CDS: • Annotated: integrase, transposases, IS, phage• Length <300nt
.012JOUR / MOIS / ANNEE
RCAM 2018 12
Characteristics of mapping
CDS CDS FilteredFilteredCDS
Samtools & bedtools: • Identify variant positions • VCF file
Compute expected coverage• Fraction of genome that should be covered by at least one read if the genome is present• Lander & Waterman statistics
thGenomeLeng
ReadNumberReadLength
exp1
C
Observed distribution Expected distribution
htslib.org
.013JOUR / MOIS / ANNEE
RCAM 2018 13
Genomeindexes
Summary (Samtools – Bedtools)
Reference creation
Alignment(bowtie)
Reference genomes database(genbank)
Metagenome(fastq)
Gene annotations (GFF)
CDS CDSCDS
Reads alignment (BAM)
Summary for each genome(CSV)
Reads for each CDS (GFF)
Schema
.014JOUR / MOIS / ANNEE
RCAM 2018 14
software output
Mean, median, sd coverageNumber of variant positions
Genome name
CDS number%CDS with at least 1 read
% positions covered by readsExpected % positions covered by reads
(Lander & Waterman)
Summary for each genome(CSV)
Reads for each CDS (GFF)
CDS Localization
CDS Name & product
CDS Length, Length covered by reads & Number of positions with mismatches
CDS coverage
.015JOUR / MOIS / ANNEE
RCAM 2018 15
A dedicated dairy database
• Based on organisms known to be in dairy products
• Database enrichment: sequencing and assembly of new species
isolated from dairy products - 150 bacterial species & 15
filamentous fungi and yeasts
• 4000 genomes, manually selected
• Work in progress:
• Use text mining to:
• Identify dairy species of the literature
• Identify habitat of species found in metagenomics (for example:
sea for salt bacteria)
• Annotation enrichment: genes of technological interest
(Almeida et al. 2014 BMC Genomics)Collab C. Nedellec team, MaIAGE
.016JOUR / MOIS / ANNEE
RCAM 2018 16
Web interface & server
Quentin Cavaillé, Thibaut Guirimand, Sandra Dérozier, Pierre Renault, Valentin Loux
• User friendly interface
• Public/private genomes and samples
• Personalized analyses
.017JOUR / MOIS / ANNEE
RCAM 2018 17
• Tchapalo: traditional beer in Côte d’Ivoire
• Mean production: 38.000 t/year
• Daily familial consumption
• Income-generating economic activity
• Production process:
• Sorghum malt goes through a double fermentation:
• Natural lactic fermentation => sour wort
• Alcoholic fermentation => Tchapalo
17
Tchapalo ecosystem
Racha ZAARIR
.018JOUR / MOIS / ANNEE
RCAM 2018 18
Tchapalo ecosystem analysis
72.3%
25.1%
80.2%
15.9%
Metagenomic analysis
Microbiology analysis
.019JOUR / MOIS / ANNEE
RCAM 2018 19
Tchapalo ecosystem abundant species
genome % CDS covered meanCoverage % coverageExpected % coverage
Lactobacillus fermentum S6 100 54,979 99,215 100Lactobacillus delbrueckii subsp. lactis KCCM 34717 95,503 150,326 91,717 100
Lactobacillus delbrueckii subsp. Jakobsenii 99,669 164,759 99,119 100
The strain Lactobacillus fermentum S6 is very close to the strain of the ecosystem
Lactobacillus delbrueckii subsp. Jakobsenii is more close than Lactobacillus delbrueckii subsp. lactis KCCM 34717 to the strain of the ecosystem
.020JOUR / MOIS / ANNEE
RCAM 2018 20
Tchapalo ecosystem low abundant species
genome % CDS covered meanCoverage % coverageExpected % coverage # reads
Saccharomyces cerevisiae YJM326 90,727 0,094 8,145 8,908 21405Pediococcus acidilactici DSM 20284 81,706 0,577 9,418 46,005 28706
The strain Saccharomyces cerevisiae YJM326 YJM326 is very close to the strain of the ecosystem
Pediococcus acidilactici DSM 20284 is absent of the ecosystem (reads coming from other Lactobacillaceae)
.021JOUR / MOIS / ANNEE
RCAM 2018 21
Conclusion
• Will be publicly available for research purpose
• An account on the INRA migale platform is required
• The software and database development are still on going
Genomeindexes
Summary (Samtools – Bedtools)
Reference creation
Alignment(bowtie)
Reference genomesdatabase(genbank)
Metagenome(fastq)
Gene annotations (GFF)
CDS
CDS
CDS
Reads alignment (BAM)
Summary for each genome(CSV)
Reads for each gene (GFF)
Reference genome database
Metagenomic software
Web interface
Provinding a user friendly tool for metagenomic analysis
.022JOUR / MOIS / ANNEE
RCAM 2018 22
Perspectives
Genomes pre-selection using a faster method (k-mer or Burrows–
Wheeler transform) to speed up computation
Allow metagenomes analysis comparisons
Apply it on MetaPDOCheese project (next slide)
Application to other ecosystems with enough reference genomes
(for example: fermented food, animals digestive ecosystems…)
.023JOUR / MOIS / ANNEE
RCAM 2018 23
Compare ecosystems of the same PDO area
MetaPDO Cheese Project
INRA: MaIAGE (S. Dérozier, V. Loux, M. Mariadassou, C. Nedellec, Q. Cavaillé), Micalis (P.
Renault, T. Guirimand, B. Dridi, C. Pauvert), GMPA (F. Irlinger), URF(C Delbès), CNIEL
Follow ecosystem in the time scale of cheesemaking
What are the structural and functional diversities of cheese ecosystems?
What are the evolutionary mechanisms of microbial population?
• 44 Protected Designation of Origin French Cheeses
• 1200 samples -16S & ITS sequencing
• Some sample with shotgun sequencing
• Sequencing of 100 new genomes
.024JOUR / MOIS / ANNEE
RCAM 2018 24
Thanks to
StatInfOmics and Bibliome teams
Migale platform
Robert Bossy
Quentin Cavaillé
Estelle Chaix
Hélène Chiapello
Louise Deleger
Sandra Dérozier
Valentin Loux
Mahendra Mariadassou
Claire Nédellec
Pierre Nicolas
Sophie Schbath
Micalis
Pierre Renault
Charlie Pauvert
Thibaut Guirimand
Bédis Dridi
Racha Zaarir
Sept 9, 2018
RCAM 2018 26
% ID
Pos covered 100 nt / Pos covered 35 nt
RCAM 2018 27
% ID
Nb Reads 100 nt / Nb Reads 35 nt
.028JOUR / MOIS / ANNEE
RCAM 2018 28From: Irlinger et al. FEMS Microbiol Lett. 2014
Cheese ecosystems
.029JOUR / MOIS / ANNEE
RCAM 2018 29
Challenges of taxonomic assignation ??? Ou
pas ?? Aussi challenges fonctions ??
We don’t have reference genomesfor each strain of the ecosystem
Some genera with many reference genomes, others without a reference genomeImpossible to sequence every strain (non cultivable species, cost of DNA extraction, sequencing and storage…)
Computational challenge: impossible to compare reads to every sequenced genomeNovember 2017 : 124 481 procaryotic genomesA metagenome : millions reads per sample
Tree of life Reference genomes Ecosystem strains
.030JOUR / MOIS / ANNEE
RCAM 2018 30
GeDI method
Sequencing bias& repeted regions
Heterogenous genome coverage
ArtefactAlignment on close genome
Gene position
Very close strain – high abundance
% c
ove
rage
Genome position
Strains present in different proportions
.031JOUR / MOIS / ANNEE
RCAM 2018 31
Knowledge of cheese micro organims
Inoculated micro organims
House microbiota
starters
Micro organims from : animal milk,
Waterflows,airflows
Micro organismsfrom salt
Ripening cultures
Micro organismsfrom shelves, cellar
Not completely known
Defined starter cultures
Undefined complex starters Not completely known
more vulnerable to bacteriophage attack
Known
“domesticated cultures”