sequencing data using galaxy analysis of high...
TRANSCRIPT
![Page 1: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/1.jpg)
Analysis of high-throughput sequencing data using Galaxy
(ChIP-seq and RNA-seq)
Denis Puthier, Claire Rioualen & Jacques van Helden Aix-Marseille Univ, INSERM, TAGC lab, Marseille, France
Talleres Internacionales de Bioinformática (TIB)Cuernavaca, 2017
http://congresos.nnb.unam.mx/TIB2017/
1
![Page 2: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/2.jpg)
Goals of the workshop
● Target audience○ Biologists involved in NGS projects.○ No prior experience of NGS bioinformatics.
● Approach○ Practice-driven.○ Elements of theory interspersed in the tutorials.
● Scope○ Study cases from ChIP-seq and RNA-seq. ○ However many concepts and tools are also used by many other
applications. ● Software environment
○ Mainly Galaxy○ Visualisation with IGV○ Web sites for specific resources. ○ R under RStudio convivial environment? To be discussed ... 2
![Page 3: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/3.jpg)
● Days 3-4: RNA-seq○ RNA-Seq method intro○ Preprocessing
(Quality control, Trimming)○ Splice-aware alignment○ Transcript discovery○ Data visualization○ Quantification○ Differential analysis○ Functional annotation○ Motif analysis (continued)
● Day 5: tutorship and/or R ?○ Customized analytic flow charts
+ playing with your own data.○ Optional: first steps with R.
Schedule
● Days 1 - 2: ChIP-seq analysis○ NGS Technologies○ ChIP-Seq analysis - Intro○ Short read file formats○ Quality control of the reads○ Trimming○ Read mapping○ Data visualization (IGV)○ Coverage normalisation○ Peak calling○ Peak annotation○ Motif analysis
3
![Page 4: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/4.jpg)
● Denis Puthier○ Bioinformatics analysis of high-throughput data.○ Teaching domains: bioinformatics, genomics, programming, statistics.
● Claire Rioualen○ Bioinformatics analysis of high-throughput data.○ Development of workflows for NGS data (ChIP-seq, RNA-seq).
● Jacques van Helden○ Development of bioinformatics tools for the analysis of regulatory
sequences and networks (http://rsat.eu/).○ Teaching domains: bioinformatics, statistics, genomics.
Presentation of the teachers
4
![Page 5: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/5.jpg)
Participants introduce themselves in 4 sentences.
1. Name and affiliation2. Background in biology/bioinformatics3. Research project involving NGS / interest for NGS.4. Prior experience with NGS bioinformatics?
Presentation of the participants
5
![Page 6: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/6.jpg)
Resources used during the trainingResource Description + URL To install
locally
TIB 2017 homepage
http://congresos.nnb.unam.mx/
TIB 2017 Galaxy http://congresos.nnb.unam.mx/TIB2017/galaxy/
Galaxy server Galaxy server for the TIB2017 training http://132.248.220.36/
IGV Integrative Genomics Viewer http://software.broadinstitute.org/software/igv/ X
R R statistical package https://www.r-project.org/ X
RStudio An environment to manage R programming and projects https://www.rstudio.com/ X
GEO Gene Expression Omnibus https://www.ncbi.nlm.nih.gov/geo/
ArrayExpress Gene expression database https://www.ebi.ac.uk/arrayexpress/
RSAT Regulatory Sequence Analysis Tools http://rsat.eu/ 6
![Page 7: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/7.jpg)
High-throughput sequencing
7
![Page 8: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/8.jpg)
● 1977-1990, 500bp, manual analysis● 1990-2000, 500bp, computed assisted analysis
(1D capillary sequencers)
● 2005-2014, 20-1000bp (2D sequencers “Next Generation Sequencing.”)
Breakthrough in DNA Sequencing
8
![Page 9: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/9.jpg)
Cost per megabase (1 million base)
9
![Page 10: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/10.jpg)
Cost per human genome
10
![Page 11: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/11.jpg)
NB: most of the methods rely on fragmented DNA/RNA material.
NGS: a simplified view
11
![Page 12: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/12.jpg)
● Sequencer throughput○ Some application require good coverage
■ High dynamic range, sensibility■ e.g transcriptome analysis, ChIP-Seq
○ May offer multiplexing● Read length produced
○ May be important to resolve low complexity regions○ i.e. a word of size 20 is more ambiguous than a
word of size 500
Important things to consider
12
![Page 13: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/13.jpg)
● Fidelity○ Some sequencer may be error prone○ Fidelity may be important for variant calling (...)
● With current technologies:○ The longer the reads (i.e several kbs) the weaker the
fidelity and coverage
Important things to consider (continued)
13
![Page 14: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/14.jpg)
● Technologies are subject to rapid changes!● From this 2011 table, only a few survived in 2016.
Sequencing is continuously evolving
14
![Page 15: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/15.jpg)
Illumina sequencers
15
![Page 16: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/16.jpg)
Illumina sequencers
16
![Page 17: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/17.jpg)
https://www.illumina.com/systems/sequencing.html
NextSeq 500 Illumina sequencer
17
![Page 18: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/18.jpg)
● Illumina● A set of 10 sequencers.
○ Each producing 1,8 Terabases / 3 days● 18,000 genome / year
○ Factory-scale sequencing technology
HiSeq X 10: a factory scale sequencer
18
![Page 19: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/19.jpg)
http://glennklockwood.blogspot.nl/
● 18,000 / year ~ 340 / week● 30-50To storage / week
○ Cost of long term storage?● 518 core hours / genome● 175,000 core hours per week
But 1000$ genome coming true….
Some computing issues
19
![Page 20: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/20.jpg)
Is the 1000 $ genome for real ?
20
![Page 21: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/21.jpg)
Genetic variation ongoing project
21
![Page 22: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/22.jpg)
An overview of Illumina technology - sequencing by
synthesis
22
![Page 23: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/23.jpg)
http://www.illumina.com/company/video-hub/HMyCqWhwB8E.html
Illumina sequencing: general principle
23
![Page 24: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/24.jpg)
Illumina sequencing: general principle
24
![Page 25: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/25.jpg)
● Terminology:○ “Fragment” a piece of DNA○ “Read” the sequence(s) associated to this fragment
Starting with a fragment
25
![Page 26: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/26.jpg)
Annealing Reverse-strand synthesis
Denaturation Fragment released
First-strand synthesis
26
![Page 27: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/27.jpg)
Annealing Reverse-strand synthesis
Denaturation
Bridge-PCR
27
![Page 28: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/28.jpg)
Annealing 2-copies N-copies- Cluster- Polymerase colonies
(Polonies)
Bridge-PCR cycles
28
![Page 29: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/29.jpg)
A population of DNA colonies
29
![Page 30: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/30.jpg)
Reverse strand cleavage
Getting single-stranded colonies
30
![Page 31: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/31.jpg)
Primer annealing
Synthesis/extension Record color at each step
This is a parallelized process !
First-end sequencing
31
![Page 32: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/32.jpg)
Release of the read Read the barcode (for subsequent de-multiplexing)
Release of the read
Barcode analysis
32
![Page 33: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/33.jpg)
Bridged-annealing Reverse-strand synthesis Cleavage of the forward strand
Paired-end sequencing
33
![Page 34: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/34.jpg)
Bridged-annealing Parallel sequencing Denaturation
Paired-end sequencing
34
![Page 35: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/35.jpg)
● Paired-end sequencing: sequence both ends of a fragment○ Facilitate alignment○ Facilitate gene fusion detection○ Better to reconstruct
transcript model from RNA-seq
Single-end vs paired-end
35
![Page 36: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/36.jpg)
Other technologies
36
![Page 37: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/37.jpg)
https://nanoporetech.com/science-technology/how-it-works
● Alpha-hemolysine○ A nanopore from bacteria
that causes lysis of red blood cells
● Molecules that enter the nanopore cause characteristic disruption of the current.
● Potentially offers read lengths of tens of kilobases (kb) limited only by the length of DNA molecules presented to it.”
● ~1Gb to 2 Gb of sequence per minION.
● Detect DNA modifications.
The MinION portable sequencer
37
![Page 38: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/38.jpg)
Example application of MinION
38
![Page 39: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/39.jpg)
And now the Smidgion...
39
![Page 40: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/40.jpg)
Single-molecule real-time (SMRT) sequencing from Pacific Biosciences (PacBio). ● Zero-mode waveguides (ZMV)
○ Each ZMW well is several nanometres in diameter○ The size of each well does not allow for light propagation○ The fluorophores bound to bases can only be visualized
through the glass substrate in the bottom-most portion of the well, a volume in the zeptolitre range
○ Polymerase is fixed to the bottom of the well. ○ dNTP incorporation on each single-molecule template
per well is continuously analyzed by a laser and ○ The polymerase cleaves the dNTP-bound fluorophore
during incorporation, allowing it to diffuse away○ High error-rate, high cost per base
40
![Page 41: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/41.jpg)
Applications of high-throughput
sequencing
41
![Page 42: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/42.jpg)
High-throughput sequencing: so much applications...
http://tinyurl.com/znrb9jc
42
![Page 43: sequencing data using Galaxy Analysis of high …pedagogix-tagc.univ-mrs.fr/courses/ngs_galaxy/pdf_files/...Analysis of high-throughput sequencing data using Galaxy (ChIP-seq and RNA-seq)](https://reader030.vdocuments.us/reader030/viewer/2022041020/5ecfbc9585fee802e977941d/html5/thumbnails/43.jpg)
Merci
43