lecture: bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdflecture...
TRANSCRIPT
![Page 1: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/1.jpg)
Lecture: Bioinformatics
ENS Sacley, 2018
![Page 2: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/2.jpg)
Lecture Overview
Mathias Weller
[email protected] Bulteau
- Introduction
I Molecular BiologyI SequencingI AssemblyI Scaffolding
- Phylogenetics
. . .
- Genome Rearrangements
. . .
- Scaffold Filling
2/27
![Page 3: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/3.jpg)
Molecular Biology Basics
DNA
- double strand
- nucleotides paired: A–T, C–G
- inside nucleus (eucaryotes)
Polymerase
single strand → double strand
3/27
![Page 4: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/4.jpg)
Molecular Biology Basics
DNA
- double strand
- nucleotides paired: A–T, C–G
- inside nucleus (eucaryotes)
Polymerase
single strand → double strand
3/27
![Page 5: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/5.jpg)
Molecular Biology Basics
Chromosomes
- haploid = 1 set of chromosomes
- diploid = 2 sets of chromosomes
(usually one from each parent)
- procaryotes one circular chromosome, haploid
- eucaryotes set of chromosomes, diploid
44 90 16 46 46
4/27
![Page 6: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/6.jpg)
Mutation
Single-Nucleotide Polymorphism
- DNA damage
- caused by radioactivity, UV light, . . .
- insertion, deletion, substitution
Replication Error
- DNA rearrangement
- caused by errors in Meiosis/Mitosis
- duplication,
deletion (loss), translocation,
inversion, crossover
5/27
![Page 7: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/7.jpg)
Mutation
..AATCGCTAA..
..AATCCTAA..
Single-Nucleotide Polymorphism
- DNA damage
- caused by radioactivity, UV light, . . .
- insertion,
deletion, substitution
Replication Error
- DNA rearrangement
- caused by errors in Meiosis/Mitosis
- duplication,
deletion (loss), translocation,
inversion, crossover
5/27
![Page 8: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/8.jpg)
Mutation
..AATCCTAA..
..AATCGCTAA..
Single-Nucleotide Polymorphism
- DNA damage
- caused by radioactivity, UV light, . . .
- insertion, deletion,
substitution
Replication Error
- DNA rearrangement
- caused by errors in Meiosis/Mitosis
- duplication,
deletion (loss), translocation,
inversion, crossover
5/27
![Page 9: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/9.jpg)
Mutation
..AATCGCTAA..
..AATCACTAA..
Single-Nucleotide Polymorphism
- DNA damage
- caused by radioactivity, UV light, . . .
- insertion, deletion, substitution
Replication Error
- DNA rearrangement
- caused by errors in Meiosis/Mitosis
- duplication,
deletion (loss), translocation,
inversion, crossover
5/27
![Page 10: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/10.jpg)
Mutation
Single-Nucleotide Polymorphism
- DNA damage
- caused by radioactivity, UV light, . . .
- insertion, deletion, substitution
Replication Error
- DNA rearrangement
- caused by errors in Meiosis/Mitosis
- duplication,
deletion (loss), translocation,
inversion, crossover
5/27
![Page 11: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/11.jpg)
Mutation
Single-Nucleotide Polymorphism
- DNA damage
- caused by radioactivity, UV light, . . .
- insertion, deletion, substitution
Replication Error
- DNA rearrangement
- caused by errors in Meiosis/Mitosis
- duplication,
deletion (loss), translocation,
inversion, crossover
5/27
![Page 12: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/12.jpg)
Mutation
Single-Nucleotide Polymorphism
- DNA damage
- caused by radioactivity, UV light, . . .
- insertion, deletion, substitution
Replication Error
- DNA rearrangement
- caused by errors in Meiosis/Mitosis
- duplication, deletion (loss),
translocation,
inversion, crossover
5/27
![Page 13: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/13.jpg)
Mutation
Single-Nucleotide Polymorphism
- DNA damage
- caused by radioactivity, UV light, . . .
- insertion, deletion, substitution
Replication Error
- DNA rearrangement
- caused by errors in Meiosis/Mitosis
- duplication, deletion (loss), translocation,
inversion, crossover
5/27
![Page 14: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/14.jpg)
Mutation
Single-Nucleotide Polymorphism
- DNA damage
- caused by radioactivity, UV light, . . .
- insertion, deletion, substitution
Replication Error
- DNA rearrangement
- caused by errors in Meiosis/Mitosis
- duplication, deletion (loss), translocation,
inversion,
crossover
5/27
![Page 15: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/15.jpg)
Mutation
Single-Nucleotide Polymorphism
- DNA damage
- caused by radioactivity, UV light, . . .
- insertion, deletion, substitution
Replication Error
- DNA rearrangement
- caused by errors in Meiosis/Mitosis
- duplication, deletion (loss), translocation,
inversion, crossover
5/27
![Page 16: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/16.jpg)
mRNA, Introns, Exons, Genes
mRNA
- single strand
- transported outside nucleus
- translated into actual proteins
- Thymine (T) → Uracil (U)
Introns & Exons
parts of DNA cut out when forming
mRNA (“splicing”)
- removed “intron”
- not removed “exon”
Gene
Gene = START. . . STOP (including
introns) (103-105bp)
6/27
![Page 17: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/17.jpg)
mRNA, Introns, Exons, Genes
mRNA
- single strand
- transported outside nucleus
- translated into actual proteins
- Thymine (T) → Uracil (U)
Introns & Exons
parts of DNA cut out when forming
mRNA (“splicing”)
- removed “intron”
- not removed “exon”
Gene
Gene = START. . . STOP (including
introns) (103-105bp)
6/27
![Page 18: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/18.jpg)
mRNA, Introns, Exons, Genes
mRNA
- single strand
- transported outside nucleus
- translated into actual proteins
- Thymine (T) → Uracil (U)
Introns & Exons
parts of DNA cut out when forming
mRNA (“splicing”)
- removed “intron”
- not removed “exon”
Gene
Gene = START. . . STOP (including
introns) (103-105bp)
6/27
![Page 19: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/19.jpg)
mRNA, Introns, Exons, Genes
6/27
![Page 20: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/20.jpg)
Sanger Sequencing [Sanger et al ’77]
CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAATGGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA
GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*
Sanger Sequencing
1. split helix of thousands of copies (“amplified genome”)
2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)
4. stir & let polymerase act
5. measure the length of each fragment
each length is the position of a T in the template
Problem
- frequency of longer reads decreases drastically
- length-estimate unreliable after a couple hundred bp
chop DNA into pieces and read those
- repeated bases unreliable
7 /27
![Page 21: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/21.jpg)
Sanger Sequencing [Sanger et al ’77]
CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAATGGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA
GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*
Sanger Sequencing
1. split helix of thousands of copies (“amplified genome”)
2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)
4. stir & let polymerase act
5. measure the length of each fragment
each length is the position of a T in the template
Problem
- frequency of longer reads decreases drastically
- length-estimate unreliable after a couple hundred bp
chop DNA into pieces and read those
- repeated bases unreliable
7 /27
![Page 22: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/22.jpg)
Sanger Sequencing [Sanger et al ’77]
CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT
GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTAGGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*
Sanger Sequencing
1. split helix of thousands of copies (“amplified genome”)
2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)
4. stir & let polymerase act
5. measure the length of each fragment
each length is the position of a T in the template
Problem
- frequency of longer reads decreases drastically
- length-estimate unreliable after a couple hundred bp
chop DNA into pieces and read those
- repeated bases unreliable
7 /27
![Page 23: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/23.jpg)
Sanger Sequencing [Sanger et al ’77]
CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT
GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTAGGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*
Sanger Sequencing
1. split helix of thousands of copies (“amplified genome”)
2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)
4. stir & let polymerase act
5. measure the length of each fragment
each length is the position of a T in the template
Problem
- frequency of longer reads decreases drastically
- length-estimate unreliable after a couple hundred bp
chop DNA into pieces and read those
- repeated bases unreliable
7 /27
![Page 24: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/24.jpg)
Sanger Sequencing [Sanger et al ’77]
CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT
GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTAGGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*
Sanger Sequencing
1. split helix of thousands of copies (“amplified genome”)
2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)
4. stir & let polymerase act
5. measure the length of each fragment
each length is the position of a T in the template
Problem
- frequency of longer reads decreases drastically
- length-estimate unreliable after a couple hundred bp
chop DNA into pieces and read those
- repeated bases unreliable
7 /27
![Page 25: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/25.jpg)
Sanger Sequencing [Sanger et al ’77]
CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT
GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA
GGA*
GGACCTGCCCA*GGACCTGCCCAGTCTGTA*
Sanger Sequencing
1. split helix of thousands of copies (“amplified genome”)
2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)
4. stir & let polymerase act
5. measure the length of each fragment
each length is the position of a T in the template
Problem
- frequency of longer reads decreases drastically
- length-estimate unreliable after a couple hundred bp
chop DNA into pieces and read those
- repeated bases unreliable
7 /27
![Page 26: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/26.jpg)
Sanger Sequencing [Sanger et al ’77]
CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT
GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA
GGA*GGACCTGCCCA*
GGACCTGCCCAGTCTGTA*
Sanger Sequencing
1. split helix of thousands of copies (“amplified genome”)
2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)
4. stir & let polymerase act
5. measure the length of each fragment
each length is the position of a T in the template
Problem
- frequency of longer reads decreases drastically
- length-estimate unreliable after a couple hundred bp
chop DNA into pieces and read those
- repeated bases unreliable
7 /27
![Page 27: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/27.jpg)
Sanger Sequencing [Sanger et al ’77]
CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT
GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA
GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*
Sanger Sequencing
1. split helix of thousands of copies (“amplified genome”)
2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)
4. stir & let polymerase act
5. measure the length of each fragment
each length is the position of a T in the template
Problem
- frequency of longer reads decreases drastically
- length-estimate unreliable after a couple hundred bp
chop DNA into pieces and read those
- repeated bases unreliable
7 /27
![Page 28: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/28.jpg)
Sanger Sequencing [Sanger et al ’77]
CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT
GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA
GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*
Sanger Sequencing
1. split helix of thousands of copies (“amplified genome”)
2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)
4. stir & let polymerase act
5. measure the length of each fragment
each length is the position of a T in the template
Problem
- frequency of longer reads decreases drastically
- length-estimate unreliable after a couple hundred bp
chop DNA into pieces and read those
- repeated bases unreliable
7 /27
![Page 29: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/29.jpg)
Sanger Sequencing [Sanger et al ’77]
CCTGGACGGGTCAGACATGACAGTGGCCCCAAGATTCACAAGATCGTATCTCAATACAGTAAACGAGCAAT
GGACCTGCCCAGTCTGTACTGTCACCGGGGTTCTAAGTGTTCTAGCATAGAGTTATGTCATTTGCTCGTTA
GGA*GGACCTGCCCA*GGACCTGCCCAGTCTGTA*
Sanger Sequencing
1. split helix of thousands of copies (“amplified genome”)
2. add polymerase & floating bases: A C G T3. add a special base: A* (polymerase cannot extend, colored)
4. stir & let polymerase act
5. measure the length of each fragment
each length is the position of a T in the template
Problem
- frequency of longer reads decreases drastically
- length-estimate unreliable after a couple hundred bp
chop DNA into pieces and read those
- repeated bases unreliable
7 /27
![Page 30: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/30.jpg)
Next Generation Sequencing ( )
ACCA
AGTCTGGAGAGTC
TGAGTACCA
ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG
ACTCATGGTCTGAGAGGT......TGAGTACCA
TGGTACTCA......ACCTCTCAG
1. chop DNA into smaller pieces (approximate size known)
2. add anchors to each end of each piece
3. “flow cell” containing anchor places
4. strand anchors its two ends to two anchor places
5. polymerase completes the strand into double-strand with
“tagged” bases light emission
6. double strand is cut into single strands
7. rinse, repeat (last 3 steps) until flow chip is “full”
Paired-End reads (distance between reads = “insert size”)
8/27
![Page 31: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/31.jpg)
Next Generation Sequencing ( )
ACCA
AGTCTGGAGAGTC
TGAGTACCA
ACTCA......ACCTC
TGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG
ACTCATGGTCTGAGAGGT......TGAGTACCA
TGGTACTCA......ACCTCTCAG
1. chop DNA into smaller pieces (approximate size known)
2. add anchors to each end of each piece
3. “flow cell” containing anchor places
4. strand anchors its two ends to two anchor places
5. polymerase completes the strand into double-strand with
“tagged” bases light emission
6. double strand is cut into single strands
7. rinse, repeat (last 3 steps) until flow chip is “full”
Paired-End reads (distance between reads = “insert size”)
8/27
![Page 32: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/32.jpg)
Next Generation Sequencing ( )
ACCA
AGTCTGGAGAGTC
TGAGTACCA
ACTCA......ACCTC
TGGTACTCA......ACCTCTCAG
TGGTACTCA......ACCTCTCAGACCTCTCAG
ACTCATGGTCTGAGAGGT......TGAGTACCA
TGGTACTCA......ACCTCTCAG
1. chop DNA into smaller pieces (approximate size known)
2. add anchors to each end of each piece
3. “flow cell” containing anchor places
4. strand anchors its two ends to two anchor places
5. polymerase completes the strand into double-strand with
“tagged” bases light emission
6. double strand is cut into single strands
7. rinse, repeat (last 3 steps) until flow chip is “full”
Paired-End reads (distance between reads = “insert size”)
8/27
![Page 33: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/33.jpg)
Next Generation Sequencing ( )
ACCA
AGTC
TGGAGAGTC
TGAGTACCA
ACTCA......ACCTC
TGGTACTCA......ACCTCTCAG
TGGTACTCA......ACCTCTCAGACCTCTCAG
ACTCATGGTCTGAGAGGT......TGAGTACCA
TGGTACTCA......ACCTCTCAG
1. chop DNA into smaller pieces (approximate size known)
2. add anchors to each end of each piece
3. “flow cell” containing anchor places
4. strand anchors its two ends to two anchor places
5. polymerase completes the strand into double-strand with
“tagged” bases light emission
6. double strand is cut into single strands
7. rinse, repeat (last 3 steps) until flow chip is “full”
Paired-End reads (distance between reads = “insert size”)
8/27
![Page 34: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/34.jpg)
Next Generation Sequencing ( )
ACCA
AGTC
TGGAGAGTC
TGAGTACCA
ACTCA......ACCTCTGGTACTCA......ACCTCTCAG
TGGTACTCA......ACCTCTCAG
ACCTCTCAG
ACTCATGGTCTGAGAGGT......TGAGTACCA
TGGTACTCA......ACCTCTCAG
1. chop DNA into smaller pieces (approximate size known)
2. add anchors to each end of each piece
3. “flow cell” containing anchor places
4. strand anchors its two ends to two anchor places
5. polymerase completes the strand into double-strand with
“tagged” bases light emission
6. double strand is cut into single strands
7. rinse, repeat (last 3 steps) until flow chip is “full”
Paired-End reads (distance between reads = “insert size”)
8/27
![Page 35: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/35.jpg)
Next Generation Sequencing ( )
ACCA
AGTC
TGGAGAGTC
TGAGTACCA
ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAG
ACCTCTCAG
ACTCATGGT
CTGAGAGGT......TGAGTACCA
TGGTACTCA......ACCTCTCAG
1. chop DNA into smaller pieces (approximate size known)
2. add anchors to each end of each piece
3. “flow cell” containing anchor places
4. strand anchors its two ends to two anchor places
5. polymerase completes the strand into double-strand with
“tagged” bases light emission
6. double strand is cut into single strands
7. rinse, repeat (last 3 steps) until flow chip is “full”
Paired-End reads (distance between reads = “insert size”)
8/27
![Page 36: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/36.jpg)
Next Generation Sequencing ( )
ACCA
AGTC
TGGAGAGTC
TGAGTACCA
ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAG
ACCTCTCAG
ACTCATGGT
CTGAGAGGT......TGAGTACCA
TGGTACTCA......ACCTCTCAG
1. chop DNA into smaller pieces (approximate size known)
2. add anchors to each end of each piece
3. “flow cell” containing anchor places
4. strand anchors its two ends to two anchor places
5. polymerase completes the strand into double-strand with
“tagged” bases light emission
6. double strand is cut into single strands
7. rinse, repeat (last 3 steps) until flow chip is “full”
Paired-End reads (distance between reads = “insert size”)
8/27
![Page 37: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/37.jpg)
Next Generation Sequencing ( )
ACCA
AGTCTGGAGAGTC
TGAGTACCA
ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG
ACTCATGGT
CTGAGAGGT......TGAGTACCA
TGGTACTCA......ACCTCTCAG
1. chop DNA into smaller pieces (approximate size known)
2. add anchors to each end of each piece
3. “flow cell” containing anchor places
4. strand anchors its two ends to two anchor places
5. polymerase completes the strand into double-strand with
“tagged” bases light emission
6. double strand is cut into single strands
7. rinse, repeat (last 3 steps) until flow chip is “full”
Paired-End reads (distance between reads = “insert size”)
8/27
![Page 38: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/38.jpg)
Next Generation Sequencing ( )
ACCA
AGTCTGGAGAGTC
TGAGTACCA
ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG
ACTCATGGT
CTGAGAGGT......TGAGTACCA
TGGTACTCA......ACCTCTCAG
1. chop DNA into smaller pieces (approximate size known)
2. add anchors to each end of each piece
3. “flow cell” containing anchor places
4. strand anchors its two ends to two anchor places
5. polymerase completes the strand into double-strand with
“tagged” bases light emission
6. double strand is cut into single strands
7. rinse, repeat (last 3 steps) until flow chip is “full”
Paired-End reads (distance between reads = “insert size”)
8/27
![Page 39: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/39.jpg)
Next Generation Sequencing ( )
ACCA
AGTCTGGAGAGTC
TGAGTACCA
ACTCA......ACCTCTGGTACTCA......ACCTCTCAGTGGTACTCA......ACCTCTCAGACCTCTCAG
ACTCATGGT
CTGAGAGGT......TGAGTACCA
TGGTACTCA......ACCTCTCAG
1. chop DNA into smaller pieces (approximate size known)
2. add anchors to each end of each piece
3. “flow cell” containing anchor places
4. strand anchors its two ends to two anchor places
5. polymerase completes the strand into double-strand with
“tagged” bases light emission
6. double strand is cut into single strands
7. rinse, repeat (last 3 steps) until flow chip is “full”
Paired-End reads (distance between reads = “insert size”)
8/27
![Page 40: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/40.jpg)
Third-Generation Sequencing
Single Molecule Real Time Sequencing
1. fix a polymerase enzyme under a microscope
2. attach fluorescent molecule to each nucleotide
3. polymerase clips off fluorescent molecule when attaching a base
4. observe change in fluorescence identify base
Nanopores
1. “pore” of diameter ≈ 1nm
2. only unfolded DNA molecule can pass through
3. detect fluctuation in electric field as bases pass through pore
9/27
![Page 41: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/41.jpg)
Third-Generation Sequencing
Single Molecule Real Time Sequencing
1. fix a polymerase enzyme under a microscope
2. attach fluorescent molecule to each nucleotide
3. polymerase clips off fluorescent molecule when attaching a base
4. observe change in fluorescence identify base
Nanopores
1. “pore” of diameter ≈ 1nm
2. only unfolded DNA molecule can pass through
3. detect fluctuation in electric field as bases pass through pore
9/27
![Page 42: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/42.jpg)
Conclusion: Sequencing
Sanger
cost: 20ct/kbp
reads: 102-103kbp
errors: 0.01-1%
2nd Gen
cost: 4ct/kbp
reads: 50-500PE
errors: 0.1-1%
3rd Gen
cost: 0.4ct/kbp
reads: 104-106
errors: 5-10%
10 /27
![Page 43: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/43.jpg)
Sequence Assembly: Overview
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTCACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA
TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
Goal: reconstruct sequence
Problem 1: only have (small) reads
Idea: overlap reads to form complete sequence
11 / 27
![Page 44: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/44.jpg)
Sequence Assembly: Overview
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC
ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA
TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
Goal: reconstruct sequence
Problem 1: only have (small) reads
Idea: overlap reads to form complete sequence
11 / 27
![Page 45: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/45.jpg)
Sequence Assembly: Overview
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC
ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA
TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
Goal: reconstruct sequence
Problem 1: only have (small) reads
Idea: overlap reads to form complete sequence
Problem 2: parts of the sequence might not be covered by reads
sequence with “high coverage”
11 / 27
![Page 46: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/46.jpg)
Sequence Assembly: Overview
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC
ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACA
TCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
Goal: reconstruct sequence
Problem 1: only have (small) reads
Idea: overlap reads to form complete sequence
Problem 2: parts of the sequence might not be covered by reads
sequence with “high coverage”
11 / 27
![Page 47: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/47.jpg)
Sequence Assembly: Overview
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC
ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
Goal: reconstruct sequence
Problem 1: only have (small) reads
Idea: overlap reads to form complete sequence
Problem 2: parts of the sequence might not be covered by reads
sequence with “high coverage”
11 / 27
![Page 48: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/48.jpg)
Sequence Assembly: Overview
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC
ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
Goal: reconstruct sequence
Problem 1: only have (small) reads
Idea: overlap reads to form complete sequence
Problem 2: parts of the sequence might not be covered by reads
sequence with “high coverage”
11 / 27
![Page 49: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/49.jpg)
Sequence Assembly: Overview
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTT CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC
ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGC CGACACTCCTTGGGTTTT GGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
Goal: reconstruct sequence
Problem 1: only have (small) reads
Idea: overlap reads to form complete sequence
Problem 3: Shortest Common Superstring is NP-hard
Heuristic Assembly:
- Overlap-Layout-Consensus
- DeBruijn-Graph
11 / 27
![Page 50: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/50.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 51: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/51.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Naive Overlap
AGGAGTCGAGTCCA→
O(#reads2·read-length) time worst-case2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 52: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/52.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Naive Overlap
AGGAGTCGAGTCCA
O(#reads2·read-length) time worst-case2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 53: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/53.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Naive Overlap
AGGAGTCGAGTCCA
O(#reads2·read-length) time worst-case
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 54: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/54.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 55: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/55.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
example: BANANA�
0
BANANA�1
ANANA�
2
NANA�
3
ANA
�NA� 4
NA
�NA�5
A
�NA
O(read-length2) time & space
can be improved to linear time & space2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 56: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/56.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
example: BANANA�
0
BANANA�
1
ANANA�
2
NANA�
3
ANA
�NA� 4
NA
�NA�5
A
�NA
O(read-length2) time & space
can be improved to linear time & space2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 57: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/57.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
example: BANANA�
0
BANANA�1
ANANA�
2
NANA�
3
ANA
�NA� 4
NA
�NA�5
A
�NA
O(read-length2) time & space
can be improved to linear time & space2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 58: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/58.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
example: BANANA�
0
BANANA�1
ANANA�
2
NANA�
3
ANA
�NA� 4
NA
�NA�5
A
�NA
O(read-length2) time & space
can be improved to linear time & space2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 59: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/59.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
example: BANANA�
0
BANANA�1
ANANA�
2
NANA�
3
ANA
�NA�
4
NA
�NA�5
A
�NA
O(read-length2) time & space
can be improved to linear time & space2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 60: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/60.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
example: BANANA�
0
BANANA�1
ANANA�
2
NANA�
3
ANA
�NA� 4
NA
�NA�
5
A
�NA
O(read-length2) time & space
can be improved to linear time & space2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 61: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/61.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
example: BANANA�
0
BANANA�1
ANANA�
2
NANA�
3
ANA
�NA� 4
NA
�NA�5
A
�NA
O(read-length2) time & space
can be improved to linear time & space2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 62: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/62.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
example: BANANA�
0
BANANA�1
ANANA�
2
NANA�
3
ANA
�NA� 4
NA
�NA�5
A
�NA
O(read-length2) time & space
can be improved to linear time & space2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 63: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/63.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
example: BANANA�
0
BANANA�1
ANANA�
2
NANA�
3
ANA
�NA� 4
NA
�NA�5
A
�NA
O(read-length2) time & space
can be improved to linear time & space
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 64: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/64.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
AGGAGTC�GAGTCCA.
Exercise
Time
A TC
C
G
G6
.
0
GAGTC� TC
3
�1
CA.
5
�3
CA.
4
CA.
6
�5
A.
AGTCTC1
GAGTC�
4
�2
CA.2
�0
CA.
pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 65: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/65.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
AGGAGTC�GAGTCCA.
A TC
C
G
G6
.
0
GAGTC� TC
3
�1
CA.
5
�3
CA.
4
CA.
6
�5
A.
AGTCTC1
GAGTC�
4
�2
CA.2
�0
CA.
pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 66: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/66.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
AGGAGTC�GAGTCCA.
A TC
C
G
G6
.
0
GAGTC� TC
3
�1
CA.
5
�3
CA.
4
CA.
6
�5
A.
AGTCTC1
GAGTC�
4
�2
CA.2
�0
CA.
pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 67: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/67.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
AGGAGTC�GAGTCCA.
A TC
C
G
G6
.
0
GAGTC� TC
3
�1
CA.
5
�3
CA.
4
CA.
6
�5
A.
AGTCTC1
GAGTC�
4
�2
CA.2
�0
CA.
pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 68: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/68.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
AGGAGTC�GAGTCCA.
A TC
C
G
G6
.
0
GAGTC� TC
3
�1
CA.
5
�3
CA.
4
CA.
6
�5
A.
AGTCTC1
GAGTC�
4
�2
CA.2
�0
CA.
pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 69: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/69.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
AGGAGTC�GAGTCCA.
A TC
C
G
G6
.
0
GAGTC� TC
3
�1
CA.
5
�3
CA.
4
CA.
6
�5
A.
AGTCTC1
GAGTC�
4
�2
CA.2
�0
CA.
pairwise overlaps in O(#reads · read-length + #reads2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 70: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/70.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Suffix Trees
AGGAGTC�GAGTCCA.
A TC
C
G
G6
.
0
GAGTC� TC
3
�1
CA.
5
�3
CA.
4
CA.
6
�5
A.
AGTCTC1
GAGTC�
4
�2
CA.2
�0
CA.
pairwise overlaps in O(#reads · read-length + #reads2)
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 71: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/71.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
AGGAGTCGGTCTCA→
edit distance = # of insertions, deletions, and substitutions
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 72: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/72.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
AGGAGTCGGTCTCA
edit distance = # of insertions, deletions, and substitutions
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 73: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/73.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
AGGAGTCGAGTCTCA
edit distance = # of insertions, deletions, and substitutions
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 74: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/74.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
AGGAGTCGAGTCTCA
edit distance = # of insertions, deletions, and substitutions
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 75: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/75.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
dynamic programming where
[i,j] = edit distance of Xi. . . & Yj. . .
= min{ ↓ +1, → +1, → + idXi,Yj}
A G G A G T CG
4 3 4 4 4 5 6
7
G
5 4 3 4 3 4 5
6
T
6 5 4 3 3 3 4
5
C
6 5 4 3 2 2 3
4
T
6 5 4 3 2 1 2
3
C
6 5 4 4 3 2 1
2
A
6 5 4 3 3 2 1
1
7 6 5 4 3 2 1 0
modification: any suffix of GGTCTCA & prefix of AGGAGTC for free
best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 76: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/76.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
dynamic programming where
[i,j] = edit distance of Xi. . . & Yj. . .
= min{ ↓ +1, → +1, → + idXi,Yj}
A G G A G T CG
4 3 4 4 4 5
6 7
G
5 4 3 4 3 4
5 6
T
6 5 4 3 3 3
4 5
C
6 5 4 3 2 2
3 4
T
6 5 4 3 2 1
2 3
C
6 5 4 4 3 2
1 2
A
6 5 4 3 3 2
1 1
7 6 5 4 3 2 1 0
modification: any suffix of GGTCTCA & prefix of AGGAGTC for free
best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 77: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/77.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
dynamic programming where
[i,j] = edit distance of Xi. . . & Yj. . .
= min{ ↓ +1, → +1, → + idXi,Yj}
A G G A G T CG
4 3 4 4 4 5
6 7
G
5 4 3 4 3 4
5 6
T
6 5 4 3 3 3
4 5
C
6 5 4 3 2 2
3 4
T
6 5 4 3 2 1
2 3
C
6 5 4 4 3 2
1 2
A 6 5 4 3 3 2 1 1
7 6 5 4 3 2 1 0
modification: any suffix of GGTCTCA & prefix of AGGAGTC for free
best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 78: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/78.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
dynamic programming where
[i,j] = edit distance of Xi. . . & Yj. . .
= min{ ↓ +1, → +1, → + idXi,Yj}
A G G A G T CG 4 3 4 4 4 5 6 7
G 5 4 3 4 3 4 5 6
T 6 5 4 3 3 3 4 5
C 6 5 4 3 2 2 3 4
T 6 5 4 3 2 1 2 3
C 6 5 4 4 3 2 1 2
A 6 5 4 3 3 2 1 1
7 6 5 4 3 2 1 0
modification: any suffix of GGTCTCA & prefix of AGGAGTC for free
best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 79: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/79.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
dynamic programming where
[i,j] = edit distance of Xi. . . & Yj. . .
= min{ ↓ +1, → +1, → + idXi,Yj}
A G G A G T CG
3 2 1 1 1 2 1
0
G
4 3 2 1 0 1 1
0
T
5 4 3 2 1 0 1
0
C
5 4 3 2 1 1 0
0
T
5 4 3 2 1 0 1
0
C
6 5 4 3 2 1 0
0
A
6 5 4 3 3 2 1
0
7 6 5 4 3 2 1 0
modification: any suffix of GGTCTCA & prefix of AGGAGTC for free
best overlaps with k errors in O(#reads2 · read-length2)
Exercise
Time
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 80: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/80.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
dynamic programming where
[i,j] = edit distance of Xi. . . & Yj. . .
= min{ ↓ +1, → +1, → + idXi,Yj}
A G G A G T CG
3 2 1 1 1 2
1 0
G
4 3 2 1 0 1
1 0
T
5 4 3 2 1 0
1 0
C
5 4 3 2 1 1
0 0
T
5 4 3 2 1 0
1 0
C
6 5 4 3 2 1
0 0
A 6 5 4 3 3 2 1 0
7 6 5 4 3 2 1 0
modification: any suffix of GGTCTCA & prefix of AGGAGTC for free
best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 81: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/81.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
dynamic programming where
[i,j] = edit distance of Xi. . . & Yj. . .
= min{ ↓ +1, → +1, → + idXi,Yj}
A G G A G T CG 3 2 1 1 1 2 1 0
G 4 3 2 1 0 1 1 0
T 5 4 3 2 1 0 1 0
C 5 4 3 2 1 1 0 0
T 5 4 3 2 1 0 1 0
C 6 5 4 3 2 1 0 0
A 6 5 4 3 3 2 1 0
7 6 5 4 3 2 1 0
modification: any suffix of GGTCTCA & prefix of AGGAGTC for free
best overlaps with k errors in O(#reads2 · read-length2)2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 82: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/82.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
Fuzzy Overlap – Edit Distance
dynamic programming where
[i,j] = edit distance of Xi. . . & Yj. . .
= min{ ↓ +1, → +1, → + idXi,Yj}
A G G A G T CG 3 2 1 1 1 2 1 0
G 4 3 2 1 0 1 1 0
T 5 4 3 2 1 0 1 0
C 5 4 3 2 1 1 0 0
T 5 4 3 2 1 0 1 0
C 6 5 4 3 2 1 0 0
A 6 5 4 3 3 2 1 0
7 6 5 4 3 2 1 0
modification: any suffix of GGTCTCA & prefix of AGGAGTC for free
best overlaps with k errors in O(#reads2 · read-length2)
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 83: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/83.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 84: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/84.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
Overlap Graph
reads = vertices
directed edges = overlaps
GCTAGTGGCTAG
TGGCTAGGGTC
CTAGGGTCCGGA
AGGGTCCGGAATTA
ACTAGTAGTAGCCT
ACTAG
AGTAG
TAGTAG
TAGCCT
overlap graph non-linear due to repeats
only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 85: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/85.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
Overlap Graph
reads = vertices
directed edges = overlaps
GCTAGTGGCTAG
TGGCTAGGGTC
CTAGGGTCCGGA
AGGGTCCGGAATTA
ACTAGTAGTAGCCT
ACTAG
AGTAG
TAGTAG
TAGCCT
transitive reduction
overlap graph non-linear due to repeats
only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 86: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/86.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
Overlap Graph
reads = vertices
directed edges = overlaps
GCTAGTGGCTAG
TGGCTAGGGTC
CTAGGGTCCGGA
AGGGTCCGGAATTA
ACTAGTAGTAGCCT
ACTAG
AGTAG
TAGTAG
TAGCCT
transitive reduction
overlap graph non-linear due to repeats
only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 87: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/87.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
Overlap Graph
reads = vertices
directed edges = overlaps
GCTAGTGGCTAG
TGGCTAGGGTC
CTAGGGTCCGGA
AGGGTCCGGAATTA
ACTAGTAGTAGCCT
ACTAG
AGTAG
TAGTAG
TAGCCT
overlap graph non-linear due to repeats
only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 88: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/88.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
Overlap Graph
reads = vertices
directed edges = overlaps
GCTAGTGGCTAG
TGGCTAGGGTC
CTAGGGTCCGGA
AGGGTCCGGAATTA
ACTAGTAGTAGCCT
ACTAG
AGTAG
TAGTAG
TAGCCT
overlap graph non-linear due to repeats
only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 89: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/89.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
Overlap Graph
reads = vertices
directed edges = overlaps
GCTAGTGGCTAG
TGGCTAGGGTC
CTAGGGTCCGGA
AGGGTCCGGAATTA
ACTAGTAGTAGCCT
ACTAG
AGTAG
TAGTAG
TAGCCT
overlap graph non-linear due to repeats
only return non-branching parts (“contigs”): ACTAGTAG & TAGCCT
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 90: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/90.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 91: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/91.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
3. for each position, compute consensus base
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTC CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC
ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACProblems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 92: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/92.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
3. for each position, compute consensus base
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGGTTCTCTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTACTTTGCCCCTGAACTC CGACACTCCTTGGGTTTT CTAGGCCATTGATTGCGGGTC
ACTTCGC GGTTCTCT GGTCCAGGTGCTGTCAACGACATCGCTAGGGTTCTCTAACGA TTTACGTCGCGG CGACACTCCTTGGGTTTTTAC
TTTGCCCCTGAACTTCGCTAGGGTTCTCTAACGACACTCCTTGGGTTTTTACGTCGCGG CTAGGCCATTGATTGCGGGTCCAGGTGCTGTCAACGACACTCCTTGGGTTTTTAC
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 93: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/93.jpg)
Overlap-Layout-Consensus Assembly
Overlap-Layout-Consensus assemblers
1. produce pairwise overlaps
2. layout the reads according to the overlaps
3. for each position, compute consensus base
Problems
- overlap step too slow in practice:
108 reads 1016 read-pairs
heuristics exclude most of the read-pairs before overlap
- fragmented genome due to repeats
12 /27
![Page 94: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/94.jpg)
DeBruijn-Graph-Based Assembly
1. chop all reads into “k-mers”
real genomes: k = 30-50
2. build “DeBruijn graph”:
for each k-mer add arc from
left to right k-1 mer
3. find path using all overlaps
linear time with greedy
k=5
GAAC
AACT
ACTTCTTC
TTCG
TCGC
CGCTCCTT
CTTG
TTGG
13 /27
![Page 95: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/95.jpg)
DeBruijn-Graph-Based Assembly
1. chop all reads into “k-mers”
real genomes: k = 30-50
2. build “DeBruijn graph”:
for each k-mer add arc from
left to right k-1 mer
3. find path using all overlaps
linear time with greedy
k=5
GAAC
AACT
ACTTCTTC
TTCG
TCGC
CGCTCCTT
CTTG
TTGG
Exercise
Time
..GAACTTCGCT..
.GAACGAACTAACTTACTTCCTTCGTTCGCTCGCT
..CCTTGG..
.CCTTCCTTGTTGG.
..CCTTC..
..ACTTG..
13 /27
![Page 96: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/96.jpg)
DeBruijn-Graph-Based Assembly
1. chop all reads into “k-mers”
real genomes: k = 30-50
2. build “DeBruijn graph”:
for each k-mer add arc from
left to right k-1 mer
3. find path using all overlaps
linear time with greedy
k=5
GAAC
AACT
ACTTCTTC
TTCG
TCGC
CGCTCCTT
CTTG
TTGG
..GAACTTCGCT...GAACGAACTAACTTACTTCCTTCGTTCGCTCGCT
..CCTTGG...CCTTCCTTGTTGG.
..CCTTC..
..ACTTG..
13 /27
![Page 97: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/97.jpg)
DeBruijn-Graph-Based Assembly
1. chop all reads into “k-mers”
real genomes: k = 30-50
2. build “DeBruijn graph”:
for each k-mer add arc from
left to right k-1 mer
3. find path using all overlaps
linear time with greedy
k=5
GAAC
AACT
ACTTCTTC
TTCG
TCGC
CGCTCCTT
CTTG
TTGG
..GAACTTCGCT...GAACGAACTAACTTACTTCCTTCGTTCGCTCGCT
..CCTTGG...CCTTCCTTGTTGG.
..CCTTC..
..ACTTG..
13 /27
![Page 98: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/98.jpg)
DeBruijn-Graph-Based Assembly
1. chop all reads into “k-mers”
real genomes: k = 30-50
2. build “DeBruijn graph”:
for each k-mer add arc from
left to right k-1 mer
3. find path using all overlaps
linear time with greedy
k=5
GAAC
AACT
ACTTCTTC
TTCG
TCGC
CGCTCCTT
CTTG
TTGG
..GAACTTCGCT...GAACGAACTAACTTACTTCCTTCGTTCGCTCGCT
..CCTTGG...CCTTCCTTGTTGG.
..CCTTC..
..ACTTG..
13 /27
![Page 99: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/99.jpg)
DeBruijn-Graph-Based Assembly
1. chop all reads into “k-mers”
real genomes: k = 30-50
2. build “DeBruijn graph”:
for each k-mer add arc from
left to right k-1 mer
3. find path using all overlaps
linear time with greedy
k=5
GAAC
AACT
ACTTCTTC
TTCG
TCGC
CGCTCCTT
CTTG
TTGG
13 /27
![Page 100: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/100.jpg)
DeBruijn-Graph-Based Assembly
1. chop all reads into “k-mers”
real genomes: k = 30-50
2. build “DeBruijn graph”:
for each k-mer add arc from
left to right k-1 mer
3. find Eulerian walk
linear time with greedy
k=5
GAAC
AACT
ACTTCTTC
TTCG
TCGC
CGCTCCTT
CTTG
TTGG
13 /27
![Page 101: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/101.jpg)
DeBruijn-Graph-Based Assembly
1. chop all reads into “k-mers”
real genomes: k = 30-50
2. build “DeBruijn graph”:
for each k-mer add arc from
left to right k-1 mer
3. find Eulerian walk
linear time with greedy
k=5
GAAC
AACT
ACTTCTTC
TTCG
TCGC
CGCTCCTT
CTTG
TTGG
Running Time
#k-mers = O(#reads · read-length)
1. O(1) per k-mer
2. O(1) per k-mer
3. O(size of graph) = O(#k-mers)
Note: edges can be weighted by #occurances
13 /27
![Page 102: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/102.jpg)
DeBruijn-Graph-Based Assembly
1. chop all reads into “k-mers”
real genomes: k = 30-50
2. build “DeBruijn graph”:
for each k-mer add arc from
left to right k-1 mer
3. find Eulerian walk
linear time with greedy
k=5
GAAC
AACT
ACTTCTTC
TTCG
TCGC
CGCTCCTT
CTTG
TTGG
Problems
- choose k well
I k too small small repeats become problemsI k too big miss smaller overlaps
- Eulerian walk not neccessarily unique
- some paths in DeBruijn graph inconsistent with reads
- read-errors problematic
error-correction step before assembling
- same problem with repeats as OLC
13 /27
![Page 103: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/103.jpg)
Correcting Read Errors in Suffix Trees
Example
CAACTTACCAACTCAACAACTTACCTACTTAC
9
A
9
C
6
T
3
A
5
C
2
A
1
C
4
T
1
T
1
A
2
A
4
T
2
A
2
T
3
C
2
T
1
T
1
C
2
T
1
T
1
A
2
T
1
T
2
A
2
C
1
T
1
A
2
T
2
T
2
T
1
A
1
C
1
C1
A
1
C
1
1
1
Idea: low freq. w/ high freq. neighbor ignore branch
14 /27
![Page 104: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/104.jpg)
Correcting Read Errors in Suffix Trees
Example
CAACTTACCAACTCAACAACTTACCTACTTAC
9
A
9
C
6
T
3
A
5
C
2
A
1
C
4
T
1
T
1
A
2
A
4
T
2
A
2
T
3
C
2
T
1
T
1
C
2
T
1
T
1
A
2
T
1
T
2
A
2
C
1
T
1
A
2
T
2
T
2
T
1
A
1
C
1
C1
A
1
C1
1
1
Idea: low freq. w/ high freq. neighbor ignore branch
14 /27
![Page 105: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/105.jpg)
Correcting Read Errors in Suffix Trees
Example
CAACTTACCAACTCAACAACTTACCTACTTAC
9
A
9
C
6
T
3
A
5
C
2
A
1
C
4
T
1
T
1
A
2
A
4
T
2
A
2
T
3
C
2
T
1
T
1
C
2
T
1
T
1
A
2
T
1
T
2
A
2
C
1
T
1
A
2
T
2
T
2
T
1
A
1
C
1
C1
A
1
C1
1
1
Idea: low freq. w/ high freq. neighbor ignore branch
14 /27
![Page 106: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/106.jpg)
Correcting Read Errors in Suffix Trees
Example
CAACTTACCAACTCAACAACTTACCTACTTAC
9
A
9
C
6
T
3
A
5
C
2
A
1
C
4
T
1
T
1
A
2
A
4
T
2
A
2
T
3
C
2
T
1
T
1
C
2
T
1
T
1
A
2
T
1
T
2
A
2
C
1
T
1
A
2
T
2
T
2
T
1
A
1
C
1
C1
A
1
C
1
1
1
Idea: low freq. w/ high freq. neighbor ignore branch
14 /27
![Page 107: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/107.jpg)
Correcting Read Errors in DBG
Idea
“faulty” k-mers occur less often
than correct ones
build k-mer count histogram
De Bruijn
have to treat errors before
building the graph
k=30 & 1% error 1/4 faulty
15 /27
![Page 108: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/108.jpg)
Correcting k-mer Errors
Example
suppose: avg. k-mer count = 10 each k-mer occurs about 10x
GCGTATTACGCGTCTGGCCTCGTATT 8x
GTATTA 9x
TATTAC 7x
ATTACG 12x
TTACGC 9x
TACGCG 9x
ACGCGT 10x
CGCGTC 11x
GCGTCT 10x
CGTCTG 9x
GTCTGG 10x
TCTGGC 10x
CTGGCC 11x
TGGCCT 9x
GCGTATTACTCGTCTGGCCTCGTATT 8x
GTATTA 9x
TATTAC 7x
ATTACT 1x
TTACTC 2x
TACTCG 2x
ACTCGT 1x
CTCGTC 1x
TCGTCT 1x
CGTCTG 9x
GTCTGG 10x
TCTGGC 10x
CTGGCC 11x
TGGCCT 9x
16 /27
![Page 109: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/109.jpg)
Correcting k-mer Errors
17 /27
![Page 110: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/110.jpg)
Correcting k-mer Errors
Problem
now we have an idea where an error is, but how to fix it?
Idea
errors turn frequent k-mers into infrequent ones
correction should turn infrequent k-mers into frequent ones
replace infrequent k-mer by “frequent neighbor”
18 /27
![Page 111: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/111.jpg)
Intro: Genome Scaffolding
Recall: repeats (common in DNA) make assembly ambiguous
end product is a set of “contiguous regions”
Problem: “contig soup” not very useful
But: with NGS, we have paired-end information!
Scaffolding + Filling
Scaffolding
Goal: order & orient contigs
Idea: use pairing information on reads to “link” contigs together
19 /27
![Page 112: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/112.jpg)
Intro: Genome Scaffolding
Recall: repeats (common in DNA) make assembly ambiguous
end product is a set of “contiguous regions”
Problem: “contig soup” not very useful
But: with NGS, we have paired-end information!
Scaffolding + Filling
Scaffolding
Goal: order & orient contigs
Idea: use pairing information on reads to “link” contigs together
19 /27
![Page 113: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/113.jpg)
Intro: Genome Scaffolding
Recall: repeats (common in DNA) make assembly ambiguous
end product is a set of “contiguous regions”
Problem: “contig soup” not very useful
But: with NGS, we have paired-end information!
Scaffolding + Filling
Scaffolding
Goal: order & orient contigs
Idea: use pairing information on reads to “link” contigs together
19 /27
![Page 114: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/114.jpg)
Intro: Genome Scaffolding
Recall: repeats (common in DNA) make assembly ambiguous
end product is a set of “contiguous regions”
Problem: “contig soup” not very useful
But: with NGS, we have paired-end information!
Scaffolding + Filling
Scaffolding
Goal: order & orient contigs
Idea: use pairing information on reads to “link” contigs together
19 /27
![Page 115: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/115.jpg)
Intro: Genome Scaffolding
Recall: repeats (common in DNA) make assembly ambiguous
end product is a set of “contiguous regions”
Problem: “contig soup” not very useful
But: with NGS, we have paired-end information!
Scaffolding + Filling
Scaffolding
Goal: order & orient contigs
Idea: use pairing information on reads to “link” contigs together
19 /27
![Page 116: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/116.jpg)
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGGGTTTTTAC
CTACTGA
2
5
10
9
14
3
613
10
14
3
6
20/27
![Page 117: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/117.jpg)
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGGGTTTTTAC
CTACTGA
2
5
10
9
14
3
613
10
14
3
6
Strategy
1. map reads into contigs
2. pair contigs according to read-pairing (weighted)
3. cover “scaffold graph” with (heavy) alternating paths
each path corresponds to a chromosome
20/27
![Page 118: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/118.jpg)
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGG
GTTTTTAC
CTACTGA
2
5
10
9
14
3
613
10
14
3
6
Strategy
1. map reads into contigs
2. pair contigs according to read-pairing (weighted)
3. cover “scaffold graph” with (heavy) alternating paths
each path corresponds to a chromosome
20/27
![Page 119: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/119.jpg)
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGGGTTTTTAC
CTACTGA
2
5
10
9
14
3
613
10
14
3
6
Strategy
1. map reads into contigs
2. pair contigs according to read-pairing (weighted)
3. cover “scaffold graph” with (heavy) alternating paths
each path corresponds to a chromosome
20/27
![Page 120: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/120.jpg)
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGGGTTTTTAC
CTACTGA
2
5
10
9
14
3
613
10
14
3
6
Strategy
1. map reads into contigs
2. pair contigs according to read-pairing (weighted)
3. cover “scaffold graph” with (heavy) alternating paths
each path corresponds to a chromosome
20/27
![Page 121: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/121.jpg)
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGGGTTTTTAC
CTACTGA
2
5
10
9
14
3
613
10
14
3
6
Strategy
1. map reads into contigs
2. pair contigs according to read-pairing (weighted)
3. cover “scaffold graph” with (heavy) alternating paths
each path corresponds to a chromosome
20/27
![Page 122: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/122.jpg)
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGGGTTTTTAC
CTACTGA
25
10
9
14
3
613
10
14
3
6
Strategy
1. map reads into contigs
2. pair contigs according to read-pairing (weighted)
3. cover “scaffold graph” with (heavy) alternating paths
each path corresponds to a chromosome
20/27
![Page 123: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/123.jpg)
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGGGTTTTTAC
CTACTGA
25
10
9
14
3
613
10
14
3
6
Strategy
1. map reads into contigs
2. pair contigs according to read-pairing (weighted)
3. cover “scaffold graph” with (heavy) alternating paths
each path corresponds to a chromosome
20/27
![Page 124: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/124.jpg)
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGGGTTTTTAC
CTACTGA
25
10
9
14
3
613
10
14
3
6
Strategy
1. map reads into contigs
2. pair contigs according to read-pairing (weighted)
3. cover “scaffold graph” with (heavy) alternating paths
each path corresponds to a chromosome
20/27
![Page 125: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/125.jpg)
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGGGTTTTTAC
CTACTGA
25
10
9
14
3
613
10
14
3
6
ScaffoldingInput: Graph G , perfect matching M, weights ω, k, σp ∈ N
Question: Can M be covered by
≤σp alternating paths
σc alternating cycles
of total weight ≥ k?
20/27
![Page 126: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/126.jpg)
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGGGTTTTTAC
CTACTGA
25
10
9
14
3
613
10
14
3
6
ScaffoldingInput: Graph G , perfect matching M, weights ω, k, σp, σc ∈ N
Question: Can M be covered by
≤σp alternating paths &
≤σc alternating cycles
of total weight ≥ k?
20/27
![Page 127: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/127.jpg)
Graph-Based Scaffolding
AACGACAC
TCCTTGGG
TTTTTACG
TCGCGG
CTAGGCCATTGATTGCGGGTCCAGGTGCTG
GTTAATGTCCGAGCATAAAACTCTGGTTGGC
GTACTGAACTTGGGTTCCATAGGACCCAGA
AGAGCTTGACAGTAACACATTTAGGAGCACGCG
CGTCGCGG
ACTTGGGGTTTTTAC
CTACTGA
25
10
9
14
3
613
10
14
3
6
Exact ScaffoldingInput: Graph G , perfect matching M, weights ω, k, σp, σc ∈ N
Question: Can M be covered by
σp alternating paths &
σc alternating cycles
of total weight ≥ k?
20/27
![Page 128: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/128.jpg)
19
10
71
26
22
32
59
71
78
2 60
85
1
38
1
17
19
7
25
21 /27
![Page 129: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/129.jpg)
140
262
149
119
336
116
397
72
1
75
14
5
100
1
21
16
10
37
45
1
22
1
7
61
78
69
1
15
51
40
65
2
61
21 /27
![Page 130: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/130.jpg)
1
63
94
25
9
28
29
55
2
17
1
67
62
1
5
71
26
24
47
83
71
34
71
68
72
1
51
19
79
54
151
67
6
66
34
68
59
2
23
11
12
1
5
1
2
6
1
49
13
71
28
71
5
6
11
33
67
24
88
82
2
42
25
1
39
14
65
5
2
70
24
89
46
1
1
1
80
1
73
64
36
72
33
73
67
59
1
14
1
1
70
3
1
33
53
2
2
1
135
4
2980
74
80
84
11
31
1
1
62
73
38
31
33
77
108
28
13
1 4
1
70
8
5
1
47
3
31
2
1
18
56
49
45
2
2
21 /27
![Page 131: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/131.jpg)
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Construction
Given a directed graph D1. make a copy of D
2. duplicate all vertices M3. “slide” down all arrow tips & ignore directions
22/27
![Page 132: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/132.jpg)
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Construction
Given a directed graph D1. make a copy of D2. duplicate all vertices M
3. “slide” down all arrow tips & ignore directions
22/27
![Page 133: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/133.jpg)
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Construction
Given a directed graph D1. make a copy of D2. duplicate all vertices M3. “slide” down all arrow tips & ignore directions
22/27
![Page 134: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/134.jpg)
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Lemma
D admits a directed Hamiltonian path ⇔ M can be covered with a
single alternating path in G
22/27
![Page 135: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/135.jpg)
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Lemma
D admits a directed Hamiltonian path ⇔ M can be covered with a
single alternating path in G“⇒”: replace each v in the Hamiltonian path by vlow → vhigh
alternating X covers M X
22/27
![Page 136: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/136.jpg)
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Lemma
D admits a directed Hamiltonian path ⇔ M can be covered with a
single alternating path in G“⇒”: replace each v in the Hamiltonian path by vlow → vhigh
alternating X covers M X
22/27
![Page 137: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/137.jpg)
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Lemma
D admits a directed Hamiltonian path ⇔ M can be covered with a
single alternating path in G“⇐”: contract each matching edge in the covering alternating path
hits all vertices exactly once X is valid directed path X
22/27
![Page 138: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/138.jpg)
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Lemma
D admits a directed Hamiltonian path ⇔ M can be covered with a
single alternating path in G“⇐”: contract each matching edge in the covering alternating path
hits all vertices exactly once X is valid directed path X
22/27
![Page 139: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/139.jpg)
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Theorem
Scaffolding is NP-hard, even restricted to
• bipartite graphs
• (σp, σc) ∈ {(0, 1), (1, 0)} and
• ω : E → {0}
Corollary
Scaffolding with 2 weights is NP-hard in any
sufficiently dense graph class.
22/27
![Page 140: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/140.jpg)
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Theorem
Scaffolding is NP-hard, even restricted to
• supergraphs of bipartite graphs
• (σp, σc) ∈ {(0, 1), (1, 0)} and
• ω : E → {0, 1}
Corollary
Scaffolding with 2 weights is NP-hard in any
sufficiently dense graph class.
22/27
![Page 141: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/141.jpg)
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Theorem
Scaffolding is NP-hard, even restricted to
• supergraphs of bipartite graphs
• (σp, σc) ∈ {(0, 1), (1, 0)} and
• ω : E → {0, 1}
Corollary
Scaffolding with 2 weights is NP-hard in any
sufficiently dense graph class.22/27
![Page 142: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/142.jpg)
Hardness Warm up: Hamiltonian Path
Recall: Scaffolding
Input: Graph G , perfect matching M, weights ω, k, σp, σc ∈ NQuestion: Can M be covered by ≤ σp alternating paths &
≤ σc alternating cycles of total weight ≥ k?
Theorem
Exact Scaffolding is NP-hard, even restricted to
• supergraphs of bipartite graphs
• (σp, σc) ∈ {(0, 1), (1, 0)} and
• ω : E → {0, 1}
Corollary
Exact Scaffolding with 2 weights is NP-hard in any
sufficiently dense graph class.22/27
![Page 143: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/143.jpg)
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding
1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
![Page 144: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/144.jpg)
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
![Page 145: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/145.jpg)
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
![Page 146: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/146.jpg)
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
![Page 147: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/147.jpg)
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
![Page 148: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/148.jpg)
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
![Page 149: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/149.jpg)
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
![Page 150: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/150.jpg)
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
![Page 151: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/151.jpg)
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
![Page 152: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/152.jpg)
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
![Page 153: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/153.jpg)
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
![Page 154: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/154.jpg)
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
![Page 155: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/155.jpg)
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
23/27
![Page 156: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/156.jpg)
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
Proof
Result S∗ is a valid solution X
Note: taking an edge forbids ≤ 3 OPT edges
mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them
3ω(S∗) ≥ OPT
23/27
![Page 157: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/157.jpg)
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
Proof
Result S∗ is a valid solution XNote: taking an edge forbids ≤ 3 OPT edges
mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them
3ω(S∗) ≥ OPT
23/27
![Page 158: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/158.jpg)
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
Proof
Result S∗ is a valid solution XNote: taking an edge forbids ≤ 3 OPT edges
mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them
3ω(S∗) ≥ OPT
23/27
![Page 159: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/159.jpg)
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
Proof
Result S∗ is a valid solution XNote: taking an edge forbids ≤ 3 OPT edges
mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them
3ω(S∗) ≥ OPT
23/27
![Page 160: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/160.jpg)
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
Proof
Result S∗ is a valid solution XNote: taking an edge forbids ≤ 3 OPT edges
mark the ≤ 3 OPT-edges when taking an edge e e is heaviest among them
3ω(S∗) ≥ OPT
23/27
![Page 161: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/161.jpg)
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
Theorem
Scaffolding in complete graphs can be
3-approximated in O(|V |2 log |V |) time.
Remark
For Exact Scaffolding, we have to keep an eye on the number of
components too.
23/27
![Page 162: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/162.jpg)
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
Theorem
Scaffolding in complete (bipartite) graphs can be
3-approximated in O(|V |2 log |V |) time.
Remark
For Exact Scaffolding, we have to keep an eye on the number of
components too.
23/27
![Page 163: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/163.jpg)
3-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Scaffolding1. sort all edges by weight
2. repeatedly take heaviest edge, if
possible
Theorem
Scaffolding in complete (bipartite) graphs can be
3-approximated in O(|V |2 log |V |) time.
Remark
For Exact Scaffolding, we have to keep an eye on the number of
components too.
23/27
![Page 164: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/164.jpg)
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding
1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remain
24/27
![Page 165: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/165.jpg)
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remain
24/27
![Page 166: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/166.jpg)
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remain
24/27
![Page 167: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/167.jpg)
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remain
24/27
![Page 168: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/168.jpg)
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remain
24/27
![Page 169: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/169.jpg)
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remain
24/27
![Page 170: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/170.jpg)
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remain
24/27
![Page 171: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/171.jpg)
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remain
24/27
![Page 172: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/172.jpg)
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remainProof
Result S∗ is a valid solution X
ω(S∗) ≥ ω(fix) ≥ ω(S)/2 ≥ OPT/2
24/27
![Page 173: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/173.jpg)
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remainProof
Result S∗ is a valid solution Xω(S∗) ≥ ω(fix) ≥ ω(S)/2 ≥ OPT/2
24/27
![Page 174: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/174.jpg)
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remainTheorem
Exact Scaffolding in complete graphs can be
2-approximated in O(|V |2.5) time.
Remark
For Scaffolding, replace Step 3 by either merging cycles or
removing lightest edge, whatever looses less weight
24/27
![Page 175: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/175.jpg)
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remainTheorem
Exact Scaffolding in complete (bipartite) graphs can be
2-approximated in O(|V |2.5) time.
Remark
For Scaffolding, replace Step 3 by either merging cycles or
removing lightest edge, whatever looses less weight
24/27
![Page 176: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/176.jpg)
2-Approximation in Dense Graphs
σp = 1, σc = 1?
Approximate Exact Scaffolding1. compute max.-weight perfect
matching S S ∪M is collection of cycles
2. “fix” all but lightest edge per cycle
3. repeatedly flip any lightest non-fix
4-cycle intersecting 2 cycles
until at most σc + σp cycles remain
4. repeatedly remove lightest non-fix
cycle-edge
until at most σc cycles remainTheorem
Exact Scaffolding in complete (bipartite) graphs can be
2-approximated in O(|V |2.5) time.
Remark
For Scaffolding, replace Step 3 by either merging cycles or
removing lightest edge, whatever looses less weight
24/27
![Page 177: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/177.jpg)
Scaffolding with Multiplicities
Recall: most eucaryotes are diploid!
GGTGCGAGAGAGGTCATGGATTGCAACGA
GGTGCGAGAGGCCACTCCAATTGCAACGA
×2
×1
×1
×2
25/27
![Page 178: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/178.jpg)
Scaffolding with Multiplicities
Recall: most eucaryotes are diploid!
GGTGCGAGAGAGGTCATGGATTGCAACGA
GGTGCGAGAGGCCACTCCAATTGCAACGA
×2
×1
×1
×2
25/27
![Page 179: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/179.jpg)
Scaffolding with Multiplicities
Recall: most eucaryotes are diploid!
GGTGCGAGAGAGGTCATGGATTGCAACGA
GGTGCGAGAGGCCACTCCAATTGCAACGA
×2
×1
×1
×2
25/27
![Page 180: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/180.jpg)
Scaffolding with Multiplicities
70
5
47
67
70
135
49
3931
5
49
56
80
80
34
6
11
2489
19
25
83
73
82
71
9
14
63
72
64
28
29
33
66
33
68
94
62
62
11
47
18
67
8
6
74
59
12
34
46
88
42
79
77
13
17
26
24
70
65
45
33
73
67
23
71
108
67
36
13
55
5
38
72
51
71
5
28
73
31
71
59
80
54
84
28
29
71
151
24
5
6
33
68
53
25
×2
26/27
![Page 181: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/181.jpg)
Scaffolding with Multiplicities
70
5
47
67
70
135
49
3931
5
49
56
80
80
34
6
11
2489
19
25
83
73
82
71
9
14
63
72
64
28
29
33
66
33
68
94
62
62
11
47
18
67
8
6
74
59
12
34
46
88
42
79
77
13
17
26
24
70
65
45
33
73
67
23
71
108
67
36
13
55
5
38
72
51
71
5
28
73
31
71
59
80
54
84
28
29
71
151
24
5
6
33
68
53
25
×2
26/27
![Page 182: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/182.jpg)
Linearization of Solutions
Problem
no unique chromosome-configuration explaining solution
Proof
27/27
![Page 183: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/183.jpg)
Linearization of Solutions
Problem
no unique chromosome-configuration explaining solution
Proof
27/27
![Page 184: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/184.jpg)
Linearization of Solutions
Problem
no unique chromosome-configuration explaining solution
Proof
27/27
![Page 185: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/185.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
Proof
27/27
![Page 186: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/186.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
3 2 2 1 3 3 5 5 71
2 1 2 1
Proof
27/27
![Page 187: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/187.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
Proof
“⇒”: contraposition; let p = ambigous path
2 2 21
11
(G ,M,m) not uniquely linearizable
27/27
![Page 188: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/188.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
Proof
“⇒”: contraposition; let p = ambigous path
2 2 21
11
(G ,M,m) not uniquely linearizable
27/27
![Page 189: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/189.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
Proof
“⇒”: contraposition; let p = ambigous path
2 2 21
11
(G ,M,m) not uniquely linearizable
27/27
![Page 190: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/190.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
Proof
“⇒”: contraposition; let p = ambigous path
2 2 21
11
(G ,M,m) not uniquely linearizable
27/27
![Page 191: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/191.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
Proof
“⇐”: let (G ,M,m) be free of ambigous paths
Reduction (does not decrease number of linearizations):
5
2
22
result is collection of alternating paths & cycles
27/27
![Page 192: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/192.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
Proof
“⇐”: let (G ,M,m) be free of ambigous paths
Reduction (does not decrease number of linearizations):
3 3 3
5
2
22
result is collection of alternating paths & cycles
27/27
![Page 193: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/193.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
Proof
“⇐”: let (G ,M,m) be free of ambigous paths
Reduction (does not decrease number of linearizations):
3
5
2
22
result is collection of alternating paths & cycles
27/27
![Page 194: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/194.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
Proof
“⇐”: let (G ,M,m) be free of ambigous paths
Reduction (does not decrease number of linearizations):
3
5
2
22
result is collection of alternating paths & cycles
27/27
![Page 195: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/195.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
Proof
“⇐”: let (G ,M,m) be free of ambigous paths
Reduction (does not decrease number of linearizations):
3
3
22
result is collection of alternating paths & cycles
27/27
![Page 196: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/196.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ)
Proof
“⇐”: let (G ,M,m) be free of ambigous paths
Reduction (does not decrease number of linearizations):
3
3
22
result is collection of alternating paths & cycles
27/27
![Page 197: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/197.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily
missassembly
2. isolate each ambiguous path
information loss
3. cut as few ends as possible
as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
27/27
![Page 198: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/198.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily
missassembly
2. isolate each ambiguous path
information loss
3. cut as few ends as possible
as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
27/27
![Page 199: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/199.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path
information loss
3. cut as few ends as possible
as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
27/27
![Page 200: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/200.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path
information loss
3. cut as few ends as possible
as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
27/27
![Page 201: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/201.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible
as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
27/27
![Page 202: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/202.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible
as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
27/27
![Page 203: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/203.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
27/27
![Page 204: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/204.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
Multiplicitiesone &
#non-matching adj. to contig
27/27
![Page 205: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/205.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
Multiplicitiesone &
#non-matching adj. to contig
27/27
![Page 206: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/206.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
Multiplicitiesone &
#non-matching adj. to contig
27/27
![Page 207: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/207.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
Multiplicitiesone &
#non-matching adj. to contig
27/27
![Page 208: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/208.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible
as hard as Trans. Del. (∆-free)
27/27
![Page 209: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/209.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)
27/27
![Page 210: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/210.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)
27/27
![Page 211: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/211.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)
27/27
![Page 212: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/212.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)
27/27
![Page 213: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/213.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)
27/27
![Page 214: Lecture: Bioinformaticsigm.univ-mlv.fr › ~mweller › bioinfo2019 › lec1_basics.pdfLecture Overview Mathias Weller mathias.weller@u-pem.fr Laurent Bulteau laurent.bulteau@u-pem.fr-IntroductionI](https://reader033.vdocuments.us/reader033/viewer/2022060406/5f0f54b57e708231d443a0ec/html5/thumbnails/214.jpg)
Linearization of SolutionsTheorem
(G ,M,m) uniquely linearizable ⇔ no “ambigous paths”
(=alt. path of uniform multiplicity µ & each end incident to non-contig < µ) must destroy ambiguous paths
Idea: remove non-matching edges at their endpoints; strategy?
Proposals
1. decide arbitrarily missassembly
2. isolate each ambiguous path information loss
3. cut as few ends as possible as hard as Vertex Cover
4. cut as few multiplicities as possible as hard as Trans. Del. (∆-free)
27/27