seriation, spectral clustering and de novo genome assemblyrecanati/slides/symbiose18.pdfspectral...

42
Seriation, Spectral Clustering and de novo genome assembly Antoine Recanati, CNRS & ENS with Alexandre d’Aspremont, Thomas Kerdreux, Thomas Br¨ uls, CNRS - ENS Paris & Genoscope. A. Recanati Symbiose Seminar, Juin 2018, 1/29

Upload: others

Post on 03-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Seriation, Spectral Clustering and de novo genomeassembly

Antoine Recanati, CNRS & ENS

with Alexandre d’Aspremont, Thomas Kerdreux, Thomas Bruls,CNRS - ENS Paris & Genoscope.

A. Recanati Symbiose Seminar, Juin 2018, 1/29

Page 2: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Seriation

The Seriation Problem.

� Pairwise similarity information Aij on n variables.

� Suppose the data has a serial structure, i.e. there is an order π such that

Aπ(i)π(j) decreases with |i− j| (R-matrix)

Recover π?

20 40 60 80 100 120 140 160

20

40

60

80

100

120

140

16020 40 60 80 100 120 140 160

20

40

60

80

100

120

140

160

Similarity matrix Input Reconstructed

A. Recanati Symbiose Seminar, Juin 2018, 2/29

Page 3: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Genome Assembly

Seriation has direct applications in (de novo) genome assembly.

� Genomes are cloned multiple times and randomly cut into shorter reads(∼ 400bp to 100kbp), which are fully sequenced.

� Reorder the reads to recover the genome.

A. Recanati Symbiose Seminar, Juin 2018, 3/29

Page 4: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Genome Assembly

Overlap Layout Consensus (OLC). Three stages.

� Compute overlap between all read pairs.

� Reorder overlap matrix to recover read order.

� Average the read values to create a consensus sequence.

The read reordering problem is a seriation problem.

A. Recanati Symbiose Seminar, Juin 2018, 4/29

Page 5: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Genome Assembly in Practice

Noise. In the noiseless case, the overlap matrix is a R-matrix. In practice. . .

� There are base calling errors in the reads, typically 2% to 20% depending onthe process.

� Entire parts of the genome are repeated, which breaks the serial structure.

Sequencing technologies

� Next generation : short reads (∼ 400bp), few errors (∼ 2%). Repeats arechallenging

� Third generation : long reads (∼ 10kbp), more errors (∼ 15%). Can resolvesome repeats, but not all of them + noise can be challenging

A. Recanati Symbiose Seminar, Juin 2018, 5/29

Page 6: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Genome Assembly in Practice

Current assemblers.

� With short accurate reads, the reordering problem is solved bycombinatorial methods using the topology of the assembly graph andadditional pairing information.

� With long noisy reads, reads are corrected before assembly (hybrid correctionor self-mapping).

� Layout and consensus not clearly separated, many heuristics . . .

� minimap+miniasm : first long raw reads straight assembler (but consensussequence is as noisy as raw reads).

A. Recanati Symbiose Seminar, Juin 2018, 6/29

Page 7: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Outline

� Introduction

� Spectral relaxation of Seriation (Spectral Ordering)

� Multi-dimensional Spectral Ordering

� Results (Application to genome assembly)

A. Recanati Symbiose Seminar, Juin 2018, 7/29

Page 8: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

2-SUM and the Graph Laplacian

The 2-SUM Combinatorial Problem.

� The 2-SUM problem is written

minπ∈P

n∑i,j=1

Aπ(i)π(j)(i− j)2

or alternatively,

minπ∈P

n∑i,j=1

Aij(π(i)− π(j))2

� optimal permutation π∗ : high values of A ⇔ low |π(i)− π(j)|, i.e., i and jlay close to each other.

A. Recanati Symbiose Seminar, Juin 2018, 8/29

Page 9: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

2-SUM and the Graph Laplacian

Graph Laplacian

� A : adjacency matrix of a undirected weighted graph (Aij > 0 iff. there is anedge between nodes i and j).

� Node i has degree di =∑j Aij. Degree matrix D = diag(A1) = diag(d).

� Laplacian matrix L = D −A.

� The Laplacian can be viewed as a quadratic form,

fTLf =1

2

n∑i,j=1

Aij(fi − fj)2

A. Recanati Symbiose Seminar, Juin 2018, 9/29

Page 10: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

2-SUM and the Graph Laplacian

Mathematical reminder

� For a vector f = (f1, . . . , fn)T ∈ Rn and a matrix M ∈ Rn×n, we have,fTMf =

∑ni,j=1Mijfifj

� (λ ∈ R, u ∈ Rn) is a eigenvalue-eigenvector couple of L ∈ Rn×n iff Lu = λu

A. Recanati Symbiose Seminar, Juin 2018, 10/29

Page 11: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

2-SUM and the Graph Laplacian

The Laplacian can be viewed as a quadratic form,

fTLf =1

2

n∑i,j=1

Aij(fi − fj)2

Indeed for any f ∈ Rn,

fTLf = fTDf − fTAf=

∑ni=1 f

2iDii −

∑ni,j=1Aijfifj

=∑ni=1 f

2i (∑nj=1Aij)−

∑ni,j=1Aijfifj

=∑ni,j=1Aij(f

2i − fifj)

= 12

∑ni,j=1Aij(f

2j + f2i − 2fifj)

= 12

∑ni,j=1Aij(fi − fj)2

A. Recanati Symbiose Seminar, Juin 2018, 11/29

Page 12: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

2-SUM and the Graph Laplacian

The Laplacian can be viewed as a quadratic form,

fTLf =1

2

n∑i,j=1

Aij(fi − fj)2

� L is symmetric and positive semi-definite.

� L has n non-negative, real-valued eigenvalues, 0 = λ1 ≤ λ2 ≤ . . . ≤ λn.

� 1 = (1, . . . , 1)T is eigenvector associated to eigenvalue 0.

� If A has K connected components, the eigenvalue 0 has multiplicity K + 1,with eigenvectors being indicators of the connected components.

� If f ∈ {−1,+1}n, objective of min-cut (clustering).

A. Recanati Symbiose Seminar, Juin 2018, 12/29

Page 13: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

2-SUM and the Graph Laplacian

� The 2-SUM problem is written

minπ∈P∑ni,j=1Aπ(i)π(j)(i− j)2

or alternatively,minπ∈P

∑ni,j=1Aij(π(i)− π(j))2

i.e.,minπ∈P πTLπ

� For certain matrices A, 2-SUM ⇐⇒ seriation. ([Fogel et al., 2013])

� NP-Complete for generic matrices A.

� Constraints π ∈ P ?

A. Recanati Symbiose Seminar, Juin 2018, 13/29

Page 14: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Spectral relaxation

minπ∈P

πTLAπ (2SUM)

Set of permutation vectors :

π(i) ∈ {1, ..., n}, ∀1 ≤ i ≤ n

πT1 = n(n+ 1)/2

‖π‖22 = n(n+ 1)(2n+ 1)/6

� Since L1 = 0, (2SUM) is invariant by π ← π − (n+1)2 1, so enforce πT1 = 0.

� Up to a dilatation, we can chose ‖π‖22 = 1.

� Relax the integer constraints and let π(i) ∈ R.

A. Recanati Symbiose Seminar, Juin 2018, 14/29

Page 15: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Spectral relaxation

Spectral Seriation. Define the Laplacian of A as L = diag(A1)−A. TheFiedler vector of A is written

f = argmin1Tx=0,‖x‖2=1

xTLAx.

and is the second smallest eigenvector of the Laplacian.

The Fiedler vector reorders a R-matrix in the noiseless case.

Theorem [Atkins, Boman, and Hendrickson, 1998]

Spectral seriation. Suppose A ∈ Sn is a pre-R matrix, with a simple Fiedler valuewhose Fiedler vector f has no repeated values. Suppose that Π ∈ P is such thatthe permuted Fielder vector Πv is monotonic, then ΠAΠT is an R-matrix.

A. Recanati Symbiose Seminar, Juin 2018, 15/29

Page 16: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Spectral Ordering Algorithm

The Algorithm.

Input: Connected similarity matrix A ∈ Rn×n1: Compute Laplacian L = diag(A1)−A2: Compute second smallest eigenvector of L, x∗

3: Sort the values of x∗

Output: Permutation π : x∗π(1) ≤ x∗π(2) ≤ ... ≤ x∗π(n)

0 20 40 60 800

20

40

60

80

0 20 40 60 80 100

0.15

0.10

0.05

0.00

0.05

0.10

0.15

Similarity matrix Fiedler vector

A. Recanati Symbiose Seminar, Juin 2018, 16/29

Page 17: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Spectral Solution

� Spectral solution easy to compute and scales well (polynomial time)

� But sensitive and not flexible (hard to include additional structural constraints)

� Other (convex) relaxations can handle structural constraints and solve morerobust objectives than 2SUM

Genome assembly pipeline

� Overlap : computed from k-mers, yielding a similarity matrix A

� Layout : A is thresholded to remove noise-induced overlaps, and reorderedwith spectral ordering algorithm. Layout fine-grained with overlapinformation.

� Consensus : Genome sliced in windows

A. Recanati Symbiose Seminar, Juin 2018, 17/29

Page 18: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Spectral Solution vs Noisy Synthetic data

0 100 200 300 400 5000

100

200

300

400

500 0 100 200 300 400 500

0.075

0.050

0.025

0.000

0.025

0.050

0.075

Similarity matrix Fiedler vector

� Gaussian noise over perfect R-matrix.

A. Recanati Symbiose Seminar, Juin 2018, 18/29

Page 19: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Spectral Solution vs Real DNA data

0 2500 5000 7500 10000 12500 15000 17500

0.020

0.015

0.010

0.005

0.000

0.005

Similarity matrix Fiedler vector

� Repeats are a more structured noise that makes the method fail.

A. Recanati Symbiose Seminar, Juin 2018, 19/29

Page 20: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Outline

� Introduction

� Spectral relaxation of Seriation (Spectral Ordering)

� Multi-dimensional Spectral Ordering

� Results (Application to genome assembly)

A. Recanati Symbiose Seminar, Juin 2018, 20/29

Page 21: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Multi-dimensional Spectral Embedding

(Spoiler Alert!)

There is information in the rest of the eigenvectors of L

3d scatter plot of the 3 first non-zero eigenvectors of L

A. Recanati Symbiose Seminar, Juin 2018, 21/29

Page 22: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Multi-Dim 2-SUM and the Graph Laplacian

Generalize the quadratic expression involving the Laplacian,

Tr(

ΦTLAΦ)

=1

2

n∑i,j=1

Aij‖yi − yj‖22

� Let 0 = λ0 < λ1 ≤ . . . ≤ λn−1, Λ , diag (λ0, . . . , λn−1),Φ =

(1, f(1), . . . , f(n−1)

), be the eigendecomposition of L = ΦΛΦT .

� For any K < n, Φ(K) ,(f(1), . . . , f(K)

)defines a K-dimensional embedding

yi =(f(1)(i), f(2)(i), . . . , f(K)(i)

)T ∈ RK, for i = 1, . . . , n. (K-LE)

which solves the following embedding problem,

minimize∑ni,j=1Aij‖yi − yj‖22

such that Φ =(yT1 , . . . ,y

Tn

)T ∈ Rn×K , ΦT Φ = IK , ΦT1n = 0K(Lap-Emb)

A. Recanati Symbiose Seminar, Juin 2018, 22/29

Page 23: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Intermission : Spectral Clustering

Spectral Clustering usually leverages the first few eigenvectors of L. To partitiondata in K clusters,

� Compute the K lowest non-zero eigenvectors of L,Φ(K) =

(f(1), . . . , f(K)

)∈ Rn×K.

� Run the K-means algorithm on this K-dimensional embedding.

A. Recanati Symbiose Seminar, Juin 2018, 23/29

Page 24: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Multi-Dimensional Spectral Ordering (MDSO)

How to extract ordering from multidimensional embedding ?

� Construct new similarity matrix S

� For each point u, take k-NN in the embedding, fit by a line, use projection onthe line to define similarity Sij between points i, j ∈ kNN(u).

� Run basic Spectral Ordering on S.

� If S is not connected, reorder each connected component, and use A to mergethe ends.

A. Recanati Symbiose Seminar, Juin 2018, 24/29

Page 25: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Multi-Dimensional Spectral Ordering (MDSO)

� Simple generalization of Spectral Ordering

� Acts like a pre-preocessing on the similarity matrix

� Improves robustness to noise

� Handles circular orderings (with 2D embedding in a circle)

0.075 0.050 0.025 0.000 0.025 0.050 0.075 0.100

0.04

0.02

0.00

0.02

0.04

initial laplacian embedding of Hic

2D spectral embedding from similarity between single-cell Hi-C contact matrices

A. Recanati Symbiose Seminar, Juin 2018, 25/29

Page 26: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Outline

� Introduction

� Spectral relaxation of Seriation (Spectral Ordering)

� Multi-dimensional Spectral Ordering

� Results (Application to genome assembly)

A. Recanati Symbiose Seminar, Juin 2018, 26/29

Page 27: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Application to genome assembly

Bacterial genome.

� Escherichia coli reads sequenced by Loman et al. [2015]. ∼ 4Mbp

� Oxford Nanopore Technology MinION’s device (third generation long reads).

� minimap2 used to compute overlap-based similarity between reads.

A. Recanati Symbiose Seminar, Juin 2018, 27/29

Page 28: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Application to genome assembly

Layout.

� MDSO new similarity matrix S is disconnected.

� Connected components can be merged by looking at the similarity betweentheir ends from the original matrix A.

� Kendall-Tau score with reference ordering : 99.5%

� Full assembly pipeline yields ∼ 99% avg. identity (using MSA in slidingwindow)

0 2500 5000 7500 10000 12500 15000 17500ordering found

0

1000000

2000000

3000000

4000000

true

posit

ion

0 2500 5000 7500 10000 12500 15000 17500ordering found

0

1000000

2000000

3000000

4000000

true

posit

ion

Order in connected components Merged ordering

A. Recanati Symbiose Seminar, Juin 2018, 28/29

Page 29: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Conclusion

� Equivalence 2-SUM ⇐⇒ seriation.

� Spectral ordering : simple relaxation of 2-SUM using spectrum of theLaplacian. Related to widespread Spectral Clustering algorithm.

� Spectral ordering is sensitive to repeats.

� Multi-dimensional Spectral Ordering can overcome this issue (and solvecircular seriation).

� Straightforward assembly pipeline with MSA to perform consensus.

A. Recanati Symbiose Seminar, Juin 2018, 29/29

Page 30: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

*

References

Jonathan E Atkins, Erik G Boman, and Bruce Hendrickson. A spectral algorithmfor seriation and the consecutive ones problem. SIAM Journal on Computing,28(1):297–310, 1998.

Fajwel Fogel, Rodolphe Jenatton, Francis Bach, and Alexandre d’Aspremont.Convex relaxations for permutation problems. In Advances in NeuralInformation Processing Systems, pages 1016–1024, 2013.

Nicholas J Loman, Joshua Quick, and Jared T Simpson. A complete bacterialgenome assembled de novo using only nanopore sequencing data. Naturemethods, 12(8):733, 2015.

Antoine Recanati, Thomas Bruls, and Alexandre dAspremont. A spectralalgorithm for fast de novo layout of uncorrected long nanopore reads.Bioinformatics, 33(20):3188–3194, 2017.

Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing,17(4):395–416, 2007.

A. Recanati Symbiose Seminar, Juin 2018, 30/29

Page 31: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Application to genome assembly

Eukaryotic genome : S. Cerevisiae

� 16 chromosomes

� Many repeats

� Higher threshold on similarity matrix ⇒ many connected components

true ordering (from BWA)

recovere

d (

spectr

al ord

ering)

A. Recanati Symbiose Seminar, Juin 2018, 31/29

Page 32: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Conclusion

Straightforward assembly pipeline.

� Equivalence 2-SUM ⇐⇒ seriation.

� Layout correctly found by spectral relaxation for bacterial genomes (withlimited number of repeats)

� Consensus computed by MSA in sliding windows ⇒∼ 99% avg. identity withreference

Future work.

� Additional information could help assemble more complex genomes (e.g.with topological constraints on the similarity graph, or chromosomeassignment...)

� Other problems involving Seriation ?

� Convex relaxations can also handle constraints (e.g. |π(i)− π(j)| ≤ k) fordifferent problems

A. Recanati Symbiose Seminar, Juin 2018, 32/29

Page 33: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Consensus

� Once layout is computed and fined-grained, slicing in windows

� Multiple Sequence Alignment using Partial Order Graphs (POA) in windows

� Windows merging

window 1

window 2

window 3

POA in windows

consensus 1

consensus 2

consensus 3

consensus (1+2)

consensus ((1+2) +3)

A. Recanati Symbiose Seminar, Juin 2018, 33/29

Page 34: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Seriation

Combinatorial problems.

� The 2-SUM problem is written

minπ∈P

n∑i,j=1

Aπ(i)π(j)(i− j)2 or equivalently minπ∈P

πTLAπ

where LA is the Laplacian of A.

� NP-Complete for generic matrices A.

A. Recanati Symbiose Seminar, Juin 2018, 34/29

Page 35: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Convex Relaxation

Seriation as an optimization problem.

minπ∈P

n∑i,j=1

Aπ(i)π(j)(i− j)2

What’s the point?

� Gives a spectral (hence polynomial) solution for 2-SUM on some R-matrices.

� Write a convex relaxation for 2-SUM and seriation.

◦ Spectral solution scales very well (cf. Pagerank, spectral clustering, etc.)

◦ Not very robust. . .

◦ Not flexible. . . Hard to include additional structural constraints.

A. Recanati Symbiose Seminar, Juin 2018, 35/29

Page 36: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Convex Relaxation

� Let Dn the set of doubly stochastic matrices, where

Dn = {X ∈ Rn×n : X > 0, X1 = 1, XT1 = 1}

is the convex hull of the set of permutation matrices.

� Notice that P = D ∩O, i.e. Π permutation matrix if and only Π is bothdoubly stochastic and orthogonal.

� Solveminimize Tr(Y TΠTLAΠY )− µ‖PΠ‖2Fsubject to eT1 Πg + 1 ≤ eTnΠg,

Π1 = 1, ΠT1 = 1,Π ≥ 0,

(1)

in the variable Π ∈ Rn×n, where P = I− 1n11

T and Y ∈ Rn×p is a matrixwhose columns are small perturbations of g = (1, . . . , n)T .

A. Recanati Symbiose Seminar, Juin 2018, 36/29

Page 37: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Convex Relaxation

Objective. Tr(Y TΠTLAΠY )− µ‖PΠ‖2F

� 2-SUM term Tr(Y TΠTLAΠY ) =∑pi=1 y

Ti ΠTLAΠyi where yi are small

perturbations of the vector g = (1, . . . , n)T .

� Orthogonalization penalty −µ‖PΠ‖2F , where P = I− 1n11

T .

◦ Among all DS matrices, rotations (hence permutations) have the highestFrobenius norm.

◦ Setting µ ≤ λ2(LA)λ1(Y YT ), keeps the problem a convex QP.

Constraints.

� eT1 Πg + 1 ≤ eTnΠg breaks degeneracies by imposing π(1) ≤ π(n). Without it,both monotonic solutions are optimal and this degeneracy can significantlydeteriorate relaxation performance.

� Π1 = 1, ΠT1 = 1 and Π ≥ 0, keep Π doubly stochastic.

A. Recanati Symbiose Seminar, Juin 2018, 37/29

Page 38: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Convex Relaxation

Other relaxations.

� Relaxations for orthogonality constraints, e.g. SDPs in [???]. Simple idea:QTQ = I is a quadratic constraint on Q, lift it. This yields a O(

√n)

approximation ratio.

� O(√

log n) approximation bounds for Minimum Linear Arrangement[??????].

� All these relaxations form extremely large SDPs.

Our simplest relaxation is a QP. No approximation bounds at this point however.

A. Recanati Symbiose Seminar, Juin 2018, 38/29

Page 39: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Semi-Supervised Seriation

Convex Relaxation.

� Semi-Supervised Seriation. We can add structural constraints to therelaxation, where

a ≤ π(i)− π(j) ≤ b is written a ≤ eTi Πg − eTj Πg ≤ b.

which are linear constraints in Π.

� Sampling permutations. We can generate permutations from a doublystochastic matrix D

◦ Sample monotonic random vectors u.

◦ Recover a permutation by reordering Du.

� Algorithms. Large QP, projecting on doubly stochastic matrices can be donevery efficiently, using block coordinate descent on the dual. Extendedformulations by [?] can reduce the dimension of the problem to O(n log n) [?].

A. Recanati Symbiose Seminar, Juin 2018, 39/29

Page 40: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Numerical results: nanopores

Nanopores DNA data. New sequencing hardware.

Oxford nanopores MinION.

A. Recanati Symbiose Seminar, Juin 2018, 40/29

Page 41: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Numerical results: nanopores

Nanopores.

A. Recanati Symbiose Seminar, Juin 2018, 41/29

Page 42: Seriation, Spectral Clustering and de novo genome assemblyrecanati/slides/Symbiose18.pdfSpectral relaxation min ˇ2P ... There is information in the rest of the eigenvectors of L 3d

Numerical results

Nanopores DNA data.

� Longer reads. Average 10k base pairs in early experiments. Compared with∼ 100 base pairs for existing technologies.

� High error rate. About 20% compared with a few percents for existingtechnologies.

� Real-time data. Sequencing data flows continuously.

A. Recanati Symbiose Seminar, Juin 2018, 42/29