persistence landscapes toolbox– a tool for topological ... · persistence landscapes toolbox– a...
TRANSCRIPT
Persistence landscapes toolbox– a tool fortopological statistics.
Paweł Dłotko
Geometrica Group, Inria Saclay
IST-Austria,
7 July 2015.
Possible characterization of data.
Space of data
8
High dimension
1
2 Scalar characteristics
of data
Vorticity/dynamics in turbulent flow.
Dynamics of a flow.
Average enstrophy
Disctetization
To discover.
Ayasdi (M. Nicolau et al.).
Why people like low dimensions?
1. Because they can see them.
2. Because they can understand them.
3. Because they can use standard statistics for the obtainedobservations.
4. Because there are no tools to operate on higher dimensionaldata.
Persistence comes into play.
Space of data
8
High dimension Space of
persistence diagrams
1
2 Scalar characteristics
of data
Sta
ble
pro
ject
ion
Persistence, state of the art.
1. Persistent homology is a dimension reduction technique.
2. There are standard metrics to compare persistence diagrams.
3. Early attempts to define Frechet mean of a diagram.
Problem with Frechet mean.
Problem with Frechet mean.
Problem with Frechet mean.
Problem with Frechet mean.
Persistence, what people are doing.
Space of data
8
High dimension Space of
persistence diagrams
1
2 Scalar characteristics
of data
Sta
ble
pro
ject
ion
Persistence, what is worth to consider.
Space of data
8
High dimension Space of
persistence diagrams
1
2 Scalar characteristics
of data
Sta
ble
pro
ject
ion
Lifting persistence to larger space.
1. Persistence diagrams as distributions:1.1 Frechet means are unique, but hard to compute (K. Turner, Y.
Mileyko, S. Mukherjee, J. Harer).1.2 For image classification (J. Reininghaus, S. Huber, U. Bauer,
R. Kwitt).
2. Persistence landscapes (P. Bubenik) / Size functions (M.Ferri, P. Frosini, C. Landi, B. di Fabio, A. Cerri, ...).
Persistence landscapes.
Persistence landscapes.
Persistence landscapes.
Persistence landscapes.
Persistence landscapes.
Persistence landscapes.
Persistence landscapes λ1.
Persistence landscapes λ2.
Persistence landscapes λ3.
Formal definition.1. The persistence landscape of a multiset of persistence
barcodes {(bi , di )}ni=1 is a set of functions λk : R → R suchthat λk(x) = k-th largest value of {f (bi , di )(x)}ni=1, where
2.
f(b,d) =
0 if x 6∈ (b, d)
x − b if x ∈ (b, b+d2 ]
−x + d if x ∈ (b+d2 , d)
(1)
Landscapes as size function.
1 12
a b
Persistence landscapes.
1. 1− 1 representation of persistence.
2. Vector space operations on functions +,−, multiplication byscalar well defined.
3. Average of two functions f , g in function space is just f+g2 .
4. Standard Lp norms and distances well defined. Landscapes arestable with respect to those norms.
5. PL-functions → easy to compute.
6. Represented as a set of critical points. In between them, oneneed to use linear approximation.
How to compute a persistence landscapes?
How to compute a persistence landscapes?
How to compute a persistence landscapes?
How to compute a persistence landscapes?
How to compute a persistence landscapes?
How to compute a persistence landscapes?
How to compute a persistence landscapes?
How to compute a persistence landscapes?
How to compute a persistence landscapes?
How to compute a persistence landscapes?
How to compute a persistence landscapes?
How to compute a persistence landscapes?
Swapping line approach (M. Kerber)
Time complexity of construction.
1. In both cases, O(nlogn + Kn), where K is the number ofnonzero landscapes.
2. Pessimistically, this is O(n2).
3. We believe that the swapping line algorithm may performbetter in practice. Tests underway.
4. Over here we are talking about exact computations.R-package TDA do that over a grid (CMU TopStat Group).
Lp distance, sum of distances between levels.
Averages.
Sum.
12Sum.
Averages (in a grid).1. Note that landscape of average do not come from persistence
diagram.2. The slope of averages is between −1 and 1.3. Typically it is much lower.4. Error bounds over a grid may be heavily overestimated.5. Exact representation gives more flexibility.
0
20
40
60
80
100
0 50 100 150 200 250
Permutation tests.
152679
212527232229
152679
Permutation tests.
152679
212527232229
5
24.5
av
av
152679
Permutation tests.
152679
212527232229
5
24.5
av
av
152679
|5-24.5|=19.5
Permutation tests.
152679
212527232229
5
24.5
av
av
152679212527232229
152679
|5-24.5|=19.5merge
Permutation tests.
152679
212527232229
5
24.5
av
av
152679212527232229
212372762952512229
shuffle
152679
|5-24.5|=19.5merge
Permutation tests.
152679
212527232229
5
24.5
av
av
152679212527232229
212372762952512229
shuffle
152679
|5-24.5|=19.5
2123727629
52512229
dividemerge
Permutation tests.
152679
212527232229
5
24.5
av
av
152679212527232229
212372762952512229
shuffle
152679
|5-24.5|=19.5
2123727629
52512229
av
av
divide
18.83
10.6
|18.83-10.6|=8.23merge
Permutation tests.
152679
212527232229
5
24.5
av
av
152679212527232229
212372762952512229
shuffle
152679
|5-24.5|=19.5
2123727629
52512229
av
av
divide
18.83
10.6
|18.83-10.6|=8.23merge
Simple classifiers.
Persistence 1Persistence 2
…Persistence n
Group 1
Persistence 1Persistence 2
…Persistence n
Group 2
Persistence 1Persistence 2
…Persistence n
Group k
...
Simple classifiers.
Persistence 1Persistence 2
…Persistence n
Group 1
Persistence 1Persistence 2
…Persistence n
Group 2
Persistence 1Persistence 2
…Persistence n
Group k
...
Av(group 1) Av(group 2) Av(group n)
Simple classifiers.
Persistence 1Persistence 2
…Persistence n
Group 1
Persistence 1Persistence 2
…Persistence n
Group 2
Persistence 1Persistence 2
…Persistence n
Group k
...
Av(group 1) Av(group 2) Av(group n)
New persistence
diagram
DistanceDistance
Distance
End-user programs to compute various statistics onPersistence landscapes.
1. Computations of distance matrix.
2. Computation of averages landscapes.
3. Standard deviation.
4. Inner products of landscapes.
5. Computations of integrals.
6. Moments computations.
7. Permutation test.
8. T-test, anova.
9. Classifiers.
10. Normalization of barcodes.
11. Plots.
Why to bother with L1 distances when there is1−Wasserstein?
0.001
0.01
0.1
1
10
100
1000
10000
100 200 300 400 500 600 700 800
W1L1
Why to bother with L2 distances when there is2−Wasserstein?
0.001
0.01
0.1
1
10
100
1000
10000
100 200 300 400 500 600 700 800
W2L2
Why to bother with L2 distances when there is2−Wasserstein?
0.001
0.01
0.1
1
10
100 200 300 400 500 600 700 800
BottleneckL infty
Persistence, what is worth to consider.
Space of data
8
High dimension Space of
persistence diagrams
1
2 Scalar characteristics
of data
Sta
ble
pro
ject
ion
Latest idea.
1. Suppose we have snapshots of a process S1, . . . ,Sn.
2. They gives a collection D1, . . . ,Dn of diagrams.
3. Processes are characterized by real numbers G1, . . . ,Gn.
4. Can we recover G1, . . . ,Gn from D1, . . . ,Dn?
5. Is there a function f such that f (D1), . . . , f (Dn) correlateswith G1, . . . ,Gn?
6. In this case, scientist is usually trying to find f by looking atdiagrams.
7. Which is frustrating and time consuming.
8. The new algorithm in PLT use brute force for finding afunction of the diagram that correlate most with the data.
Latest idea.
ProcessS1,...,Sn
Persistence diagramsD1,...,Dn
persistence
cha
racte
ris tics
f
G1<...<Gn
f
Latest idea, more details.
1. Suppose that D1, . . . ,Dn are sorted according to G1, . . . ,Gn.
2. Φ – collection of scalar valued functions defined on a diagram.
3. For every f ∈ Φ, compute f (D1), . . . , f (Dn).
4. Compute the Kendall tau distance of the sequence[f (D1), . . . , f (Dn)] from the sorted version of[f (D1), . . . , f (Dn)].
5. Best function in Φ – the one with smallest tau distance.
Kendall tau distance.
4 3 2 14 3 1 24 1 3 21 4 3 21 4 2 31 2 4 31 2 3 4
1. Six transpositions are needed.
2. Number of all possible transpositions: 4 32 = 6.
3. Kendall tau distance is defined as number of transpositionsneeded divided by number of transpositions needed to reversea sequence.
Latest idea, a variation.
Collection 1D1,...,Dn
Collection 2E1,...,Ek
f?
R
Applications overview.
1. Patterns from numerical analysis (Cahn-Hiliard-Cook,Diblock-Copolymer equations).
2. Dimensions of spheres.
3. Efficient distance matrix computations (granular mediaanalysis).
Classification example – Cahn Hilliard Cook patterns.
Topological process.1. Sequence of time varying data, each of which has its own
persistence.2. Distances are computed between the corresponding time steps.3. Averaging over time steps (to get averaged topological
process).
time
Process #
Topological classifier.
Dynamics f(proportion of mass, time)
Patterns
Persistent homology of patterns
p1
p2
Sim
ple
nea
rest
nei
g hbo
r c l
ass i
fier
Dimensions of spheres.
1. This is a proof-of-concept experiment.
2. Suppose we are given a collection of point clouds sampledfrom (round) Sd for d ∈ {1, . . . , 9}.
3. Suppose that the point clouds were normalized so thataverage distance between points is 1.
4. Then, zero and one dimensional persistence identify thedimension (permutation test).
Average landscapes in dimension 1.
Granular media.1. Granular media – large conglomerations of discrete
macroscopic particles.2. Behaves differently from solids, liquids, or gases.3. Konstantin Mischaikow and Miro Kramar are using
persistence to characterize force networks inside media.4. Lot of distances needs to be computed.
More to come soon.
I hope...
How to obtain?
1. Go to http://www.math.upenn.edu/~dlotko/persistenceLandscape.html.
2. Linux, windows and osx exacutables which can perform mosttypical tasks are provided.
3. Source code (still a bit messy) for advanced users. A lot ofcomments are provided in the code.
4. The R-package TDA can be obtained from here: http://cran.r-project.org/web/packages/TDA/index.html.
5. Will be happy to do a code demonstration for you!
Joint work with:
Peter Bubenik, Takashi Ishihara, Michael Kerber, KonstantinMischaikow, Thomas Wanner.
Thank you for your time!
Pawel DlotkoInria, Saclay, [email protected]
pawel dlotko @ skypepdlotko @ gmail
Let’s check out the library!1. Dataset: Let us sample 11 times 50n points from wedge of
n−circles iid with some error.2. Compute Rips complex and persistence of each of the point
clouds.
How to obtain?
1. Go to http://www.math.upenn.edu/~dlotko/persistenceLandscape.html.
2. Linux, windows and osx exacutables which can perform mosttypical tasks are provided.
3. Source code (still a bit messy) for advanced users. A lot ofcomments are provided in the code.
4. The R-package TDA can be obtained from here: http://cran.r-project.org/web/packages/TDA/index.html.
What do you need first?
1. You need a persistence intervals in a form of a file:1 24 59 22
2. They can be obtained with various programs to computepersistent homology.
3. Dyinizous, JPlex, Perserus, Phat, Plex.
Distance matrix
1. Go to http://www.math.upenn.edu/~dlotko/persistenceLandscape.html.
2. Linux, windows and osx exacutables which can perform mosttypical tasks are provided.
3. Source code (still a bit messy) for advanced users. A lot ofcomments are provided in the code.
4. Construct to files with paths to the barcodes.
5. Call DistanceMatrix program.
Others...
1. Let us try standard deviation (StandardDeviation),
2. Permutation test (PermutationTest),
3. Computations of averages (ComputeAverage),
4. Ploting subroutines (PlotsOfLandscapesViaScripts).
5. Classification (in dimension 1)(ClassifierBasedOnSingleDimension).