statistical techniques in topological data analysis - inegi · overview goal of talk give an...

67
Statistical techniques in topological data analysis Andrew J. Blumberg ([email protected]) June 17th, 2014 Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Upload: vuthuy

Post on 25-Apr-2018

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Statistical techniques in topological data analysis

Andrew J. Blumberg ([email protected])

June 17th, 2014

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 2: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Overview

Goal of talk

Give an overview of work on integrating statistical methodologywith topological data analysis (persistent homology).

This integration is essential to using TDA for data analysis:

Provides ways to understand what the results of TDA mean.

Provides methodology for handling noise.

Provides approaches to taming computational burden of TDA.

Particularly salient when considering “big data” applications.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 3: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Overview

Goal of talk

Give an overview of work on integrating statistical methodologywith topological data analysis (persistent homology).

This integration is essential to using TDA for data analysis:

Provides ways to understand what the results of TDA mean.

Provides methodology for handling noise.

Provides approaches to taming computational burden of TDA.

Particularly salient when considering “big data” applications.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 4: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Overview

Goal of talk

Give an overview of work on integrating statistical methodologywith topological data analysis (persistent homology).

This integration is essential to using TDA for data analysis:

Provides ways to understand what the results of TDA mean.

Provides methodology for handling noise.

Provides approaches to taming computational burden of TDA.

Particularly salient when considering “big data” applications.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 5: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Overview

Goal of talk

Give an overview of work on integrating statistical methodologywith topological data analysis (persistent homology).

This integration is essential to using TDA for data analysis:

Provides ways to understand what the results of TDA mean.

Provides methodology for handling noise.

Provides approaches to taming computational burden of TDA.

Particularly salient when considering “big data” applications.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 6: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Overview

Goal of talk

Give an overview of work on integrating statistical methodologywith topological data analysis (persistent homology).

This integration is essential to using TDA for data analysis:

Provides ways to understand what the results of TDA mean.

Provides methodology for handling noise.

Provides approaches to taming computational burden of TDA.

Particularly salient when considering “big data” applications.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 7: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Overview

Goal of talk

Give an overview of work on integrating statistical methodologywith topological data analysis (persistent homology).

This integration is essential to using TDA for data analysis:

Provides ways to understand what the results of TDA mean.

Provides methodology for handling noise.

Provides approaches to taming computational burden of TDA.

Particularly salient when considering “big data” applications.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 8: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Recollections about TDA methodology

Pipeline

Finite metric space→ Filtered space→ barcode

(X , ∂X ) 7→ {Zk} 7→ B.

Question

What does the barcode mean?

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 9: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Recollections about TDA methodology

Pipeline

Finite metric space→ Filtered space→ barcode

(X , ∂X ) 7→ {Zk} 7→ B.

Question

What does the barcode mean?

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 10: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 11: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 12: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 13: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Reflections on what those pictures tell us

Data set might not come from an object with obviousgeometric structure.

Data might have meaningful features at varying scales.

Output of TDA might be more useful as a signature ordescriptor than something to explicitly extract featureinformation from.

Even small amounts of maliciously placed corruption canchange results substantially.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 14: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Reflections on what those pictures tell us

Data set might not come from an object with obviousgeometric structure.

Data might have meaningful features at varying scales.

Output of TDA might be more useful as a signature ordescriptor than something to explicitly extract featureinformation from.

Even small amounts of maliciously placed corruption canchange results substantially.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 15: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Reflections on what those pictures tell us

Data set might not come from an object with obviousgeometric structure.

Data might have meaningful features at varying scales.

Output of TDA might be more useful as a signature ordescriptor than something to explicitly extract featureinformation from.

Even small amounts of maliciously placed corruption canchange results substantially.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 16: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Reflections on what those pictures tell us

Data set might not come from an object with obviousgeometric structure.

Data might have meaningful features at varying scales.

Output of TDA might be more useful as a signature ordescriptor than something to explicitly extract featureinformation from.

Even small amounts of maliciously placed corruption canchange results substantially.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 17: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Refinement of TDA pipeline to integrate with statistics

Incorporate sampling: start with a metric measure space(X , ∂X , µX ). Example: Riemannian manifold.

Refined pipeline

mm-space→ finite metric space→ filtered complex→ barcode

(X , ∂X , µX ) 7→ (X ′, ∂′X ) 7→ {Zk} 7→ B

where X ′ ⊂ X .

This perspective leads to serious consideration of distributions ofbarcodes {Bk}, associated to blocks of samples {X ′

k}.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 18: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Refinement of TDA pipeline to integrate with statistics

Incorporate sampling: start with a metric measure space(X , ∂X , µX ). Example: Riemannian manifold.

Refined pipeline

mm-space→ finite metric space→ filtered complex→ barcode

(X , ∂X , µX ) 7→ (X ′, ∂′X ) 7→ {Zk} 7→ B

where X ′ ⊂ X .

This perspective leads to serious consideration of distributions ofbarcodes {Bk}, associated to blocks of samples {X ′

k}.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 19: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Refinement of TDA pipeline to integrate with statistics

Incorporate sampling: start with a metric measure space(X , ∂X , µX ). Example: Riemannian manifold.

Refined pipeline

mm-space→ finite metric space→ filtered complex→ barcode

(X , ∂X , µX ) 7→ (X ′, ∂′X ) 7→ {Zk} 7→ B

where X ′ ⊂ X .

This perspective leads to serious consideration of distributions ofbarcodes {Bk}, associated to blocks of samples {X ′

k}.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 20: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Barcode space

Barcode space is a metric space:

Definition

Bottleneck distance between barcodes B1 = {Iα} and B2 = {Jβ} is

dB(B1,B2) = infC

sup(I ,J)∈C

d∞(I , J),

where C varies over all matchings between B1 and B2.

Roughly speaking, dB(B1,B2) ≤ ε means that there is a matchingin which the discrepancy between matched bars is < ε and allunmatched bars have length < ε.

(There are also a variety of other metrics.)

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 21: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Barcode space

Barcode space is a metric space:

Definition

Bottleneck distance between barcodes B1 = {Iα} and B2 = {Jβ} is

dB(B1,B2) = infC

sup(I ,J)∈C

d∞(I , J),

where C varies over all matchings between B1 and B2.

Roughly speaking, dB(B1,B2) ≤ ε means that there is a matchingin which the discrepancy between matched bars is < ε and allunmatched bars have length < ε.

(There are also a variety of other metrics.)

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 22: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Barcode space

Barcode space is a metric space:

Definition

Bottleneck distance between barcodes B1 = {Iα} and B2 = {Jβ} is

dB(B1,B2) = infC

sup(I ,J)∈C

d∞(I , J),

where C varies over all matchings between B1 and B2.

Roughly speaking, dB(B1,B2) ≤ ε means that there is a matchingin which the discrepancy between matched bars is < ε and allunmatched bars have length < ε.

(There are also a variety of other metrics.)

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 23: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Barcode space

Barcode space is a metric space:

Definition

Bottleneck distance between barcodes B1 = {Iα} and B2 = {Jβ} is

dB(B1,B2) = infC

sup(I ,J)∈C

d∞(I , J),

where C varies over all matchings between B1 and B2.

Roughly speaking, dB(B1,B2) ≤ ε means that there is a matchingin which the discrepancy between matched bars is < ε and allunmatched bars have length < ε.

(There are also a variety of other metrics.)

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 24: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Properties of barcode space

What kind of metric space is barcode space?

Good news: Barcode space is amenable to probability theory.(Formally, a suitable subset is a complete metric space with acountable dense subset. [Mileyko, Mukerjee, Harer])

This in particular means that the standard metrics ondistributions that metrize weak convergence.

Bad news: Barcode space is positively curved.

This in particular means that there are not unique shortestpaths between points and so centroids are hard to define.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 25: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Properties of barcode space

What kind of metric space is barcode space?

Good news: Barcode space is amenable to probability theory.(Formally, a suitable subset is a complete metric space with acountable dense subset. [Mileyko, Mukerjee, Harer])

This in particular means that the standard metrics ondistributions that metrize weak convergence.

Bad news: Barcode space is positively curved.

This in particular means that there are not unique shortestpaths between points and so centroids are hard to define.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 26: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Properties of barcode space

What kind of metric space is barcode space?

Good news: Barcode space is amenable to probability theory.(Formally, a suitable subset is a complete metric space with acountable dense subset. [Mileyko, Mukerjee, Harer])

This in particular means that the standard metrics ondistributions that metrize weak convergence.

Bad news: Barcode space is positively curved.

This in particular means that there are not unique shortestpaths between points and so centroids are hard to define.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 27: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Properties of barcode space

What kind of metric space is barcode space?

Good news: Barcode space is amenable to probability theory.(Formally, a suitable subset is a complete metric space with acountable dense subset. [Mileyko, Mukerjee, Harer])

This in particular means that the standard metrics ondistributions that metrize weak convergence.

Bad news: Barcode space is positively curved.

This in particular means that there are not unique shortestpaths between points and so centroids are hard to define.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 28: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Properties of barcode space

What kind of metric space is barcode space?

Good news: Barcode space is amenable to probability theory.(Formally, a suitable subset is a complete metric space with acountable dense subset. [Mileyko, Mukerjee, Harer])

This in particular means that the standard metrics ondistributions that metrize weak convergence.

Bad news: Barcode space is positively curved.

This in particular means that there are not unique shortestpaths between points and so centroids are hard to define.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 29: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Summarizing distributions of barcodes

Question

How do you summarize a distribution of barcodes? (What are themoments?)

Two approaches:

1 Although centroids don’t make sense, can define Frechetmean. Unfortunately, very hard to compute, sometimescounterintuitive properties, not stable. [Turner, Mileyko,Mukerjee, Harer.]

2 Produce some kind of distribution on R (or at least somethingnicer than barcode space) and then use traditional summarystatistics for that.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 30: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Summarizing distributions of barcodes

Question

How do you summarize a distribution of barcodes? (What are themoments?)

Two approaches:

1 Although centroids don’t make sense, can define Frechetmean. Unfortunately, very hard to compute, sometimescounterintuitive properties, not stable. [Turner, Mileyko,Mukerjee, Harer.]

2 Produce some kind of distribution on R (or at least somethingnicer than barcode space) and then use traditional summarystatistics for that.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 31: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Summarizing distributions of barcodes

Question

How do you summarize a distribution of barcodes? (What are themoments?)

Two approaches:

1 Although centroids don’t make sense, can define Frechetmean. Unfortunately, very hard to compute, sometimescounterintuitive properties, not stable. [Turner, Mileyko,Mukerjee, Harer.]

2 Produce some kind of distribution on R (or at least somethingnicer than barcode space) and then use traditional summarystatistics for that.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 32: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Summarizing distributions of barcodes

Question

How do you summarize a distribution of barcodes? (What are themoments?)

Two approaches:

1 Although centroids don’t make sense, can define Frechetmean. Unfortunately, very hard to compute, sometimescounterintuitive properties, not stable. [Turner, Mileyko,Mukerjee, Harer.]

2 Produce some kind of distribution on R (or at least somethingnicer than barcode space) and then use traditional summarystatistics for that.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 33: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Projections to distributions on more tractable spaces

A reasonable map from barcode space to either R or some kind offunction space will allow us to turn distributions of barcodes intomore tractable objects.

1 Many possible maps:

Length of kth-longest barRatio of endpoints of kth-longest barDifference between size of kth and k + 1th bar, or a vectorconsisting of these values for all k.

2 “Persistence landscapes” — idea is to define λk(t) to be thelargest size of a barcode such that at least k intervals containthat barcode centered at t. [Bubenik]

Taking a robust statistic (e.g, the median) of such a distributionimmediately provides a robust statistic of the barcode distribution.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 34: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Projections to distributions on more tractable spaces

A reasonable map from barcode space to either R or some kind offunction space will allow us to turn distributions of barcodes intomore tractable objects.

1 Many possible maps:

Length of kth-longest barRatio of endpoints of kth-longest barDifference between size of kth and k + 1th bar, or a vectorconsisting of these values for all k.

2 “Persistence landscapes” — idea is to define λk(t) to be thelargest size of a barcode such that at least k intervals containthat barcode centered at t. [Bubenik]

Taking a robust statistic (e.g, the median) of such a distributionimmediately provides a robust statistic of the barcode distribution.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 35: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Projections to distributions on more tractable spaces

A reasonable map from barcode space to either R or some kind offunction space will allow us to turn distributions of barcodes intomore tractable objects.

1 Many possible maps:

Length of kth-longest barRatio of endpoints of kth-longest barDifference between size of kth and k + 1th bar, or a vectorconsisting of these values for all k.

2 “Persistence landscapes” — idea is to define λk(t) to be thelargest size of a barcode such that at least k intervals containthat barcode centered at t. [Bubenik]

Taking a robust statistic (e.g, the median) of such a distributionimmediately provides a robust statistic of the barcode distribution.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 36: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Projections to distributions on more tractable spaces

A reasonable map from barcode space to either R or some kind offunction space will allow us to turn distributions of barcodes intomore tractable objects.

1 Many possible maps:

Length of kth-longest barRatio of endpoints of kth-longest barDifference between size of kth and k + 1th bar, or a vectorconsisting of these values for all k.

2 “Persistence landscapes” — idea is to define λk(t) to be thelargest size of a barcode such that at least k intervals containthat barcode centered at t. [Bubenik]

Taking a robust statistic (e.g, the median) of such a distributionimmediately provides a robust statistic of the barcode distribution.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 37: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Projections to distributions on more tractable spaces

A reasonable map from barcode space to either R or some kind offunction space will allow us to turn distributions of barcodes intomore tractable objects.

1 Many possible maps:

Length of kth-longest barRatio of endpoints of kth-longest barDifference between size of kth and k + 1th bar, or a vectorconsisting of these values for all k.

2 “Persistence landscapes” — idea is to define λk(t) to be thelargest size of a barcode such that at least k intervals containthat barcode centered at t. [Bubenik]

Taking a robust statistic (e.g, the median) of such a distributionimmediately provides a robust statistic of the barcode distribution.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 38: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Sampling at a fixed finite rate as an invariant

Claim

The distribution of barcodes produced by the collection of allsamples of a fixed size n is a good basic invariant. [Blumberg, Gal,Mandell, Pancia]

1 Easy to approximate in practice: take the empiricalapproximation by repeated sampling, or use resampling ifgetting new samples is problematic.

2 Efficient: use samples that are as large as the computationwill bear, very parallelizable.

3 Robust to noise: low-density noise doesn’t affect manysamples, depending on balance of sampling rate and totalnumber of points.

4 (Related to the last point: Stable to changes in thedistribution, in a precise sense.)

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 39: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Sampling at a fixed finite rate as an invariant

Claim

The distribution of barcodes produced by the collection of allsamples of a fixed size n is a good basic invariant. [Blumberg, Gal,Mandell, Pancia]

1 Easy to approximate in practice: take the empiricalapproximation by repeated sampling, or use resampling ifgetting new samples is problematic.

2 Efficient: use samples that are as large as the computationwill bear, very parallelizable.

3 Robust to noise: low-density noise doesn’t affect manysamples, depending on balance of sampling rate and totalnumber of points.

4 (Related to the last point: Stable to changes in thedistribution, in a precise sense.)

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 40: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Sampling at a fixed finite rate as an invariant

Claim

The distribution of barcodes produced by the collection of allsamples of a fixed size n is a good basic invariant. [Blumberg, Gal,Mandell, Pancia]

1 Easy to approximate in practice: take the empiricalapproximation by repeated sampling, or use resampling ifgetting new samples is problematic.

2 Efficient: use samples that are as large as the computationwill bear, very parallelizable.

3 Robust to noise: low-density noise doesn’t affect manysamples, depending on balance of sampling rate and totalnumber of points.

4 (Related to the last point: Stable to changes in thedistribution, in a precise sense.)

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 41: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Sampling at a fixed finite rate as an invariant

Claim

The distribution of barcodes produced by the collection of allsamples of a fixed size n is a good basic invariant. [Blumberg, Gal,Mandell, Pancia]

1 Easy to approximate in practice: take the empiricalapproximation by repeated sampling, or use resampling ifgetting new samples is problematic.

2 Efficient: use samples that are as large as the computationwill bear, very parallelizable.

3 Robust to noise: low-density noise doesn’t affect manysamples, depending on balance of sampling rate and totalnumber of points.

4 (Related to the last point: Stable to changes in thedistribution, in a precise sense.)

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 42: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Sampling at a fixed finite rate as an invariant

Claim

The distribution of barcodes produced by the collection of allsamples of a fixed size n is a good basic invariant. [Blumberg, Gal,Mandell, Pancia]

1 Easy to approximate in practice: take the empiricalapproximation by repeated sampling, or use resampling ifgetting new samples is problematic.

2 Efficient: use samples that are as large as the computationwill bear, very parallelizable.

3 Robust to noise: low-density noise doesn’t affect manysamples, depending on balance of sampling rate and totalnumber of points.

4 (Related to the last point: Stable to changes in thedistribution, in a precise sense.)

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 43: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Efficiency

Computing persistence barcodes for large data sets naively isimpossible (combinatorial explosion).

Approximation methods are an active area of research, butmany depend on a priori knowledge of feature scale.

Perspective that the distribution is the invariant allows us toassemble information from many small samples.

Easy to compute in parallel.

Necessary for potential applications to “big data”.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 44: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Efficiency

Computing persistence barcodes for large data sets naively isimpossible (combinatorial explosion).

Approximation methods are an active area of research, butmany depend on a priori knowledge of feature scale.

Perspective that the distribution is the invariant allows us toassemble information from many small samples.

Easy to compute in parallel.

Necessary for potential applications to “big data”.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 45: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Efficiency

Computing persistence barcodes for large data sets naively isimpossible (combinatorial explosion).

Approximation methods are an active area of research, butmany depend on a priori knowledge of feature scale.

Perspective that the distribution is the invariant allows us toassemble information from many small samples.

Easy to compute in parallel.

Necessary for potential applications to “big data”.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 46: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Efficiency

Computing persistence barcodes for large data sets naively isimpossible (combinatorial explosion).

Approximation methods are an active area of research, butmany depend on a priori knowledge of feature scale.

Perspective that the distribution is the invariant allows us toassemble information from many small samples.

Easy to compute in parallel.

Necessary for potential applications to “big data”.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 47: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Efficiency

Computing persistence barcodes for large data sets naively isimpossible (combinatorial explosion).

Approximation methods are an active area of research, butmany depend on a priori knowledge of feature scale.

Perspective that the distribution is the invariant allows us toassemble information from many small samples.

Easy to compute in parallel.

Necessary for potential applications to “big data”.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 48: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Hypothesis testing and confidence intervals

Can try to perform statistical inference either directly in barcodespace or on distributions in R.

In barcode space:

1 Hypothesis testing is difficult:

Hard to specify null hypotheses.Don’t have analytic formulas for asymptotics of null hypothesisdistributions. (Despite excellent work of many people onhomology for reasonable classes of null hypotheses. [Adler,Bobrowksi, Kahle, Meckes, Weinberger])

2 Specify hypotheses in terms of sampling procedure and thenuse Monte Carlo sampling.

3 Confidence intervals are easier: can be obtained for“underlying” persistent homology. [Fasy, Lecci, Rinaldo,Wasserman, Balakrishnan, Singh.]

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 49: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Hypothesis testing and confidence intervals

Can try to perform statistical inference either directly in barcodespace or on distributions in R.

In barcode space:

1 Hypothesis testing is difficult:

Hard to specify null hypotheses.Don’t have analytic formulas for asymptotics of null hypothesisdistributions. (Despite excellent work of many people onhomology for reasonable classes of null hypotheses. [Adler,Bobrowksi, Kahle, Meckes, Weinberger])

2 Specify hypotheses in terms of sampling procedure and thenuse Monte Carlo sampling.

3 Confidence intervals are easier: can be obtained for“underlying” persistent homology. [Fasy, Lecci, Rinaldo,Wasserman, Balakrishnan, Singh.]

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 50: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Hypothesis testing and confidence intervals

Can try to perform statistical inference either directly in barcodespace or on distributions in R.

In barcode space:

1 Hypothesis testing is difficult:

Hard to specify null hypotheses.Don’t have analytic formulas for asymptotics of null hypothesisdistributions. (Despite excellent work of many people onhomology for reasonable classes of null hypotheses. [Adler,Bobrowksi, Kahle, Meckes, Weinberger])

2 Specify hypotheses in terms of sampling procedure and thenuse Monte Carlo sampling.

3 Confidence intervals are easier: can be obtained for“underlying” persistent homology. [Fasy, Lecci, Rinaldo,Wasserman, Balakrishnan, Singh.]

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 51: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Hypothesis testing and confidence intervals

Can try to perform statistical inference either directly in barcodespace or on distributions in R.

In barcode space:

1 Hypothesis testing is difficult:

Hard to specify null hypotheses.Don’t have analytic formulas for asymptotics of null hypothesisdistributions. (Despite excellent work of many people onhomology for reasonable classes of null hypotheses. [Adler,Bobrowksi, Kahle, Meckes, Weinberger])

2 Specify hypotheses in terms of sampling procedure and thenuse Monte Carlo sampling.

3 Confidence intervals are easier: can be obtained for“underlying” persistent homology. [Fasy, Lecci, Rinaldo,Wasserman, Balakrishnan, Singh.]

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 52: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Hypothesis testing and confidence intervals

Can try to perform statistical inference either directly in barcodespace or on distributions in R.

In barcode space:

1 Hypothesis testing is difficult:

Hard to specify null hypotheses.Don’t have analytic formulas for asymptotics of null hypothesisdistributions. (Despite excellent work of many people onhomology for reasonable classes of null hypotheses. [Adler,Bobrowksi, Kahle, Meckes, Weinberger])

2 Specify hypotheses in terms of sampling procedure and thenuse Monte Carlo sampling.

3 Confidence intervals are easier: can be obtained for“underlying” persistent homology. [Fasy, Lecci, Rinaldo,Wasserman, Balakrishnan, Singh.]

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 53: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Hypothesis testing and confidence intervals

Can try to perform statistical inference either directly in barcodespace or on distributions in R.

In barcode space:

1 Hypothesis testing is difficult:

Hard to specify null hypotheses.Don’t have analytic formulas for asymptotics of null hypothesisdistributions. (Despite excellent work of many people onhomology for reasonable classes of null hypotheses. [Adler,Bobrowksi, Kahle, Meckes, Weinberger])

2 Specify hypotheses in terms of sampling procedure and thenuse Monte Carlo sampling.

3 Confidence intervals are easier: can be obtained for“underlying” persistent homology. [Fasy, Lecci, Rinaldo,Wasserman, Balakrishnan, Singh.]

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 54: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Hypothesis testing and confidence intervals

Can try to perform statistical inference either directly in barcodespace or on distributions in R.

In barcode space:

1 Hypothesis testing is difficult:

Hard to specify null hypotheses.Don’t have analytic formulas for asymptotics of null hypothesisdistributions. (Despite excellent work of many people onhomology for reasonable classes of null hypotheses. [Adler,Bobrowksi, Kahle, Meckes, Weinberger])

2 Specify hypotheses in terms of sampling procedure and thenuse Monte Carlo sampling.

3 Confidence intervals are easier: can be obtained for“underlying” persistent homology. [Fasy, Lecci, Rinaldo,Wasserman, Balakrishnan, Singh.]

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 55: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Hypothesis testing and confidence intervals

As above, can first project to distributions on R or Rn (or the like):

1 Similar issues, although easier to specify hypotheses.

2 Confidence intervals for various statistics, in some casesanalytics.

Resampling methods (i.e., the bootstrap) are very relevant (andasymptotically consistent).

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 56: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Hypothesis testing and confidence intervals

As above, can first project to distributions on R or Rn (or the like):

1 Similar issues, although easier to specify hypotheses.

2 Confidence intervals for various statistics, in some casesanalytics.

Resampling methods (i.e., the bootstrap) are very relevant (andasymptotically consistent).

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 57: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Hypothesis testing and confidence intervals

As above, can first project to distributions on R or Rn (or the like):

1 Similar issues, although easier to specify hypotheses.

2 Confidence intervals for various statistics, in some casesanalytics.

Resampling methods (i.e., the bootstrap) are very relevant (andasymptotically consistent).

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 58: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Hypothesis testing and confidence intervals

As above, can first project to distributions on R or Rn (or the like):

1 Similar issues, although easier to specify hypotheses.

2 Confidence intervals for various statistics, in some casesanalytics.

Resampling methods (i.e., the bootstrap) are very relevant (andasymptotically consistent).

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 59: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Euler characteristic

Even though the field is very new, I’ve left out a lot of good stuff.One topic to mention in passing:

Observation

The Euler characteristic χ of a simplicial complex is both muchmore robust and much easier to compute than homology.

Theoretical foundations are more tractable here: Eulercharacteristics of Gaussian random fields [Adler-Taylor].

Weaker by itself, but as an ensemble: Euler characteristictransform for slices through 3D solids [Turner, Mukerhjee].

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 60: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Euler characteristic

Even though the field is very new, I’ve left out a lot of good stuff.One topic to mention in passing:

Observation

The Euler characteristic χ of a simplicial complex is both muchmore robust and much easier to compute than homology.

Theoretical foundations are more tractable here: Eulercharacteristics of Gaussian random fields [Adler-Taylor].

Weaker by itself, but as an ensemble: Euler characteristictransform for slices through 3D solids [Turner, Mukerhjee].

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 61: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Euler characteristic

Even though the field is very new, I’ve left out a lot of good stuff.One topic to mention in passing:

Observation

The Euler characteristic χ of a simplicial complex is both muchmore robust and much easier to compute than homology.

Theoretical foundations are more tractable here: Eulercharacteristics of Gaussian random fields [Adler-Taylor].

Weaker by itself, but as an ensemble: Euler characteristictransform for slices through 3D solids [Turner, Mukerhjee].

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 62: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Euler characteristic

Even though the field is very new, I’ve left out a lot of good stuff.One topic to mention in passing:

Observation

The Euler characteristic χ of a simplicial complex is both muchmore robust and much easier to compute than homology.

Theoretical foundations are more tractable here: Eulercharacteristics of Gaussian random fields [Adler-Taylor].

Weaker by itself, but as an ensemble: Euler characteristictransform for slices through 3D solids [Turner, Mukerhjee].

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 63: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Questions for the future:

Outstanding issue

So far, limited experience using these tools on very large data sets.

(Except “Mapper”, which I’ll talk about this afternoon.)

1 What about guided sampling?

2 How can we best handle data sets that have “wild” geometricstructure (or none at all)?

3 More and better “topological” summaries and hypothesisspecification tools.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 64: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Questions for the future:

Outstanding issue

So far, limited experience using these tools on very large data sets.

(Except “Mapper”, which I’ll talk about this afternoon.)

1 What about guided sampling?

2 How can we best handle data sets that have “wild” geometricstructure (or none at all)?

3 More and better “topological” summaries and hypothesisspecification tools.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 65: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Questions for the future:

Outstanding issue

So far, limited experience using these tools on very large data sets.

(Except “Mapper”, which I’ll talk about this afternoon.)

1 What about guided sampling?

2 How can we best handle data sets that have “wild” geometricstructure (or none at all)?

3 More and better “topological” summaries and hypothesisspecification tools.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 66: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Questions for the future:

Outstanding issue

So far, limited experience using these tools on very large data sets.

(Except “Mapper”, which I’ll talk about this afternoon.)

1 What about guided sampling?

2 How can we best handle data sets that have “wild” geometricstructure (or none at all)?

3 More and better “topological” summaries and hypothesisspecification tools.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis

Page 67: Statistical techniques in topological data analysis - INEGI · Overview Goal of talk Give an overview of work on integrating statistical methodology with topological data analysis

Questions for the future:

Outstanding issue

So far, limited experience using these tools on very large data sets.

(Except “Mapper”, which I’ll talk about this afternoon.)

1 What about guided sampling?

2 How can we best handle data sets that have “wild” geometricstructure (or none at all)?

3 More and better “topological” summaries and hypothesisspecification tools.

Andrew J. Blumberg ([email protected]) Statistical techniques in topological data analysis