statistical techniques in topological data analysis - inegi · overview goal of talk give an...

Post on 25-Apr-2018

218 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Statistical techniques in topological data analysis

Andrew J. Blumberg (blumberg@math.utexas.edu)

June 17th, 2014

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Overview

Goal of talk

Give an overview of work on integrating statistical methodologywith topological data analysis (persistent homology).

This integration is essential to using TDA for data analysis:

Provides ways to understand what the results of TDA mean.

Provides methodology for handling noise.

Provides approaches to taming computational burden of TDA.

Particularly salient when considering “big data” applications.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Overview

Goal of talk

Give an overview of work on integrating statistical methodologywith topological data analysis (persistent homology).

This integration is essential to using TDA for data analysis:

Provides ways to understand what the results of TDA mean.

Provides methodology for handling noise.

Provides approaches to taming computational burden of TDA.

Particularly salient when considering “big data” applications.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Overview

Goal of talk

Give an overview of work on integrating statistical methodologywith topological data analysis (persistent homology).

This integration is essential to using TDA for data analysis:

Provides ways to understand what the results of TDA mean.

Provides methodology for handling noise.

Provides approaches to taming computational burden of TDA.

Particularly salient when considering “big data” applications.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Overview

Goal of talk

Give an overview of work on integrating statistical methodologywith topological data analysis (persistent homology).

This integration is essential to using TDA for data analysis:

Provides ways to understand what the results of TDA mean.

Provides methodology for handling noise.

Provides approaches to taming computational burden of TDA.

Particularly salient when considering “big data” applications.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Overview

Goal of talk

Give an overview of work on integrating statistical methodologywith topological data analysis (persistent homology).

This integration is essential to using TDA for data analysis:

Provides ways to understand what the results of TDA mean.

Provides methodology for handling noise.

Provides approaches to taming computational burden of TDA.

Particularly salient when considering “big data” applications.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Overview

Goal of talk

Give an overview of work on integrating statistical methodologywith topological data analysis (persistent homology).

This integration is essential to using TDA for data analysis:

Provides ways to understand what the results of TDA mean.

Provides methodology for handling noise.

Provides approaches to taming computational burden of TDA.

Particularly salient when considering “big data” applications.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Recollections about TDA methodology

Pipeline

Finite metric space→ Filtered space→ barcode

(X , ∂X ) 7→ {Zk} 7→ B.

Question

What does the barcode mean?

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Recollections about TDA methodology

Pipeline

Finite metric space→ Filtered space→ barcode

(X , ∂X ) 7→ {Zk} 7→ B.

Question

What does the barcode mean?

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Reflections on what those pictures tell us

Data set might not come from an object with obviousgeometric structure.

Data might have meaningful features at varying scales.

Output of TDA might be more useful as a signature ordescriptor than something to explicitly extract featureinformation from.

Even small amounts of maliciously placed corruption canchange results substantially.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Reflections on what those pictures tell us

Data set might not come from an object with obviousgeometric structure.

Data might have meaningful features at varying scales.

Output of TDA might be more useful as a signature ordescriptor than something to explicitly extract featureinformation from.

Even small amounts of maliciously placed corruption canchange results substantially.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Reflections on what those pictures tell us

Data set might not come from an object with obviousgeometric structure.

Data might have meaningful features at varying scales.

Output of TDA might be more useful as a signature ordescriptor than something to explicitly extract featureinformation from.

Even small amounts of maliciously placed corruption canchange results substantially.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Reflections on what those pictures tell us

Data set might not come from an object with obviousgeometric structure.

Data might have meaningful features at varying scales.

Output of TDA might be more useful as a signature ordescriptor than something to explicitly extract featureinformation from.

Even small amounts of maliciously placed corruption canchange results substantially.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Refinement of TDA pipeline to integrate with statistics

Incorporate sampling: start with a metric measure space(X , ∂X , µX ). Example: Riemannian manifold.

Refined pipeline

mm-space→ finite metric space→ filtered complex→ barcode

(X , ∂X , µX ) 7→ (X ′, ∂′X ) 7→ {Zk} 7→ B

where X ′ ⊂ X .

This perspective leads to serious consideration of distributions ofbarcodes {Bk}, associated to blocks of samples {X ′

k}.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Refinement of TDA pipeline to integrate with statistics

Incorporate sampling: start with a metric measure space(X , ∂X , µX ). Example: Riemannian manifold.

Refined pipeline

mm-space→ finite metric space→ filtered complex→ barcode

(X , ∂X , µX ) 7→ (X ′, ∂′X ) 7→ {Zk} 7→ B

where X ′ ⊂ X .

This perspective leads to serious consideration of distributions ofbarcodes {Bk}, associated to blocks of samples {X ′

k}.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Refinement of TDA pipeline to integrate with statistics

Incorporate sampling: start with a metric measure space(X , ∂X , µX ). Example: Riemannian manifold.

Refined pipeline

mm-space→ finite metric space→ filtered complex→ barcode

(X , ∂X , µX ) 7→ (X ′, ∂′X ) 7→ {Zk} 7→ B

where X ′ ⊂ X .

This perspective leads to serious consideration of distributions ofbarcodes {Bk}, associated to blocks of samples {X ′

k}.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Barcode space

Barcode space is a metric space:

Definition

Bottleneck distance between barcodes B1 = {Iα} and B2 = {Jβ} is

dB(B1,B2) = infC

sup(I ,J)∈C

d∞(I , J),

where C varies over all matchings between B1 and B2.

Roughly speaking, dB(B1,B2) ≤ ε means that there is a matchingin which the discrepancy between matched bars is < ε and allunmatched bars have length < ε.

(There are also a variety of other metrics.)

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Barcode space

Barcode space is a metric space:

Definition

Bottleneck distance between barcodes B1 = {Iα} and B2 = {Jβ} is

dB(B1,B2) = infC

sup(I ,J)∈C

d∞(I , J),

where C varies over all matchings between B1 and B2.

Roughly speaking, dB(B1,B2) ≤ ε means that there is a matchingin which the discrepancy between matched bars is < ε and allunmatched bars have length < ε.

(There are also a variety of other metrics.)

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Barcode space

Barcode space is a metric space:

Definition

Bottleneck distance between barcodes B1 = {Iα} and B2 = {Jβ} is

dB(B1,B2) = infC

sup(I ,J)∈C

d∞(I , J),

where C varies over all matchings between B1 and B2.

Roughly speaking, dB(B1,B2) ≤ ε means that there is a matchingin which the discrepancy between matched bars is < ε and allunmatched bars have length < ε.

(There are also a variety of other metrics.)

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Barcode space

Barcode space is a metric space:

Definition

Bottleneck distance between barcodes B1 = {Iα} and B2 = {Jβ} is

dB(B1,B2) = infC

sup(I ,J)∈C

d∞(I , J),

where C varies over all matchings between B1 and B2.

Roughly speaking, dB(B1,B2) ≤ ε means that there is a matchingin which the discrepancy between matched bars is < ε and allunmatched bars have length < ε.

(There are also a variety of other metrics.)

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Properties of barcode space

What kind of metric space is barcode space?

Good news: Barcode space is amenable to probability theory.(Formally, a suitable subset is a complete metric space with acountable dense subset. [Mileyko, Mukerjee, Harer])

This in particular means that the standard metrics ondistributions that metrize weak convergence.

Bad news: Barcode space is positively curved.

This in particular means that there are not unique shortestpaths between points and so centroids are hard to define.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Properties of barcode space

What kind of metric space is barcode space?

Good news: Barcode space is amenable to probability theory.(Formally, a suitable subset is a complete metric space with acountable dense subset. [Mileyko, Mukerjee, Harer])

This in particular means that the standard metrics ondistributions that metrize weak convergence.

Bad news: Barcode space is positively curved.

This in particular means that there are not unique shortestpaths between points and so centroids are hard to define.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Properties of barcode space

What kind of metric space is barcode space?

Good news: Barcode space is amenable to probability theory.(Formally, a suitable subset is a complete metric space with acountable dense subset. [Mileyko, Mukerjee, Harer])

This in particular means that the standard metrics ondistributions that metrize weak convergence.

Bad news: Barcode space is positively curved.

This in particular means that there are not unique shortestpaths between points and so centroids are hard to define.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Properties of barcode space

What kind of metric space is barcode space?

Good news: Barcode space is amenable to probability theory.(Formally, a suitable subset is a complete metric space with acountable dense subset. [Mileyko, Mukerjee, Harer])

This in particular means that the standard metrics ondistributions that metrize weak convergence.

Bad news: Barcode space is positively curved.

This in particular means that there are not unique shortestpaths between points and so centroids are hard to define.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Properties of barcode space

What kind of metric space is barcode space?

Good news: Barcode space is amenable to probability theory.(Formally, a suitable subset is a complete metric space with acountable dense subset. [Mileyko, Mukerjee, Harer])

This in particular means that the standard metrics ondistributions that metrize weak convergence.

Bad news: Barcode space is positively curved.

This in particular means that there are not unique shortestpaths between points and so centroids are hard to define.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Summarizing distributions of barcodes

Question

How do you summarize a distribution of barcodes? (What are themoments?)

Two approaches:

1 Although centroids don’t make sense, can define Frechetmean. Unfortunately, very hard to compute, sometimescounterintuitive properties, not stable. [Turner, Mileyko,Mukerjee, Harer.]

2 Produce some kind of distribution on R (or at least somethingnicer than barcode space) and then use traditional summarystatistics for that.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Summarizing distributions of barcodes

Question

How do you summarize a distribution of barcodes? (What are themoments?)

Two approaches:

1 Although centroids don’t make sense, can define Frechetmean. Unfortunately, very hard to compute, sometimescounterintuitive properties, not stable. [Turner, Mileyko,Mukerjee, Harer.]

2 Produce some kind of distribution on R (or at least somethingnicer than barcode space) and then use traditional summarystatistics for that.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Summarizing distributions of barcodes

Question

How do you summarize a distribution of barcodes? (What are themoments?)

Two approaches:

1 Although centroids don’t make sense, can define Frechetmean. Unfortunately, very hard to compute, sometimescounterintuitive properties, not stable. [Turner, Mileyko,Mukerjee, Harer.]

2 Produce some kind of distribution on R (or at least somethingnicer than barcode space) and then use traditional summarystatistics for that.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Summarizing distributions of barcodes

Question

How do you summarize a distribution of barcodes? (What are themoments?)

Two approaches:

1 Although centroids don’t make sense, can define Frechetmean. Unfortunately, very hard to compute, sometimescounterintuitive properties, not stable. [Turner, Mileyko,Mukerjee, Harer.]

2 Produce some kind of distribution on R (or at least somethingnicer than barcode space) and then use traditional summarystatistics for that.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Projections to distributions on more tractable spaces

A reasonable map from barcode space to either R or some kind offunction space will allow us to turn distributions of barcodes intomore tractable objects.

1 Many possible maps:

Length of kth-longest barRatio of endpoints of kth-longest barDifference between size of kth and k + 1th bar, or a vectorconsisting of these values for all k.

2 “Persistence landscapes” — idea is to define λk(t) to be thelargest size of a barcode such that at least k intervals containthat barcode centered at t. [Bubenik]

Taking a robust statistic (e.g, the median) of such a distributionimmediately provides a robust statistic of the barcode distribution.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Projections to distributions on more tractable spaces

A reasonable map from barcode space to either R or some kind offunction space will allow us to turn distributions of barcodes intomore tractable objects.

1 Many possible maps:

Length of kth-longest barRatio of endpoints of kth-longest barDifference between size of kth and k + 1th bar, or a vectorconsisting of these values for all k.

2 “Persistence landscapes” — idea is to define λk(t) to be thelargest size of a barcode such that at least k intervals containthat barcode centered at t. [Bubenik]

Taking a robust statistic (e.g, the median) of such a distributionimmediately provides a robust statistic of the barcode distribution.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Projections to distributions on more tractable spaces

A reasonable map from barcode space to either R or some kind offunction space will allow us to turn distributions of barcodes intomore tractable objects.

1 Many possible maps:

Length of kth-longest barRatio of endpoints of kth-longest barDifference between size of kth and k + 1th bar, or a vectorconsisting of these values for all k.

2 “Persistence landscapes” — idea is to define λk(t) to be thelargest size of a barcode such that at least k intervals containthat barcode centered at t. [Bubenik]

Taking a robust statistic (e.g, the median) of such a distributionimmediately provides a robust statistic of the barcode distribution.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Projections to distributions on more tractable spaces

A reasonable map from barcode space to either R or some kind offunction space will allow us to turn distributions of barcodes intomore tractable objects.

1 Many possible maps:

Length of kth-longest barRatio of endpoints of kth-longest barDifference between size of kth and k + 1th bar, or a vectorconsisting of these values for all k.

2 “Persistence landscapes” — idea is to define λk(t) to be thelargest size of a barcode such that at least k intervals containthat barcode centered at t. [Bubenik]

Taking a robust statistic (e.g, the median) of such a distributionimmediately provides a robust statistic of the barcode distribution.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Projections to distributions on more tractable spaces

A reasonable map from barcode space to either R or some kind offunction space will allow us to turn distributions of barcodes intomore tractable objects.

1 Many possible maps:

Length of kth-longest barRatio of endpoints of kth-longest barDifference between size of kth and k + 1th bar, or a vectorconsisting of these values for all k.

2 “Persistence landscapes” — idea is to define λk(t) to be thelargest size of a barcode such that at least k intervals containthat barcode centered at t. [Bubenik]

Taking a robust statistic (e.g, the median) of such a distributionimmediately provides a robust statistic of the barcode distribution.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Sampling at a fixed finite rate as an invariant

Claim

The distribution of barcodes produced by the collection of allsamples of a fixed size n is a good basic invariant. [Blumberg, Gal,Mandell, Pancia]

1 Easy to approximate in practice: take the empiricalapproximation by repeated sampling, or use resampling ifgetting new samples is problematic.

2 Efficient: use samples that are as large as the computationwill bear, very parallelizable.

3 Robust to noise: low-density noise doesn’t affect manysamples, depending on balance of sampling rate and totalnumber of points.

4 (Related to the last point: Stable to changes in thedistribution, in a precise sense.)

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Sampling at a fixed finite rate as an invariant

Claim

The distribution of barcodes produced by the collection of allsamples of a fixed size n is a good basic invariant. [Blumberg, Gal,Mandell, Pancia]

1 Easy to approximate in practice: take the empiricalapproximation by repeated sampling, or use resampling ifgetting new samples is problematic.

2 Efficient: use samples that are as large as the computationwill bear, very parallelizable.

3 Robust to noise: low-density noise doesn’t affect manysamples, depending on balance of sampling rate and totalnumber of points.

4 (Related to the last point: Stable to changes in thedistribution, in a precise sense.)

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Sampling at a fixed finite rate as an invariant

Claim

The distribution of barcodes produced by the collection of allsamples of a fixed size n is a good basic invariant. [Blumberg, Gal,Mandell, Pancia]

1 Easy to approximate in practice: take the empiricalapproximation by repeated sampling, or use resampling ifgetting new samples is problematic.

2 Efficient: use samples that are as large as the computationwill bear, very parallelizable.

3 Robust to noise: low-density noise doesn’t affect manysamples, depending on balance of sampling rate and totalnumber of points.

4 (Related to the last point: Stable to changes in thedistribution, in a precise sense.)

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Sampling at a fixed finite rate as an invariant

Claim

The distribution of barcodes produced by the collection of allsamples of a fixed size n is a good basic invariant. [Blumberg, Gal,Mandell, Pancia]

1 Easy to approximate in practice: take the empiricalapproximation by repeated sampling, or use resampling ifgetting new samples is problematic.

2 Efficient: use samples that are as large as the computationwill bear, very parallelizable.

3 Robust to noise: low-density noise doesn’t affect manysamples, depending on balance of sampling rate and totalnumber of points.

4 (Related to the last point: Stable to changes in thedistribution, in a precise sense.)

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Sampling at a fixed finite rate as an invariant

Claim

The distribution of barcodes produced by the collection of allsamples of a fixed size n is a good basic invariant. [Blumberg, Gal,Mandell, Pancia]

1 Easy to approximate in practice: take the empiricalapproximation by repeated sampling, or use resampling ifgetting new samples is problematic.

2 Efficient: use samples that are as large as the computationwill bear, very parallelizable.

3 Robust to noise: low-density noise doesn’t affect manysamples, depending on balance of sampling rate and totalnumber of points.

4 (Related to the last point: Stable to changes in thedistribution, in a precise sense.)

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Efficiency

Computing persistence barcodes for large data sets naively isimpossible (combinatorial explosion).

Approximation methods are an active area of research, butmany depend on a priori knowledge of feature scale.

Perspective that the distribution is the invariant allows us toassemble information from many small samples.

Easy to compute in parallel.

Necessary for potential applications to “big data”.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Efficiency

Computing persistence barcodes for large data sets naively isimpossible (combinatorial explosion).

Approximation methods are an active area of research, butmany depend on a priori knowledge of feature scale.

Perspective that the distribution is the invariant allows us toassemble information from many small samples.

Easy to compute in parallel.

Necessary for potential applications to “big data”.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Efficiency

Computing persistence barcodes for large data sets naively isimpossible (combinatorial explosion).

Approximation methods are an active area of research, butmany depend on a priori knowledge of feature scale.

Perspective that the distribution is the invariant allows us toassemble information from many small samples.

Easy to compute in parallel.

Necessary for potential applications to “big data”.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Efficiency

Computing persistence barcodes for large data sets naively isimpossible (combinatorial explosion).

Approximation methods are an active area of research, butmany depend on a priori knowledge of feature scale.

Perspective that the distribution is the invariant allows us toassemble information from many small samples.

Easy to compute in parallel.

Necessary for potential applications to “big data”.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Efficiency

Computing persistence barcodes for large data sets naively isimpossible (combinatorial explosion).

Approximation methods are an active area of research, butmany depend on a priori knowledge of feature scale.

Perspective that the distribution is the invariant allows us toassemble information from many small samples.

Easy to compute in parallel.

Necessary for potential applications to “big data”.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Hypothesis testing and confidence intervals

Can try to perform statistical inference either directly in barcodespace or on distributions in R.

In barcode space:

1 Hypothesis testing is difficult:

Hard to specify null hypotheses.Don’t have analytic formulas for asymptotics of null hypothesisdistributions. (Despite excellent work of many people onhomology for reasonable classes of null hypotheses. [Adler,Bobrowksi, Kahle, Meckes, Weinberger])

2 Specify hypotheses in terms of sampling procedure and thenuse Monte Carlo sampling.

3 Confidence intervals are easier: can be obtained for“underlying” persistent homology. [Fasy, Lecci, Rinaldo,Wasserman, Balakrishnan, Singh.]

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Hypothesis testing and confidence intervals

Can try to perform statistical inference either directly in barcodespace or on distributions in R.

In barcode space:

1 Hypothesis testing is difficult:

Hard to specify null hypotheses.Don’t have analytic formulas for asymptotics of null hypothesisdistributions. (Despite excellent work of many people onhomology for reasonable classes of null hypotheses. [Adler,Bobrowksi, Kahle, Meckes, Weinberger])

2 Specify hypotheses in terms of sampling procedure and thenuse Monte Carlo sampling.

3 Confidence intervals are easier: can be obtained for“underlying” persistent homology. [Fasy, Lecci, Rinaldo,Wasserman, Balakrishnan, Singh.]

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Hypothesis testing and confidence intervals

Can try to perform statistical inference either directly in barcodespace or on distributions in R.

In barcode space:

1 Hypothesis testing is difficult:

Hard to specify null hypotheses.Don’t have analytic formulas for asymptotics of null hypothesisdistributions. (Despite excellent work of many people onhomology for reasonable classes of null hypotheses. [Adler,Bobrowksi, Kahle, Meckes, Weinberger])

2 Specify hypotheses in terms of sampling procedure and thenuse Monte Carlo sampling.

3 Confidence intervals are easier: can be obtained for“underlying” persistent homology. [Fasy, Lecci, Rinaldo,Wasserman, Balakrishnan, Singh.]

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Hypothesis testing and confidence intervals

Can try to perform statistical inference either directly in barcodespace or on distributions in R.

In barcode space:

1 Hypothesis testing is difficult:

Hard to specify null hypotheses.Don’t have analytic formulas for asymptotics of null hypothesisdistributions. (Despite excellent work of many people onhomology for reasonable classes of null hypotheses. [Adler,Bobrowksi, Kahle, Meckes, Weinberger])

2 Specify hypotheses in terms of sampling procedure and thenuse Monte Carlo sampling.

3 Confidence intervals are easier: can be obtained for“underlying” persistent homology. [Fasy, Lecci, Rinaldo,Wasserman, Balakrishnan, Singh.]

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Hypothesis testing and confidence intervals

Can try to perform statistical inference either directly in barcodespace or on distributions in R.

In barcode space:

1 Hypothesis testing is difficult:

Hard to specify null hypotheses.Don’t have analytic formulas for asymptotics of null hypothesisdistributions. (Despite excellent work of many people onhomology for reasonable classes of null hypotheses. [Adler,Bobrowksi, Kahle, Meckes, Weinberger])

2 Specify hypotheses in terms of sampling procedure and thenuse Monte Carlo sampling.

3 Confidence intervals are easier: can be obtained for“underlying” persistent homology. [Fasy, Lecci, Rinaldo,Wasserman, Balakrishnan, Singh.]

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Hypothesis testing and confidence intervals

Can try to perform statistical inference either directly in barcodespace or on distributions in R.

In barcode space:

1 Hypothesis testing is difficult:

Hard to specify null hypotheses.Don’t have analytic formulas for asymptotics of null hypothesisdistributions. (Despite excellent work of many people onhomology for reasonable classes of null hypotheses. [Adler,Bobrowksi, Kahle, Meckes, Weinberger])

2 Specify hypotheses in terms of sampling procedure and thenuse Monte Carlo sampling.

3 Confidence intervals are easier: can be obtained for“underlying” persistent homology. [Fasy, Lecci, Rinaldo,Wasserman, Balakrishnan, Singh.]

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Hypothesis testing and confidence intervals

Can try to perform statistical inference either directly in barcodespace or on distributions in R.

In barcode space:

1 Hypothesis testing is difficult:

Hard to specify null hypotheses.Don’t have analytic formulas for asymptotics of null hypothesisdistributions. (Despite excellent work of many people onhomology for reasonable classes of null hypotheses. [Adler,Bobrowksi, Kahle, Meckes, Weinberger])

2 Specify hypotheses in terms of sampling procedure and thenuse Monte Carlo sampling.

3 Confidence intervals are easier: can be obtained for“underlying” persistent homology. [Fasy, Lecci, Rinaldo,Wasserman, Balakrishnan, Singh.]

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Hypothesis testing and confidence intervals

As above, can first project to distributions on R or Rn (or the like):

1 Similar issues, although easier to specify hypotheses.

2 Confidence intervals for various statistics, in some casesanalytics.

Resampling methods (i.e., the bootstrap) are very relevant (andasymptotically consistent).

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Hypothesis testing and confidence intervals

As above, can first project to distributions on R or Rn (or the like):

1 Similar issues, although easier to specify hypotheses.

2 Confidence intervals for various statistics, in some casesanalytics.

Resampling methods (i.e., the bootstrap) are very relevant (andasymptotically consistent).

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Hypothesis testing and confidence intervals

As above, can first project to distributions on R or Rn (or the like):

1 Similar issues, although easier to specify hypotheses.

2 Confidence intervals for various statistics, in some casesanalytics.

Resampling methods (i.e., the bootstrap) are very relevant (andasymptotically consistent).

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Hypothesis testing and confidence intervals

As above, can first project to distributions on R or Rn (or the like):

1 Similar issues, although easier to specify hypotheses.

2 Confidence intervals for various statistics, in some casesanalytics.

Resampling methods (i.e., the bootstrap) are very relevant (andasymptotically consistent).

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Euler characteristic

Even though the field is very new, I’ve left out a lot of good stuff.One topic to mention in passing:

Observation

The Euler characteristic χ of a simplicial complex is both muchmore robust and much easier to compute than homology.

Theoretical foundations are more tractable here: Eulercharacteristics of Gaussian random fields [Adler-Taylor].

Weaker by itself, but as an ensemble: Euler characteristictransform for slices through 3D solids [Turner, Mukerhjee].

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Euler characteristic

Even though the field is very new, I’ve left out a lot of good stuff.One topic to mention in passing:

Observation

The Euler characteristic χ of a simplicial complex is both muchmore robust and much easier to compute than homology.

Theoretical foundations are more tractable here: Eulercharacteristics of Gaussian random fields [Adler-Taylor].

Weaker by itself, but as an ensemble: Euler characteristictransform for slices through 3D solids [Turner, Mukerhjee].

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Euler characteristic

Even though the field is very new, I’ve left out a lot of good stuff.One topic to mention in passing:

Observation

The Euler characteristic χ of a simplicial complex is both muchmore robust and much easier to compute than homology.

Theoretical foundations are more tractable here: Eulercharacteristics of Gaussian random fields [Adler-Taylor].

Weaker by itself, but as an ensemble: Euler characteristictransform for slices through 3D solids [Turner, Mukerhjee].

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Euler characteristic

Even though the field is very new, I’ve left out a lot of good stuff.One topic to mention in passing:

Observation

The Euler characteristic χ of a simplicial complex is both muchmore robust and much easier to compute than homology.

Theoretical foundations are more tractable here: Eulercharacteristics of Gaussian random fields [Adler-Taylor].

Weaker by itself, but as an ensemble: Euler characteristictransform for slices through 3D solids [Turner, Mukerhjee].

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Questions for the future:

Outstanding issue

So far, limited experience using these tools on very large data sets.

(Except “Mapper”, which I’ll talk about this afternoon.)

1 What about guided sampling?

2 How can we best handle data sets that have “wild” geometricstructure (or none at all)?

3 More and better “topological” summaries and hypothesisspecification tools.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Questions for the future:

Outstanding issue

So far, limited experience using these tools on very large data sets.

(Except “Mapper”, which I’ll talk about this afternoon.)

1 What about guided sampling?

2 How can we best handle data sets that have “wild” geometricstructure (or none at all)?

3 More and better “topological” summaries and hypothesisspecification tools.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Questions for the future:

Outstanding issue

So far, limited experience using these tools on very large data sets.

(Except “Mapper”, which I’ll talk about this afternoon.)

1 What about guided sampling?

2 How can we best handle data sets that have “wild” geometricstructure (or none at all)?

3 More and better “topological” summaries and hypothesisspecification tools.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Questions for the future:

Outstanding issue

So far, limited experience using these tools on very large data sets.

(Except “Mapper”, which I’ll talk about this afternoon.)

1 What about guided sampling?

2 How can we best handle data sets that have “wild” geometricstructure (or none at all)?

3 More and better “topological” summaries and hypothesisspecification tools.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

Questions for the future:

Outstanding issue

So far, limited experience using these tools on very large data sets.

(Except “Mapper”, which I’ll talk about this afternoon.)

1 What about guided sampling?

2 How can we best handle data sets that have “wild” geometricstructure (or none at all)?

3 More and better “topological” summaries and hypothesisspecification tools.

Andrew J. Blumberg (blumberg@math.utexas.edu) Statistical techniques in topological data analysis

top related