studying the shape of data using topology - inegi · topological data analysis (tda) tda is a...
TRANSCRIPT
Studying the Shape of DataUsing Topology
Michael LesnickInstitute for Mathematics and its Applications, USA
INFOTECJune 17, 2014
Topological Data Analysis (TDA)
TDA is a branch of statistics.
Goal: Apply topology to develop tools forstudying qualitative features of data.
Two Data TypesData type 1:A finite set of points in Rn
[We call such data point cloud data.]
Two Data TypesData type 1:A finite set of points in Rn
[We call such data point cloud data.]
Two Data TypesData type 2:A function f : X ! R, X any space.
(We also study functions f : X ! Rm, m > 1).
Two Data TypesData type 2:A function f : X ! R, X any space.
(We also study functions f : X ! Rm, m > 1).
Topological Data Analysis (TDA)
TDA is a branch of statistics.
Goal: Apply topology to develop tools forstudying qualitative features of data.
Informally, qualitative features=“coarse-scale, global geometric features.”
Topological Data Analysis (TDA)
TDA is a branch of statistics.
Goal: Apply topology to develop tools forstudying qualitative features of data.
Informally, qualitative features=“coarse-scale, global geometric features.”
Examples of qualitative features of PCD (in 2-D):
Clusters
Clusters
Clusters
Cycles
Cycles
Tendrils/Flares
Tendrils/Flares
“Graph Structure”
Qualitative Features of Functions
Modes
“Craters”
In TDA, we seek to develop:
• Formal definitions of such features
• Computational tools for detecting,visualizing such features
• (When data is random) methodology forquantifying the statistical significance ofsuch features.
We focus on tools suitable for highdimensional PCD.
In TDA, we seek to develop:
• Formal definitions of such features
• Computational tools for detecting,visualizing such features
• (When data is random) methodology forquantifying the statistical significance ofsuch features.
We focus on tools suitable for highdimensional PCD.
In TDA, we seek to develop:
• Formal definitions of such features
• Computational tools for detecting,visualizing such features
• (When data is random) methodology forquantifying the statistical significance ofsuch features.
We focus on tools suitable for highdimensional PCD.
In TDA, we seek to develop:
• Formal definitions of such features
• Computational tools for detecting,visualizing such features
• (When data is random) methodology forquantifying the statistical significance ofsuch features.
We focus on tools suitable for highdimensional PCD.
Why Study Qualitative Features ofData?
Key Premise:Insight into shape of scientific data has a goodchance of giving insight into the science itself.
An example:
• Statistics of natural images (persistenthomology)
Statistics of Natural ImagesCarlsson et al. studied a set of 5000 3⇥ 3-pixelpatches sampled from natural images.
• After normalization of intensity+contrast, eachpatch lies on 7-D sphere.
• Discovery: Densest regions of data setconcentrate around a Klein bottle.
Statistics of Natural ImagesCarlsson et al. studied a set of 5000 3⇥ 3-pixelpatches sampled from natural images.
• After normalization of intensity+contrast, eachpatch lies on 7-D sphere.
• Discovery: Densest regions of data setconcentrate around a Klein bottle.
Statistics of Natural ImagesCarlsson et al. studied a set of 5000 3⇥ 3-pixelpatches sampled from natural images.
• After normalization of intensity+contrast, eachpatch lies on 7-D sphere.
• Discovery: Densest regions of data setconcentrate around a Klein bottle.
Klein Bottle in Space of 3⇥3Patches
[Source: Carlsson, Perea 2014]
Application: Texture classification [Perea,Carlsson 2013].
Other Applications of TDA• biophysics of proteins• genomics + evolutionary biology• astronomy• coverage detection in wireless sensor networks• shape segmentation• shape comparison/shape matching• basketball analytics
Introduction to Algebraic Topology
What is Algebraic Topology?
Informally, branch of math concerned withproperties of geometric objects that are invariantunder “continuous deformations.”
Continuous deformations:
• bending
• twisting
• stretching
• (but not tearing)
What is Algebraic Topology?
Informally, branch of math concerned withproperties of geometric objects that are invariantunder “continuous deformations.”
Continuous deformations:
• bending
• twisting
• stretching
• (but not tearing)
Classic Example
Classic Example
Algebraic Topology + HolesPrimary example of a property invariant undercontinuous deformations: Presence of holes.
Algebraic topology is largely concerned with:
1 formalizing the notion of a “hole” ingeometric object,
2 calculating numbers of holes of di↵erenttypes,
3 understanding mathematical implications ofpresence of holes.
Algebraic Topology + HolesPrimary example of a property invariant undercontinuous deformations: Presence of holes.
Algebraic topology is largely concerned with:
1 formalizing the notion of a “hole” ingeometric object,
2 calculating numbers of holes of di↵erenttypes,
3 understanding mathematical implications ofpresence of holes.
Algebraic Topology + HolesPrimary example of a property invariant undercontinuous deformations: Presence of holes.
Algebraic topology is largely concerned with:
1 formalizing the notion of a “hole” ingeometric object,
2 calculating numbers of holes of di↵erenttypes,
3 understanding mathematical implications ofpresence of holes.
Types of holes
In algebraic topology, we define i-dimensionalholes for each i � 0.
0-D holes are connected components
The pair of ovals has two 0-D holes.
1-D holes in 3-D objects are “holes you can seethrough.”
The donut has one 1-D hole.
2-D holes in 3-D objects are hollow spaces.
A ballon has one 2-D hole.
Counting Holes: Betti numbers
For a geometric object X , we define Bi(X), theith Betti number of X , to be the number ofi-dimensional holes in X .
Examples
B0(X) = 2;B1(X) = 0;B2(X) = 0.
Examples
B0(X) = 1;B1(X) = 2;B2(X) = 0.
Computing Betti Numbers
For discretely represented geometric objects,Bi(X) is easily computable via linear algebra.
Persistent Homology
Topology of PCD?
How can we use the hole-detection formalism oftopology to develop robust computationalmethods for studying qualitative features of data?
One approach: Persistent Homology.
• Introduced in 2000
• Widely studied and applied
Topology of PCD?
How can we use the hole-detection formalism oftopology to develop robust computationalmethods for studying qualitative features of data?
One approach: Persistent Homology.
• Introduced in 2000
• Widely studied and applied
Persistent HomologyProduces simple descriptors of qualitativefeatures of data called barcodes.
A barcode is a set of closed intervals in R.
Model Example
X
How can we detect the cycle in X?
Naive Idea
Choose r > 0. Let U(X, r) be the union ofballs of radius r centered at the points of X .
Idea: Consider B1(U(X, r)) for some choice of r.
Naive Idea
Choose r > 0. Let U(X, r) be the union ofballs of radius r centered at the points of X .
Idea: Consider B1(U(X, r)) for some choice of r.
Example
X U(X,r)
B0(U(X, r)) = 1;B1(U(X, r)) = 1;B2(U(X, r))) = 0.
When X is nice enough, for a good choice of r,B1(U(X, r)) detects the cycle in X .
Problems with this Descriptor
1 No clear way to choose r.
2 Invariant is unstable with respect toperturbation of data or small changes in r.
3 Doesn’t distinguish small holes from big ones
4 Invariant is very sensitive to outliers.
Problems with this Descriptor
1 No clear way to choose r.
2 Invariant is unstable with respect toperturbation of data or small changes in r.
3 Doesn’t distinguish small holes from big ones
4 Invariant is very sensitive to outliers.
Problems with this Descriptor
1 No clear way to choose r.
2 Invariant is unstable with respect toperturbation of data or small changes in r.
3 Doesn’t distinguish small holes from big ones
4 Invariant is very sensitive to outliers.
Problems with this Descriptor
1 No clear way to choose r.
2 Invariant is unstable with respect toperturbation of data or small changes in r.
3 Doesn’t distinguish small holes from big ones
4 Invariant is very sensitive to outliers.
Problems with this Descriptor
1 No clear way to choose r.
2 Invariant is unstable with respect toperturbation of data or small changes in r.
3 Doesn’t distinguish small holes from big ones
4 Invariant is very sensitive to outliers.
Example: No Good Choice of r
Example: No Good Choice of r
Example: Sensitivity to Outliers
B1(U(X, r)) = 7;
Problems with this Descriptor
1 No canonical choice of r.
2 Invariant is unstable with respect toperturbation of data or small changes in r.
3 Doesn’t distinguish small holes from big ones
4 Invariant is very sensitive to outliers.
Let’s deal with problems 1-3 first.
A Solution
Consider not single choice of radius r, but allchoices of r at once.
This gives us a filtration, that is, a 1-parameterfamily of geometric objects:
F (X) = {U(X, r)}r2[0,1)
Example
Example
Example
Example
Example
Example
Example
Example
Example
Key Mathematical Observation
Not only can we count holes in each space in afiltration, we can track holes in a consistent wayacross the whole filtration at once.
The formalization of this idea is persistenthomology.
BarcodesFor each i � 0, we can define barcode Bi(X), aset of closed intervals in R.
Each interval represents a i-D cylce in thefiltration.
Also records the radii at which that cycle forms,closes up.
BarcodesFor each i � 0, we can define barcode Bi(X), aset of closed intervals in R.
Each interval represents a i-D cylce in thefiltration.Also records the radii at which that cycle forms,closes up.
Properties of a Barcode
• Allows us to distinguish in significantfeatures from insignificant features
• Records the size/scale of the feature
• Is stable w.r.t. perturbations of the data.
• Is computable in practice (using a variant ofGaussian Elimination).
Properties of a Barcode
• Allows us to distinguish in significantfeatures from insignificant features
• Records the size/scale of the feature
• Is stable w.r.t. perturbations of the data.
• Is computable in practice (using a variant ofGaussian Elimination).
Properties of a Barcode
• Allows us to distinguish in significantfeatures from insignificant features
• Records the size/scale of the feature
• Is stable w.r.t. perturbations of the data.
• Is computable in practice (using a variant ofGaussian Elimination).
Properties of a Barcode
• Allows us to distinguish in significantfeatures from insignificant features
• Records the size/scale of the feature
• Is stable w.r.t. perturbations of the data.
• Is computable in practice (using a variant ofGaussian Elimination).
Stability
Stability
Once we have barcodes, can do furtherprocessing to find geometric representations ofthe significant holes.
This framework for building descriptors of datavia barcodes is very flexible.
Example: We can build filtrations from pointcloud data whose barcodes detect flares orclusters.
Can also be adapted to detect qualitativefeatures of functions.
This framework for building descriptors of datavia barcodes is very flexible.
Example: We can build filtrations from pointcloud data whose barcodes detect flares orclusters.
Can also be adapted to detect qualitativefeatures of functions.
This framework for building descriptors of datavia barcodes is very flexible.
Example: We can build filtrations from pointcloud data whose barcodes detect flares orclusters.
Can also be adapted to detect qualitativefeatures of functions.
Advertisement
Do you have data that might have interestingshape?
Come talk to us!
Thanks!