uses of information theory inuses of information theory in
TRANSCRIPT
Uses of Information Theory inUses of Information Theory in Medical Imaging
Norbert Schuff, Ph.D.Center for Imaging of Neurodegenerative Diseases
Measurement & InformationMeasurement & Information
ObjectiveObjective
L h t tif i f tiLearn how to quantify information
TopicsTopicsI. Measurement of informationeasu e e o o a oII. Image registration using information theoryIII Imaging feature selection using informationIII. Imaging feature selection using information
theoryIV. Image classification based on informationIV. Image classification based on information
theoretic measures
Information and UncertaintyRandom Generator Which char comes
next?Decrease in Uncertainty
log(1) 0=
log(3) log(1/ 3) log(3)= −
log(1/ 2) log(2)= −g( ) g( )
log(3) log(2) log(6)− − = −Both combined:
Note, we assume each symbol is likely to appear at equal chance
More On UncertaintyMore On UncertaintyRandom
GGeneratorOf M Symbols
Which char comes next?
Some symbols may appear more
( )
( )
2logi i
M
p p−
∑appear more frequent than others
Entropy: H
( )21
: logShannon i ii
M
Many H p p=
= −∑
∑py
11i
ip
=
=∑
Example: Entropy Of mRNAExample: Entropy Of mRNA…ACGTAACACCGCACCTG…ACGTAACACCGCACCTG
1 1 1 1
Assume “A” occurs 50% of all instances, “C” = 25%, “G” and “T”, each = 12.5%
1 1 1 1; ; ;2 4 8 8A C G Tp p p p= = = =
( )2.. log : 1; 2; 3; 3i A C G TLogodds p p p p p− = = = =
1 1 1 1( ) 1 2 3 3 1.752 4 8 8
Entropy bits H = ⋅ + ⋅ + ⋅ + ⋅ =2 4 8 8
Concept Of EntropyConcept Of Entropy
•Shannon Entropy formulated by Claude Shannon•Shannon Entropy formulated by Claude Shannon
•American mathematician•Founded information theory with one landmark paper in 1948•Worked with Alan Turin on cryptography during WWII
•History of Entropy•1854 – Rudolf Clausius German physicist developed theory of heat
Claude Shannon 1916-2001
1854 Rudolf Clausius, German physicist, developed theory of heat•1876 – Williard Gibbs, American physicist, used it for theory of energy•1877 – Ludwig Boltzmann, Austrian physicist, formulated theory of thermodynamics•1879 – Gibbs re-formulated entropy in terms of statistical mechanics1948 Sh•1948 - Shannon
Three Interpretations of EntropyThree Interpretations of Entropy• The uncertainty in the outcome of an event
– Systems with frequent event have less entropy than systems with infrequent events.
• The amount of surprise (information)• The amount of surprise (information)– An infrequently occurring event is a greater surprise, i.e.
carries more information and thus has higher entropy, than a frequently occurring event
• The dispersion in the probability distributionA if i h l di hi t d th– An uniform image has a less disperse histogram and thus lower entropy than a heterogeneous image.
Generalized EntropyGeneralized Entropy• The following generating function can be
used as an abstract definition of entropy:used as an abstract definition of entropy:
⎟⎞
⎜⎛ ∑M pv )(ϕ
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
⋅
⋅=
∑∑
=
=M
i ii
i ii
pv
pvhPH
1 2
1 1
)(
)()(
ϕ
ϕ
• Various definitions of these parameters provide different definitions of entropyprovide different definitions of entropy.– Found over 20 definitions of entropy
Various Formulations Of EntropyVarious Formulations Of EntropyNamesShannonShannonRenyiAczelAczelAczelAczelVarmaVarmaKapurHadraArimotoSharmaSharma
Various Formulations Of Entropy IIVarious Formulations Of Entropy IINamesTanejaTanejaSharmaSharmaFerreri
Sant'Anna
Sant'AnnaPicardPicardPi dPicardPicardPicard
Entropy In Medical Imagingpy g gEntropy Map HistogramAveraging
H=6.0 bitsN=1
H= 4.0N=40 H 4.0N=40
Histogramsimage brain only
H= 3.9H 10 1image brain only
Co-variance map H= 10.1
Entropy Of Noisy ImagesEntropy Of Noisy ImagesEntropy
MapSignal-to-Noise Level
SNR=20
opySNR=2
Ent
ro
Decreasing SNR LevelsSNR=0.2
Decreasing SNR Levels
Image Registration Using Information Theory
ObjectiveObjective
L h t I f ti Th fLearn how to use Information Theory for imaging registration and similar
dprocedures
Image RegistrationImage Registration• Define a transform T that maps one image ontoDefine a transform T that maps one image onto
another image such that some measure of overlap is maximized .– Discuss information theory as means for generating
measures to be maximized over sets of transforms
MR
I
CT MRI CT
Entropy In Image RegistrationEntropy In Image Registration• Define estimate of joint probability
distribution of images: A Bg
– 2-D histogram where each axis designates the number of possible intensity values in corresponding imageintensity values in corresponding image
– each histogram cell is incremented each time a pair (I 1(x,y), I 2(x,y)) occurs in p ( _ ( ,y), _ ( ,y))the pair of images (“co-occurrence”)
– if images are perfectly aligned then the histogram is highly focused; as the Joint 2D histogramhistogram is highly focused; as the images become mis-aligned the dispersion grows
Of images A and BB
– recall one interpretation of entropy is as a measure of histogram dispersion A
Joint Entropypy
• Joint entropy (entropy of 2-D histogram):py ( py g )
∑ 2,
( , ) ( , ) log [ ( , )]i i
i i i ix X y Y
H X Y p x y p x y∈ ∈
= − ⋅∑
• H(X,Y)=H(X)+H(Y) only if X and Y are completely independent.
• Image registration can be guided by minimizing joint entropy H(X,Y), i.e. dispersion in the joint histogram for images is minimizedimages is minimized
ExampleJoint Entropy of 2-D Histogram for rotation of image with respect
to itself of 0, 2, 5, and 10 degrees
Entire Rotation SeriesEntire Rotation Series
Rotation (degrees)0 2 5 10
Mutual InformationMutual Information
⎛ ⎞
,
( , )( , ) ( , ) log( ) ( )
i i
i ii i
x X y Y i i
p x yMI X Y p x yp x p y∈ ∈
⎛ ⎞= ⋅ ⎜ ⎟
⎝ ⎠∑ ∑
• Measures the dependence of the two distributions• MI(X,Y) is maximized when the images are co-registered
• In feature selection choose the features that minimizeIn feature selection choose the features that minimize MI(X,Y) to ensure they are not related
Use Of Mutual Information for Image R i t tiRegistration
• Alternative definition(s) of mutual information MI:• Alternative definition(s) of mutual information MI:MI(X,Y) = H(Y) - H(Y|X) = H(X) - H(X|Y)
• amount by which uncertainty of X is reduced if Y is known or vice versa.
MI(X,Y) = H(X) + H(Y) - H(X,Y)• maximizing MI(X,Y) is equivalent to minimizing joint entropy
H(X,Y)
• Advantage in using mutual information over joint entropy is that MI includes the entropy of each distribution separately.
• Mutual information works better than simply joint entropy in regions with low contrast where there will be high joint entropy but this is offset by high individual entropies as well - so the overall mutual information will be lowwell - so the overall mutual information will be low
• Mutual information is maximized for registered images
Properties of Mutual Informationp• MI is symmetric:
MI (X,Y) = MI(Y,X)
• MI(X,X) = H(X)• MI(X,Y) <= H(X)
– Information each image contains about the other i t b t th th t t l i f tiimage cannot be greater than the total information in each image.
• MI(X Y) >= 0MI(X,Y) >= 0– Cannot increase uncertainty in X by knowing Y
• MI(X Y) = 0 only if X and Y are independentMI(X,Y) 0 only if X and Y are independent
Processing Flow for Image Registration U i M t l I f tiUsing Mutual Information
Pre-processingInput Images
Mutual InfoProbability
D itEstimation
DensityEstimation
Image Transformation
OptimizationScheme
Output ImageTransformation Scheme Image
Relative EntropyRelative Entropy• Assume we know X then we seek the remaining
entropy of Yentropy of Y
( | )H Y X
( ) [ ]( | )
( | ) log ( | )i i
i i i i ix X y Y
p x p y x p y x∈ ∈
= − ⋅∑ ∑
( , )( , ) log( )
i iy
i ii i
p y xp x yp x
⎡ ⎤= − ⎢ ⎥
⎣ ⎦∑
, ( )i ix X y Y ip x∈ ∈ ⎣ ⎦∑
Also known as Kullback-Leibler divergence; An index of information gain
Measurement Of Similarity Between Di t ib tiDistributions
• Question: How close (in bits) is a distribution X to a model distribution Ω?
( )( || ) ( ) log ip xD X p x⎛ ⎞
Ω ⎜ ⎟∑( || ) ( ) log( )
i
iKL i
x i
D X p xq x∈Ω
Ω = ⋅ ⎜ ⎟⎝ ⎠
∑Relative Entropy
• DKL = 0 only if X and Ω are identical; otherwise DKL > 0D (X||Ω) i NOT l t D (Ω||X)
Kullback-Leibler Divergence
• DKL(X||Ω) is NOT equal to DKL(Ω||X)
• Symmetrical Form 1 ( )1( || ) ( || ) ( || )2sKLsD X D X D XΩ = Ω + Ω
Uses Of The Symmetrical KLUses Of The Symmetrical KLHow much more noisy is image A compared to image B?
Map Histograms sKL distance B from AA
sKL
B
Increasing noise level
B
Increasing noise level
Uses Of The Symmetrical KLUses Of The Symmetrical KLDetect outlier in a image series
#1 #2 ……… #5 #6 ………… #13
LsK
L
Series Number
Image Feature Selection Using Information Theory
Obj tiObjective
Learn how to use information theory for determining imaging featuresg g g
Mutual Information based Feature Selection Method
We test the ability to separate two classes• We test the ability to separate two classes (features).
f• Let X be the feature vector, e.g. co-occurances of intensities
• Y is the classification, e.g. particular i i iintensities
• How to maximize the separability of the classification?
Mutual Information based Feature Selection Method
• MI tests a feature’s ability to separate two y pclasses.– Based on definition 3) for mutual information
( , )( , ) ( , ) log( ) ( )
i i
i ii i
x X y Y i i
p x yMI X Y p x yp x p y∈ ∈
⎛ ⎞= ⋅ ⎜ ⎟
⎝ ⎠∑ ∑
– Here X is the feature vector and Y is the l ifi ti
( ) ( )i ix X y Y i ip p y∈ ∈ ⎝ ⎠
classification• Note that X is continuous while Y is discrete
– By maximizing MI(X,Y), we maximize the y g ( , ),separability of the feature
• Note this method only tests each feature individually
Joint Mutual Information based Feature Selection Method
• Joint MI tests a feature’s independence from all other• Joint MI tests a feature s independence from all other features:
• Two implementations:– 1) Compute all individual MI.s and sort from high to low
2) T t th j i t MI f t f t hil k i ll th– 2) Test the joint MI of current feature while keeping all others• Keep the features with the lowest JMI (implies independence)• Implement by selecting features that maximize:
Information Of Time SeriesInformation Of Time Series
Objective
Learn how to use information theory for quantifying functional brain signalsquantifying functional brain signals
How much information carry FMRI fluctuations?
A
B
( )a t
C ( )b t
( )time
( )c t
Conventional Analysis – Correlations, e.g.:
( )M
∑ <Corr> does not measure( ) ( )( ) ( ) ( )
1( , )1
i ii
a a b bcorr A B
M std a std b=
− ⋅ −=
− ⋅ ⋅
∑ <Corr> does not measure information, aka uncertainty.
Bits of information
( ) ( ) ( )1M std a std b
Quantification of Uncertainty in fMRI using Entropy
A
B
Probability distribution over block length L
C
Pr(s
L )
sliding block (L)
Block Entropy H(L): Uncertainty in identifyingUncertainty in identifying fluctuation patterns. H(L)
0 LBits of information
Measuring Difficulty in Overcoming Uncertainty
A
B
Probability distribution over block length L
C
Pr(s
L )
sliding block (L)
EETI
•Block Entropy H(L): uncertainty in identifying fluctuation patterns (e g
H(L)
uncertainty in identifying fluctuation patterns (e.g. patterns of block size L).
•Excess Entropy EE: intrinsic pattern memoryH(L)
0 L•Transient Information TI: ∑
LTI = EE-H(L)
Bits of information
Transient Information in fMRI
Image Classification Based On I f i Th i MInformation Theoretic Measures
Objective
Learn how to use information theory for classification of complex image patternsclassification of complex image patterns
Complexity In Medical ImagingComplexity In Medical ImagingComplexity
•In imaging, local correlations may introduce uncertainty in predicting intensity patterns
•Complexity describes the inherent difficulty in reducing uncertainty in predicting patternsg y p g p
•Need a metric to quantify complexity, i.e. reduce uncertainty in the distribution of observed patternsuncertainty in the distribution of observed patterns
Example: P Of C i l Thi iPatterns Of Cortical Thinning
Brain MRI Histological CutCortical ribbonCortical ribbon
Patterns Of Cortical Thinning In D tiDementia
Alzheimer’s Disease Frontotemporal DementiaAlzheimer s Disease Frontotemporal Dementia
Warmer colors represent more thinningWarmer colors represent more thinning
Information Theoretic Measures of Image C l itComplexity
Image Segment
Parse window size L over image
Information theoretic quantificationq
Three states: Zi =
Entropy: ( ) ( ),
, log ( ,Z Z
H P Z Z P Z Z′
′ ′= − ⋅∑
⎛ ⎞Statistical Complexity: ( ) ( )| log ( |
Z ZSC P Z Z P Z Z
′
⎛ ⎞′ ′= − ⋅⎜ ⎟⎝ ⎠
∑ ∑
Excess Entropy: ( )
L
EE H L= −∑
Summary Complexity MeasuresSummary Complexity Measures
E (H) b d if i f di ib i• Entropy (H) – measures number and uniformity of distributions over observed patterns (joint entropy). For example, a higher H represents increasing spatial correlations in image regions
• Statistical Complexity (SC) – quantifies the information, i.e. uncertainty , contained in the distribution of observed patterns. For example, a higher SC represents an increase in locally correlated patterns. p
• Excess Entropy (EE) – measures convergence rate of entropy. A higher EE represents increase long range correlations across regionsregions
Complexity Based Detection Of Cortical Thinning (Simulations)Thinning (Simulations)
H, SC, and EE automatically identify cortical thinning and do a good job at separating the groups for which the cortex was thinned from the group for which there was no thinning
Detection Of Cortical Thinning Pattern In Dementias Using Complexity MeasuresDementias Using Complexity Measures
Control
ADAD
RGB representation: H; SC; EEMore saturated red means more spatial correlations;
FTD
More saturated red means more spatial correlations;More saturation of blue means more locally correlated patterns;More saturated green means more long range spatial correlations;Simultaneous increase/decrease of H,EE,SC results in brighter/darker levels of gray
SummarySummaryMetric DescriptionEntropy Amount of uncertainty in observationspy yConditional entropy Uncertainty in observations given the
distribution of other observations`Joint entropy Uncertainty in observations across Jo t e t opy U ce ta ty obse at o s ac oss
multiple distributionsRelative entropy,aka Kullback-Leibler divergence
Similarity between distributions
Mutual Information The amount of overlap between distributions
Statistical complexity Uncertainty in linked conditional distributions
Excess entropy Convergence of entropyTransient information Difficulty in reducing uncertainty
LiteratureLiteratureAuthor: James P Sethna
Publisher: Oxford : Oxford University Press, 2006.
Author: H Haken
Publisher: Berlin ; New York : Springer, ©2006.