introduction to information theory - mickey...
TRANSCRIPT
![Page 1: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/1.jpg)
Introduction to Information Theory
Gurinder Singh “Mickey” Atwal [email protected]
Center for Quantitative Biology
![Page 2: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/2.jpg)
Summary
• Shannon’s coding theorems
• Entropy
• Mutual Information
• Multi-information
• Kullback-Leibler Divergence
![Page 3: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/3.jpg)
Role of Information Theory in Biology
i) Mathematical modeling of biological phenomena e.g. Optimization of early neural processing in the brain; bacterial population strategies
ii) Extraction of biological information from large
data-sets e.g. Gene expression analyses; GWAS (genome-wide association studies)
![Page 4: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/4.jpg)
Mathematical Theory of Communication
• Claude Shannon (1948) • Bell Sys. Tech. J.
Vol.27, 379-423, 623-656
• How to encode information? • How to transmit messages reliably?
![Page 5: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/5.jpg)
Model of General Communication System
Information source Destination
message
Channel
Visual Image Retina Visual Cortex
Morphogen Concentration
Differentiation Genes Gene Pathway
Computer File
Fiber Optic Cable
Another Computer
![Page 6: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/6.jpg)
Model of General Communication System
Information source Transmitter Receiver Destination
message message signal
noise
Channel
MESSAGE ENCODED
MESSAGE DECODED
![Page 7: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/7.jpg)
Model of General Communication System
Shannon’s Source Coding theorem There exists a fundamental lower bound on the size of the compressed message without losing information
Information source Transmitter Receiver Destination
message message signal
noise
Channel
![Page 8: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/8.jpg)
Model of General Communication System
Information source Transmitter Receiver Destination
message message signal
noise
Channel
2) Shannon’s channel coding theorem Information can be transmitted, with negligible error, at rates no faster than the channel capacity
![Page 9: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/9.jpg)
Information Theory Information content of a message (random variable) ? How much uncertainty is there in an outcome of an
event ? e.g.
0
0.1
0.2
0.3
0.4
0.5
A T G C
High information content
0
0.1
0.2
0.3
0.4
0.5
A T G C
Low information content
p(A)=p(T)=p(G)=p(C)=0.25
p(A)=p(T)=0.4 p(G)=p(C)=0.1
Homo sapiens
Plasmodium falciparum
![Page 10: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/10.jpg)
Measure of Uncertainty H({pi}) Suppose we have a set of N possible events with
probabilities p1p2…pN General requirements of H • Continuous in pi
• If all pi are equal then H should be monotonically increasing with N
• H should be consistent 1/2
1/3
1/6
1/2
1/2
2/3
1/3
=
![Page 11: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/11.jpg)
Entropy as a measure of uncertainty
Unique answer provided by Shannon
!
H[B] = " p(b)log2 p(b)b#B$
base 2
• Similar to Gibbs entropy in statistical mechanics • Maximum when all probabilities are equal, p(b)=1/N, • Units are measured in bits (binary digits)
random variable B with N elements b
Discrete states
∫−= dbbpbpBH )(log)(][ 2 Continuous states
NBH 2max log][ = Boltzmann entropy
![Page 12: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/12.jpg)
Intrepretations of entropy H • Average length of shortest code to transmit a message
(Shannon’s source coding theorem) • Captures variability of a variable without making any
model assumptions • Average yes/no questions to determine the outcome of a
random event
0
0.1
0.2
0.3
0.4
0.5
A T G C
H = 2 bits p(A)=p(T)=p(G)=p(C)=0.25
0
0.1
0.2
0.3
0.4
0.5
A T G C
p(A)=p(T)=0.4 p(G)=p(C)=0.1
H ~ 1 bit
Homo Sapiens
Plasmodium falciparum
![Page 13: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/13.jpg)
Entropy as average length of shortest code
Symbol Probability of symbol, P(x)
Optimal code length =-log2(P)
Optimal code
A 1/2 1 0
C 1/4 2 10
T 1/8 3 110
G 1/8 3 111
Note that the average length of the optimal code is equal to the entropy of the distribution
=− �log2 P (x)�P (x)
≡−�
x
P (x) log2 P (x)
≡H[x]
Avg length=1.75
![Page 14: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/14.jpg)
Example : Binding sequence conservation • Sequence conservation
⎟⎠
⎞⎜⎝
⎛−−=−= ∑
=
N
nnnobsseq ppNHHR
122max loglog
CAP (Catabolite Activator Protein), acts as a transcription promoter at more than 100 sites within the E. Coli genome Sequence conservation reveals CAP binding site
![Page 15: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/15.jpg)
Two random variables? • Joint entropy
∑∈∈
−=YyXx
yxpyxpYXH,
2 ),(log),(],[
• If variables are independent p(x,y)=p(x)p(y) then H[X,Y]=H[X]+H[Y] • Difference measures total amount of correlation between two variables
∑∈∈
=
−+=
YyXx ypxpyxpyxp
YXHYHXHYXI
,2 )()(
),(log),(
],[][][];[Mutual Information, I(X;Y)
![Page 16: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/16.jpg)
Mutual Information, I(X;Y)
I[X;Y] H[X|Y] H[Y|X]
H[Y] H[X]
H[X,Y] )|()();( YXHXHYXI −=
• I(X;Y) quantifies how much uncertainty of X is reduced if we know Y • If X and Y are independent, then I(X;Y)=0 • Model independent • Captures all non-linear correlations (c.f. Pearson’s correlation) • Independent of measurement scale • Units (bits) have physical meaning
![Page 17: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/17.jpg)
Mutual information captures non-linear relationships
0 0.5 10
1
2
x
y
R2 = 0.487 ± 0.019I = 0.72 ± 0.08
MIC = 0.48 ± 0.02
1 0 10
1
2
x
y
R2 = 0.001 ± 0.002I = 0.70 ± 0.09
MIC = 0.40 ± 0.02
A B
Kinney and Atwal, PNAS 2014
![Page 18: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/18.jpg)
Responsiveness to “complicated” relations
MI~1 bit; Corr.~0.9
gene-A expression level
gene
-B e
xpre
ssio
n le
vel
MI~1.3 bits; Corr.~0
gene
-B e
xpre
ssio
n le
vel
gene-A expression level
![Page 19: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/19.jpg)
Data processing inequality
• Suppose we have a sequence of processes e.g. a signal transduction pathway (Markov process)
CBA →→Physical Statement In any physical process the information about A gets continually degraded along the sequence of processes Mathematical Statement
);();();(CBIBAICAI
≤
≤
![Page 20: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/20.jpg)
Multi-Entropy, H(x1x2…xn)
)...(log)...(]...[ 21...
2212121
nxxx
nn xxxpxxxpXXXHn
∑−=
Measures total correlation in n variables
!
I[X1X2...Xn ] = p(x1x2 ...xn )log2p(x1x2 ...xn )
p(x1)p(x2)...p(xn )i=1
n
"
Multi-Information, I(x1x2…xn)
![Page 21: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/21.jpg)
Generalised correlation between more than two elements
• Multiinformation is a natural extension of Shannon’s mutual information to an arbitrary number of random variables
• Provides a general measure of nonindependence among
multiple variables in a network • Captures higher-order interactions than just simple pair-
wise interactions
∑=
−=N
iNiN XXXHXHXXXI
12121 }),...,,({)(}),...,,({
![Page 22: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/22.jpg)
Capturing more than pairwise relations
MI~0 bits; Corr.~0
Experiment index
gene
-A/g
ene-
B ex
pres
sion
Experiment index
gene
-A/g
ene-
B/ge
ne-C
exp
ress
ion
Multi-information ~ 1.0 bits
![Page 23: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/23.jpg)
Multi-allelic associations
Phenotype
allele A
allele B
A B P
0 0 0
0 1 1
1 0 1
1 1 0
XOR
I(A;B)=I(A;P)=I(B;P)=0
I(A;B;P)=1 bit
Multi-loci associations can be completely masked by single-loci studies !
![Page 24: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/24.jpg)
Synergy and Redundancy
)];();([)};,({);();();();;(
ZYIZXIZYXIZYIZXIYXIZYXIS
+−=
−−−=
S compares the information that X and Y together provide about Z with the information that these two variables provide separately If S < 0 then X and Y are redundant in providing information about Z If S > 0 then there is synergy between X and Y
Motivating example X : SNP 1 Y : SNP 2 Z : phenotype (apoptosis level)
![Page 25: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/25.jpg)
How do we quantify distance between distributions?
Kullback-Leibler Divergence (DKL) • Also known as relative entropy • Quantifies difference between two distributions:
P(x) and Q(x)
• Non-symmetric measure • DKL(P||Q)≥0, DKL(P||Q)=0 if and only if P=Q • Invariant to reparameterization of x
DKL (P ||Q) = P(x)ln P(x)Q(x)x
!
= P(x)ln" P(x)Q(x)
dx
(discrete)
(continuous)
![Page 26: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/26.jpg)
Kullback-Leibler Divergence DKL≥0
Proof, use Jensen’s inequality: for a concave function f(x), ln(x)
for a concave function, every chord lies below the function
x
f x( ) ! f (x)
DKL (P ||Q) = P(x)ln P(x)Q(x)x
! = " P(x)lnQ(x)P(x)x
! = " lnQ(x)P(x) P(x )
# " ln Q(x)P(x) P(x )
= " ln P(x)x! Q(x)
P(x)= ln Q(x)
x! = ln1= 0
E.g. ln x( ) ! ln(x)
! DKL (P ||Q) " 0
![Page 27: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/27.jpg)
Kullback-Leibler Divergence Motivation 1: Counting Statistics
• Flip a fair coin N times, i.e., qH=qT=0.5 • E.g. N=50, observe 27 heads and 23 tails • What is the probability of observing this?
0
0.2
0.4
0.6
Heads Tails
Observed Distribution
0
0.2
0.4
0.6
Heads Tails
Actual Distribution
P(x)={0.54;0.46} Q(x)={0.50;0.50} pH pT qH qT
![Page 28: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/28.jpg)
Kullback-Leibler Divergence Motivation 1: Counting Statistics
P (nH, nT) =N !
nH!nT!qnH
HqnT
T
≈ exp (−NpH ln pH/qH −NpT ln pT/qT)
= exp (−NDKL[P ||Q])
- Probability of observing counts depends on i) N and ii) how much observed distribution differs from true distribution - DKL emerges from the large N limit of a binomial (multinomial) distribution. - DKL quantifies how much the observed distribution diverges from the true underlying distribution. - If DKL>1/N then the distributions are “very” different.
(Binomial distribution)
(for large N)
![Page 29: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/29.jpg)
Kullback-Leibler Divergence Motivation 2: Information Theory
• How many extra bits, on average, do we need to code samples from P(x) using a code optimized for Q(x)?
DKL (P ||Q) = avg no. of bits using bad code - avg no. of bits using optimal code
= ! P(x)log2Q(x)x"
#
$%
&
'(! ! P(x)log2 P(x)
x"
#
$%
&
'(
= P(x)log2P(x)Q(x)x
"
![Page 30: Introduction to Information Theory - Mickey Atwalatwallab.cshl.edu/teaching/Information_Theory.pdfModel of General Communication System Shannon’s Source Coding theorem There exists](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e85dc1a68b809176d2e0e18/html5/thumbnails/30.jpg)
Kullback-Leibler Divergence Motivation 2: Information Theory
Symbol Probability of symbol, P(x)
Bad code, but optimal
for Q(x)
Optimal code for P(x)
A 1/2 00 0
C 1/4 01 10
T 1/8 10 110
G 1/8 11 111
P(x)={1/2,1/4,1/8,1/8} Q(x)={1/4,1/4,1/4,1/4} Entropy of symbol distribution = ! p(x)log2 p(x)
x"
=1.75 bits
Avg length =2 bits
Avg length =1.75
DKL(P||Q)=2-1.75=0.25 i.e. there is an additional overhead of 0.25 bits per symbol if we use the bad code {A=00;C=01;T=10;G=11} instead of the optimal code.
This is equal to the entropy and thus is optimal