information theory, it entropy mutual information …gatius/mai-inlp/information...• entropy...
TRANSCRIPT
![Page 1: Information theory, IT Entropy Mutual Information …gatius/mai-inlp/Information...• Entropy measures information content of a random variable. We can consider it as the average](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b8e9e53fc7f170b417380/html5/thumbnails/1.jpg)
NLP Language Models 1
• Information theory, IT• Entropy• Mutual Information• Use in NLP
Some basic concepts of Information Theory and Entropy
![Page 2: Information theory, IT Entropy Mutual Information …gatius/mai-inlp/Information...• Entropy measures information content of a random variable. We can consider it as the average](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b8e9e53fc7f170b417380/html5/thumbnails/2.jpg)
NLP Language Models 2
Entropy
• Related to the coding theory- more efficient code: fewer bits for more frequent messages at the cost of more bits for the less frequent
![Page 3: Information theory, IT Entropy Mutual Information …gatius/mai-inlp/Information...• Entropy measures information content of a random variable. We can consider it as the average](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b8e9e53fc7f170b417380/html5/thumbnails/3.jpg)
NLP Language Models 3
EXAMPLE: You have to send messages about the two occupants in a house every five minutes
• Equal probability:
0 no occupants
1 first occupant
2 second occupant
3 Both occupants
• Different probability
Situation Probability Code
no occupants .5 0
first occupant .125 110
second occupant .125 111
Both occupants .25 10
![Page 4: Information theory, IT Entropy Mutual Information …gatius/mai-inlp/Information...• Entropy measures information content of a random variable. We can consider it as the average](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b8e9e53fc7f170b417380/html5/thumbnails/4.jpg)
NLP Language Models 4
• Let X a random variable taking values x1, x2, ..., xn from a domain de according to a probability distribution
• We can define the expected value of X, E(x) as the summatory of the possible values weighted with their probability
• E(X) = p(x1)X(x1) + p(x2)X(x2) + ... p(xn)X(xn)
![Page 5: Information theory, IT Entropy Mutual Information …gatius/mai-inlp/Information...• Entropy measures information content of a random variable. We can consider it as the average](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b8e9e53fc7f170b417380/html5/thumbnails/5.jpg)
NLP Language Models 5
Entropy
• A message can be thought of as a random variable W that can take one of several values V(W) and a probability distribution P.
• Is there a lower bound on the number of bits neede tod encode a message? Yes, the entropy
• It is possible to get close to the minimum (lower bound)
• It is also a measure of our uncertainty about wht the message says (lot of bits- uncertain, few - certain)
![Page 6: Information theory, IT Entropy Mutual Information …gatius/mai-inlp/Information...• Entropy measures information content of a random variable. We can consider it as the average](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b8e9e53fc7f170b417380/html5/thumbnails/6.jpg)
NLP Language Models 6
• Given an event we want to obtain its information content (I)
• From Shannon in the 1940s• Two constraints:
• Significance:• The less probable is an event the more information it
contains
• P(x1) > P(x2) => I(x2) > I(x1)
• Additivity:• If two events are independent
• I(x1x2) = I(x1) + I(x2)
![Page 7: Information theory, IT Entropy Mutual Information …gatius/mai-inlp/Information...• Entropy measures information content of a random variable. We can consider it as the average](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b8e9e53fc7f170b417380/html5/thumbnails/7.jpg)
NLP Language Models 7
• I(m) = 1/p(m) does not satisfy the second requirement
• I(x) = - log p(x) satisfies both• So we define I(X) = - log p(X)
![Page 8: Information theory, IT Entropy Mutual Information …gatius/mai-inlp/Information...• Entropy measures information content of a random variable. We can consider it as the average](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b8e9e53fc7f170b417380/html5/thumbnails/8.jpg)
NLP Language Models 8
• Let X a random variable, described by p(X), owning an information content I
• Entropy: is the expected value of I: E(I)
• Entropy measures information content of a random variable. We can consider it as the average length of the message needed to transmite a value of this variable using an optimal coding.
• Entropy measures the degree of desorder (uncertainty) of the random variable.
![Page 9: Information theory, IT Entropy Mutual Information …gatius/mai-inlp/Information...• Entropy measures information content of a random variable. We can consider it as the average](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b8e9e53fc7f170b417380/html5/thumbnails/9.jpg)
NLP Language Models 9
• Uniform distribution of a variable X.• Each possible value xi ∈ X with |X| = M has the same
probability pi = 1/M
• If the value xi is codified in binary we need log2 M bits of information
• Non uniform distribution. • by analogy
• Each value xi has a different probability pi
• Let assume pi to be independent
• If Mpi = 1/ pi we will need log2 Mpi = log2 (1/ pi ) = - log2 pi bits of information
![Page 10: Information theory, IT Entropy Mutual Information …gatius/mai-inlp/Information...• Entropy measures information content of a random variable. We can consider it as the average](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b8e9e53fc7f170b417380/html5/thumbnails/10.jpg)
NLP Language Models 10
X = a?
X = b?
X = c?
a
b
c a
si
si
si
no
no
no
Average number of questions: 1.75
Let X ={a, b, c, d} with pa = 1/2; pb = 1/4; pc = 1/8; pd = 1/8
entropy(X) = E(I)=-1/2 log2 (1/2) -1/4 log2 (1/4) -1/8 log2 (1/8) -1/8 log2 (1/8) = 7/4 = 1.75 bits
![Page 11: Information theory, IT Entropy Mutual Information …gatius/mai-inlp/Information...• Entropy measures information content of a random variable. We can consider it as the average](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b8e9e53fc7f170b417380/html5/thumbnails/11.jpg)
NLP Language Models 11
Let X with a binomial distributionX = 0 with probability pX = 1 with probability (1-p)
H(X) = -p log2 (p) -(1-p) log2 (1-p)
p = 0 => 1 - p = 1 H(X) = 0p = 1 => 1 - p = 0 H(X) = 0p = 1/2 => 1 - p = 1/2 H(X) = 1
0 1/2 1 p
1
0
H(Xp)
![Page 12: Information theory, IT Entropy Mutual Information …gatius/mai-inlp/Information...• Entropy measures information content of a random variable. We can consider it as the average](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b8e9e53fc7f170b417380/html5/thumbnails/12.jpg)
NLP Language Models 12
![Page 13: Information theory, IT Entropy Mutual Information …gatius/mai-inlp/Information...• Entropy measures information content of a random variable. We can consider it as the average](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b8e9e53fc7f170b417380/html5/thumbnails/13.jpg)
NLP Language Models 13
• joint entropy of two random variables, X, Y is average information content for specifying both variables
![Page 14: Information theory, IT Entropy Mutual Information …gatius/mai-inlp/Information...• Entropy measures information content of a random variable. We can consider it as the average](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b8e9e53fc7f170b417380/html5/thumbnails/14.jpg)
NLP Language Models 14
• The conditional entropy of a random variable Y given another random variable X, describes what amount of information is needed in average to communicate when the reader already knows X
![Page 15: Information theory, IT Entropy Mutual Information …gatius/mai-inlp/Information...• Entropy measures information content of a random variable. We can consider it as the average](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b8e9e53fc7f170b417380/html5/thumbnails/15.jpg)
NLP Language Models 15
P(A,B) = P(A|B)P(B) = P(B|A)P(A)
P(A,B,C,D…) = P(A)P(B|A)P(C|A,B)P(D|A,B,C..)
Chaining rule for probabilities
![Page 16: Information theory, IT Entropy Mutual Information …gatius/mai-inlp/Information...• Entropy measures information content of a random variable. We can consider it as the average](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b8e9e53fc7f170b417380/html5/thumbnails/16.jpg)
NLP Language Models 16
Chaining rule for entropies
![Page 17: Information theory, IT Entropy Mutual Information …gatius/mai-inlp/Information...• Entropy measures information content of a random variable. We can consider it as the average](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b8e9e53fc7f170b417380/html5/thumbnails/17.jpg)
NLP Language Models 17
I(X,Y) is the mutual information between X and Y.
• I(X,Y) measures the reduction of incertaincy of X when Y is known
• It measures too the amouny of information X owns about Y (or Y about X)
Mutual Information
![Page 18: Information theory, IT Entropy Mutual Information …gatius/mai-inlp/Information...• Entropy measures information content of a random variable. We can consider it as the average](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b8e9e53fc7f170b417380/html5/thumbnails/18.jpg)
NLP Language Models 18
• I = 0 only when X and Y are independent:• H(X|Y)=H(X)
• H(X)=H(X)-H(X|X)=I(X,X) • Entropy is the autoinformation (mutual
information between X and X)
![Page 19: Information theory, IT Entropy Mutual Information …gatius/mai-inlp/Information...• Entropy measures information content of a random variable. We can consider it as the average](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b8e9e53fc7f170b417380/html5/thumbnails/19.jpg)
NLP Language Models 19
![Page 20: Information theory, IT Entropy Mutual Information …gatius/mai-inlp/Information...• Entropy measures information content of a random variable. We can consider it as the average](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b8e9e53fc7f170b417380/html5/thumbnails/20.jpg)
NLP Language Models 20
• The PMI of a pair of outcomes x and y belonging to discrete random variables quantifies the discrepancy between the probability of their coincidence given their joint distribution versus the probability of their coincidence given only their individual distributions and assuming independence
• The mutual information of X and Y is the expected value of the Specific Mutual Information of all possible outcomes.
Pointwise Mutual Information
![Page 21: Information theory, IT Entropy Mutual Information …gatius/mai-inlp/Information...• Entropy measures information content of a random variable. We can consider it as the average](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b8e9e53fc7f170b417380/html5/thumbnails/21.jpg)
NLP Language Models 21
• H: entropy of a language L• We ignore p(X)• Let q(X) a LM• How good is q(X) as an estimation of
p(X) ?
![Page 22: Information theory, IT Entropy Mutual Information …gatius/mai-inlp/Information...• Entropy measures information content of a random variable. We can consider it as the average](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b8e9e53fc7f170b417380/html5/thumbnails/22.jpg)
NLP Language Models 22
Cross Entropy
Measures the “surprise” of a model q when it describes events following a distribution p
![Page 23: Information theory, IT Entropy Mutual Information …gatius/mai-inlp/Information...• Entropy measures information content of a random variable. We can consider it as the average](https://reader034.vdocuments.us/reader034/viewer/2022042110/5e8b8e9e53fc7f170b417380/html5/thumbnails/23.jpg)
NLP Language Models 23
Relative Entropy Relativa or Kullback-Leibler (KL) divergence
Measures the difference between two probabilistic distributions