information theory and security. lecture motivation up to this point we have seen: –classical...
Post on 21-Dec-2015
220 views
TRANSCRIPT
Information Theory and SecurityInformation Theory and Security
Lecture MotivationLecture Motivation
Up to this point we have seen:– Classical Crypto
– Symmetric Crypto
– Asymmetric Crypto
These systems have focused on issues of confidentiality: Ensuring that an adversary cannot infer the original plaintext message, or cannot learn any information about the original plaintext from the ciphertext.
In today’s lecture we will put a more formal framework around the notion of what information is, and use this to provide a definition of security from an information-theoretic point of view.
Lecture OutlineLecture Outline
Probability Review: Conditional Probability and Bayes
Entropy: – Desired properties and definition
– Chain Rule and conditioning
Coding and Information Theory– Huffman codes
– General source coding results
Secrecy and Information Theory– Probabilistic definitions of a cryptosystem
– Perfect Secrecy
The Basic IdeaThe Basic Idea
Suppose we roll a 6-sided dice.– Let A be the event that the number of dots is odd.
– Let B be the event that the number of dots is at least 3.
A = {1, 3, 5}
B = {3, 4, 5, 6}
I tell you: the roll belongs to both A and B then you know there are only two possibilities: {3, 5}
In this sense tells you more than just A or just B.
That is, there is less uncertainty in than in A or B.
Information is closely linked with this idea of uncertainty: Information increases when uncertainty decreases.
BA
BA
Probability Review, pg. 1Probability Review, pg. 1 A random variable (event) is an experiment whose outcomes
are mapped to real numbers.
For our discussion we will deal with discrete-valued random variables.
Probability: We denote pX(x) = Pr(X = x).
For a subset A,
Joint Probability: Sometimes we want to consider more than two events at the same time, in which we case we lump them together into a joint random variable, e.g. Z = (X,Y).
Independence: We say that two events are independent if
yY,xXPry,xY,Xp Y,X
Ax
X xp)A(p
ypxpy,xY,Xp YXY,X
Probability Review, pg. 2Probability Review, pg. 2 Conditional Probability: We will often ask questions about
the probability of events Y given that we have observed X=x. In particular, we define the conditional probability of Y=y given X=x by
Independence: We immediately get
Bayes’s Theorem: If pX(x)>0 and pY(y)>0 then
)y(p)x|y(p YY
)x(p
)y,x(p)x|y(p
X
XYY
)y(p
)x|y(p)x(p)y|x(p
Y
YXX
Entropy and UncertaintyEntropy and Uncertainty We are concerned with how much uncertainty a random event
has, but how do we define or measure uncertainty?
We want our measure to have the following properties:
1. To each set of nonnegative numbers with , we define the uncertainty by .
2. should be a continuous function: A slight change in p should not drastically change
3. for all n>0. Uncertainty increases when there are more outcomes.
4. If 0<q<1, then
n21 p,,p,pp
1n1
1n1
n1
n1 ,,H,,H
q1,qHpp,,pHp,,p)q1(,qp,,pH jn1njj1
)p(H1ppp n21
)p(H)p(H
Entropy, pg. 2Entropy, pg. 2 We define the entropy of a random variable by
Example: Consider a fair coin toss. There are two outcomes, with probability ½ each. The entropy is
Example: Consider a non-fair coin toss X with probability p of getting heads and 1-p of getting tails. The entropy is
The entropy is maximum when p= ½.
bit12
1log
2
1
2
1log
2
122
x
2 )x(plogxpXH
p1logp1plogpXH 22
Entropy, pg. 3Entropy, pg. 3 Entropy may be thought of as the number of yes-no questions
needed to accurately determine the outcome of a random event.
Example: Flip two coins, and let X be the number of heads. The possibilities are {0,1,2} and the probabilities are {1/4, 1/2, 1/4}. The Entropy is
So how can we relate this to questions? First, ask “Is there exactly one head?” You will half the time
get the right answer… Next, ask “Are there two heads?” Half the time you needed one question, half you needed two
bits2
3
4
1log
4
1
2
1log
2
1
4
1log
4
1222
Entropy, pg. 4Entropy, pg. 4 Suppose we have two random variables X and Y, the joint
entropy H(X,Y) is given by
Conditional Entropy: In security, we ask questions of whether an observation reduces the uncertainty in something else. In particular, we want a notion of conditional entropy. Given that we observe event X, how much uncertainty is left in Y?
x y
XY2XY )y,x(plogy,xpY,XH
x yY2XY
x yY2YX
xX
)x|y(plog)y,x(p
)x|y(plog)x|y(p)x(p
)xX|Y(H)x(pX|YH
Entropy, pg. 5Entropy, pg. 5 Chain Rule: The Chain Rule allows us to relate joint entropy
to conditional entropy via H(X,Y) = H(Y|X)+H(X).
(Remaining details will be provided on the white board)
Meaning: Uncertainty in (X,Y) is the uncertainty of X plus whatever uncertainty remains in Y given we observe X.
)X|Y(H)X(H
)X|Y(H)x(plog)x(p
)x|y(p)x(plog)y,x(p
)y,x(plog)y,x(pY,XH
xX2X
x yYX2XY
x yXY2XY
Entropy, pg. 6Entropy, pg. 6
Main Theorem:
1. Entropy is non-negative.
2. where denotes the number of elements in the sample space of X.
3.
4. (Conditioning reduces entropy)
with equality if and only if X and Y are independent.
0)X(H
)Y(H)X(HY,XH
2log)X(H
)Y(H)X|Y(H
Entropy and Source Coding TheoryEntropy and Source Coding Theory There is a close relationship between entropy and representing
information.
Entropy captures the notion of how many “Yes-No” questions are needed to accurately identify a piece of information… that is, how many bits are needed!
One of the main focus areas in the field of information theory is on the issue of source-coding:– How to efficiently (“Compress”) information into as few bits as
possible.
We will talk about one such technique, Huffman Coding.
Huffman coding is for a simple scenario, where the source is a stationary stochastic process with independence between successive source symbols
Huffman Coding, pg. 1Huffman Coding, pg. 1
Suppose we have an alphabet with four letters A, B, C, D with frequencies:
We could represent this with A=00, B=01, C=10, D=11. This would mean we use an average of 2 bits per letter.
On the other hand, we could use the following representation: A=1, B=01, C=001, D=000. Then the average number of bits per letter becomes
(0.5)*1+(0.3)*2+(0.1)*3+(0.1)*3 = 1.7
Hence, this representation, on average, is more efficient.
A B C D0.5 0.3 0.1 0.1
Huffman Coding, pg. 2Huffman Coding, pg. 2
Huffman Coding is an algorithm that produces a representation for a source.
The Algorithm:– List all outputs and their
probabilities– Assign a 1 and 0 to smallest two,
and combine to form an output with probability equal to the sum
– Sort List according to probabilities and repeat the process
– The binary strings are then obtained by reading backwards through the procedure
A
B
C
D
0.5
0.3
0.1
0.1
1
0
0.2
0.5
1
0
1
0
1.0
Symbol RepresentationsA: 1B: 01C: 001D: 000
Huffman Coding, pg. 3Huffman Coding, pg. 3
In the previous example, we used probabilities. We may directly use event counts.
Example: Consider 8 symbols, and suppose we have counted how many times they have occurred in an output sample.
We may derive the Huffman Tree (Exercise will be done on whiteboard)
The corresponding length vector is (2,2,3,3,3,4,5,5)
The average codelength is 2.83. If we had used a full-balanced tree representation (i.e. the straight-forward representation) we would have had an average codelength of 3.
S1 S2 S3 S4 S5 S6 S7 S828 25 20 16 15 8 7 5
Huffman Coding, pg. 4Huffman Coding, pg. 4
We would like to quantify the average amount of bits needed in terms of entropy.
Theorem: Let L be the average number of bits per output for Huffman encoding of a random variable X, then
Here, lx =length of codeword assigned to symbol x.
Example: Let’s look back at the 4 symbol example
Our average codelength was 1.7 bits.
685.1)1.0(log1.)1.0(log1.)3.0(log3.)5.0(log5.XH 2222
x
xxlpL,1XHLXH
Huffman Coding, pg. 5Huffman Coding, pg. 5
An interesting and useful question is: What if I use the wrong distribution when calculating the code? How badly will my code perform?
Suppose the true distribution is px, and you used another distribution to find the lengths lx. Define the auxiliary distribution .
Theorem: If we code the source X with lx instead of the correct Huffman code, then the resulting average codelength will satisfy:
where the Kullback-Leibler Divergence D(p||q) is
1qpDXHlpqpDXHx
xx
xlx 2q
x x
x2x q
plogpqpD
Another way to look at cryptography, pg. 1Another way to look at cryptography, pg. 1
So far in class, we have looked at the security problem from an algorithm point-of-view (DES, RC4, RSA,…).
But why build these algorithms? How can we say we are doing a good job?
Enter information theory and its relationship to ciphers…
Suppose we have a cipher with possible plaintexts P, ciphertexts C, and keys K. – Suppose that a plaintext P is chosen according to a probability
law.
– Suppose the key K is chosen independent of P
– The resulting ciphertexts have various probabilities depending on the probabilities for P and K.
Another way to look at cryptography, pg. 2Another way to look at cryptography, pg. 2
Now, enter Eve… She sees the ciphertext C and several security questions arise:– Does she learn anything about P from seeing C?
– Does she learn anything about the key K from seeing C?
Thus, our questions are associated with H(P|C) and H(K|C).
Ideally, we would like for the uncertainty to not decrease, i.e.
H(P | C) = H(P)
H(K | C) = H(K)
Another way to look at cryptography, pg. 3Another way to look at cryptography, pg. 3
Example: Suppose we have three plaintexts {a,b,c} with probabilities {0.5, 0.3, 0.2}. Suppose we have two keys k1 and k2 with probabilities 0.5 and 0.5. Suppose there are three ciphertexts U,V,W.
We may calculate probabilities of the ciphertexts
Similarly we get pC(V)=0.25 and pC(W)=0.25
50.0)a(p)k(p)a(p)k(p)U(p P2kP1KC
Ek1(a)=U Ek1(b)=V Ek1(c)=W Ek2(a)=U Ek2(b)=W Ek2(c)=V
Another way to look at cryptography, pg. 4Another way to look at cryptography, pg. 4
Suppose Eve observes the ciphertext U, then she knows the plaintext was “a”.
We may calculate the conditional probabilities:
Similarly we get pP(c|V)=0.4 and pP(a|V)=0. Also pP(a|W)=0 , pP(b|W)=0.6 , pP(c|W)=0.4.
What does this tell us? Remember, the original plaintexts probabilities were 0.5, 0.3, and 0.2. So, if we see a ciphertext, then we may revise the probabilities… Something is “learned”
6.0
25.
)5.0)(3.0(
)V(p
k,bp
)V(p
V,bp)V|b(p
C
1C,P
C
C,PP
Another way to look at cryptography, pg. 5Another way to look at cryptography, pg. 5
We use entropy to quantify the amount of information that is learned about the plaintext given the ciphertext is observed.
The conditional entropy of P given C is
Thus an entire bit of information is revealed just by observing the ciphertext!
)2.0(log2.0)3.0(log3.0)5.0(log5.0)P(H 222
}c,b,a{x
P2P}W,V,U{cC 485.0)y|x(plog)c|x(p)c(p)C|P(H
Perfect Secrecy and EntropyPerfect Secrecy and Entropy
The previous example gives us the motivation for the information-theoretic definition of security (or “secrecy”)
Definition: A cryptosystem has perfect secrecy if H(P|C)=H(P).
Theorem: The one-time pad has perfect secrecy.
Proof: See the book for the details. Basic idea is to show each ciphertext will result with equal likelihood. We then use manipulations like:
Equating these two as equal and using H(K)=H(C) gives the result.
)C|P(H)C(H)C,P(H)C,K,P(H
)K(H)P(H)K,P(H)C,K,P(H Why?