probability theory - github pages · probability of seeing a stopping time of exactly 3.141596?...
TRANSCRIPT
PROBABILITY THEORY CMPUT 296
Martha White
Winter, 2020
QUICK CHECK ON PRE-REQ KNOWLEDGE
• You should know how to take derivatives • Will rarely, if ever integrate • I will teach you about Partial Derivatives
• I assume you know about vectors and dot products, hopefully also about matrices
• Need to have learned about probability, I will cover • Expected Value (Mean) • Variance • Random Variables • Probability Density Functions…
WHY DO WE NEED PROBABILITIES?
• We could just assume a deterministic world • I see an input x, I can produce the output y = f(x)
• Example: in a game (e.g., chess), you take an action, and the outcome is deterministic
• But, even in a deterministic world, we have a problem: partial observability
• Outcomes look random because we don’t have enough information
• Example: Imagine a (high-tech) gumball machine, where f(x = has candy, battery charged) = output candy
• You can only see if it has candy
WHY DO WE NEED PROBABILITIES?
• But, even in a deterministic world, we have a problem: partial observability
• Outcomes look random because we don’t have enough information
• Example: Imagine a (high-tech) gumball machine, where f(x = has candy, battery charged) = output candy
• You can only see if it has candy • From your perspective, when x = has candy,
sometimes candy is outputted and sometimes not • Looks stochastic (dependent on hidden battery
charge)
SPACE OF OUTCOMES AND EVENTS
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
SAMPLE SPACES
e.g.,F = {;, {1, 2}, {3, 4, 5, 6}, {1, 2, 3, 4, 5, 6}} e.g.,F = {;, [0, 0.5], (0.5, 1.0], [0, 1]}E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
e.g., Outcome of Dice Roll
A FEW COMMENTS ON TERMINOLOGY
• A few new terms, including countable, closure • only a small amount of terminology used, can
google these terms and learn on your own • notation sheet in notes
• Countable: integers, {0.1,2.0,3.6},… • Uncountable: real numbers, intervals, … • Interchangeably I use (though its somewhat loose)
• discrete and countable • continuous and uncountable
(MEASURABLE) SPACE OF OUTCOMES AND EVENTS
an event space
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
WHY IS THIS THE DEFINITION?Intuitively,
1. A collection of outcomes is an event (e.g., either a 1 or 6 was rolled)
2. If we can measure two events separately, then their union should also bea measurable event
3. If we can measure an event, then we should be able to measure that thatevent did not occur (the complement)
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
QUICK CHECK ON BACKGROUND
• The complement of a set
• The union of sets
• A set of sets
• Any other confusing notation?
SAMPLE SPACES
e.g.,F = {;, {1, 2}, {3, 4, 5, 6}, {1, 2, 3, 4, 5, 6}} e.g.,F = {;, [0, 0.5], (0.5, 1.0], [0, 1]}
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit> E
<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
AXIOMS OF PROBABILITY
1. (unit measure) P (⌦) = 1
2. (�-additivity) Any countable sequence of disjoint eventsA1, A2, . . . 2 F satisfies P ([1
i=1Ai) =P1
i=1 P (Ai)
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
FINDING PROBABILITY DISTRIBUTIONS
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
Do you recognize this distribution?
PROBABILITY MASS FUNCTIONS
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
ARBITRARY PMFS
e.g. PMF for a fair die (table of values)⌦ = {1, 2, 3, 4, 5, 6}
p(!) = 1/6 8! 2 ⌦
EXERCISE: EXAMPLES OF EVENTS
• What is a possible event? What is its probability
• What is the event space?
⌦ = {1, 2, 3, 4, 5, 6}p(!) = 1/6 8! 2 ⌦
EXERCISE: HOW ARE PMFS USEFUL AS A MODEL?
• Histograms!
• Imagine you recorded your commute times for a year, in minutes (365 recorded times)
• How do you get p(t)?
• How is p(t) useful?
EXERCISE: HOW ARE PMFS USEFUL AS A MODEL?• Histograms!
• Imagine you recorded your commute times for a year, in minutes
• Get p(t): count number of times t = 1, 2, 3, … occurs and then normalize probabilities by # samples
USEFUL PMFS
REMINDERS: JANUARY 9, 2020
• Assignment 1 is available on the website
• https://marthawhite.github.io/mlbasics/
• You should start reading the notes
• Chapters 1, 2 and 3 (about 30 pages)
• The notes are in-progress
• Some sections still say “Coming soon…”
• Avoid printing the full set of notes (if at all). At most, print what you are reading
A FEW OTHER NOTES ON POLICY
• If you have an issue (sickness, injury, family problem, etc.), and need an extension on an assignment, contact me about it before the assignment deadline.
• I cannot give extensions after.
• You cannot submit Thought Questions late
• only Assignments have a late policy that allows for late submission
• if you submit 1 hour late, we won’t penalize you
USEFUL PMFS
e.g., amount of mail received in a day number of calls received by call center in an hour
USEFUL PMFS
e.g., amount of mail received in a day number of calls received by call center in an hour
1. Can we use a table for this? 2. How do we know this is a valid
pmf? How can you check?
EXERCISE: CAN WE USE A POISSON FOR COMMUTE TIMES?• Used a probability table (histogram) for minutes:
count number of times t = 1, 2, 3, … occurs and then normalize probabilities by # samples
• Can we use a Poisson?
Why Poisson? Why not just a histogram?
PROBABILITY DENSITY FUNCTIONS
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
Who has never seen an integral?
PROBABILITY DENSITY FUNCTIONS
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
e.g. normal distribution (Gaussian)
PROBABILITY DENSITY FUNCTIONS
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
Who has never seen an integral?
PMFS VS. PDFS
Example:
- Stopping time of a car, in interval [3,15]. What is the probability of seeing a stopping time of exactly 3.141596? (How much mass or space does it take up in [3,15]?) - More reasonable to ask the probability of stopping between 3 to 3.5 seconds. How do we get that probability?
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
USEFUL PDFS
USEFUL PDFS
USEFUL PDFS
WHY CAN THE DENSITY BE ABOVE 1?
p(x) can be big, because delta x is small P(A) MUST be less than or equal to 1 p(x) can be bigger than 1
RANDOM VARIABLES
Age: 35 Likes sports: Yes Height: 1.85m Smokes: No Weight: 75kg Marital st.: Single IQ: 104 Occupation: Musician
Age: 26 Likes sports: Yes Height: 1.75m Smokes: No Weight: 79kg Marital st.: Divorced IQ: 103 Occupation: Athlete
Musician is a random variable (a function) A is a new event, let’s call it 1; Not-A is 0 Omega is {0, 1} Can ask P(M = 0) and P(M = 1)
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
WE INSTINCTIVELY CREATE THIS TRANSFORMATION
Assume ⌦ is a set of people.
Compute the probability that a randomly selected person ! 2 ⌦ has a cold.
Define event A = {! 2 ⌦ : Disease(!) = cold}.
Disease is our new random variable, P (Disease = cold)
Disease is a function that maps outcome space to new outcome space {cold, not cold}
Disease is a function, which is neither a variable nor random BUT, this term is still a good one since we treat Disease as a variable And assume it can take on different values (randomly according to some distribution)
RANDOM VARIABLE: FORMAL DEFINITION
Example X : ⌦ ! [0,1)
⌦ is set of (measured) people in population
with associated measurements such as height and weight
X(!) = height
A = interval = [50100, 50200]
P (X 2 A) = P (50100 X 50200) = P ({! : X(!) 2 A})
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
3 MINUTE BREAK AND EXERCISE
• Let X be a random variable that corresponds to the ratio of hard-to-easy problems on an assignment. Assume it takes values in {0.1, 0.25, 0.7}.
• Is this discrete or continuous? Does it have a PMF or PDF?
• Further, where could the variability come from? i.e., why is this a random variable?
• Let X be the stopping time of a car, taking values in [3,5] union [7,9]. Is this discrete or continuous?
• Think of an example of a discrete random variable (RV) and a continuous RV
WHAT IF WE HAVE MORE THAN TWO VARIABLES…
• So far, we have considered scalar random variables
• Axioms of probability defined abstractly, apply to vector random variables
⌦ = R2, e.g., ! = [�0.5, 10]
⌦ = [0, 1]⇥ [2, 5], e.g., ! = [0.2, 3.5]
But, when defining probabilities, we will want to consider how the variables interact
E<latexit sha1_base64="m6CE42hLwWekZhHAgZiHtWINreU=">AAAB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFEVxWsA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6W19Y3NrfJ2ZWd3b/+genjU0TJVhLaJ5FL1QqwpZ4K2DTOc9hJFcRxy2g0nt7nffaJKMykezTShQYxHgkWMYGMlvx9jMyaYZ3ezQbXm1t050CrxClKDAq1B9as/lCSNqTCEY619z01MkGFlGOF0VumnmiaYTPCI+pYKHFMdZPPIM3RmlSGKpLJPGDRXf29kONZ6God2Mo+ol71c/M/zUxNdBxkTSWqoIIuPopQjI1F+PxoyRYnhU0swUcxmRWSMFSbGtlSxJXjLJ6+STqPuXdQbD5e15k1RRxlO4BTOwYMraMI9tKANBCQ8wyu8OcZ5cd6dj8VoySl2juEPnM8fd7CRXw==</latexit>
TWO DISCRETE RANDOM VARIABLES
Random variables X and YOutcome spaces X and Y
answer is to actually take the expected value (mean) of this gamma distribution, which wedefine below in Section 1.4. ⇤
1.3 Multivariate random variablesMuch of the above development extends to multivariate random variables—a vector of ran-dom variables—because the definition of outcome spaces and probabilities is general. Theexamples so far, however, have dealt with scalar random variables, because for multivariaterandom variables, we need to understand how variables interact. In this section, we discussseveral new notions that only arise when there are multiple random variables, includingjoint distributions, conditional distributions, marginals and dependence between variables.
Let us start with a simpler example, with two discrete random variables X and Y withoutcome spaces X and Y. There is a joint probability mass function p : X ◊ Y æ [0, 1], andcorresponding joint probability distribution P , such that
p(x, y) = P (X = x, Y = y)
where the pmf needs to satisfyÿ
xœX
ÿ
yœY
p(x, y) = 1.
For example, if X = {young, old} and Y = {no arthritis, arthritis}, then the pmf could bethe table of probabilities
Y0 1
X0 1/2 1/1001 1/10 39/100
This fits within the definition of probability spaces, because � = X ◊ Y is a valid space,and
qÊœ� p(Ê) =
q(x,y)œ� p(x, y) =
qxœX
qyœY
p(x, y). We consider the random variableZ = (X, Y ) to be a multivariate random variable, of two dimensions.
Looking at the joint probabilities in the table, we can definitely see that the two randomvariables interact. The joint probability is small for young and having arthritis, for example.Further, there seems to be more magnitude in the rows corresponding to young, suggestingthat the probabilities are also influenced by the proportion of people in the population thatare old or young. In fact, one might ask if we can figure out this proportion just from thistable.
The answer is a resounding yes, and leads us to marginal distributions and why wemight care about marginal distributions. Given a joint distribution over random variables,one would hope that we could extract more specific probabilities, like the distribution overjust one of those variables, which is called the marginal distribution. The marginal can besimply computed, by summing up over all values of the other variable
P (X = young) = p(young, no arthritis) + p(young, arthritis) = 51100 .
A young person either does or does not have arthritis, so summing up over these twopossible cases factors out that variable. Therefore, using data collected for random variable
21
answer is to actually take the expected value (mean) of this gamma distribution, which wedefine below in Section 1.4. ⇤
1.3 Multivariate random variablesMuch of the above development extends to multivariate random variables—a vector of ran-dom variables—because the definition of outcome spaces and probabilities is general. Theexamples so far, however, have dealt with scalar random variables, because for multivariaterandom variables, we need to understand how variables interact. In this section, we discussseveral new notions that only arise when there are multiple random variables, includingjoint distributions, conditional distributions, marginals and dependence between variables.
Let us start with a simpler example, with two discrete random variables X and Y withoutcome spaces X and Y. There is a joint probability mass function p : X ◊ Y æ [0, 1], andcorresponding joint probability distribution P , such that
p(x, y) = P (X = x, Y = y)
where the pmf needs to satisfyÿ
xœX
ÿ
yœY
p(x, y) = 1.
For example, if X = {young, old} and Y = {no arthritis, arthritis}, then the pmf could bethe table of probabilities
Y0 1
X0 1/2 1/1001 1/10 39/100
This fits within the definition of probability spaces, because � = X ◊ Y is a valid space,and
qÊœ� p(Ê) =
q(x,y)œ� p(x, y) =
qxœX
qyœY
p(x, y). We consider the random variableZ = (X, Y ) to be a multivariate random variable, of two dimensions.
Looking at the joint probabilities in the table, we can definitely see that the two randomvariables interact. The joint probability is small for young and having arthritis, for example.Further, there seems to be more magnitude in the rows corresponding to young, suggestingthat the probabilities are also influenced by the proportion of people in the population thatare old or young. In fact, one might ask if we can figure out this proportion just from thistable.
The answer is a resounding yes, and leads us to marginal distributions and why wemight care about marginal distributions. Given a joint distribution over random variables,one would hope that we could extract more specific probabilities, like the distribution overjust one of those variables, which is called the marginal distribution. The marginal can besimply computed, by summing up over all values of the other variable
P (X = young) = p(young, no arthritis) + p(young, arthritis) = 51100 .
A young person either does or does not have arthritis, so summing up over these twopossible cases factors out that variable. Therefore, using data collected for random variable
21
answer is to actually take the expected value (mean) of this gamma distribution, which wedefine below in Section 1.4. ⇤
1.3 Multivariate random variablesMuch of the above development extends to multivariate random variables—a vector of ran-dom variables—because the definition of outcome spaces and probabilities is general. Theexamples so far, however, have dealt with scalar random variables, because for multivariaterandom variables, we need to understand how variables interact. In this section, we discussseveral new notions that only arise when there are multiple random variables, includingjoint distributions, conditional distributions, marginals and dependence between variables.
Let us start with a simpler example, with two discrete random variables X and Y withoutcome spaces X and Y. There is a joint probability mass function p : X ◊ Y æ [0, 1], andcorresponding joint probability distribution P , such that
p(x, y) = P (X = x, Y = y)
where the pmf needs to satisfyÿ
xœX
ÿ
yœY
p(x, y) = 1.
For example, if X = {young, old} and Y = {no arthritis, arthritis}, then the pmf could bethe table of probabilities
Y0 1
X0 1/2 1/1001 1/10 39/100
This fits within the definition of probability spaces, because � = X ◊ Y is a valid space,and
qÊœ� p(Ê) =
q(x,y)œ� p(x, y) =
qxœX
qyœY
p(x, y). We consider the random variableZ = (X, Y ) to be a multivariate random variable, of two dimensions.
Looking at the joint probabilities in the table, we can definitely see that the two randomvariables interact. The joint probability is small for young and having arthritis, for example.Further, there seems to be more magnitude in the rows corresponding to young, suggestingthat the probabilities are also influenced by the proportion of people in the population thatare old or young. In fact, one might ask if we can figure out this proportion just from thistable.
The answer is a resounding yes, and leads us to marginal distributions and why wemight care about marginal distributions. Given a joint distribution over random variables,one would hope that we could extract more specific probabilities, like the distribution overjust one of those variables, which is called the marginal distribution. The marginal can besimply computed, by summing up over all values of the other variable
P (X = young) = p(young, no arthritis) + p(young, arthritis) = 51100 .
A young person either does or does not have arthritis, so summing up over these twopossible cases factors out that variable. Therefore, using data collected for random variable
21
* these numbers are completely made up
about your commute time tomorrow. For this setting, your random variable X correspondsto the commute time, and you need to define probabilities for this random variable. Thisdata could be considered to be discrete, taking values, in minutes, {4, 5, 6, . . . , 26}. Youcould then create histograms of this data (table of probability values), as shown in Figure1.4, to reflect the likelihood of commute times.
The commute time, however, is not actually discrete, and so you would like to modelit as a continuous RV. One reasonable choice is a gamma distribution. How, though, doesone take the recorded data and determine the parameters –, — to the gamma distribution?Estimating these parameters is actually quite straightforward, though not as immediatelyobvious as estimating tables of probability values; we discuss how to do so in Chapter 3.The learned gamma distribution is also depicted in Figure 1.4.
Given the gamma distribution, one could now ask the question: what is the most likelycommute time today? This corresponds to maxÊ p(Ê), which is called the mode of thedistribution. Another natural question is the average or expected commute time. To obtainthis, you need the expected value (mean) of this gamma distribution, which we define belowin Section 1.4. ⇤
1.3 Multivariate random variablesMuch of the above development extends to multivariate random variables—a vector of ran-dom variables—because the definition of outcome spaces and probabilities is general. Theexamples so far, however, have dealt with scalar random variables, because for multivariaterandom variables, we need to understand how variables interact. In this section, we discussseveral new notions that only arise when there are multiple random variables, includingjoint distributions, conditional distributions, marginals and dependence between variables.
Let us start with a simpler example, with two discrete random variables X and Y withoutcome spaces X and Y. There is a joint probability mass function p : X ◊ Y æ [0, 1], andcorresponding joint probability distribution P , such that
p(x, y) def= P (X = x, Y = y)
where the pmf needs to satisfy ÿ
xœX
ÿ
yœY
p(x, y) = 1.
For example, if X = {young, old} and Y = {no arthritis, arthritis}, then the pmf could bethe table of joint probabilities This fits within the definition of probability spaces, because
Y0 1
X0 P (X=0, Y =0) = 1/2 P (X=0, Y =1) = 1/1001 P (X=1, Y =0) = 1/10 P (X=1, Y =1) = 39/100
Table 1.1: A joint probability table for random variables X and Y .
� = X ◊ Y is a valid space, andq
Êœ� p(Ê) =q
(x,y)œ� p(x, y) =q
xœX
qyœY
p(x, y). Therandom variable Z = (X, Y ) is a multivariate random variable, of two dimensions.
21
SOME QUESTIONS WE MIGHT ASK NOW THAT WE HAVE TWO RANDOM VARIABLES
answer is to actually take the expected value (mean) of this gamma distribution, which wedefine below in Section 1.4. ⇤
1.3 Multivariate random variablesMuch of the above development extends to multivariate random variables—a vector of ran-dom variables—because the definition of outcome spaces and probabilities is general. Theexamples so far, however, have dealt with scalar random variables, because for multivariaterandom variables, we need to understand how variables interact. In this section, we discussseveral new notions that only arise when there are multiple random variables, includingjoint distributions, conditional distributions, marginals and dependence between variables.
Let us start with a simpler example, with two discrete random variables X and Y withoutcome spaces X and Y. There is a joint probability mass function p : X ◊ Y æ [0, 1], andcorresponding joint probability distribution P , such that
p(x, y) = P (X = x, Y = y)
where the pmf needs to satisfyÿ
xœX
ÿ
yœY
p(x, y) = 1.
For example, if X = {young, old} and Y = {no arthritis, arthritis}, then the pmf could bethe table of probabilities
Y0 1
X0 1/2 1/1001 1/10 39/100
This fits within the definition of probability spaces, because � = X ◊ Y is a valid space,and
qÊœ� p(Ê) =
q(x,y)œ� p(x, y) =
qxœX
qyœY
p(x, y). We consider the random variableZ = (X, Y ) to be a multivariate random variable, of two dimensions.
Looking at the joint probabilities in the table, we can definitely see that the two randomvariables interact. The joint probability is small for young and having arthritis, for example.Further, there seems to be more magnitude in the rows corresponding to young, suggestingthat the probabilities are also influenced by the proportion of people in the population thatare old or young. In fact, one might ask if we can figure out this proportion just from thistable.
The answer is a resounding yes, and leads us to marginal distributions and why wemight care about marginal distributions. Given a joint distribution over random variables,one would hope that we could extract more specific probabilities, like the distribution overjust one of those variables, which is called the marginal distribution. The marginal can besimply computed, by summing up over all values of the other variable
P (X = young) = p(young, no arthritis) + p(young, arthritis) = 51100 .
A young person either does or does not have arthritis, so summing up over these twopossible cases factors out that variable. Therefore, using data collected for random variable
21
answer is to actually take the expected value (mean) of this gamma distribution, which wedefine below in Section 1.4. ⇤
1.3 Multivariate random variablesMuch of the above development extends to multivariate random variables—a vector of ran-dom variables—because the definition of outcome spaces and probabilities is general. Theexamples so far, however, have dealt with scalar random variables, because for multivariaterandom variables, we need to understand how variables interact. In this section, we discussseveral new notions that only arise when there are multiple random variables, includingjoint distributions, conditional distributions, marginals and dependence between variables.
Let us start with a simpler example, with two discrete random variables X and Y withoutcome spaces X and Y. There is a joint probability mass function p : X ◊ Y æ [0, 1], andcorresponding joint probability distribution P , such that
p(x, y) = P (X = x, Y = y)
where the pmf needs to satisfyÿ
xœX
ÿ
yœY
p(x, y) = 1.
For example, if X = {young, old} and Y = {no arthritis, arthritis}, then the pmf could bethe table of probabilities
Y0 1
X0 1/2 1/1001 1/10 39/100
This fits within the definition of probability spaces, because � = X ◊ Y is a valid space,and
qÊœ� p(Ê) =
q(x,y)œ� p(x, y) =
qxœX
qyœY
p(x, y). We consider the random variableZ = (X, Y ) to be a multivariate random variable, of two dimensions.
Looking at the joint probabilities in the table, we can definitely see that the two randomvariables interact. The joint probability is small for young and having arthritis, for example.Further, there seems to be more magnitude in the rows corresponding to young, suggestingthat the probabilities are also influenced by the proportion of people in the population thatare old or young. In fact, one might ask if we can figure out this proportion just from thistable.
The answer is a resounding yes, and leads us to marginal distributions and why wemight care about marginal distributions. Given a joint distribution over random variables,one would hope that we could extract more specific probabilities, like the distribution overjust one of those variables, which is called the marginal distribution. The marginal can besimply computed, by summing up over all values of the other variable
P (X = young) = p(young, no arthritis) + p(young, arthritis) = 51100 .
A young person either does or does not have arthritis, so summing up over these twopossible cases factors out that variable. Therefore, using data collected for random variable
21
Are these two variables related? Or do they change completely independently of each other?
Given this joint distribution, can we determine just the distribution over arthritis? i.e., P(Y = 1)? (Marginal distribution)
If we knew something about one of the variables, say that the person Is young, do we know the distribution over Y? (Conditional distribution)
EXAMPLE: MARGINAL AND CONDITIONAL DISTRIBUTION
answer is to actually take the expected value (mean) of this gamma distribution, which wedefine below in Section 1.4. ⇤
1.3 Multivariate random variablesMuch of the above development extends to multivariate random variables—a vector of ran-dom variables—because the definition of outcome spaces and probabilities is general. Theexamples so far, however, have dealt with scalar random variables, because for multivariaterandom variables, we need to understand how variables interact. In this section, we discussseveral new notions that only arise when there are multiple random variables, includingjoint distributions, conditional distributions, marginals and dependence between variables.
Let us start with a simpler example, with two discrete random variables X and Y withoutcome spaces X and Y. There is a joint probability mass function p : X ◊ Y æ [0, 1], andcorresponding joint probability distribution P , such that
p(x, y) = P (X = x, Y = y)
where the pmf needs to satisfyÿ
xœX
ÿ
yœY
p(x, y) = 1.
For example, if X = {young, old} and Y = {no arthritis, arthritis}, then the pmf could bethe table of probabilities
Y0 1
X0 1/2 1/1001 1/10 39/100
This fits within the definition of probability spaces, because � = X ◊ Y is a valid space,and
qÊœ� p(Ê) =
q(x,y)œ� p(x, y) =
qxœX
qyœY
p(x, y). We consider the random variableZ = (X, Y ) to be a multivariate random variable, of two dimensions.
Looking at the joint probabilities in the table, we can definitely see that the two randomvariables interact. The joint probability is small for young and having arthritis, for example.Further, there seems to be more magnitude in the rows corresponding to young, suggestingthat the probabilities are also influenced by the proportion of people in the population thatare old or young. In fact, one might ask if we can figure out this proportion just from thistable.
The answer is a resounding yes, and leads us to marginal distributions and why wemight care about marginal distributions. Given a joint distribution over random variables,one would hope that we could extract more specific probabilities, like the distribution overjust one of those variables, which is called the marginal distribution. The marginal can besimply computed, by summing up over all values of the other variable
P (X = young) = p(young, no arthritis) + p(young, arthritis) = 51100 .
A young person either does or does not have arthritis, so summing up over these twopossible cases factors out that variable. Therefore, using data collected for random variable
21
answer is to actually take the expected value (mean) of this gamma distribution, which wedefine below in Section 1.4. ⇤
1.3 Multivariate random variablesMuch of the above development extends to multivariate random variables—a vector of ran-dom variables—because the definition of outcome spaces and probabilities is general. Theexamples so far, however, have dealt with scalar random variables, because for multivariaterandom variables, we need to understand how variables interact. In this section, we discussseveral new notions that only arise when there are multiple random variables, includingjoint distributions, conditional distributions, marginals and dependence between variables.
Let us start with a simpler example, with two discrete random variables X and Y withoutcome spaces X and Y. There is a joint probability mass function p : X ◊ Y æ [0, 1], andcorresponding joint probability distribution P , such that
p(x, y) = P (X = x, Y = y)
where the pmf needs to satisfyÿ
xœX
ÿ
yœY
p(x, y) = 1.
For example, if X = {young, old} and Y = {no arthritis, arthritis}, then the pmf could bethe table of probabilities
Y0 1
X0 1/2 1/1001 1/10 39/100
This fits within the definition of probability spaces, because � = X ◊ Y is a valid space,and
qÊœ� p(Ê) =
q(x,y)œ� p(x, y) =
qxœX
qyœY
p(x, y). We consider the random variableZ = (X, Y ) to be a multivariate random variable, of two dimensions.
Looking at the joint probabilities in the table, we can definitely see that the two randomvariables interact. The joint probability is small for young and having arthritis, for example.Further, there seems to be more magnitude in the rows corresponding to young, suggestingthat the probabilities are also influenced by the proportion of people in the population thatare old or young. In fact, one might ask if we can figure out this proportion just from thistable.
The answer is a resounding yes, and leads us to marginal distributions and why wemight care about marginal distributions. Given a joint distribution over random variables,one would hope that we could extract more specific probabilities, like the distribution overjust one of those variables, which is called the marginal distribution. The marginal can besimply computed, by summing up over all values of the other variable
P (X = young) = p(young, no arthritis) + p(young, arthritis) = 51100 .
A young person either does or does not have arthritis, so summing up over these twopossible cases factors out that variable. Therefore, using data collected for random variable
21
P(Y = 1) = P(Y = 1, X = 0) + P(Y = 1, X = 1) = 40/100 What is P(Y = 0)?
P(X = 1) = 49/100. So what is P(X = 0)?
P(Y = 1 | X = 0) = ? Is it 1/100, where the table tells us P(Y = 1, X=0)?
No
CONDITIONAL DISTRIBUTIONS
p(y|x) = p(x, y)
p(x)
If p(x,y) is small, does this imply that p(y|x) is small?
EXERCISE: CONDITIONAL DISTRIBUTION
answer is to actually take the expected value (mean) of this gamma distribution, which wedefine below in Section 1.4. ⇤
1.3 Multivariate random variablesMuch of the above development extends to multivariate random variables—a vector of ran-dom variables—because the definition of outcome spaces and probabilities is general. Theexamples so far, however, have dealt with scalar random variables, because for multivariaterandom variables, we need to understand how variables interact. In this section, we discussseveral new notions that only arise when there are multiple random variables, includingjoint distributions, conditional distributions, marginals and dependence between variables.
Let us start with a simpler example, with two discrete random variables X and Y withoutcome spaces X and Y. There is a joint probability mass function p : X ◊ Y æ [0, 1], andcorresponding joint probability distribution P , such that
p(x, y) = P (X = x, Y = y)
where the pmf needs to satisfyÿ
xœX
ÿ
yœY
p(x, y) = 1.
For example, if X = {young, old} and Y = {no arthritis, arthritis}, then the pmf could bethe table of probabilities
Y0 1
X0 1/2 1/1001 1/10 39/100
This fits within the definition of probability spaces, because � = X ◊ Y is a valid space,and
qÊœ� p(Ê) =
q(x,y)œ� p(x, y) =
qxœX
qyœY
p(x, y). We consider the random variableZ = (X, Y ) to be a multivariate random variable, of two dimensions.
Looking at the joint probabilities in the table, we can definitely see that the two randomvariables interact. The joint probability is small for young and having arthritis, for example.Further, there seems to be more magnitude in the rows corresponding to young, suggestingthat the probabilities are also influenced by the proportion of people in the population thatare old or young. In fact, one might ask if we can figure out this proportion just from thistable.
The answer is a resounding yes, and leads us to marginal distributions and why wemight care about marginal distributions. Given a joint distribution over random variables,one would hope that we could extract more specific probabilities, like the distribution overjust one of those variables, which is called the marginal distribution. The marginal can besimply computed, by summing up over all values of the other variable
P (X = young) = p(young, no arthritis) + p(young, arthritis) = 51100 .
A young person either does or does not have arthritis, so summing up over these twopossible cases factors out that variable. Therefore, using data collected for random variable
21
answer is to actually take the expected value (mean) of this gamma distribution, which wedefine below in Section 1.4. ⇤
1.3 Multivariate random variablesMuch of the above development extends to multivariate random variables—a vector of ran-dom variables—because the definition of outcome spaces and probabilities is general. Theexamples so far, however, have dealt with scalar random variables, because for multivariaterandom variables, we need to understand how variables interact. In this section, we discussseveral new notions that only arise when there are multiple random variables, includingjoint distributions, conditional distributions, marginals and dependence between variables.
Let us start with a simpler example, with two discrete random variables X and Y withoutcome spaces X and Y. There is a joint probability mass function p : X ◊ Y æ [0, 1], andcorresponding joint probability distribution P , such that
p(x, y) = P (X = x, Y = y)
where the pmf needs to satisfyÿ
xœX
ÿ
yœY
p(x, y) = 1.
For example, if X = {young, old} and Y = {no arthritis, arthritis}, then the pmf could bethe table of probabilities
Y0 1
X0 1/2 1/1001 1/10 39/100
This fits within the definition of probability spaces, because � = X ◊ Y is a valid space,and
qÊœ� p(Ê) =
q(x,y)œ� p(x, y) =
qxœX
qyœY
p(x, y). We consider the random variableZ = (X, Y ) to be a multivariate random variable, of two dimensions.
Looking at the joint probabilities in the table, we can definitely see that the two randomvariables interact. The joint probability is small for young and having arthritis, for example.Further, there seems to be more magnitude in the rows corresponding to young, suggestingthat the probabilities are also influenced by the proportion of people in the population thatare old or young. In fact, one might ask if we can figure out this proportion just from thistable.
The answer is a resounding yes, and leads us to marginal distributions and why wemight care about marginal distributions. Given a joint distribution over random variables,one would hope that we could extract more specific probabilities, like the distribution overjust one of those variables, which is called the marginal distribution. The marginal can besimply computed, by summing up over all values of the other variable
P (X = young) = p(young, no arthritis) + p(young, arthritis) = 51100 .
A young person either does or does not have arthritis, so summing up over these twopossible cases factors out that variable. Therefore, using data collected for random variable
21
P(Y = 1 | X = 0) = ?
What is P(Y = 0 | X = 0)? Should P(Y = 1 | X = 0) + P(Y = 0 | X = 0) = 1?
p(y|x) = p(x, y)
p(x)
JOINT DISTRIBUTIONS FOR MANY VARIABLESZ = (X, Y ), we can determine the proportion of the population that is young and theproportion that is old.
In general, we can consider d-dimensional random variable X = (X1, X2, . . . , Xd) withvector-valued outcomes x = (x1, x2, . . . , xd), such that each xi is chosen from some Xi.Then, for the discrete case, any function p : X1 ◊ X2 ◊ . . . ◊ Xd æ [0, 1] is called a multidi-mensional probability mass function if
ÿ
x1œX1
ÿ
x2œX2
· · ·
ÿ
xdœX d
p (x1, x2, . . . , xd) = 1.
or, for the continuous case, p : X1 ◊X2 ◊ . . .◊Xd æ [0, Œ] is a multidimensional probabilitydensity function if
X1 X2· · ·
X d
p (x1, x2, . . . , xd) dx1dx2 . . . dxd = 1.
A marginal distribution is defined for a subset of X = (X1, X2, . . . , Xd) by summing orintegrating over the remaining variables. For the discrete case, the marginal distributionp (xi) is defined as
p (xi) =ÿ
x1œX1
· · ·
ÿ
xi≠1œXi≠1
ÿ
xi+1œXi+1
· · ·
ÿ
xdœXd
p (x1, . . . , xi≠1, xi, xi+1, . . . , xd) ,
where the variable xi is fixed to some value and we sum over all possible values of the othervariables. Similarly, for the continuous case, the marginal distribution p (xi) is defined as
p (xi) =X1
· · ·
Xi≠1 Xi+1
· · ·
Xd
p (x1, . . . , xi≠1, xi, xi+1, . . . , xd) dx1 . . . dxi≠1dxi+1 . . . dxd.
Notice that we use p to define the density over x, but then we overload this terminology andalso use p for the density only over xi. To be more precise, we should define two separatefunctions (pdfs), say px for the density over the multivariate random variable and pxi
forthe marginal. It is common, however, to simply use p, and infer the random variable fromcontext. In most cases, it is clear; if it is not, we will explicitly highlight the pdfs withadditional subscripts.
We can define common multivariate pmfs and pdfs, that are extensions of the scalarpmfs and pdfs. The most useful extensions include table of probability values, uniformdistributions and Gaussian distributions. We define these concretely in the final sectionof this chapter, Section 1.5, for reference. Before we provide these definitions, however,it will be useful to understand how multiple variables interact, because this will influencethe extension from univariate to multivariate. In particular, it will be useful to understandconditional distributions and dependence, which we discuss next.
1.3.1 Conditional distributionsConditional probabilities define probabilities of a random variable X, given informationabout the value of another random variable Y . More formally, the conditional probabilityp(y|x) for two random variables X and Y is defined as
p(y|x) = p(x, y)p(x) (1.1)
22
MARGINAL DISTRIBUTIONS
Z = (X, Y ), we can determine the proportion of the population that is young and theproportion that is old.
In general, we can consider d-dimensional random variable X = (X1, X2, . . . , Xd) withvector-valued outcomes x = (x1, x2, . . . , xd), such that each xi is chosen from some Xi.Then, for the discrete case, any function p : X1 ◊ X2 ◊ . . . ◊ Xd æ [0, 1] is called a multidi-mensional probability mass function if
ÿ
x1œX1
ÿ
x2œX2
· · ·
ÿ
xdœX d
p (x1, x2, . . . , xd) = 1.
or, for the continuous case, p : X1 ◊X2 ◊ . . .◊Xd æ [0, Œ] is a multidimensional probabilitydensity function if
X1 X2· · ·
X d
p (x1, x2, . . . , xd) dx1dx2 . . . dxd = 1.
A marginal distribution is defined for a subset of X = (X1, X2, . . . , Xd) by summing orintegrating over the remaining variables. For the discrete case, the marginal distributionp (xi) is defined as
p (xi) =ÿ
x1œX1
· · ·
ÿ
xi≠1œXi≠1
ÿ
xi+1œXi+1
· · ·
ÿ
xdœXd
p (x1, . . . , xi≠1, xi, xi+1, . . . , xd) ,
where the variable xi is fixed to some value and we sum over all possible values of the othervariables. Similarly, for the continuous case, the marginal distribution p (xi) is defined as
p (xi) =X1
· · ·
Xi≠1 Xi+1
· · ·
Xd
p (x1, . . . , xi≠1, xi, xi+1, . . . , xd) dx1 . . . dxi≠1dxi+1 . . . dxd.
Notice that we use p to define the density over x, but then we overload this terminology andalso use p for the density only over xi. To be more precise, we should define two separatefunctions (pdfs), say px for the density over the multivariate random variable and pxi
forthe marginal. It is common, however, to simply use p, and infer the random variable fromcontext. In most cases, it is clear; if it is not, we will explicitly highlight the pdfs withadditional subscripts.
We can define common multivariate pmfs and pdfs, that are extensions of the scalarpmfs and pdfs. The most useful extensions include table of probability values, uniformdistributions and Gaussian distributions. We define these concretely in the final sectionof this chapter, Section 1.5, for reference. Before we provide these definitions, however,it will be useful to understand how multiple variables interact, because this will influencethe extension from univariate to multivariate. In particular, it will be useful to understandconditional distributions and dependence, which we discuss next.
1.3.1 Conditional distributionsConditional probabilities define probabilities of a random variable X, given informationabout the value of another random variable Y . More formally, the conditional probabilityp(y|x) for two random variables X and Y is defined as
p(y|x) = p(x, y)p(x) (1.1)
22
Natural question: Why do you use p for p(xi) and for p(x1, …., xd)? They have different domains, they can’t be the same function!
DROPPING SUBSCRIPTS
Instead of:
We will write: p(y|x) = p(x, y)
p(x)
ANOTHER EXAMPLE FOR CONDITIONAL DISTRIBUTIONS
• Let X be a Bernoulli random variable (i.e., 0 or 1 with probability alpha)
• Let Y be a random variable in {10, 11, …, 1000}
• p(y | X = 0) and p(y | X = 1) are different distributions
• Two types of books: fiction (X=0) and non-fiction (X=1)
• Let Y correspond to number of pages
• Distribution over number of pages different for fiction and non-fiction books (e.g., average different)
EXAMPLE CONTINUED
• Two types of books: fiction (X=0) and non-fiction (X=1) • Y corresponds to number of pages • If most books are non-fiction, p(X = 0, y) is small even if
y is a likely number of pages for a fiction book • p(X = 0) accounts for the fact that joint probability small
if p(X = 0) is small • p(y | X = 0) = p(X = 0, y)/p(X = 0) • p(X = 0 , y) = probability that a book is fiction and has
y pages (imagine randomly sampling a book) • p(X = 0) = probability that a book is fiction
ANOTHER EXAMPLE
• Two types of books: fiction (X=0) and non-fiction (X=1) • Let Y be a random variable over the reals, which
corresponds to amount of money made • p(y | X = 0) and p(y | X = 1) are different distributions • e.g., even if both p(y | X = 0) and p(y | X = 1) are
Gaussian, they likely have different means and variances
THINK-PAIR-SHARE (5 MINUTES)• How might you use a given Poisson distribution, that
models commute times? (Hint: recall modes)
• How might you pick lambda for a Poisson distribution, to model commute times? (Hint: the mean of a Poisson is lambda)
Review so far
• PMFs (discrete) and PDFs (continuous) • Joint probabilities • Marginals • Conditional probabilities
50
CHAIN RULE
Two variable example p(x, y) = p(x|y)p(y) = p(y|x)p(x)
p(x1, . . . , xk) = p(xk)k�1Y
i=1
p(xi|xi+1, . . . , xk)
= p(x1)kY
i=2
p(xi|x1, . . . , xi�1)<latexit sha1_base64="RVK/3RBZsHi67FgwqVrXklW5t6A=">AAACdHicbZFNS8MwGMfT+jbn29SDBz1Ex0RRRzMEvQhDLx4VnA7WWdI0m2FpU5JUHLWfwG/nzY/hxbPp1oPbfCDw5/9/fnl54secKe04X5Y9N7+wuFRaLq+srq1vVDa3HpVIJKEtIriQbR8ryllEW5ppTtuxpDj0OX3yBzd5/vRKpWIietDDmHZD3I9YjxGsjeVVPuKjNw+dQpcHQqtT+OYNjuHhFcxto9xYisBL2RXKntPBGcpGAYPvpjFlJyibJF23XLDoD9vI2QlykkuZ2diwXqXq1J1RwVmBClEFRd15lU83ECQJaaQJx0p1kBPrboqlZoTTrOwmisaYDHCfdoyMcEhVNx0NLYM14wSwJ6RZkYYj9y+R4lCpYeibzhDrFzWd5eZ/WSfRvctuyqI40TQi44N6CYdawPwHYMAkJZoPjcBEMnNXSF6wxESbfyqbIaDpJ8+Kx0YdOXV0f15tXhfjKIFdcACOAAIXoAluwR1oAQK+rR0LWvvWj71nV+3auNW2CmYbTJRd/wUg1Liq</latexit><latexit sha1_base64="RVK/3RBZsHi67FgwqVrXklW5t6A=">AAACdHicbZFNS8MwGMfT+jbn29SDBz1Ex0RRRzMEvQhDLx4VnA7WWdI0m2FpU5JUHLWfwG/nzY/hxbPp1oPbfCDw5/9/fnl54secKe04X5Y9N7+wuFRaLq+srq1vVDa3HpVIJKEtIriQbR8ryllEW5ppTtuxpDj0OX3yBzd5/vRKpWIietDDmHZD3I9YjxGsjeVVPuKjNw+dQpcHQqtT+OYNjuHhFcxto9xYisBL2RXKntPBGcpGAYPvpjFlJyibJF23XLDoD9vI2QlykkuZ2diwXqXq1J1RwVmBClEFRd15lU83ECQJaaQJx0p1kBPrboqlZoTTrOwmisaYDHCfdoyMcEhVNx0NLYM14wSwJ6RZkYYj9y+R4lCpYeibzhDrFzWd5eZ/WSfRvctuyqI40TQi44N6CYdawPwHYMAkJZoPjcBEMnNXSF6wxESbfyqbIaDpJ8+Kx0YdOXV0f15tXhfjKIFdcACOAAIXoAluwR1oAQK+rR0LWvvWj71nV+3auNW2CmYbTJRd/wUg1Liq</latexit><latexit sha1_base64="RVK/3RBZsHi67FgwqVrXklW5t6A=">AAACdHicbZFNS8MwGMfT+jbn29SDBz1Ex0RRRzMEvQhDLx4VnA7WWdI0m2FpU5JUHLWfwG/nzY/hxbPp1oPbfCDw5/9/fnl54secKe04X5Y9N7+wuFRaLq+srq1vVDa3HpVIJKEtIriQbR8ryllEW5ppTtuxpDj0OX3yBzd5/vRKpWIietDDmHZD3I9YjxGsjeVVPuKjNw+dQpcHQqtT+OYNjuHhFcxto9xYisBL2RXKntPBGcpGAYPvpjFlJyibJF23XLDoD9vI2QlykkuZ2diwXqXq1J1RwVmBClEFRd15lU83ECQJaaQJx0p1kBPrboqlZoTTrOwmisaYDHCfdoyMcEhVNx0NLYM14wSwJ6RZkYYj9y+R4lCpYeibzhDrFzWd5eZ/WSfRvctuyqI40TQi44N6CYdawPwHYMAkJZoPjcBEMnNXSF6wxESbfyqbIaDpJ8+Kx0YdOXV0f15tXhfjKIFdcACOAAIXoAluwR1oAQK+rR0LWvvWj71nV+3auNW2CmYbTJRd/wUg1Liq</latexit><latexit sha1_base64="RVK/3RBZsHi67FgwqVrXklW5t6A=">AAACdHicbZFNS8MwGMfT+jbn29SDBz1Ex0RRRzMEvQhDLx4VnA7WWdI0m2FpU5JUHLWfwG/nzY/hxbPp1oPbfCDw5/9/fnl54secKe04X5Y9N7+wuFRaLq+srq1vVDa3HpVIJKEtIriQbR8ryllEW5ppTtuxpDj0OX3yBzd5/vRKpWIietDDmHZD3I9YjxGsjeVVPuKjNw+dQpcHQqtT+OYNjuHhFcxto9xYisBL2RXKntPBGcpGAYPvpjFlJyibJF23XLDoD9vI2QlykkuZ2diwXqXq1J1RwVmBClEFRd15lU83ECQJaaQJx0p1kBPrboqlZoTTrOwmisaYDHCfdoyMcEhVNx0NLYM14wSwJ6RZkYYj9y+R4lCpYeibzhDrFzWd5eZ/WSfRvctuyqI40TQi44N6CYdawPwHYMAkJZoPjcBEMnNXSF6wxESbfyqbIaDpJ8+Kx0YdOXV0f15tXhfjKIFdcACOAAIXoAluwR1oAQK+rR0LWvvWj71nV+3auNW2CmYbTJRd/wUg1Liq</latexit>
CHAIN RULE
Three variable example
p(x, y, z) = p(y|x, z)p(x, z) = p(y|x, z)p(x|z)p(z)= p(x|y, z)p(y|z)p(z)= p(x|y, z)p(z|y)p(y)
<latexit sha1_base64="9QsCkcafLHX5+w5FPFQWzQOTcY4=">AAACZnicfZHNS8MwGMbT+rXVr7khHrwEh7LBGK0IehGGXjxOcB+wjZFm6RaWpiVJZd3cP+nNsxf/DNOuB93EF0J/eZ73JclTN2RUKtv+MMyt7Z3dvVze2j84PDounBTbMogEJi0csEB0XSQJo5y0FFWMdENBkO8y0nGnj4nfeSVC0oC/qDgkAx+NOfUoRkpLw8IyrMxqMK7BeRVe3cOwEsM3OEu3qaO/G6rmFObVft9KhxIpzuz4f3uecNqn7WGhbNfttOAmOBmUQVbNYeG9Pwpw5BOuMENS9hw7VIMFEopiRpZWP5IkRHiKxqSnkSOfyMEijWkJL7Uygl4g9OIKpurPiQXypYx9V3f6SE3kupeIf3m9SHl3gwXlYaQIx6uDvIhBFcAkcziigmDFYg0IC6rvCvEECYSV/jOWDsFZf/ImtK/rjl13nm/KjYcsjhw4BxegAhxwCxrgCTRBC2DwaeSNolEyvswj89Q8W7WaRjZTAr/KhN9Ez6xo</latexit><latexit sha1_base64="9QsCkcafLHX5+w5FPFQWzQOTcY4=">AAACZnicfZHNS8MwGMbT+rXVr7khHrwEh7LBGK0IehGGXjxOcB+wjZFm6RaWpiVJZd3cP+nNsxf/DNOuB93EF0J/eZ73JclTN2RUKtv+MMyt7Z3dvVze2j84PDounBTbMogEJi0csEB0XSQJo5y0FFWMdENBkO8y0nGnj4nfeSVC0oC/qDgkAx+NOfUoRkpLw8IyrMxqMK7BeRVe3cOwEsM3OEu3qaO/G6rmFObVft9KhxIpzuz4f3uecNqn7WGhbNfttOAmOBmUQVbNYeG9Pwpw5BOuMENS9hw7VIMFEopiRpZWP5IkRHiKxqSnkSOfyMEijWkJL7Uygl4g9OIKpurPiQXypYx9V3f6SE3kupeIf3m9SHl3gwXlYaQIx6uDvIhBFcAkcziigmDFYg0IC6rvCvEECYSV/jOWDsFZf/ImtK/rjl13nm/KjYcsjhw4BxegAhxwCxrgCTRBC2DwaeSNolEyvswj89Q8W7WaRjZTAr/KhN9Ez6xo</latexit><latexit sha1_base64="9QsCkcafLHX5+w5FPFQWzQOTcY4=">AAACZnicfZHNS8MwGMbT+rXVr7khHrwEh7LBGK0IehGGXjxOcB+wjZFm6RaWpiVJZd3cP+nNsxf/DNOuB93EF0J/eZ73JclTN2RUKtv+MMyt7Z3dvVze2j84PDounBTbMogEJi0csEB0XSQJo5y0FFWMdENBkO8y0nGnj4nfeSVC0oC/qDgkAx+NOfUoRkpLw8IyrMxqMK7BeRVe3cOwEsM3OEu3qaO/G6rmFObVft9KhxIpzuz4f3uecNqn7WGhbNfttOAmOBmUQVbNYeG9Pwpw5BOuMENS9hw7VIMFEopiRpZWP5IkRHiKxqSnkSOfyMEijWkJL7Uygl4g9OIKpurPiQXypYx9V3f6SE3kupeIf3m9SHl3gwXlYaQIx6uDvIhBFcAkcziigmDFYg0IC6rvCvEECYSV/jOWDsFZf/ImtK/rjl13nm/KjYcsjhw4BxegAhxwCxrgCTRBC2DwaeSNolEyvswj89Q8W7WaRjZTAr/KhN9Ez6xo</latexit><latexit sha1_base64="9QsCkcafLHX5+w5FPFQWzQOTcY4=">AAACZnicfZHNS8MwGMbT+rXVr7khHrwEh7LBGK0IehGGXjxOcB+wjZFm6RaWpiVJZd3cP+nNsxf/DNOuB93EF0J/eZ73JclTN2RUKtv+MMyt7Z3dvVze2j84PDounBTbMogEJi0csEB0XSQJo5y0FFWMdENBkO8y0nGnj4nfeSVC0oC/qDgkAx+NOfUoRkpLw8IyrMxqMK7BeRVe3cOwEsM3OEu3qaO/G6rmFObVft9KhxIpzuz4f3uecNqn7WGhbNfttOAmOBmUQVbNYeG9Pwpw5BOuMENS9hw7VIMFEopiRpZWP5IkRHiKxqSnkSOfyMEijWkJL7Uygl4g9OIKpurPiQXypYx9V3f6SE3kupeIf3m9SHl3gwXlYaQIx6uDvIhBFcAkcziigmDFYg0IC6rvCvEECYSV/jOWDsFZf/ImtK/rjl13nm/KjYcsjhw4BxegAhxwCxrgCTRBC2DwaeSNolEyvswj89Q8W7WaRjZTAr/KhN9Ez6xo</latexit>
. . .
p(x1, . . . , xk) = p(xk)k�1Y
i=1
p(xi|xi+1, . . . , xk)<latexit sha1_base64="LMJGMgMfsBXQ42t5fd2istmQ/is=">AAACNXicbVDLSgMxFM3UV62vqks3wSJUrGUigm4KRTcuXFSwD2jrkMmkbWhmMiQZaRnnp9z4H6504UIRt/6C6WNhqwcCh3PO5eYeN+RMadt+tVILi0vLK+nVzNr6xuZWdnunpkQkCa0SwYVsuFhRzgJa1Uxz2gglxb7Lad3tX478+j2ViongVg9D2vZxN2AdRrA2kpO9DvMDBxVgi3tCqwIcOP1DWIIj1ZBWKIXnxKyEkru4f4ySscHgg8nF7Agls4NONmcX7THgX4KmJAemqDjZ55YnSOTTQBOOlWoiO9TtGEvNCKdJphUpGmLSx13aNDTAPlXteHx1Ag+M4sGOkOYFGo7V3xMx9pUa+q5J+lj31Lw3Ev/zmpHunLdjFoSRpgGZLOpEHGoBRxVCj0lKNB8agolk5q+Q9LDERJuiM6YENH/yX1I7KSK7iG5Oc+WLaR1psAf2QR4gcAbK4ApUQBUQ8AhewDv4sJ6sN+vT+ppEU9Z0ZhfMwPr+Af93qJM=</latexit><latexit sha1_base64="LMJGMgMfsBXQ42t5fd2istmQ/is=">AAACNXicbVDLSgMxFM3UV62vqks3wSJUrGUigm4KRTcuXFSwD2jrkMmkbWhmMiQZaRnnp9z4H6504UIRt/6C6WNhqwcCh3PO5eYeN+RMadt+tVILi0vLK+nVzNr6xuZWdnunpkQkCa0SwYVsuFhRzgJa1Uxz2gglxb7Lad3tX478+j2ViongVg9D2vZxN2AdRrA2kpO9DvMDBxVgi3tCqwIcOP1DWIIj1ZBWKIXnxKyEkru4f4ySscHgg8nF7Agls4NONmcX7THgX4KmJAemqDjZ55YnSOTTQBOOlWoiO9TtGEvNCKdJphUpGmLSx13aNDTAPlXteHx1Ag+M4sGOkOYFGo7V3xMx9pUa+q5J+lj31Lw3Ev/zmpHunLdjFoSRpgGZLOpEHGoBRxVCj0lKNB8agolk5q+Q9LDERJuiM6YENH/yX1I7KSK7iG5Oc+WLaR1psAf2QR4gcAbK4ApUQBUQ8AhewDv4sJ6sN+vT+ppEU9Z0ZhfMwPr+Af93qJM=</latexit><latexit sha1_base64="LMJGMgMfsBXQ42t5fd2istmQ/is=">AAACNXicbVDLSgMxFM3UV62vqks3wSJUrGUigm4KRTcuXFSwD2jrkMmkbWhmMiQZaRnnp9z4H6504UIRt/6C6WNhqwcCh3PO5eYeN+RMadt+tVILi0vLK+nVzNr6xuZWdnunpkQkCa0SwYVsuFhRzgJa1Uxz2gglxb7Lad3tX478+j2ViongVg9D2vZxN2AdRrA2kpO9DvMDBxVgi3tCqwIcOP1DWIIj1ZBWKIXnxKyEkru4f4ySscHgg8nF7Agls4NONmcX7THgX4KmJAemqDjZ55YnSOTTQBOOlWoiO9TtGEvNCKdJphUpGmLSx13aNDTAPlXteHx1Ag+M4sGOkOYFGo7V3xMx9pUa+q5J+lj31Lw3Ev/zmpHunLdjFoSRpgGZLOpEHGoBRxVCj0lKNB8agolk5q+Q9LDERJuiM6YENH/yX1I7KSK7iG5Oc+WLaR1psAf2QR4gcAbK4ApUQBUQ8AhewDv4sJ6sN+vT+ppEU9Z0ZhfMwPr+Af93qJM=</latexit><latexit sha1_base64="LMJGMgMfsBXQ42t5fd2istmQ/is=">AAACNXicbVDLSgMxFM3UV62vqks3wSJUrGUigm4KRTcuXFSwD2jrkMmkbWhmMiQZaRnnp9z4H6504UIRt/6C6WNhqwcCh3PO5eYeN+RMadt+tVILi0vLK+nVzNr6xuZWdnunpkQkCa0SwYVsuFhRzgJa1Uxz2gglxb7Lad3tX478+j2ViongVg9D2vZxN2AdRrA2kpO9DvMDBxVgi3tCqwIcOP1DWIIj1ZBWKIXnxKyEkru4f4ySscHgg8nF7Agls4NONmcX7THgX4KmJAemqDjZ55YnSOTTQBOOlWoiO9TtGEvNCKdJphUpGmLSx13aNDTAPlXteHx1Ag+M4sGOkOYFGo7V3xMx9pUa+q5J+lj31Lw3Ev/zmpHunLdjFoSRpgGZLOpEHGoBRxVCj0lKNB8agolk5q+Q9LDERJuiM6YENH/yX1I7KSK7iG5Oc+WLaR1psAf2QR4gcAbK4ApUQBUQ8AhewDv4sJ6sN+vT+ppEU9Z0ZhfMwPr+Af93qJM=</latexit>
HOW DO WE GET BAYES RULE?
Recall chain rule: p(x, y) = p(x|y)p(y) = p(y|x)p(x)
p(y|x) = p(x|y)p(y)p(x)
Bayes rule:
EXAMPLE: DRUG TEST
• Imagine p(DT = True | User = True) = 0.99 • Imagine p(User = True) = 0.005 • Imagine p(DT = True | User = False) = 0.01 • What is p(User = True | DT = True)?
REMINDERS: JANUARY 14, 2020
• Thought Questions 1 due on January 23 • Chapters 1-3 • You should be reading now • if you read the material before I lecture, it will (a)
help you understand a lot better and (b) be easier to write coherent thought questions
• Assignment 1 due on January 30 • Any questions?
INDEPENDENCE OF RANDOM VARIABLES
We will drop subscripts and write p(x, y) = p(x)p(y)
p(x, y|z) = p(x|z)p(y|z)
CONDITIONAL INDEPENDENCE EXAMPLES EXAMPLE 7 IN THE NOTES
• Imagine you have a biased coin (does not flip 50% heads and 50% tails, but skewed towards one)
• Let Z = bias of a coin (say outcomes are 0.3, 0.5, 0.8 with associated probabilities 0.7, 0.2, 0.1)
• what other outcome space could we consider? • what kinds of distributions?
• Let X and Y be consecutive flips of the coin • Are X and Y independent? • Are X and Y conditionally independent, given Z?
**(Basic example about an important issue in ML: hidden variables)
CONDITIONAL INDEPENDENCE IS RELATIVE TO DISTRIBUTION YOU PICK, NOT OBJECTIVE FOR ALL DISTRIBUTIONS
• Explained on whiteboard • What is p(X) and p(X | Z)
CONDITIONAL INDEPENDENCE EXAMPLES EXAMPLE 7 IN THE NOTES
• Are X and Y independent? (don’t know Z) • Are X and Y conditionally independent, given Z?
**(Basic example about an important issue in ML: hidden variables)
Z
X Y
0 or 1 0 or 1
z 0.3 0.5 0.8
p(z) 0.7 0.2 0.1
bias
probability of that bias
Imagine don’t know Z and flip two 0s. Does that tell you anything about Z?
p(X,Y |Z) = p(X|Z)p(Y |Z)?<latexit sha1_base64="+7KGq3ZWR9HAbJVhuQO6VAvHe6M=">AAACBnicbVDLSgMxFM3UV62vUZciBItQQcqMCLoRi25cVrAv26Fk0kwbmsmEJCOUtis3/oobF4q49Rvc+Tdm2llo64GQk3Pu5eYeXzCqtON8W5mFxaXllexqbm19Y3PL3t6pqiiWmFRwxCJZ95EijHJS0VQzUheSoNBnpOb3rxO/9kCkohG/0wNBvBB1OQ0oRtpIbXtfFOrHDTiC90fwAprHyBBRaCTXZa5t552iMwGcJ25K8iBFuW1/tToRjkPCNWZIqabrCO0NkdQUMzLOtWJFBMJ91CVNQzkKifKGkzXG8NAoHRhE0hyu4UT93TFEoVKD0DeVIdI9Nesl4n9eM9bBuTekXMSacDwdFMQM6ggmmcAOlQRrNjAEYUnNXyHuIYmwNsklIbizK8+T6knRdYru7Wm+dJXGkQV74AAUgAvOQAncgDKoAAwewTN4BW/Wk/VivVsf09KMlfbsgj+wPn8AqCiVZg==</latexit><latexit sha1_base64="+7KGq3ZWR9HAbJVhuQO6VAvHe6M=">AAACBnicbVDLSgMxFM3UV62vUZciBItQQcqMCLoRi25cVrAv26Fk0kwbmsmEJCOUtis3/oobF4q49Rvc+Tdm2llo64GQk3Pu5eYeXzCqtON8W5mFxaXllexqbm19Y3PL3t6pqiiWmFRwxCJZ95EijHJS0VQzUheSoNBnpOb3rxO/9kCkohG/0wNBvBB1OQ0oRtpIbXtfFOrHDTiC90fwAprHyBBRaCTXZa5t552iMwGcJ25K8iBFuW1/tToRjkPCNWZIqabrCO0NkdQUMzLOtWJFBMJ91CVNQzkKifKGkzXG8NAoHRhE0hyu4UT93TFEoVKD0DeVIdI9Nesl4n9eM9bBuTekXMSacDwdFMQM6ggmmcAOlQRrNjAEYUnNXyHuIYmwNsklIbizK8+T6knRdYru7Wm+dJXGkQV74AAUgAvOQAncgDKoAAwewTN4BW/Wk/VivVsf09KMlfbsgj+wPn8AqCiVZg==</latexit><latexit sha1_base64="+7KGq3ZWR9HAbJVhuQO6VAvHe6M=">AAACBnicbVDLSgMxFM3UV62vUZciBItQQcqMCLoRi25cVrAv26Fk0kwbmsmEJCOUtis3/oobF4q49Rvc+Tdm2llo64GQk3Pu5eYeXzCqtON8W5mFxaXllexqbm19Y3PL3t6pqiiWmFRwxCJZ95EijHJS0VQzUheSoNBnpOb3rxO/9kCkohG/0wNBvBB1OQ0oRtpIbXtfFOrHDTiC90fwAprHyBBRaCTXZa5t552iMwGcJ25K8iBFuW1/tToRjkPCNWZIqabrCO0NkdQUMzLOtWJFBMJ91CVNQzkKifKGkzXG8NAoHRhE0hyu4UT93TFEoVKD0DeVIdI9Nesl4n9eM9bBuTekXMSacDwdFMQM6ggmmcAOlQRrNjAEYUnNXyHuIYmwNsklIbizK8+T6knRdYru7Wm+dJXGkQV74AAUgAvOQAncgDKoAAwewTN4BW/Wk/VivVsf09KMlfbsgj+wPn8AqCiVZg==</latexit><latexit sha1_base64="+7KGq3ZWR9HAbJVhuQO6VAvHe6M=">AAACBnicbVDLSgMxFM3UV62vUZciBItQQcqMCLoRi25cVrAv26Fk0kwbmsmEJCOUtis3/oobF4q49Rvc+Tdm2llo64GQk3Pu5eYeXzCqtON8W5mFxaXllexqbm19Y3PL3t6pqiiWmFRwxCJZ95EijHJS0VQzUheSoNBnpOb3rxO/9kCkohG/0wNBvBB1OQ0oRtpIbXtfFOrHDTiC90fwAprHyBBRaCTXZa5t552iMwGcJ25K8iBFuW1/tToRjkPCNWZIqabrCO0NkdQUMzLOtWJFBMJ91CVNQzkKifKGkzXG8NAoHRhE0hyu4UT93TFEoVKD0DeVIdI9Nesl4n9eM9bBuTekXMSacDwdFMQM6ggmmcAOlQRrNjAEYUnNXyHuIYmwNsklIbizK8+T6knRdYru7Wm+dJXGkQV74AAUgAvOQAncgDKoAAwewTN4BW/Wk/VivVsf09KMlfbsgj+wPn8AqCiVZg==</latexit>
p(X,Y ) = p(X)p(Y )?<latexit sha1_base64="bl6zlgp6j49zYrr3WzDcmkNPH7w=">AAAB/nicbZDLSgMxFIbP1Futt1Fx5SZYhBakzIigG7HoxmUFe5F2KJk0bUMzmSHJCGUo+CpuXCji1udw59uYaWehrQcSPv7/HHLy+xFnSjvOt5VbWl5ZXcuvFzY2t7Z37N29hgpjSWidhDyULR8rypmgdc00p61IUhz4nDb90U3qNx+pVCwU93ocUS/AA8H6jGBtpK59EJVaJw9ldIkMlM1l+KrQtYtOxZkWWgQ3gyJkVevaX51eSOKACk04VqrtOpH2Eiw1I5xOCp1Y0QiTER7QtkGBA6q8ZLr+BB0bpYf6oTRHaDRVf08kOFBqHPimM8B6qOa9VPzPa8e6f+ElTESxpoLMHurHHOkQpVmgHpOUaD42gIlkZldEhlhiok1iaQju/JcXoXFacZ2Ke3dWrF5nceThEI6gBC6cQxVuoQZ1IJDAM7zCm/VkvVjv1sesNWdlM/vwp6zPHxcfklQ=</latexit><latexit sha1_base64="bl6zlgp6j49zYrr3WzDcmkNPH7w=">AAAB/nicbZDLSgMxFIbP1Futt1Fx5SZYhBakzIigG7HoxmUFe5F2KJk0bUMzmSHJCGUo+CpuXCji1udw59uYaWehrQcSPv7/HHLy+xFnSjvOt5VbWl5ZXcuvFzY2t7Z37N29hgpjSWidhDyULR8rypmgdc00p61IUhz4nDb90U3qNx+pVCwU93ocUS/AA8H6jGBtpK59EJVaJw9ldIkMlM1l+KrQtYtOxZkWWgQ3gyJkVevaX51eSOKACk04VqrtOpH2Eiw1I5xOCp1Y0QiTER7QtkGBA6q8ZLr+BB0bpYf6oTRHaDRVf08kOFBqHPimM8B6qOa9VPzPa8e6f+ElTESxpoLMHurHHOkQpVmgHpOUaD42gIlkZldEhlhiok1iaQju/JcXoXFacZ2Ke3dWrF5nceThEI6gBC6cQxVuoQZ1IJDAM7zCm/VkvVjv1sesNWdlM/vwp6zPHxcfklQ=</latexit><latexit sha1_base64="bl6zlgp6j49zYrr3WzDcmkNPH7w=">AAAB/nicbZDLSgMxFIbP1Futt1Fx5SZYhBakzIigG7HoxmUFe5F2KJk0bUMzmSHJCGUo+CpuXCji1udw59uYaWehrQcSPv7/HHLy+xFnSjvOt5VbWl5ZXcuvFzY2t7Z37N29hgpjSWidhDyULR8rypmgdc00p61IUhz4nDb90U3qNx+pVCwU93ocUS/AA8H6jGBtpK59EJVaJw9ldIkMlM1l+KrQtYtOxZkWWgQ3gyJkVevaX51eSOKACk04VqrtOpH2Eiw1I5xOCp1Y0QiTER7QtkGBA6q8ZLr+BB0bpYf6oTRHaDRVf08kOFBqHPimM8B6qOa9VPzPa8e6f+ElTESxpoLMHurHHOkQpVmgHpOUaD42gIlkZldEhlhiok1iaQju/JcXoXFacZ2Ke3dWrF5nceThEI6gBC6cQxVuoQZ1IJDAM7zCm/VkvVjv1sesNWdlM/vwp6zPHxcfklQ=</latexit><latexit sha1_base64="bl6zlgp6j49zYrr3WzDcmkNPH7w=">AAAB/nicbZDLSgMxFIbP1Futt1Fx5SZYhBakzIigG7HoxmUFe5F2KJk0bUMzmSHJCGUo+CpuXCji1udw59uYaWehrQcSPv7/HHLy+xFnSjvOt5VbWl5ZXcuvFzY2t7Z37N29hgpjSWidhDyULR8rypmgdc00p61IUhz4nDb90U3qNx+pVCwU93ocUS/AA8H6jGBtpK59EJVaJw9ldIkMlM1l+KrQtYtOxZkWWgQ3gyJkVevaX51eSOKACk04VqrtOpH2Eiw1I5xOCp1Y0QiTER7QtkGBA6q8ZLr+BB0bpYf6oTRHaDRVf08kOFBqHPimM8B6qOa9VPzPa8e6f+ElTESxpoLMHurHHOkQpVmgHpOUaD42gIlkZldEhlhiok1iaQju/JcXoXFacZ2Ke3dWrF5nceThEI6gBC6cQxVuoQZ1IJDAM7zCm/VkvVjv1sesNWdlM/vwp6zPHxcfklQ=</latexit>
EXPECTED VALUE (MEAN, AVERAGE)
E [X] =
8><
>:
Px2X xp(x) X : discrete
RX xp(x)dx X : continuous
Exercise: Biased coin with p(Heads) = 0.8
EXPECTATIONS WITH FUNCTIONS1.2.5 Expectations and moments
Expectations of functions are defined as sums (or integrals) of function values weighted ac-cording to the probability mass (or density) function. Given a probability space (X ,B(X ), PX),we consider a function f : X ! C and define its expectation function as
E [f(X)] =
8><
>:
Px2X
f(x)p(x) X : discrete
´Xf(x)p(x)dx X : continuous
Note that we use a capital X for the random variable (with f(X) the random variabletransformed by f) and lower case x when it is an instance (e.g., p(x) is the probability of aspecific outcome). The capital X in E [f(X)] also specifies that the summation (integration)is defined over p(x); this will become more important later when we consider multiple randomvariables. It can happen that E [f(X)] = ±1; in such cases we say that the expectationdoes not exist or is not well-defined.4 For f(x) = x, we have a standard expectationE [X] =
Pxp(x), or the mean value of X. Using f(x) = xk results in the k-th moment,
f(x) = log 1/p(x) gives the well-known entropy function H(X), or differential entropy forcontinuous random variables, and f(x) = (x� E [X])2 provides the variance of a randomvariable X, denoted by V [X]. Interestingly, the probability of some event A ✓ X
5 can alsobe expressed in the form of expectation; i.e.,
P (A) = E [1(X 2 A)] ,
where
1(t) =
(1 t is true0 t is false
(1.5)
is an indicator function. With this, it is possible to express the cumulative distributionfunction as FX(t) = E[1(X 2 (�1, t])].
Function f(x) inside the expectation can also be complex-valued. For example, 'X(t) =E[eitX ], where i is the imaginary unit, defines the characteristic function of X. The char-acteristic function is closely related to the inverse Fourier transform of p(x) and is useful inmany forms of statistical inference. Several expectation functions are summarized in Table1.1.
Given two random variables X and Y and a specific value x assigned to X, we definethe conditional expectation as
E [f(Y )|x] =
8><
>:
Py2Y
f(y)p(y|x) Y : discrete
´Yf(y)p(y|x)dy Y : continuous
4There is sometimes disagreement on terminology, and some definitions allow the expected value to be
infinite, which, for example, still allows the strong law of large numbers. In that setting, an expectation is
not well-defined only if both left and right improper integrals are infinite. For our purposes, this is splitting
hairs.5This notation is a bit loose; we should say A 2 B(X ) instead of A ✓ X . We used it to re-emphasize
(through this footnote) that some subsets of continuous sets are not measurable.
25
f : X ! R
f(x)p(x)
Exercise: Imagine you get 10 dollars if a Heads is flipped, lose 3 if Tails.
CONDITIONAL EXPECTATIONS
E [Y |X = x] =
8><
>:
Py2Y yp(y|x) Y : discrete
RY yp(y|x)dy Y : continuous
Different expected value, depending on which x is observed
3 MINUTE BREAK
E [Y |X = x] =
8><
>:
Py2Y yp(y|x) Y : discrete
RY yp(y|x)dy Y : continuous
What is the E[Y | x] for the following? - p(y | x) = Gaussian with N(mu = x, sigma^2 = 10) - p(y | x) = Gaussian with N(mu = f(x), sigma^2 = 0.1) - p(y | x) = Bernoulli with alpha = x
Turn to your neighbour and discuss a question that you have If you have nothing, consider the questions below
PROPERTIES OF EXPECTATIONS
• E[cX] = c E[X], for a constant c • E[X + Y] = E[X] + E[Y] (linearity of expectation) • If X and Y independent, then E[XY] = E[X] E[Y] • E[Y] = E[E[Y | X]], where outer expectation over X
• called Law of Total Expectation • (prove these on the whiteboard)
VARIANCE
Variance(X) = E[(X � E[X])2]
= E[X2]� E[X]2<latexit sha1_base64="ZsVW4Q/gat7XW9uJrdGkK/5obYc=">AAACSXicbVBNSwMxFMzW7/pV9eglWJR6sOwWQS+CKIJHBVsXumvNpq8ams0uyVuxLP17Xrx58z948aCIJ9Pag1YHApOZeeRlolQKg6777BQmJqemZ2bnivMLi0vLpZXVhkkyzaHOE5loP2IGpFBQR4ES/FQDiyMJl1H3eOBf3oE2IlEX2EshjNmNEh3BGVqpVboOEO4xbzAtmOLQr/jbdOuABjHD2yjKT/rNik93flz9cPuqFgYBpcXfOd/KY8mrGqWtUtmtukPQv8QbkTIZ4axVegraCc9iUMglM6bpuSmGOdMouIR+McgMpIx32Q00LVUsBhPmwyb6dNMqbdpJtD0K6VD9OZGz2JheHNnkYFEz7g3E/7xmhp39MBcqzRAU/36ok0mKCR3USttCA0fZs4RxLeyulN8yzTja8ou2BG/8y39Jo1b13Kp3vls+PBrVMUvWyQapEI/skUNySs5InXDyQF7IG3l3Hp1X58P5/I4WnNHMGvmFwsQXn3qvtg==</latexit><latexit sha1_base64="ZsVW4Q/gat7XW9uJrdGkK/5obYc=">AAACSXicbVBNSwMxFMzW7/pV9eglWJR6sOwWQS+CKIJHBVsXumvNpq8ams0uyVuxLP17Xrx58z948aCIJ9Pag1YHApOZeeRlolQKg6777BQmJqemZ2bnivMLi0vLpZXVhkkyzaHOE5loP2IGpFBQR4ES/FQDiyMJl1H3eOBf3oE2IlEX2EshjNmNEh3BGVqpVboOEO4xbzAtmOLQr/jbdOuABjHD2yjKT/rNik93flz9cPuqFgYBpcXfOd/KY8mrGqWtUtmtukPQv8QbkTIZ4axVegraCc9iUMglM6bpuSmGOdMouIR+McgMpIx32Q00LVUsBhPmwyb6dNMqbdpJtD0K6VD9OZGz2JheHNnkYFEz7g3E/7xmhp39MBcqzRAU/36ok0mKCR3USttCA0fZs4RxLeyulN8yzTja8ou2BG/8y39Jo1b13Kp3vls+PBrVMUvWyQapEI/skUNySs5InXDyQF7IG3l3Hp1X58P5/I4WnNHMGvmFwsQXn3qvtg==</latexit><latexit sha1_base64="ZsVW4Q/gat7XW9uJrdGkK/5obYc=">AAACSXicbVBNSwMxFMzW7/pV9eglWJR6sOwWQS+CKIJHBVsXumvNpq8ams0uyVuxLP17Xrx58z948aCIJ9Pag1YHApOZeeRlolQKg6777BQmJqemZ2bnivMLi0vLpZXVhkkyzaHOE5loP2IGpFBQR4ES/FQDiyMJl1H3eOBf3oE2IlEX2EshjNmNEh3BGVqpVboOEO4xbzAtmOLQr/jbdOuABjHD2yjKT/rNik93flz9cPuqFgYBpcXfOd/KY8mrGqWtUtmtukPQv8QbkTIZ4axVegraCc9iUMglM6bpuSmGOdMouIR+McgMpIx32Q00LVUsBhPmwyb6dNMqbdpJtD0K6VD9OZGz2JheHNnkYFEz7g3E/7xmhp39MBcqzRAU/36ok0mKCR3USttCA0fZs4RxLeyulN8yzTja8ou2BG/8y39Jo1b13Kp3vls+PBrVMUvWyQapEI/skUNySs5InXDyQF7IG3l3Hp1X58P5/I4WnNHMGvmFwsQXn3qvtg==</latexit><latexit sha1_base64="ZsVW4Q/gat7XW9uJrdGkK/5obYc=">AAACSXicbVBNSwMxFMzW7/pV9eglWJR6sOwWQS+CKIJHBVsXumvNpq8ams0uyVuxLP17Xrx58z948aCIJ9Pag1YHApOZeeRlolQKg6777BQmJqemZ2bnivMLi0vLpZXVhkkyzaHOE5loP2IGpFBQR4ES/FQDiyMJl1H3eOBf3oE2IlEX2EshjNmNEh3BGVqpVboOEO4xbzAtmOLQr/jbdOuABjHD2yjKT/rNik93flz9cPuqFgYBpcXfOd/KY8mrGqWtUtmtukPQv8QbkTIZ4axVegraCc9iUMglM6bpuSmGOdMouIR+McgMpIx32Q00LVUsBhPmwyb6dNMqbdpJtD0K6VD9OZGz2JheHNnkYFEz7g3E/7xmhp39MBcqzRAU/36ok0mKCR3USttCA0fZs4RxLeyulN8yzTja8ou2BG/8y39Jo1b13Kp3vls+PBrVMUvWyQapEI/skUNySs5InXDyQF7IG3l3Hp1X58P5/I4WnNHMGvmFwsQXn3qvtg==</latexit>
Why? See if you can get this formula
COVARIANCEX
Y
where f : Y ! C is some function. Again, using f(y) = y results in E [Y |x] =P
yp(y|x) orE [Y |x] =
´yp(y|x)dy. We shall see later that under some conditions E [Y |x] is referred to
as the regression function. These types of integrals are often seen and evaluated in Bayesianstatistics.
For two random variables X and Y we also define
E [f(X,Y )] =
8><
>:
Px2X
Py2Y
f(x, y)p(x, y) X,Y : discrete
´X
´Yf(x, y)p(x, y)dxdy X, Y : continuous
Expectations can also be defined over a single variable
E [f(X, y)] =
8><
>:
Px2X
f(x, y)p(x) X : discrete
´Xf(x, y)p(x)dx X : continuous
where E [f(X, y)] is now a function of y.We define the covariance function as
Cov[X,Y ] = E [(X � E [X]) (Y � E [Y ])]
= E [XY ]� E [X]E [Y ] ,
with Cov[X,X] = V [X] being the variance of the random variable X. Similarly, we definea correlation function as
Corr[X,Y ] =Cov[X,Y ]pV [X] · V [Y ]
,
which is simply a covariance function normalized by the product of standard deviations.Both covariance and correlation functions have wide applicability in statistics, machinelearning, signal processing and many other disciplines. Several important expectations fortwo random variables are listed in Table 1.2.
Example 6: Three tosses of a fair coin (yet again). Consider two random variables fromExamples 3 and 5, and calculate the expectation and variance for both X and Y . Thencalculate E [Y |X = 0].
We start by calculating E [X] = 0 · pX(0) + 1 · pX(1) = 12 . Similarly,
E [Y ] =3X
y=0
y · pY (y)
= pY (1) + 2pY (2) + 3pY (3)
=3
2
The conditional expectation can be found as
E [Y |X = 0] =3X
y=0
y · pY |X(y|0)
= pY |X(1|0) + 2pY |X(2|0) + 3pY |X(3|0)
= 1
27
where f : Y ! C is some function. Again, using f(y) = y results in E [Y |x] =P
yp(y|x) orE [Y |x] =
´yp(y|x)dy. We shall see later that under some conditions E [Y |x] is referred to
as the regression function. These types of integrals are often seen and evaluated in Bayesianstatistics.
For two random variables X and Y we also define
E [f(X,Y )] =
8><
>:
Px2X
Py2Y
f(x, y)p(x, y) X,Y : discrete
´X
´Yf(x, y)p(x, y)dxdy X, Y : continuous
Expectations can also be defined over a single variable
E [f(X, y)] =
8><
>:
Px2X
f(x, y)p(x) X : discrete
´Xf(x, y)p(x)dx X : continuous
where E [f(X, y)] is now a function of y.We define the covariance function as
Cov[X,Y ] = E [(X � E [X]) (Y � E [Y ])]
= E [XY ]� E [X]E [Y ] ,
with Cov[X,X] = V [X] being the variance of the random variable X. Similarly, we definea correlation function as
Corr[X,Y ] =Cov[X,Y ]pV [X] · V [Y ]
,
which is simply a covariance function normalized by the product of standard deviations.Both covariance and correlation functions have wide applicability in statistics, machinelearning, signal processing and many other disciplines. Several important expectations fortwo random variables are listed in Table 1.2.
Example 6: Three tosses of a fair coin (yet again). Consider two random variables fromExamples 3 and 5, and calculate the expectation and variance for both X and Y . Thencalculate E [Y |X = 0].
We start by calculating E [X] = 0 · pX(0) + 1 · pX(1) = 12 . Similarly,
E [Y ] =3X
y=0
y · pY (y)
= pY (1) + 2pY (2) + 3pY (3)
=3
2
The conditional expectation can be found as
E [Y |X = 0] =3X
y=0
y · pY |X(y|0)
= pY |X(1|0) + 2pY |X(2|0) + 3pY |X(3|0)
= 1
27
PROPERTIES OF VARIANCES
• V[c] = 0 for a constant c • V[c X] = c^2 V[X] • V[X + Y] = V[X] + V[Y] + 2 Cov[X,Y] • If X and Y are independent, V[X + Y] = V[X] + V[Y]
• i.e., Cov[X,Y] = 0
INDEPENDENCE AND DECORRELATION
• Independent RVs have zero correlation • How can we tell? • Hint: use Cov[X,Y] = E[XY] - E[X]E[Y]
• Uncorrelated RVs (zero correlation) might be dependent
• Correlation (Pearson’s correlation) shows linear relationships; can miss nonlinear ones
• Example: X normal RV, Y = X^2 (whiteboard)
ALTERNATIVES: MUTUAL INFORMATION (USING KL)• KL-divergence measures how different two distributions
are
KL(p||q) =X
x2Xp(x) log
p(x)
q(x)
=X
x2Xp(x) [log p(x)� log q(x)]
=X
x2Xp(x) log p(x)�
X
x2Xp(x) log q(x)
<latexit sha1_base64="9AKpmBK2cWkXEOnZmrdAMoTKgD4=">AAAC0HicjVJNbxMxEPUuXyV8BTj2YhGB0gPRplTQHpAqekECpIJIGyleRV5nNrHq9W7sWZTIWSGu/Dxu/AL+Bt5NVBWKUEey9ebNvPHY46RQ0mIU/QzCa9dv3Ly1dbt15+69+w/aDx+d2Lw0AgYiV7kZJtyCkhoGKFHBsDDAs0TBaXJ2VMdPv4CxMtefcVlAnPGplqkUHD01bv9iCAt0795X3YKuVnS+Q1vPXlNmy2zsFpRJTVnGcSa4csOqokV3sUOZyqeUpYYLV/uVm9c7Y1dRQoqjRt/4z9e15k3MyOkM46uVuVDgf5nn1cftTtSLGqOXQX8DOmRjx+P2DzbJRZmBRqG4taN+VGDsuEEpFFQtVloouDjjUxh5qHkGNnbNQCr61DMTmubGL420YS8qHM+sXWaJz6w7tn/HavJfsVGJ6X7spC5KBC3WB6WlopjTerp0Ig0IVEsPuDDS90rFjPtBof8DreYRDmp7eX7ly+Bkt9d/0dv7uNc5fLN5ji2yTZ6QLumTV+SQvCXHZEBE8CGwwSqowk/hIvwaflunhsFG85j8YeH33ykj3BU=</latexit>
or
KL(p||q) =Z
Xp(x) log
p(x)
q(x)dx
<latexit sha1_base64="KG0zdAqSBfbo9nwjY1FE/ceTcQw=">AAACP3icbVBNTxsxFPRCW8L2g0CPvVhEReESbQDxcUBC9FKpHEBqQqQ4iryON1h47cV+ixI5+8+49C/0xpULh1YVV254k6hqS0eyNZ73np5n4kwKC1F0Gywsvnj5aqmyHL5+8/bdSnV1rW11bhhvMS216cTUcikUb4EAyTuZ4TSNJT+PLz+V9fNrbqzQ6iuMM95L6VCJRDAKXupX2xsE+AicNgUh4Yx/OSnqGZ5M8NUmDjcOMREK+o6kFC4Yla5TFDirjzYxkXqISWIoc+W7cFfljQcjQvrVWtSIpsDPSXNOamiO0371OxlolqdcAZPU2m4zyqDnqAHBJC9CklueUXZJh7zrqaIptz039V/gj14Z4EQbfxTgqfrnhKOpteM09p2lCftvrRT/V+vmkOz3nFBZDlyx2aIklxg0LsPEA2E4Azn2hDIj/F8xu6A+EPCRh9MQDkrs/rb8nLS3Gs3txs7ZTu3oeB5HBX1A66iOmmgPHaHP6BS1EEM36A79QD+Db8F98Ct4mLUuBPOZ9+gvBI9PY6Kusg==</latexit>
ALTERNATIVES: MUTUAL INFORMATION• KL-divergence measures how different two distributions
are
• Mutual Information between X and Y: • I(X; Y) = KL(p_{xy} || p_x p_y) • only zero when X and Y are independent • measure of price for encoding (X,Y) as independent
RVs, even when they are not
EXERCISE: MODELLING COMMUTE TIMES
• Let’s imagine we have 365 samples of commute times • Say you wanted to model commute time C as a
continuous RV (takes 7.6 minutes to get to work) • This means we have to specify (or learn) the pdf; how? • One option: pick distribution type (e.g., Gaussian), and
find the “best” parameters that match the data • What are the parameters to learn? • Is a Gaussian a good choice?
EXERCISE: MODELLING COMMUTE TIMES (CONT.)
• Note a better choice is actually a Gamma dist. • Gaussian distribution (or gamma) for commute times
extrapolates between recorded time in minutes
EXERCISE: CONDITIONAL PROBABILITIES
• Using conditional probabilities, we can incorporate other external information (features)
• Let y be the commute time, x the day of the year • Array of conditional probability values —> p(y | x)
• y = 1, 2, … and x = 1, 2, …, 365 • What other x could we use?
EXERCISE: ADDING IN AUXILIARY INFORMATION (1)
• Mean, variance for p(y | x) could depend on value of x • Example: X = 1 if slippery out, and X = 0 else
Gaussian denoted by N
p(y|X = 0) = N�µ0,�
20
�
p(y|X = 1) = N�µ1,�
21
�<latexit sha1_base64="urhoOP+QZFfszXmFHWgTPXYAT+k=">AAACX3icdVFbS8MwGE3rbc5b1SfxJTiUCTJaFS8PguiLT6LgdLDMkmbpFkzaknwdjLo/6Zvgi//EdhvzfiBwOOc7JN9JkEhhwHVfLXtqemZ2rjRfXlhcWl5xVtfuTZxqxusslrFuBNRwKSJeBwGSNxLNqQokfwieLgv/oce1EXF0B/2EtxTtRCIUjEIu+U4vqfbxM27gM+zu4p0zTBSFLqMyux4QyUOoEpX67h4mRnQU9d3HfaJFpwu7hJQ/s97/WW+S9SbZsu9U3Jo7BP5NvDGpoDFufOeFtGOWKh4Bk9SYpucm0MqoBsEkH5RJanhC2RPt8GZOI6q4aWXDfgZ4O1faOIx1fiLAQ/VrIqPKmL4K8sliA/PTK8S/vGYK4UkrE1GSAo/Y6KIwlRhiXJSN20JzBrKfE8q0yN+KWZdqyiD/klEJpwWOJiv/Jvf7Ne+gdnh7WDm/GNdRQptoC1WRh47RObpCN6iOGHqzbGvBWrTe7Tl72XZGo7Y1zqyjb7A3PgD6arB3</latexit>
EXERCISE: ADDING IN AUXILIARY INFORMATION (2)
• Can incorporate external information (features) by modeling parameter = function(features)
µ =dX
j=1
wixi
<latexit sha1_base64="C+FHd4OxPMaQJmhWk3lXIXY53Rg=">AAACA3icbVDLSsNAFJ3UV62vqDvdDBbBVUm0+FgUim5cVrAPaGKYTKbt2JkkzEzUEgpu/BU3LhRx60+4829M0iBqPXDhcM693HuPGzIqlWF8aoWZ2bn5heJiaWl5ZXVNX99oySASmDRxwALRcZEkjPqkqahipBMKgrjLSNsdnqV++4YISQP/Uo1CYnPU92mPYqQSydG3LB7BGrRkxJ34umaOrzx461B451BHLxsVIwOcJmZOyiBHw9E/LC/AESe+wgxJ2TWNUNkxEopiRsYlK5IkRHiI+qSbUB9xIu04+2EMdxPFg71AJOUrmKk/J2LEpRxxN+nkSA3kXy8V//O6keod2zH1w0gRH08W9SIGVQDTQKBHBcGKjRKCsKDJrRAPkEBYJbGVshBOUhx+vzxNWvsV86BSvaiW66d5HEWwDXbAHjDBEaiDc9AATYDBPXgEz+BFe9CetFftbdJa0PKZTfAL2vsXntqXAw==</latexit>
p(y|x) = N
0
@µ =dX
j=1
wixi,�2
1
A
<latexit sha1_base64="nKc3NOYTWMM+/I4lyv/P7mpWr+w=">AAACOnicbVDLSgMxFM34tr6qLt0Ei1BBykwVHwtBdONKFKwKnTpk0kwbTWaG5I5axvkuN36FOxduXCji1g8w0xbxdSBwcs69JOf4seAabPvRGhgcGh4ZHRsvTExOTc8UZ+dOdJQoymo0EpE684lmgoesBhwEO4sVI9IX7NS/3Mv90yumNI/CY+jErCFJK+QBpwSM5BWP4nIH32JXEmj7QXqTLePt3o0SkR5krmABlF2Z5LJOpJdebDvZeRNfexzfeHzFqLwlyXnVVbzVhmXsFUt2xe4C/yVOn5RQH4de8cFtRjSRLAQqiNZ1x46hkRIFnAqWFdxEs5jQS9JidUNDIplupN3oGV4yShMHkTInBNxVv2+kRGrdkb6ZzEPp314u/ufVEwg2GykP4wRYSHsPBYnAEOG8R9zkilEQHUMIVdz8FdM2UYSCabvQLWErx/pX5L/kpFpxVitrR2ulnd1+HWNoAS2iMnLQBtpB++gQ1RBFd+gJvaBX6956tt6s997ogNXfmUc/YH18AiLhrJE=</latexit>
Gaussian denoted by N
MIXTURES OF DISTRIBUTIONS
MIXTURES OF GAUSSIANS
* Image from https://people.ucsc.edu/~ealdrich/Teaching/Econ114/LectureNotes/kde.html
MIXTURES CAN PRODUCE COMPLEX DISTRIBUTIONS
THINK-PAIR-SHARE (5 MINUTES)• Notice that moving to continuous RVs puts more restrictions
on the distributions we can define • For discrete RVs, distributions are tables of probabilities and
so are highly flexible • we can define any possible distribution
• So, why not just discretize our variables? • Example: imagine you have an RV in range [-10,10] • You decide to discretize into chunks of size 0.01 • How many variables do you need to define the PMF? • What if had instead modelled it as a Gaussian? Or a mixture
of two Gaussians?
12
34
5
45
67
89
1011
1213
14
0
1
2
3
4
5
6
7
8
Red lights
Commute time
Occ
uren
ces
THINK-PAIR-SHARE
• Example: imagine you have an RV in range [-10,10] • You decide to discretize into chunks of size 0.01 • How many variables do you need to define the PMF? • What if had instead modelled it as a Gaussian? Or mixture
of two Gaussians? • Additional question if you have time: imagine you have a
2-dim. RV (in [-10, 10] x [-10,10]). Now imagine you discretize to the same level. How many variables in this PMF? (i.e., this multidimensional array?)
EXERCISE: SAMPLE AVERAGE IS AN UNBIASED ESTIMATOR
Obtain instances x1, . . . , xn
What can we say about the sample average?
This sample is random, so we consider i.i.d. random variables
X1, . . . , Xn
Reflects that we could have seen a di↵erent set of instances xi
E"1
n
nX
i=1
Xi
#=
1
n
nX
i=1
E[Xi]
=1
n
nX
i=1
µ
= µ
For any one sample x1, . . . , xn, unlikely that1
n
nX
i=1
xi = µ