linear algebra and probability theory

8/3/2019 Linear Algebra and Probability Theory

http://slidepdf.com/reader/full/linear-algebra-and-probability-theory 1/47

Quick Tour of Basic Linear Algebra and Probability Theory

Quick Tour of Basic Linear Algebra and

Probability Theory

CS246: Mining Massive Data Sets

Winter 2011

http://goforward/

http://find/

http://goback/




Basic Linear Algebra

Outline

1 Basic Linear Algebra

2 Basic Probability Theory

http://goforward/

http://find/

http://goback/





Matrices and Vectors

Matrix: A rectangular array of numbers, e.g., A ∈ Rm ×n :

A =

a 11 a 12 . . . a 1n a 21 a 22 . . . a 2n

... ... ... ...

a m 1 a m 2 . . . a mn

http://goforward/

http://find/

http://goback/





Matrices and Vectors

Matrix: A rectangular array of numbers, e.g., A ∈ Rm ×n :

A =

a 11 a 12 . . . a 1n a 21 a 22 . . . a 2n

... ... ... ...

a m 1 a m 2 . . . a mn

Vector: A matrix consisting of only one column (default) or

one row, e.g., x ∈Rn

x =

x 1x 2...

x n

http://goforward/

http://find/

http://goback/





Matrix Multiplication

If A ∈ Rm ×n , B ∈ Rn ×p , C = AB , then C ∈ Rm ×p :

C ij =

n k =1

Aik B kj

http://goforward/

http://find/

http://goback/





Matrix Multiplication

If A ∈ Rm ×n , B ∈ Rn ×p , C = AB , then C ∈ Rm ×p :

C ij =

n k =1

Aik B kj

Special cases: Matrix-vector product, inner product of two

vectors. e.g., with x , y

∈Rn :

x T y =n

i =1

x i y i ∈ R

Q i k T f B i Li Al b d P b bili Th

http://goforward/

http://find/

http://goback/





Properties of Matrix Multiplication

Associative: (AB )C = A(BC )

Q i k T f B i Li Al b d P b bilit Th

http://goforward/

http://find/

http://goback/







Distributive: A(B + C ) = AB + AC


http://goforward/

http://find/

http://goback/








Non-commutative: AB = BA


http://goforward/

http://find/

http://goback/









Block multiplication: If A = [Aik ], B = [B kj ], where Aik ’s and

B kj ’s are matrix blocks, and the number of columns in Aik is

equal to the number of rows in B kj , then C = AB = [C ij ]where C ij =

k Aik B kj


http://goforward/

http://find/

http://goback/









Block multiplication: If A = [Aik ], B = [B kj ], where Aik ’s and

B kj ’s are matrix blocks, and the number of columns in Aik is

equal to the number of rows in B kj , then C = AB = [C ij ]where C ij =

k Aik B kj

Example: If−→x ∈ Rn and A = [

−→a 1| −→a 2| . . . | −→a n ] ∈ Rm ×n ,

B = [−→b 1| −→b 2| . . . | −→b p ] ∈ Rn ×p

:

A−→x =

n i =1

x i −→a i

AB = [A−→b 1|A−→b 2| . . . |A−→b p ]


http://goforward/

http://find/

http://goback/





Operators and properties

Transpose: A ∈ Rm ×n , then AT ∈ Rn ×m : (AT )ij = A ji


http://goforward/

http://find/

http://goback/






Transpose: A ∈ Rm ×n , then AT ∈ Rn ×m : (AT )ij = A ji Properties:

(AT

)T

= A(AB )T = B T AT

(A + B )T = AT + B T


http://goforward/

http://find/

http://goback/



g y y




(AT

)T

= A(AB )T = B T AT

(A + B )T = AT + B T

Trace: A ∈ Rn ×n , then: tr (A) =n

i =1 Aii


http://goforward/

http://find/

http://goback/



g y y




(A

T

)

T

= A(AB )T = B T AT

(A + B )T = AT + B T

Trace: A ∈ Rn ×n , then: tr (A) =n

i =1 Aii

Properties:

tr (A) = tr (AT )tr (A + B ) = tr (A) + tr (B )tr (λA) = λtr (A)If AB is a square matrix, tr (AB ) = tr (BA)


http://goforward/

http://find/

http://goback/




Special types of matrices

Identity matrix: I = I n ∈ Rn ×n :

I ij =

1 i=j,

0 otherwise.

∀A ∈ Rm ×n : AI n = I m A = A


http://goforward/

http://find/

http://goback/






I ij =

1 i=j,

0 otherwise.

∀A ∈ Rm ×n : AI n = I m A = A

Diagonal matrix: D = diag (d 1,d 2, . . . , d n ):

D ij =d i j=i,

0 otherwise.


http://goforward/

http://find/

http://goback/






I ij =

1 i=j,

0 otherwise.

∀A ∈ Rm ×n : AI n = I m A = A


D ij =d i j=i,

0 otherwise.

Symmetric matrices: A ∈ Rn ×n is symmetric if A = AT .


http://goforward/

http://find/

http://goback/






I ij =

1 i=j,

0 otherwise.

∀A ∈ Rm ×n : AI n = I m A = A


D ij =d i j=i,

0 otherwise.

Symmetric matrices: A ∈ Rn ×n is symmetric if A = AT .

Orthogonal matrices: U

∈Rn ×n is orthogonal if

UU T = I = U T U


http://goforward/

http://find/

http://goback/




Linear Independence and Rank

A set of vectors x 1, . . . , x n is linearly independent if

α1, . . . , αn

: n

i =1 αi x i = 0


http://goforward/

http://find/

http://goback/






α1, . . . , αn

: n

i =1 αi x i = 0

Rank: A ∈ Rm ×n , then rank (A) is the maximum number of

linearly independent columns (or equivalently, rows)


http://goforward/

http://find/

http://goback/






α1, . . . , αn

: n

i =1 αi x i = 0

Rank: A ∈ Rm ×n , then rank (A) is the maximum number of

linearly independent columns (or equivalently, rows)

Properties:

rank (A) ≤ minm ,n rank (A) = rank (A

T

)rank (AB ) ≤ minrank (A), rank (B )rank (A + B ) ≤ rank (A) + rank (B )


http://goforward/

http://find/

http://goback/




Matrix Inversion

If A∈Rn ×n , rank (A) = n , then the inverse of A, denoted

A−1 is the matrix that: AA−1 = A−1A = I

Properties:

(A−1)−1 = A(AB )−1 = B −1A−1

(A

−1

)

T

= (A

T

)

−1


B i Li Al b

http://goforward/

http://find/

http://goback/




Range and Nullspace of a Matrix

Span: span (x 1, . . . , x n ) =

n i =1 αi x i |αi ∈ R


B i Li Al b

http://goforward/

http://find/

http://goback/





Span: span (x 1, . . . , x n ) =


Projection:Proj (y ; x i 1≤i ≤n ) = argmin v ∈span (x i 1≤i ≤n )||y − v ||2



http://goforward/

http://find/

http://goback/





Span: span (x 1, . . . , x n ) =


Projection:Proj (y ; x i 1≤i ≤n ) = argmin v ∈span (x i 1≤i ≤n )||y − v ||2Range: A ∈ Rm ×n , then R(A) = Ax | x ∈ R n is the span

of the columns of A



http://goforward/

http://find/

http://goback/





Span: span (x 1, . . . , x n ) =



of the columns of A

Proj (y ,A) = A(A

T

A)

−1

A

T

y



http://goforward/

http://find/

http://goback/





Span: span (x 1, . . . , x n ) =



of the columns of A

Proj (y ,A

) =A

(AT A

)

−1AT y

Nullspace: null (A) = x ∈ Rn |Ax = 0



http://goforward/

http://find/

http://goback/




Determinant

A ∈ Rn ×n , a 1, . . . , a n the rows of A,

S = n i =1 αi a i | 0 ≤ αi ≤ 1, then det (A) is the volume of

S .Properties:

det (I ) = 1det (λA) = λdet (A)det (AT ) = det (A)

det (AB ) = det (A)det (B )det (A) = 0 if and only if A is invertible.If A invertible, then det (A−1) = det (A)



http://goforward/

http://find/

http://goback/



as c ea geb a

Quadratic Forms and Positive Semidefinite Matrices

A ∈ Rn ×n , x ∈ Rn , x T Ax is called a quadratic form:

x

T

Ax =

1≤i , j ≤n Aij x i x j



http://goforward/

http://find/

http://goback/



g

Quadratic Forms and Positive Semidefinite Matrices

A ∈ Rn ×n , x ∈ Rn , x T Ax is called a quadratic form:

x

T

Ax =

1≤i , j ≤n Aij x i x j

A is positive definite if ∀ x ∈ Rn : x T Ax > 0

A is positive semidefinite if∀x

∈Rn : x T Ax

≥0

A is negative definite if ∀ x ∈ Rn : x T Ax < 0

A is negative semidefinite if ∀ x ∈ Rn : x T Ax ≤ 0



http://goforward/

http://find/

http://goback/



Eigenvalues and Eigenvectors

A ∈ Rn ×n , λ ∈ C is an eigenvalue of A with the

corresponding eigenvector x ∈ Cn (x = 0) if:

Ax = λx

eigenvalues: the n possibly complex roots of the

polynomial equation det (A− λI ) = 0, and denoted as

λ1, . . . , λn

Properties:tr (A) =

n i =1 λi

det (A) =n

i =1 λi rank (A) = |1 ≤ i ≤ n |λi = 0|



http://goforward/

http://find/

http://goback/



Matrix Eigendecomposition

A ∈ Rn ×n , λ1, . . . , λn the eigenvalues, and x 1, . . . , x n the

eigenvectors. X = [x 1|x 2| . . . |x n ], Λ = diag (λ1, . . . , λn ),

then AX = X Λ.

A called diagonalizable if X invertible: A = X ΛX −1

If A symmetric, then all eigenvalues real, and X orthogonal

(hence denoted by U = [u 1|u 2| . . . |u n ]):

A = U ΛU T =

n i =1

λi u i u T i

A special case of Signular Value Decomposition


Basic Probability Theory

http://goforward/

http://find/

http://goback/



Outline

1 Basic Linear Algebra

2 Basic Probability Theory



http://goforward/

http://find/

http://goback/



Elements of Probability

Sample Space Ω: Set of all possible outcomes

Event Space F : A family of subsets of Ω

Probability Measure: Function P : F → R with properties:

1 P (A) ≥ 0 (∀A ∈ F )2 P (Ω) = 13 A

i ’s disjoint, then P

(

i Ai ) =

i P

(Ai )



http://goforward/

http://find/

http://goback/



Conditional Probability and Independence

For events A,B :

P (A|B ) =P (A

B )

P (B )

A, B independent if P (A|B ) = P (A) or equivalently:

P (AB ) = P (A)P (B )



http://goforward/

http://find/

http://goback/



Random Variables and Distributions

A random variable X is a function X : Ω → RExample: Number of heads in 20 tosses of a coin

Probabilities of events associated with random variables

defined based on the original probability function. e.g.,

P (X = k ) = P (ω ∈ Ω|X (ω) = k )

Cumulative Distribution Function (CDF) F X : R → [0, 1]:F X (x ) = P (X

≤x )

Probability Mass Function (pmf): X discrete thenp X (x ) = P (X = x )

Probability Density Function (pdf): f X (x ) = dF X (x )/dx



http://goforward/

http://find/

http://goback/



Properties of Distribution Functions

CDF:

0 ≤ F X (x ) ≤ 1F X monotone increasing, with lim x →−∞F X (x ) = 0,

lim x →∞

F X (x ) = 1pmf:

0 ≤ p X (x ) ≤ 1x p X (x ) = 1

x ∈A p X (x ) = p X (A)

pdf:f X (x ) ≥ 0 ∞

−∞f X (x )dx = 1

x ∈Af X (x )dx = P (X ∈ A)



http://goforward/

http://find/

http://goback/



Expectation and Variance

Assume random variable X has pdf f X (x ), and g : R → R.

Then

E [g (X )] = ∞

−∞

g (x )f X (x )dx

for discrete X , E [g (X )] =

x g (x )p X (x )

Properties:

for any constant a ∈ R, E [a ] = a E

[ag

(X

)] =aE

[g

(X

)]Linearity of Expectation:E [g (X ) + h (X )] = E [g (X )] + E [h (X )]

Var [X ] = E [(X − E [X ])2]



http://goforward/

http://find/

http://goback/



Some Common Random Variables

X ∼ Bernoulli (p ) (0 ≤ p ≤ 1):

p X (x ) =

p x=1,

1 − p x=0.

X ∼ Geometric (p ) (0 ≤ p ≤ 1): p X (x ) = p (1 − p )x −1

X ∼ Uniform (a ,b ) (a < b ):

f X (x ) = 1

b −a a ≤ x ≤ b ,

0 otherwise.

X ∼ Normal (µ, σ2):

f X (x ) =1√2πσ

e − 1

2σ2 (x −µ)2



http://goforward/

http://find/

http://goback/



Multiple Random Variables and Joint Distributions

X 1, . . . ,X n random variables

Joint CDF: F X 1,...,X n (x 1, . . . , x n ) = P (X 1 ≤ x 1, . . . ,X n ≤ x n )

Joint pdf: f X 1,...,X n (x 1, . . . , x n ) =∂ n F X 1,...,X n (x 1,...,x n )

∂ x 1...∂ x n

Marginalization:f X 1 (x 1) =

∞−∞ . . .

∞−∞ f X 1,...,X n (x 1, . . . , x n )dx 2 . . .dx n

Conditioning: f X 1|X 2,...,X n (x 1|x 2, . . . , x n ) =f X 1,...,X n (x 1,...,x n )

f X 2,...,X n (x 2,...,x n )

Chain Rule: f (x 1, . . . , x n ) = f (x 1)n

i =2 f (x i |x 1, . . . , x i −1)Independence: f (x 1, . . . , x n ) =

n i =1 f (x i ).

More generally, events A1, . . . ,An independent if

P (

i ∈S Ai ) =

i ∈S P (Ai ) (∀S ⊆ 1, . . . , n ).



http://goforward/

http://find/

http://goback/



Random Vectors

X 1, . . . ,X n random variables. X = [X 1X 2 . . .X n ]T random vector.

If g : Rn → R, then

E [g (X )] = Rn g (x 1, . . . ,x n )f X 1,...,X n (x 1, . . . ,x n )dx 1 . . .dx n

if g : Rn → Rm , g = [g 1 . . . g m ]T , then

E [g (X )] =E [g 1(X )] . . .E [g m (X )]

T Covariance Matrix:

Σ = Cov (X ) = E

(X − E [X ])(X − E [X ])T

Properties of Covariance Matrix:

Σij = Cov [X i ,X j ] = E

(X i − E [X i ])(X j − E [X j ])

Σ symmetric, positive semidefinite



http://goforward/

http://find/

http://goback/



Multivariate Gaussian Distribution

µ ∈ Rn , Σ ∈ Rn ×n symmetric, positive semidefinite

X ∼ N (µ,Σ) n -dimensional Gaussian distribution:

f X (x ) =1

(2π)n /2det (Σ)1/2exp

− 1

2(x − µ)T Σ−1(x − µ)

E [X ] = µCov (X ) = Σ



http://goforward/

http://find/

http://goback/



Parameter Estimation: Maximum Likelihood

Parametrized distribution f X (x ; θ) with parameter(s) θ unknown.IID samples x 1, . . . ,x n observed.

Goal: Estimate θ

MLE: θ = argmax θf (x 1, . . . , x n ; θ)



http://goforward/

http://find/

http://goback/



MLE Example

X ∼ Gaussian (µ, σ2). θ = (µ, σ2) unknown. Samples x 1, . . . , x n .Then:

f (x 1, . . . , x n ;µ, σ2) = ( 12πσ2 )n /2 exp

− n

i =1(x i − µ)2

2σ2

Setting: ∂ log f

∂µ = 0 and ∂ log f ∂σ = 0

Gives:

µMLE =

n i =1 x i n

, σ2MLE =

n i =1(x i − µ)2

n

If not possible to find the optimal point in closed form, iterative

methods such as gradient decent can be used.



http://goforward/

http://find/

http://goback/



Some Useful Inequalities

Markov’s Inequality: X random variable, and a > 0. Then:

P (|X | ≥ a ) ≤ E [|X |]a

Chebyshev’s Inequality: If E [X ] = µ, Var (X ) = σ2, k > 0,

then:

Pr (|X − µ| ≥ k σ) ≤ 1

k 2

Chernoff bound: X 1, . . . ,X n iid random variables, with

E [X i ] = µ, X i

∈ 0,1

(

∀1

≤i

≤n ). Then:

P (|1

n

n i =1

X i − µ| ≥ ) ≤ 2 exp(−2n 2)

Multiple variants of Chernoff-type bounds exist, which can

be useful in different settings



http://goforward/

http://find/

http://goback/



References

1 CS229 notes on basic linear algebra and probability theory

2 Wikipedia!

http://goforward/

http://find/

http://goback/

linear algebra and probability theory

Documents