analytic geometry - mml-book.github.io · errata and feedback to . 3.3 lengths and distances 71 the...

3

Analytic Geometry

1463

In Chapter 2, we studied vectors, vector spaces and linear mappings at1464

a general but abstract level. In this chapter, we will add some geometric1465

interpretation and intuition to all of these concepts. In particular, we will1466

look at geometric vectors, compute their lengths and distances or angles1467

between two vectors. To be able to do this, we equip the vector space with1468

an inner product that induces the geometry of the vector space. Inner1469

products and their corresponding norms and metrics capture the intuitive1470

notions of similarity and distances, which we use to develop the Support1471

Vector Machine in Chapter 12. We will then use the concepts of lengths1472

and angles between vectors to discuss orthogonal projections, which will1473

play a central role when we discuss principal component analysis in Chap-1474

ter 10 and regression via maximum likelihood estimation in Chapter 9.1475

Figure 3.1 gives an overview of how concepts in this chapter are related1476

and how they are connected to other chapters of the book.1477

Figure 3.1 A mindmap of the conceptsintroduced in thischapter, along withwhen they are usedin other parts of thebook.

Inner product

Norm

Lengths Orthogonalprojection Angles Rotations

Chapter 4Matrix

decomposition

Chapter 10Dimensionality

reduction

Chapter 9Regression

Chapter 12Classification

induces

66Draft chapter (June 27, 2018) from “Mathematics for Machine Learning” c©2018 by Marc PeterDeisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press.Report errata and feedback to http://mml-book.com. Please do not post or distribute this file,please link to http://mml-book.com.

http://mml-book.com

http://mml-book.com

3.1 Norms 67

Figure 3.3 Fordifferent norms, thered lines indicatethe set of vectorswith norm 1. Left:Manhattan distance;Right: Euclideandistance.

1

1 1

1

‖x‖1 = 1 ‖x‖2 = 1

3.1 Norms1478

When we think of geometric vectors, i.e., directed line segments that start1479

at the origin, then intuitively the length of a vector is the distance of the1480

“end” of this directed line segment from the origin. In the following, we1481

will discuss the notion of the length of vectors using the concept of a norm.1482

Definition 3.1 (Norm). A norm on a vector space V is a function norm

‖ · ‖ :V → R , (3.1)

x 7→ ‖x‖ , (3.2)

which assigns each vector x its length ‖x‖ ∈ R, such that for all λ ∈ R length1483

and x,y ∈ V the following hold:1484

• Absolutely homogeneous: ‖λx‖ = |λ|‖x‖1485 Triangle inequality

• Triangle inequality: ‖x+ y‖ 6 ‖x‖+ ‖y‖1486 Positive definite

• Positive definite: ‖x‖ > 0 and ‖x‖ = 0 ⇐⇒ x = 0.1487

Figure 3.2 Triangleinequality.

a b

c ≤ a + b

In geometric terms, the triangle inequality states that for any triangle,1488

the sum of the lengths of any two sides must be greater than or equal to1489

the length of the remaining side; see Figure 3.2 for an illustration.1490

Recall that for a vector x ∈ Rn we denote the elements of the vector1491

using a subscript, that is xi is the ith element of the vector x.1492

Example (Manhattan Distance)The Manhattan distance on Rn is defined for x ∈ R as Manhattan distance

‖x‖1 :=n∑i=1

|xi| , (3.3)

where | · | is the absolute value. The left panel of Figure 3.3 indicates allvectors x ∈ R2 with ‖x‖1 = 1. The Manhattan distance is also called `1 `1 norm

norm.

c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.

68 Analytic Geometry

Example (Euclidean Norm)The length of a vector x ∈ Rn is given by

‖x‖2 :=

√√√√ n∑i=1

x2i =√x>x , (3.4)

which computes the Euclidean distance of x from the origin. This norm isEuclidean distance

called the Euclidean norm. The right panel of Figure 3.3 shows all vectorsEuclidean norm

x ∈ R2 with ‖x‖2 = 1. The Euclidean norm is also called `2 norm.`2 norm

Remark. Throughout this book, we will use the Euclidean norm (3.4) by1493

default if not stated otherwise. ♦1494

Remark (Inner Products and Norms). Every inner product induces a norm,but there are norms (like the `1 norm) without a corresponding innerproduct. For an inner product vector space (V, 〈·, ·〉) the induced norm‖ · ‖ satisfies the Cauchy-Schwarz inequalityCauchy-Schwarz

inequality|〈x,y〉| 6 ‖x‖‖y‖ . (3.5)

♦1495

3.2 Inner Products1496

Inner products allow for the introduction of intuitive geometrical con-1497

cepts, such as the length of a vector and the angle or distance between1498

two vectors. A major purpose of inner products is to determine whether1499

vectors are orthogonal to each other.1500

3.2.1 Dot Product1501

We may already be familiar with a particular type of inner product, thescalar product/dot product in Rn, which is given byscalar product

dot product

x>y =n∑i=1

xiyi . (3.6)

We will refer to the particular inner product above as the dot product1502

in this book. However, inner products are more general concepts with1503

specific properties, which we will now introduce.1504

3.2.2 General Inner Products1505

Recall the linear mapping from Section 2.7, where we can rearrange themapping with respect to addition and multiplication with a scalar. A bilinearbilinear mapping

Draft (2018-06-27) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.

http://mml-book.com

3.2 Inner Products 69

mapping Ω is a mapping with two arguments, and it is linear in each ar-gument, i.e., when we look at a vector space V then it holds that for allx,y, z ∈ V, λ ∈ R

Ω(λx+ y, z) = λΩ(x, z) + Ω(y, z) (3.7)

Ω(x, λy + z) = λΩ(x,y) + Ω(x, z) . (3.8)

Equation (3.7) asserts that Ω is linear in the first argument, and equa-1506

tion (3.8) asserts that Ω is linear in the second argument.1507

Definition 3.2. Let V be a vector space and Ω : V × V → R be a bilinear1508

mapping that takes two vectors and maps them onto a real number. Then1509

• Ω is called symmetric if Ω(x,y) = Ω(y,x) for all x,y ∈ V , i.e., the symmetric1510

order of the arguments does not matter.1511

• Ω is called positive definite if positive definite

∀x ∈ V \ 0 : Ω(x,x) > 0 , Ω(0,0) = 0 (3.9)

Definition 3.3. Let V be a vector space and Ω : V × V → R be a bilinear1512

mapping that takes two vectors and maps them onto a real number. Then1513

• A positive definite, symmetric bilinear mapping Ω : V ×V → R is called1514

an inner product on V . We typically write 〈x,y〉 instead of Ω(x,y). inner product1515

• The pair (V, 〈·, ·〉) is called an inner product space or (real) vector space inner product space

vector space withinner product

1516

with inner product. If we use the dot product defined in (3.6), we call1517

(V, 〈·, ·〉) a Euclidean vector space.Euclidean vectorspace

1518

We will refer to the spaces above as inner product spaces in this book.1519

Example (Inner Product that is not the Dot Product)Consider V = R2. If we define

〈x,y〉 := x1y1 − (x1y2 + x2y1) + 2x2y2 (3.10)

then 〈·, ·〉 is an inner product but different from the dot product. The proofwill be an exercise.

3.2.3 Symmetric, Positive Definite Matrices1520

Symmetric, positive definite (SPD) matrices play an important role in ma-1521

chine learning, and they are defined via the inner product.1522

Consider an n-dimensional vector space V with an inner product 〈·, ·〉 :V × V → R (see Definition 3.3) and an ordered basis B = (b1, . . . , bn)of V . Recall the definition of a basis (Section 2.6.1), which says that anyvector can be written as a linear combination of basis vectors. That is, forany vectors x ∈ V and y ∈ V we can write them as x =

∑ni=1 ψibi ∈ V



and y =∑n

j=1 λjbj ∈ V , for ψi, λj ∈ R. Due to the bilinearity of theinner product it holds that for all x and y that

〈x,y〉 =

⟨n∑i=1

ψibi,n∑j=1

λjbj

⟩=

n∑i=1

n∑j=1

ψi〈bi, bj〉λj = x>Ay , (3.11)

where Aij := 〈bi, bj〉 and x, y are the coordinates of x and y with respectto the basis B. This implies that the inner product 〈·, ·〉 is uniquely deter-mined through A. The symmetry of the inner product also means that Ais symmetric. Furthermore, the positive definiteness of the inner productimplies that

∀x ∈ V \0 : x>Ax > 0 . (3.12)

A symmetric matrix A that satisfies (3.12) is called positive definite.positive definite

Example (SPD Matrices)Consider the following matrices:

A1 =

[9 66 5

], A2 =

[9 66 3

](3.13)

Then, A1 is positive definite because it is symmetric and

x>A1x =[x1 x2

] [9 66 5

] [x1

x2

](3.14)

= 9x21 + 12x1x2 + 5x2

2 = (3x1 + 2x2)2 + x22 > 0 (3.15)

for all x ∈ V \ 0. However, A2 is symmetric but not positive definitebecause x>A2x = 9x2

1 +12x1x2 +4x22 = (3x1 +2x2)2−x2

2 can be smallerthan 0, e.g., for x = [2,−3]>.

If A ∈ Rn×n is symmetric, positive definite then

〈x,y〉 = x>Ay (3.16)

defines an inner product with respect to an ordered basis B where x and1523

y are the coordinate representations of x,y ∈ V with respect to B.1524

Theorem 3.4. For a real-valued, finite-dimensional vector space V and anordered basis B of V it holds that 〈·, ·〉 : V × V → R is an inner product ifand only if there exists a symmetric, positive definite matrixA ∈ Rn×n with

〈x,y〉 = x>Ay . (3.17)

The following properties hold if A ∈ Rn×n is symmetric and positive1525

definite of order n:1526

• The nullspace (kernel) of A consists only of 0 because x>Ax > 0 for1527

all x 6= 0. This implies that Ax 6= 0 if x 6= 0.1528


http://mml-book.com

3.3 Lengths and Distances 71

• The diagonal elements of A are positive because aii = e>i Aei > 0,1529

where ei is the ith vector of the standard basis in Rn.1530

In Section 4.3, we will return to symmetric, positive definite matrices in1531

the context of matrix decompositions.1532

3.3 Lengths and Distances1533

In Section 3.1, we already discussed norms that we can use to computethe length of a vector. Inner products and norms are closely related in thesense that any inner product induces a norm Inner products

induce norms.

‖x‖ :=√〈x,x〉 (3.18)

in a natural way, such that we can compute lengths of vectors using the1534

inner product. However, not every norm is induced by an inner product.1535

The Manhattan distance (3.3) is an example of a norm that is not in-1536

duced by an inner product. In the following, we will focus on norms that1537

are induced by inner products and introduce geometric concepts, such as1538

lengths, distances and angles.1539

Example (Length of Vectors using Inner Products)In geometry, we are often interested in lengths of vectors. We can now usean inner product to compute them using (3.18). Let us take x = [1, 1]> ∈R2. If we use the dot product as the inner product, with (3.18) we obtain

‖x‖ =√x>x =

√12 + 12 =

√2 (3.19)

as the length of x. Let us now choose a different inner product:

〈x,y〉 := x>[

1 − 12

− 12

1

]y = x1y1 −

1

2(x1y2 + x2y1) + x2y2 . (3.20)

If we compute the norm of a vector, then this inner product returns smallervalues than the dot product if x1 and x2 have the same sign (and x1x2 >0), otherwise it returns greater values than the dot product. With thisinner product we obtain

〈x,x〉 = x21 − x1x2 + x2

2 = 1− 1 + 1 = 1 =⇒ ‖x‖ =√

1 = 1 , (3.21)

such that x is “shorter” with this inner product than with the dot product.

Definition 3.5 (Distance and Metric). Consider an inner product space(V, 〈·, ·〉). Then

d(x,y) := ‖x− y‖ =√〈x− y,x− y〉 (3.22)

is called distance of x,y ∈ V . If we use the dot product as the inner distance

product, then the distance is called Euclidean distance. The mapping Euclidean distance



d : V × V → R (3.23)

(x,y) 7→ d(x,y) (3.24)

is called metric.metric 1540

Remark. Similar to the length of a vector, the distance between vectors1541

does not require an inner product: a norm is sufficient. If we have a norm1542

induced by an inner product, the distance may vary depending on the1543

choice of the inner product. ♦1544

A metric d satisfies:1545

1. d is positive definite, i.e., d(x,y) > 0 for all x,y ∈ V and d(x,y) =positive definite 1546

0 ⇐⇒ x = y1547

2. d is symmetric, i.e., d(x,y) = d(y,x) for all x,y ∈ V .symmetric 1548

Triangularinequality

3. Triangular inequality: d(x, z) 6 d(x,y) + d(y, z).1549

3.4 Angles and Orthogonality1550

Figure 3.4 Whenrestricted to [0, π]

then f(ω) = cos(ω)

returns a uniquenumber in theinterval [−1, 1].

0 1 2 3

ω

−1.00

−0.75

−0.50

−0.25

0.00

0.25

0.50

0.75

1.00

cos(ω

)

f (ω) = cos(ω)

The Cauchy-Schwarz inequality (3.5) allows us to define angles ω in innerproduct spaces between two vectors x,y. Assume that x 6= 0,y 6= 0. Then

−1 6〈x,y〉‖x‖ ‖y‖ 6 1 . (3.25)

Therefore, there exists a unique ω ∈ [0, π] with

cosω =〈x,y〉‖x‖ ‖y‖ , (3.26)

see Figure 3.4 for an illustration. The number ω is the angle betweenangle

1551

the vectors x and y. Intuitively, the angle between two vectors tells us1552

how similar their orientations are. For example, using the dot product,1553

the angle between x and y = 4x, i.e., y is a scaled version of x, is 0:1554

Their orientation is the same.Figure 3.5 Theangle ω betweentwo vectors x,y iscomputed using theinner product.

y

x

10

1

ω

1555

Example (Angle between Vectors)Let us compute the angle between x = [1, 1]> ∈ R2 and y = [1, 2]> ∈ R2,see Figure 3.5, where we use the dot product as the inner product. Thenwe get

cosω =〈x,y〉√〈x,x〉〈y,y〉

=x>y√x>xy>y

=3√10, (3.27)

and the angle between the two vectors is arccos( 3√10

) ≈ 0.32 rad, whichcorresponds to about 18.

The inner product also allows us to characterize vectors that are most1556

dissimilar, i.e., orthogonal.1557


http://mml-book.com

3.5 Inner Products of Functions 73

Definition 3.6 (Orthogonality). Two vectors x and y are orthogonal if orthogonal1558

and only if 〈x,y〉 = 0, and we write x ⊥ y.1559

An implication of this definition is that the 0-vector is orthogonal to1560

every vector in the vector space.1561

Remark. Orthogonality is the generalization of the concept of perpendic-1562

ularity to bilinear forms that do not have to be the dot product. In our1563

context, geometrically, we can think of orthogonal vectors to have a right1564

angle with respect to a specific inner product. ♦1565

Example (Orthogonal Vectors)

Figure 3.6 Theangle ω betweentwo vectors x,y canchange dependingon the innerproduct.

y x

−1 10

1

ω

Consider two vectors x = [1, 1]>,y = [−1, 1]> ∈ R2, see Figure 3.6.We are interested in determining the angle ω between them using twodifferent inner products. Using the dot product as inner product yields anangle ω between x and y of 90, such that x ⊥ y. However, if we choosethe inner product

〈x,y〉 = x>[2 00 1

]y , (3.28)

we get that the angle ω between x and y is given by

cosω =〈x,y〉‖x‖‖y‖ = −1

3=⇒ ω ≈ 1.91 rad ≈ 109.5 , (3.29)

and x and y are not orthogonal. Therefore, vectors that are orthogonalwith respect to one inner product do not have to be orthogonal with re-spect to a different inner product.

3.5 Inner Products of Functions1566

Thus far, we looked at properties of inner products to compute lengths,1567

angles and distances. We focused on inner products of finite-dimensional1568

vectors.1569

In the following, we will look at an example of inner products of a1570

different type of vectors: inner products of functions.1571



The inner products we discussed so far were defined for vectors with1572

a finite number of entries. We can think of these vectors as discrete func-1573

tions with a finite number of function values. The concept of an inner1574

product can be generalized to vectors with an infinite number of entries1575

(countably infinite) and also continuous-valued functions (uncountably1576

infinite). Then, the sum over individual components of vectors, see (3.6)1577

for example, turns into an integral.1578

An inner product of two functions u : R → R and v : R → R can bedefined as the definite integral

〈u, v〉 :=

∫ b

a

u(x)v(x)dx (3.30)

for lower and upper limits a, b < ∞, respectively. As with our usual in-1579

ner product, we can define norms and orthogonality by looking at the1580

inner product. If (3.30) evaluates to 0, the functions u and v are orthogo-1581

nal. To make the above inner product mathematically precise, we need to1582

take care of measures, and the definition of integrals. Furthermore, unlike1583

inner product on finite dimensional vectors, inner products on functions1584

may diverge (have infinite value). Some careful definitions need to be ob-1585

served, which requires a foray into real and functional analysis which we1586

do not cover in this book.1587

Example (Inner Product of Functions)If we choose u = sin(x) and v = cos(x), the integrand f(x) = u(x)v(x)Figure 3.7 f(x) =

sin(x) cos(x).

−2 0 2

x

−0.4

−0.2

0.0

0.2

0.4

f(x

)=

sin

(x)

cos(x

)

of (3.30), is shown in Figure 3.7. We see that this function is odd, i.e.,f(−x) = −f(x). Therefore, the integral with limits a = −π, b = π of thisproduct evaluates to 0. Therefore, sin and cos are orthogonal functions.

Remark. It also holds that the collection of functions

1, cos(x), cos(2x), cos(3x), . . . (3.31)

is orthogonal if we integrate from −π to π, i.e., any pair of functions are1588

orthogonal to each other. ♦1589

In Chapter 6, we will have a look at a second type of unconventional1590

inner products: the inner product of random variables.1591

3.6 Orthogonal Projections1592

Projections are an important class of linear transformations (besides rota-1593

tions and reflections). Projections play an important role in graphics, cod-1594

ing theory, statistics and machine learning. In machine learning, we often1595

deal with data that is high-dimensional. High-dimensional data is often1596

hard to analyze or visualize. However, high-dimensional data quite often1597


http://mml-book.com

3.6 Orthogonal Projections 75Figure 3.8Orthogonalprojection of atwo-dimensionaldata set onto aone-dimensionalsubspace.

−4 −3 −2 −1 0 1 2 3 4x1

−4

−3

−2

−1

0

1

2

3

4

x2

(a) Original data set.

−4 −3 −2 −1 0 1 2 3 4x1

−4

−3

−2

−1

0

1

2

3

4

x2

(b) Original data points (black) and their cor-responding orthogonal projections (red) onto alower-dimensional subspace (straight line).

possesses the property that only a few dimensions contain most informa-1598

tion, and most other dimensions are not essential to describe key proper-1599

ties of the data. When we compress or visualize high-dimensional data we1600

will lose information. To minimize this compression loss, we ideally find1601

the most informative dimensions in the data. Then, we can project the1602

original high-dimensional data onto a lower-dimensional feature space “Feature” is acommonly usedword for “datarepresentation”.

1603

and work in this lower-dimensional space to learn more about the data set1604

and extract patterns. For example, machine learning algorithms, such as1605

Principal Component Analysis (PCA) by Pearson (1901); Hotelling (1933)1606

and Deep Neural Networks (e.g., deep auto-encoders Deng et al. (2010)),1607

heavily exploit the idea of dimensionality reduction. In the following, we1608

will focus on orthogonal projections, which we will use in Chapter 10 for1609

linear dimensionality reduction and in Chapter 12 for classification. Even1610

linear regression, which we discuss in Chapter 9 can be interpreted using1611

orthogonal projections. For a given lower-dimensional subspace, orthog-1612

onal projections of high-dimensional data retain as much information as1613

possible and minimize the difference/error between the original data and1614

the corresponding projection. An illustration of such an orthogonal pro-1615

jection is given in Figure 3.8.1616

Before we detail how to obtain these projections, let us define what a1617

projection actually is.1618

Definition 3.7 (Projection). Let V be a vector space and W ⊆ V a1619

subspace of V . A linear mapping π : V → W is called a projection if projection1620

π2 = π π = π.1621

Remark (Projection matrix). Since linear mappings can be expressed by1622

transformation matrices, the definition above applies equally to a special1623



Figure 3.9Projection ofx ∈ R2 onto asubspace U withbasis b.

b

x

p = πU(x)

ω

Figure 3.10Projection of atwo-dimensionalvector x onto aone-dimensionalsubspace with‖x‖ = 1.

cosω

ω

sinω

b

x

kind of transformation matrices, the projection matrices P π, which exhibit projection matrices1624

the property that P 2π = P π.1625

Since P 2π = P π it follows that all eigenvalues of P π are either 0 or1626

1. The corresponding eigenspaces are the kernel and image of the pro-A good illustrationis given here:http://tinyurl.

com/p5jn5ws.

1627

jection, respectively. More details about eigenvalues and eigenvectors are1628

provided in Chapter 4. ♦1629

In the following, we will derive orthogonal projections of vectors in the1630

inner product space (Rn, 〈·, ·〉) onto subspaces. We will start with one-1631

dimensional subspaces, which are also called lines. If not mentioned oth-lines 1632

erwise, we assume the dot product 〈x,y〉 = x>y as the inner product.1633

3.6.1 Projection onto 1-Dimensional Subspaces (Lines)1634

Assume we are given a line (1-dimensional subspace) through the origin1635

with basis vector b ∈ Rn. The line is a one-dimensional subspace U ⊆ Rn1636

spanned by b. When we project x ∈ Rn onto U , we want to find the1637

point πU(x) ∈ U that is closest to x. Using geometric arguments, let us1638

characterize some properties of the projection πU(x) (Fig. 3.9 serves as1639

an illustration):1640


http://tinyurl.com/p5jn5ws

http://tinyurl.com/p5jn5ws

http://mml-book.com

3.6 Orthogonal Projections 77

• The projection πU(x) is closest to x, where “closest” implies that the1641

distance ‖x−πU(x)‖ is minimal. It follows that the segment πU(x)−x1642

from πU(x) to x is orthogonal to U and, therefore, the basis b of U . The1643

orthogonality condition yields 〈πU(x)−x, b〉 = 0 since angles between1644

vectors are defined by means of the inner product.1645

• The projection πU(x) of x onto U must be an element of U and, there-1646

fore, a multiple of the basis vector b that spans U . Hence, πU(x) = λb,1647

for some λ ∈ R. λ is then thecoordinate of πU (x)

with respect to b.

1648

In the following three steps, we determine the coordinate λ, the projection1649

πU(x) =∈ U and the projection matrix P π that maps arbitrary x ∈ Rn1650

onto U .1651

1. Finding the coordinate λ. The orthogonality condition yields

〈x− πU(x), b〉 = 0 (3.32)πU (x)=λb⇐⇒ 〈x− λb, b〉 = 0 . (3.33)

We can now exploit the bilinearity of the inner product and arrive at With a general innerproduct, we getλ = 〈x, b〉 if‖b‖ = 1.

〈x, b〉 − λ〈b, b〉 = 0 (3.34)

⇐⇒ λ =〈x, b〉〈b, b〉 =

〈x, b〉‖b‖2 (3.35)

If we choose 〈·, ·〉 to be the dot product, we obtain

λ =b>x

b>b=b>x

‖b‖2 (3.36)

If ‖b‖ = 1, then the coordinate λ of the projection is given by b>x.1652

2. Finding the projection point πU(x) ∈ U . Since πU(x) = λb we imme-diately obtain with (3.36) that

πU(x) = λb =〈x, b〉‖b‖2 b =

b>x

‖b‖2b , (3.37)

where the last equality holds for the dot product only. We can alsocompute the length of πU(x) by means of Definition 3.1 as

‖p‖ = ‖λb‖ = |λ| ‖b‖ . (3.38)

This means that our projection is of length |λ| times the length of b.1653

This also adds the intuition that λ is the coordinate of πU(x) with1654

respect to the basis vector b that spans our one-dimensional subspace1655

U .1656

If we use the dot product as an inner product we get

‖πU(x)‖ (3.37)=|b>x|‖b‖2 ‖b‖

(3.26)= | cosω| ‖x‖ ‖b‖ ‖b‖‖b‖2 = | cosω| ‖x‖ .

(3.39)



Here, ω is the angle between x and b. This equation should be familiar1657

from trigonometry: If ‖x‖ = 1 then x lies on the unit circle. It follows1658

that the projection onto the horizontal axis spanned by b is exactlyThe horizontal axisis a one-dimensionalsubspace.

1659

cosω, and the length of the corresponding vector πU(x) = | cosω|. An1660

illustration is given in Figure 3.10.1661

3. Finding the projection matrix P π. We know that a projection is a lin-ear mapping (see Definition 3.7). Therefore, there exists a projectionmatrix P π, such that πU(x) = P πx. With the dot product as innerproduct and

πU(x) = λb = bλ = bb>x

‖b‖2 =bb>

‖b‖2x (3.40)

we immediately see that

P π =bb>

‖b‖2 . (3.41)

Note that bb> is a symmetric matrix (with rank 1) and ‖b‖2 = 〈b, b〉Projection matricesare alwayssymmetric.

1662

is a scalar.1663

The projection matrixP π projects any vector x ∈ Rn onto the line through1664

the origin with direction b (equivalently, the subspace U spanned by b).1665

Remark. The projection πU(x) ∈ Rn is still an n-dimensional vector and1666

not a scalar. However, we no longer require n coordinates to represent the1667

projection, but only a single one if we want to express it with respect to1668

the basis vector b that spans the subspace U : λ. ♦1669

Example (Projection onto a Line)Find the projection matrix P π onto the line through the origin spannedby b =

[1 2 2

]>. b is a direction and a basis of the one-dimensional

subspace (line through origin).With (3.41), we obtain

P π =bb>

b>b=

1

9

122

[1 2 2]

=1

9

1 2 22 4 42 4 4

. (3.42)

Let us now choose a particular x and see whether it lies in the subspacespanned by b. For x =

[1 1 1

]>, the projected point is

πU(x) = P πx =1

9

1 2 22 4 42 4 4

111

=1

9

51010

∈ span[

122

] . (3.43)

Note that the application of P π to πU(x) does not change anything, i.e.,P ππU(x) = πU(x). This is expected because according to Definition 3.7


http://mml-book.com


Figure 3.11Projection onto atwo-dimensionalsubspace U withbasis b1, b2. Theprojection πU (x) ofx ∈ R3 onto U canbe expressed as alinear combinationof b1, b2 and thedisplacement vectorx− πU (x) isorthogonal to bothb1 and b2.

0

x

b1

b2

U

πU (x)

x− πU (x)

we know that a projection matrix P π satisfies P 2πx = P πx. Therefore,

πU(x) is also an eigenvector of P π, and the corresponding eigenvalue is1.

3.6.2 Projection onto General Subspaces1670

In the following, we look at orthogonal projections of vectors x ∈ Rn1671

onto higher-dimensional subspaces U ⊆ Rn with dim(U) = m > 1. An1672

illustration is given in Figure 3.11.1673

Assume that (b1, . . . , bm) is an ordered basis of U . Any projection πU(x) If U is given by a setof spanning vectors,which are not abasis, make sureyou determine abasis b1, . . . , bmbefore proceeding.

1674

onto U is necessarily an element of U . Therefore, they can be represented1675

as linear combinations of the basis vectors b1, . . . , bm of U , such that1676

πU(x) =∑m

i=1 λibi.

The basis vectorsform the columns ofB ∈ Rn×m, whereB = [b1, . . . , bm].

1677

As in the 1D case, we follow a three-step procedure to find the projec-1678

tion πU(x) and the projection matrix P π:1679

1. Find the coordinates λ1, . . . , λm of the projection (with respect to thebasis of U), such that the linear combination

πU(x) =m∑i=1

λibi = Bλ , (3.44)

B = [b1, . . . , bm] ∈ Rn×m, λ = [λ1, . . . , λm]> ∈ Rm , (3.45)

is closest to x ∈ Rn. As in the 1D case, “closest” means “minimumdistance”, which implies that the vector connecting πU(x) ∈ U andx ∈ Rn must be orthogonal to all basis vectors of U . Therefore, weobtain m simultaneous conditions (assuming the dot product as theinner product)

〈b1,x− πU(x)〉 = b>1 (x− πU(x)) = 0 (3.46)



... (3.47)

〈bm,x− πU(x)〉 = b>m(x− πU(x)) = 0 (3.48)

which, with πU(x) = Bλ, can be written as

b>1 (x−Bλ) = 0 (3.49)... (3.50)

b>m(x−Bλ) = 0 (3.51)

such that we obtain a homogeneous linear equation systemb>1...b>m

x−Bλ

= 0 ⇐⇒ B>(x−Bλ) = 0 (3.52)

⇐⇒ B>Bλ = B>x . (3.53)

The last expression is called normal equation. Since b1, . . . , bm are anormal equation

basis of U and, therefore, linearly independent, B>B ∈ Rm×m is reg-ular and can be inverted. This allows us to solve for the coefficients/coordinates

λ = (B>B)−1B>x . (3.54)

The matrix (B>B)−1B> is also called the pseudo-inverse of B, whichpseudo-inverse 1680

can be computed for non-square matricesB. It only requires thatB>B1681

is positive definite, which is the case if B is full rank.11682

2. Find the projection πU(x) ∈ U . We already established that πU(x) =Bλ. Therefore, with (3.54)

πU(x) = p = B(B>B)−1B>x . (3.55)

3. Find the projection matrix P π. From (3.55) we can immediately seethat the projection matrix that solves P πx = πU(x) must be

P π = B(B>B)−1B> . (3.56)

Remark. Comparing the solutions for projecting onto a one-dimensional1683

subspace and the general case, we see that the general case includes the1684

1D case as a special case: If dim(U) = 1 thenB>B ∈ R is a scalar and we1685

can rewrite the projection matrix in (3.56) (either give equation number1686

or repeat the equation but not both, i would just give the eq. number)1687

P π = B(B>B)−1B> as P π = BB>

B>B, which is exactly the projection1688

matrix in (3.41). ♦1689

1In practical applications (e.g., linear regression), we often add a “jitter term” εI to B>Bto guarantee increase numerical stability and positive definiteness. This “ridge” can be rigorouslyderived using Bayesian inference. See Chapter 9 for details.


http://mml-book.com


Example (Projection onto a Two-dimensional Subspace)

For a subspace U = span[

111

,0

12

] ⊆ R3 and x =

600

∈ R3 find the

coordinates λ of x in terms of the subspace U , the projection point πU(x)and the projection matrix P π.

First, we see that the generating set of U is a basis (linear indepen-

dence) and write the basis vectors of U into a matrix B =

1 01 11 2

.

Second, we compute the matrix B>B and the vector B>x as

B>B =

[1 1 10 1 2

]1 01 11 2

=

[3 33 5

], B>x =

[1 1 10 1 2

]600

=

[60

].

(3.57)

Third, we solve the normal equation B>Bλ = B>x to find λ:[3 33 5

] [λ1

λ2

]=

[60

]⇐⇒ λ =

[5−3

]. (3.58)

Fourth, the projection πU(x) of x onto U , i.e., into the column space ofB, can be directly computed via

πU(x) = Bλ =

52−1

. (3.59)

The corresponding projection error is the norm of the distance between projection error

the original vector and its projection onto U , i.e.,

‖x− πU(x)‖ =∥∥∥[1 −2 1

]>∥∥∥ =√

6 . (3.60)

The projection erroris also called thereconstruction error. reconstruction errorFifth, the projection matrix (for any x ∈ R3) is given by

P π = B(B>B)−1B> =1

6

5 2 −12 2 2−1 2 5

. (3.61)

To verify the results, we can (a) check whether the displacement vectorπU(x)−x is orthogonal to all basis vectors of U , (b) verify that P π = P 2

π

(see Definition 3.7).

Remark. The projections πU(x) are still vectors in Rn although they lie in1690

an m-dimensional subspace U ⊆ Rn. However, to represent a projected1691


82 Analytic GeometryFigure 3.12Projection onto anaffine space. (a) Theoriginal setting; (b)The setting isshifted by −x0, sothat x− x0 can beprojected onto thedirection space U ;(c) The projection istranslated back tox0 + πU (x− x0),which gives the finalorthogonalprojection πL(x).

L

x 0

x

b2

b10

(a) Setting.

b10

x − x 0

U = L− x 0

πU (x − x 0)

b2

(b) Reduce problem to projec-tion πU onto vector subspace.

L

x 0

x

b2

b10

πL(x )

(c) Add support point back in toget affine projection πL.

vector we only need the m coordinates λ1, . . . , λm with respect to the1692

basis vectors b1, . . . , bm of U . ♦1693

Remark. In vector spaces with general inner products, we have to pay1694

attention when computing angles and distances, which are defined by1695

means of the inner product. ♦1696

We can findapproximatesolutions tounsolvable linearequation systemsusing projections.

Projections allow us to look at situations where we have a linear system1697

Ax = b without a solution. Recall that this means that b does not lie in1698

the span of A, i.e., the vector b does not lie in the subspace spanned by1699

the columns ofA. Given that the linear equation cannot be solved exactly,1700

we can find an approximate solution. The idea is to find the vector in the1701

subspace spanned by the columns ofA that is closest to b, i.e., we compute1702

the orthogonal projection of b onto the subspace spanned by the columns1703

of A. This problem arises often in practice, and the solution is called the1704

least squares solution (assuming the dot product as the inner product) ofleast squaressolution

1705

an overdetermined system. This is discussed further in Chapter 9.1706

3.6.3 Projection onto Affine Subspaces1707

Thus far, we discussed how to project a vector onto a lower-dimensional1708

subspace U . In the following, we provide a solution to projecting a vector1709

onto an affine subspace.1710

Consider the setting in Figure 3.12(a). We are given an affine space L =x0 + U where b1, b2 are basis vectors of U . To determine the orthogonalprojection πL(x) of x onto L, we transform the problem into a problemthat we know how to solve: the projection onto a vector subspace. Inorder to get there, we subtract the support point x0 from x and from L,so that L− x0 = U is exactly the vector subspace U . We can now use theorthogonal projections onto a subspace we discussed in Section 3.6.2 andobtain the projection πU(x − x0), which is illustrated in Figure 3.12(b).This projection can now be translated back into L by adding x0, such that


http://mml-book.com

3.7 Orthonormal Basis 83

we obtain the orthogonal projection onto an affine space L as

πL(x) = x0 + πU(x− x0) , (3.62)

where πU(·) is the orthogonal projection onto the subspace U , i.e., the1711

direction space of L, see Figure 3.12(c).1712

From Figure 3.12 it is also evident that the distance of x from the affinespace L is identical to the distance of x− x0 from U , i.e.,

d(x, L) = ‖x− πL(x)‖ = ‖x− (x0 + πU(x− x0))‖ (3.63)

= d(x− x0, πU(x− x0)) . (3.64)

3.7 Orthonormal Basis1713

In Section 2.6.1, we characterized properties of basis vectors and found1714

that in an n-dimensional vector space, we need n basis vectors, i.e., n1715

vectors that are linearly independent. In Sections 3.3 and 3.4, we used1716

inner products to compute the length of vectors and the angle between1717

vectors. In the following, we will discuss the special case where the basis1718

vectors are orthogonal to each other and where the length of each basis1719

vector is 1. We will call this basis then an orthonormal basis.1720

Let us introduce this more formally.1721

Definition 3.8 (Orthonormal basis). Consider an n-dimensional vectorspace V and a basis b1, . . . , bn of V . If

〈bi, bj〉 = 0 for i 6= j (3.65)

〈bi, bi〉 = 1 (3.66)

for all i, j = 1, . . . , n then the basis is called an orthonormal basis (ONB). orthonormal basisONB

1722

If only (3.65) is satisfied and then the basis is called an orthogonal basis.orthogonal basis

1723

Note that (3.66) implies that every basis vector has length/norm 1. The1724

Gram-Schmidt process (Strang, 2003) is a constructive way to iteratively1725

build an orthonormal basis b1, . . . , bn given a set b1, . . . , bn of non-1726

orthogonal and unnormalized basis vectors.1727

Example (Orthonormal basis)The canonical/standard basis for a Euclidean vector space Rn is an or-thonormal basis, where the inner product is the dot product of vectors.

In R2, the vectors

b1 =1√2

[11

], b2 =

1√2

[1−1

](3.67)

form an orthonormal basis since b>1 b2 = 0 and ‖b1‖ = 1 = ‖b2‖.



Figure 3.13 Therobotic arm needs torotate its joints inorder to pick upobjects or to placethem correctly.Figure takenfrom (Deisenrothet al., 2015).

In Section 3.6, we derived projections of vectors x onto a subspace Uwith basis vectors b1, . . . , bk. If this basis is an ONB, i.e., (3.65)–(3.66)are satisfied, the projection equation (3.55) simplifies greatly to

πU(x) = BB>x (3.68)

since B>B = I. This means that we no longer have to compute the1728

tedious inverse from (3.55), which saves us much computation time.1729

We will exploit the concept of an orthonormal basis in Chapter 12 and1730

Chapter 10 when we discuss Support Vector Machines and Principal Com-1731

ponent Analysis.1732

3.8 Rotations1733

An important category of linear mappings are rotations. A rotation is act-1734

ing to rotate an object (counterclockwise) by an angle θ about the origin.1735

Important application areas of rotations include computer graphics and1736

robotics. For example, in robotics, it is often important to know how to1737

rotate the joints of a robotic arm in order to pick up or place an object,1738

see Figure 3.13.1739

3.8.1 Rotations in the Plane1740

Consider the standard basis in R2

e1 =

[10

], e2 =

[01

], which defines1741

the standard coordinate system in R2. Assume we want to rotate this co-1742

ordinate system by an angle θ as illustrated in Figure 3.14. Note that the1743

rotated vectors are still linearly independent and, therefore, are a basis of1744

R2. This means that the rotation performs a basis change.1745

Since rotations Φ are linear mappings, we can express them by a rotationrotation matrix

matrix R(θ). Trigonometry allows us to determine the coordinates of the


http://mml-book.com

3.8 Rotations 85

Figure 3.14Rotation of thestandard basis in R2

by an angle θ.

e1

e2

θ

θ

Φ(e2) = [− sin θ, cos θ]T

Φ(e1) = [cos θ, sin θ]T

cos θ

sin θ

− sin θ

cos θ

rotated axes with respect to the standard basis in R2. We obtain

Φ(e1) =

[cos θsin θ

], Φ(e2) =

[− sin θcos θ

]. (3.69)

Therefore, the rotation matrix that performs the basis change into therotated coordinates R(θ) is given as

R(θ) =

[cos θ − sin θsin θ cos θ

](3.70)

Generally, in R2 rotations of any vector x ∈ R2 by an angle θ are givenby

x′ = Φ(x) = R(θ)x =

[cos θ − sin θsin θ cos θ

] [x1

x2

]=

[x1 cos θ − x2 sin θx1 cos θ + x2 sin θ

].

(3.71)

3.8.2 Rotations in Three Dimensions1746

In three dimensions, we have to define what “counterclockwise” means:1747

We will go by the convention that a “counterclockwise” (planar) rotation1748

about an axis refers to a rotation about an axis when we look at the axis1749

“from the end toward the origin”. In R3 there are three (planar) rotations1750

about three standard basis vectors (see Figure 3.15):1751

• Counterclockwise rotation about the e1-axis

R1(θ) =[Φ(e1) Φ(e2) Φ(e3)

]=

1 0 00 cos θ − sin θ0 sin θ cos θ

(3.72)

Here, the e1 coordinate is fixed, and the counterclockwise rotation is1752

performed in the e2e3 plane.1753


R2(θ) =

cos θ 0 sin θ0 1 0

− sin θ 0 cos θ

(3.73)



Figure 3.15Rotation of a vector(gray) in R3 by anangle θ about thee3-axis. The rotatedvector is shown inblue.

e1

e2

e3

θ

If we rotate the e1e3 plane about the e2 axis, we need to look at the e21754

axis from its “tip” toward the origin.1755


R3(θ) =

cos θ − sin θ 0sin θ cos θ 0

0 0 1

. (3.74)

Figure 3.15 illustrates this.1756

3.8.3 Rotations in n Dimensions1757

The generalization of rotations from 2D and 3D to n-dimensional Eu-1758

clidean vector spaces can be intuitively described as keeping n − 2 di-1759

mensions fix and restrict the rotation to a two-dimensional plane in the1760

n-dimensional space.1761

Definition 3.9 (Givens Rotation). Let V be an n-dimensional Euclideanvector space and Φ : V → V an automorphism with transformation ma-trix

Rij(θ) :=

I i−1 0 · · · · · · 00 cos θ 0 − sin θ 00 0 Ij−i−1 0 00 sin θ 0 cos θ 00 · · · · · · 0 In−j

∈ Rn×n , θ ∈ R ,

(3.75)

for 1 6 i < j 6 n. Then Rij(θ) is called a Givens rotation. Essentially,Givens rotation

Rij(θ) is the identity matrix In with

rii = cos θ , rij = − sin θ , rji = sin θ , rjj = cos θ . (3.76)

In two dimensions (i.e., n = 2), we obtain (3.70) as a special case.1762


http://mml-book.com

3.9 Further Reading 87

3.8.4 Properties of Rotations1763

Rotations exhibit some useful properties: Rotations preservelengths anddistances.Rotations are notcommutative.

1764

• The composition of rotations can be expressed as a single rotation wherethe angles are summed up:

R(φ)R(θ) = R(φ+ θ) (3.77)

• Rotations preserve lengths, i.e., ‖x‖ = ‖R(φ)x‖. The original vector1765

and the rotated vector have the same length. This also implies that1766

det(R(φ)) = 11767

• Rotations preserve distances, i.e., ‖x−y‖ = ‖Rθ(x)−Rθ(y)‖. In other1768

words, rotations leave the distance between any two points unchanged1769

after the transformation1770

• Rotations in three (or more) dimensions are generally not commuta-1771

tive. Therefore, the order in which rotations are applied is important,1772

even if they rotate about the same point. Only in two dimensions vector1773

rotations are commutative, such that R(φ)R(θ) = R(θ)R(φ) for all1774

φ, θ ∈ [0, 2π), and form an Abelian group (with multiplication) if they1775

rotate about the same point (e.g., the origin).1776

• Rotations have no real eigenvalues, except we rotate about nπ where1777

n ∈ Z.1778

3.9 Further Reading1779

In this chapter, we gave a brief overview of some of the important concepts1780

of analytic geometry, which we will use in later chapters of the book. For1781

a broader and more in-depth overview of some the concepts we presented1782

we refer to the excellent book by Boyd and Vandenberghe (2018), for1783

example.1784

Inner products allow us to determine specific bases of vector (sub)spaces,1785

where each vector is orthogonal to all others (orthogonal bases) using the1786

Gram-Schmidt method. These bases are important in optimization and1787

numerical algorithms for solving linear equation systems. For instance,1788

Krylov subspace methods, such as Conjugate Gradients or GMRES, mini-1789

mize residual errors that are orthogonal to each other (Stoer and Burlirsch,1790

2002).1791

In machine learning, inner products are important in the context of1792

kernel methods (Scholkopf and Smola, 2002). Kernel methods exploit the1793

fact that many linear algorithms can be expressed purely by inner prod-1794

uct computations. Then, the “kernel trick” allows us to compute these1795

inner products implicitly in a (potentially infinite-dimensional) feature1796

space, without even knowing this feature space explicitly. This allowed the1797

“non-linearization” of many algorithms used in machine learning, such as1798

kernel-PCA (Scholkopf et al., 1997) for dimensionality reduction. Gaus-1799

sian processes (Rasmussen and Williams, 2006) also fall into the category1800



of kernel methods and are the current state-of-the-art in probabilistic re-1801

gression (fitting curves to data points). The idea of kernels is explored1802

further in Chapter 12.1803

Projections are often used in computer graphics, e.g., to generate shad-1804

ows. In optimization, orthogonal projections are often used to (iteratively)1805

minimize residual errors. This also has applications in machine learning,1806

e.g., in linear regression where we want to find a (linear) function that1807

minimizes the residual errors, i.e., the lengths of the orthogonal projec-1808

tions of the data onto the linear function (Bishop, 2006). We will investi-1809

gate this further in Chapter 9. PCA (Hotelling, 1933; Pearson, 1901) also1810

uses projections to reduce the dimensionality of high-dimensional data.1811

We will discuss this in more detail in Chapter 10.1812

Exercises1813

3.1 Show that 〈·, ·〉 defined for all x = (x1, x2) and y = (y1, y2) in R2 by:

〈x,y〉 := x1y1 − (x1y2 + x2y1) + 2(x2y2)

is an inner product.1814

3.2 Consider R2 with 〈·, ·〉 defined for all x and y in R2 as:

〈x,y〉 := x>[2 0

1 2

]︸︷︷︸

=:A

y

Is 〈·, ·〉 an inner product?1815

3.3 Consider the Euclidean vector space R5 with the dot product. A subspaceU ⊆ R5 and x ∈ R5 are given by

U = span[

0

−1

2

0

2

,

1

−3

1

−1

2

,−3

4

1

2

1

,−1

−3

5

0

7

] , x =

−1

−9

−1

4

1

1. Determine the orthogonal projection πU (x) of x onto U1816

2. Determine the distance d(x, U)1817

3.4 Consider R3 with the inner product

〈x,y〉 := x>

2 1 0

1 2 −1

0 −1 2

y .Furthermore, we define e1, e2, e3 as the standard/canonical basis in R3.1818

1. Determine the orthogonal projection πU (e2) of e2 onto

U = span[e1, e3] .

Hint: Orthogonality is defined through the inner product.1819

2. Compute the distance d(e2, U).1820


http://mml-book.com

Exercises 89

3. Draw the scenario: standard basis vectors and πU (e2)1821

3.5 Prove the Cauchy-Schwarz inequality |〈x,y〉| 6 ‖x‖ ‖y‖ for x,y ∈ V , where1822

V is a vector space.1823


analytic geometry - mml-book.github.io · errata and feedback to . 3.3 lengths and distances 71 the...

Documents