analytic geometry - mml-book.github.io · errata and feedback to . 3.3 lengths and distances 71 the...
TRANSCRIPT
3
Analytic Geometry
1463
In Chapter 2, we studied vectors, vector spaces and linear mappings at1464
a general but abstract level. In this chapter, we will add some geometric1465
interpretation and intuition to all of these concepts. In particular, we will1466
look at geometric vectors, compute their lengths and distances or angles1467
between two vectors. To be able to do this, we equip the vector space with1468
an inner product that induces the geometry of the vector space. Inner1469
products and their corresponding norms and metrics capture the intuitive1470
notions of similarity and distances, which we use to develop the Support1471
Vector Machine in Chapter 12. We will then use the concepts of lengths1472
and angles between vectors to discuss orthogonal projections, which will1473
play a central role when we discuss principal component analysis in Chap-1474
ter 10 and regression via maximum likelihood estimation in Chapter 9.1475
Figure 3.1 gives an overview of how concepts in this chapter are related1476
and how they are connected to other chapters of the book.1477
Figure 3.1 A mindmap of the conceptsintroduced in thischapter, along withwhen they are usedin other parts of thebook.
Inner product
Norm
Lengths Orthogonalprojection Angles Rotations
Chapter 4Matrix
decomposition
Chapter 10Dimensionality
reduction
Chapter 9Regression
Chapter 12Classification
induces
66Draft chapter (June 27, 2018) from “Mathematics for Machine Learning” c©2018 by Marc PeterDeisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University Press.Report errata and feedback to http://mml-book.com. Please do not post or distribute this file,please link to http://mml-book.com.
3.1 Norms 67
Figure 3.3 Fordifferent norms, thered lines indicatethe set of vectorswith norm 1. Left:Manhattan distance;Right: Euclideandistance.
1
1 1
1
‖x‖1 = 1 ‖x‖2 = 1
3.1 Norms1478
When we think of geometric vectors, i.e., directed line segments that start1479
at the origin, then intuitively the length of a vector is the distance of the1480
“end” of this directed line segment from the origin. In the following, we1481
will discuss the notion of the length of vectors using the concept of a norm.1482
Definition 3.1 (Norm). A norm on a vector space V is a function norm
‖ · ‖ :V → R , (3.1)
x 7→ ‖x‖ , (3.2)
which assigns each vector x its length ‖x‖ ∈ R, such that for all λ ∈ R length1483
and x,y ∈ V the following hold:1484
• Absolutely homogeneous: ‖λx‖ = |λ|‖x‖1485 Triangle inequality
• Triangle inequality: ‖x+ y‖ 6 ‖x‖+ ‖y‖1486 Positive definite
• Positive definite: ‖x‖ > 0 and ‖x‖ = 0 ⇐⇒ x = 0.1487
Figure 3.2 Triangleinequality.
a b
c ≤ a + b
In geometric terms, the triangle inequality states that for any triangle,1488
the sum of the lengths of any two sides must be greater than or equal to1489
the length of the remaining side; see Figure 3.2 for an illustration.1490
Recall that for a vector x ∈ Rn we denote the elements of the vector1491
using a subscript, that is xi is the ith element of the vector x.1492
Example (Manhattan Distance)The Manhattan distance on Rn is defined for x ∈ R as Manhattan distance
‖x‖1 :=n∑i=1
|xi| , (3.3)
where | · | is the absolute value. The left panel of Figure 3.3 indicates allvectors x ∈ R2 with ‖x‖1 = 1. The Manhattan distance is also called `1 `1 norm
norm.
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
68 Analytic Geometry
Example (Euclidean Norm)The length of a vector x ∈ Rn is given by
‖x‖2 :=
√√√√ n∑i=1
x2i =√x>x , (3.4)
which computes the Euclidean distance of x from the origin. This norm isEuclidean distance
called the Euclidean norm. The right panel of Figure 3.3 shows all vectorsEuclidean norm
x ∈ R2 with ‖x‖2 = 1. The Euclidean norm is also called `2 norm.`2 norm
Remark. Throughout this book, we will use the Euclidean norm (3.4) by1493
default if not stated otherwise. ♦1494
Remark (Inner Products and Norms). Every inner product induces a norm,but there are norms (like the `1 norm) without a corresponding innerproduct. For an inner product vector space (V, 〈·, ·〉) the induced norm‖ · ‖ satisfies the Cauchy-Schwarz inequalityCauchy-Schwarz
inequality|〈x,y〉| 6 ‖x‖‖y‖ . (3.5)
♦1495
3.2 Inner Products1496
Inner products allow for the introduction of intuitive geometrical con-1497
cepts, such as the length of a vector and the angle or distance between1498
two vectors. A major purpose of inner products is to determine whether1499
vectors are orthogonal to each other.1500
3.2.1 Dot Product1501
We may already be familiar with a particular type of inner product, thescalar product/dot product in Rn, which is given byscalar product
dot product
x>y =n∑i=1
xiyi . (3.6)
We will refer to the particular inner product above as the dot product1502
in this book. However, inner products are more general concepts with1503
specific properties, which we will now introduce.1504
3.2.2 General Inner Products1505
Recall the linear mapping from Section 2.7, where we can rearrange themapping with respect to addition and multiplication with a scalar. A bilinearbilinear mapping
Draft (2018-06-27) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
3.2 Inner Products 69
mapping Ω is a mapping with two arguments, and it is linear in each ar-gument, i.e., when we look at a vector space V then it holds that for allx,y, z ∈ V, λ ∈ R
Ω(λx+ y, z) = λΩ(x, z) + Ω(y, z) (3.7)
Ω(x, λy + z) = λΩ(x,y) + Ω(x, z) . (3.8)
Equation (3.7) asserts that Ω is linear in the first argument, and equa-1506
tion (3.8) asserts that Ω is linear in the second argument.1507
Definition 3.2. Let V be a vector space and Ω : V × V → R be a bilinear1508
mapping that takes two vectors and maps them onto a real number. Then1509
• Ω is called symmetric if Ω(x,y) = Ω(y,x) for all x,y ∈ V , i.e., the symmetric1510
order of the arguments does not matter.1511
• Ω is called positive definite if positive definite
∀x ∈ V \ 0 : Ω(x,x) > 0 , Ω(0,0) = 0 (3.9)
Definition 3.3. Let V be a vector space and Ω : V × V → R be a bilinear1512
mapping that takes two vectors and maps them onto a real number. Then1513
• A positive definite, symmetric bilinear mapping Ω : V ×V → R is called1514
an inner product on V . We typically write 〈x,y〉 instead of Ω(x,y). inner product1515
• The pair (V, 〈·, ·〉) is called an inner product space or (real) vector space inner product space
vector space withinner product
1516
with inner product. If we use the dot product defined in (3.6), we call1517
(V, 〈·, ·〉) a Euclidean vector space.Euclidean vectorspace
1518
We will refer to the spaces above as inner product spaces in this book.1519
Example (Inner Product that is not the Dot Product)Consider V = R2. If we define
〈x,y〉 := x1y1 − (x1y2 + x2y1) + 2x2y2 (3.10)
then 〈·, ·〉 is an inner product but different from the dot product. The proofwill be an exercise.
3.2.3 Symmetric, Positive Definite Matrices1520
Symmetric, positive definite (SPD) matrices play an important role in ma-1521
chine learning, and they are defined via the inner product.1522
Consider an n-dimensional vector space V with an inner product 〈·, ·〉 :V × V → R (see Definition 3.3) and an ordered basis B = (b1, . . . , bn)of V . Recall the definition of a basis (Section 2.6.1), which says that anyvector can be written as a linear combination of basis vectors. That is, forany vectors x ∈ V and y ∈ V we can write them as x =
∑ni=1 ψibi ∈ V
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
70 Analytic Geometry
and y =∑n
j=1 λjbj ∈ V , for ψi, λj ∈ R. Due to the bilinearity of theinner product it holds that for all x and y that
〈x,y〉 =
⟨n∑i=1
ψibi,n∑j=1
λjbj
⟩=
n∑i=1
n∑j=1
ψi〈bi, bj〉λj = x>Ay , (3.11)
where Aij := 〈bi, bj〉 and x, y are the coordinates of x and y with respectto the basis B. This implies that the inner product 〈·, ·〉 is uniquely deter-mined through A. The symmetry of the inner product also means that Ais symmetric. Furthermore, the positive definiteness of the inner productimplies that
∀x ∈ V \0 : x>Ax > 0 . (3.12)
A symmetric matrix A that satisfies (3.12) is called positive definite.positive definite
Example (SPD Matrices)Consider the following matrices:
A1 =
[9 66 5
], A2 =
[9 66 3
](3.13)
Then, A1 is positive definite because it is symmetric and
x>A1x =[x1 x2
] [9 66 5
] [x1
x2
](3.14)
= 9x21 + 12x1x2 + 5x2
2 = (3x1 + 2x2)2 + x22 > 0 (3.15)
for all x ∈ V \ 0. However, A2 is symmetric but not positive definitebecause x>A2x = 9x2
1 +12x1x2 +4x22 = (3x1 +2x2)2−x2
2 can be smallerthan 0, e.g., for x = [2,−3]>.
If A ∈ Rn×n is symmetric, positive definite then
〈x,y〉 = x>Ay (3.16)
defines an inner product with respect to an ordered basis B where x and1523
y are the coordinate representations of x,y ∈ V with respect to B.1524
Theorem 3.4. For a real-valued, finite-dimensional vector space V and anordered basis B of V it holds that 〈·, ·〉 : V × V → R is an inner product ifand only if there exists a symmetric, positive definite matrixA ∈ Rn×n with
〈x,y〉 = x>Ay . (3.17)
The following properties hold if A ∈ Rn×n is symmetric and positive1525
definite of order n:1526
• The nullspace (kernel) of A consists only of 0 because x>Ax > 0 for1527
all x 6= 0. This implies that Ax 6= 0 if x 6= 0.1528
Draft (2018-06-27) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
3.3 Lengths and Distances 71
• The diagonal elements of A are positive because aii = e>i Aei > 0,1529
where ei is the ith vector of the standard basis in Rn.1530
In Section 4.3, we will return to symmetric, positive definite matrices in1531
the context of matrix decompositions.1532
3.3 Lengths and Distances1533
In Section 3.1, we already discussed norms that we can use to computethe length of a vector. Inner products and norms are closely related in thesense that any inner product induces a norm Inner products
induce norms.
‖x‖ :=√〈x,x〉 (3.18)
in a natural way, such that we can compute lengths of vectors using the1534
inner product. However, not every norm is induced by an inner product.1535
The Manhattan distance (3.3) is an example of a norm that is not in-1536
duced by an inner product. In the following, we will focus on norms that1537
are induced by inner products and introduce geometric concepts, such as1538
lengths, distances and angles.1539
Example (Length of Vectors using Inner Products)In geometry, we are often interested in lengths of vectors. We can now usean inner product to compute them using (3.18). Let us take x = [1, 1]> ∈R2. If we use the dot product as the inner product, with (3.18) we obtain
‖x‖ =√x>x =
√12 + 12 =
√2 (3.19)
as the length of x. Let us now choose a different inner product:
〈x,y〉 := x>[
1 − 12
− 12
1
]y = x1y1 −
1
2(x1y2 + x2y1) + x2y2 . (3.20)
If we compute the norm of a vector, then this inner product returns smallervalues than the dot product if x1 and x2 have the same sign (and x1x2 >0), otherwise it returns greater values than the dot product. With thisinner product we obtain
〈x,x〉 = x21 − x1x2 + x2
2 = 1− 1 + 1 = 1 =⇒ ‖x‖ =√
1 = 1 , (3.21)
such that x is “shorter” with this inner product than with the dot product.
Definition 3.5 (Distance and Metric). Consider an inner product space(V, 〈·, ·〉). Then
d(x,y) := ‖x− y‖ =√〈x− y,x− y〉 (3.22)
is called distance of x,y ∈ V . If we use the dot product as the inner distance
product, then the distance is called Euclidean distance. The mapping Euclidean distance
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
72 Analytic Geometry
d : V × V → R (3.23)
(x,y) 7→ d(x,y) (3.24)
is called metric.metric 1540
Remark. Similar to the length of a vector, the distance between vectors1541
does not require an inner product: a norm is sufficient. If we have a norm1542
induced by an inner product, the distance may vary depending on the1543
choice of the inner product. ♦1544
A metric d satisfies:1545
1. d is positive definite, i.e., d(x,y) > 0 for all x,y ∈ V and d(x,y) =positive definite 1546
0 ⇐⇒ x = y1547
2. d is symmetric, i.e., d(x,y) = d(y,x) for all x,y ∈ V .symmetric 1548
Triangularinequality
3. Triangular inequality: d(x, z) 6 d(x,y) + d(y, z).1549
3.4 Angles and Orthogonality1550
Figure 3.4 Whenrestricted to [0, π]
then f(ω) = cos(ω)
returns a uniquenumber in theinterval [−1, 1].
0 1 2 3
ω
−1.00
−0.75
−0.50
−0.25
0.00
0.25
0.50
0.75
1.00
cos(ω
)
f (ω) = cos(ω)
The Cauchy-Schwarz inequality (3.5) allows us to define angles ω in innerproduct spaces between two vectors x,y. Assume that x 6= 0,y 6= 0. Then
−1 6〈x,y〉‖x‖ ‖y‖ 6 1 . (3.25)
Therefore, there exists a unique ω ∈ [0, π] with
cosω =〈x,y〉‖x‖ ‖y‖ , (3.26)
see Figure 3.4 for an illustration. The number ω is the angle betweenangle
1551
the vectors x and y. Intuitively, the angle between two vectors tells us1552
how similar their orientations are. For example, using the dot product,1553
the angle between x and y = 4x, i.e., y is a scaled version of x, is 0:1554
Their orientation is the same.Figure 3.5 Theangle ω betweentwo vectors x,y iscomputed using theinner product.
y
x
10
1
ω
1555
Example (Angle between Vectors)Let us compute the angle between x = [1, 1]> ∈ R2 and y = [1, 2]> ∈ R2,see Figure 3.5, where we use the dot product as the inner product. Thenwe get
cosω =〈x,y〉√〈x,x〉〈y,y〉
=x>y√x>xy>y
=3√10, (3.27)
and the angle between the two vectors is arccos( 3√10
) ≈ 0.32 rad, whichcorresponds to about 18.
The inner product also allows us to characterize vectors that are most1556
dissimilar, i.e., orthogonal.1557
Draft (2018-06-27) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
3.5 Inner Products of Functions 73
Definition 3.6 (Orthogonality). Two vectors x and y are orthogonal if orthogonal1558
and only if 〈x,y〉 = 0, and we write x ⊥ y.1559
An implication of this definition is that the 0-vector is orthogonal to1560
every vector in the vector space.1561
Remark. Orthogonality is the generalization of the concept of perpendic-1562
ularity to bilinear forms that do not have to be the dot product. In our1563
context, geometrically, we can think of orthogonal vectors to have a right1564
angle with respect to a specific inner product. ♦1565
Example (Orthogonal Vectors)
Figure 3.6 Theangle ω betweentwo vectors x,y canchange dependingon the innerproduct.
y x
−1 10
1
ω
Consider two vectors x = [1, 1]>,y = [−1, 1]> ∈ R2, see Figure 3.6.We are interested in determining the angle ω between them using twodifferent inner products. Using the dot product as inner product yields anangle ω between x and y of 90, such that x ⊥ y. However, if we choosethe inner product
〈x,y〉 = x>[2 00 1
]y , (3.28)
we get that the angle ω between x and y is given by
cosω =〈x,y〉‖x‖‖y‖ = −1
3=⇒ ω ≈ 1.91 rad ≈ 109.5 , (3.29)
and x and y are not orthogonal. Therefore, vectors that are orthogonalwith respect to one inner product do not have to be orthogonal with re-spect to a different inner product.
3.5 Inner Products of Functions1566
Thus far, we looked at properties of inner products to compute lengths,1567
angles and distances. We focused on inner products of finite-dimensional1568
vectors.1569
In the following, we will look at an example of inner products of a1570
different type of vectors: inner products of functions.1571
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
74 Analytic Geometry
The inner products we discussed so far were defined for vectors with1572
a finite number of entries. We can think of these vectors as discrete func-1573
tions with a finite number of function values. The concept of an inner1574
product can be generalized to vectors with an infinite number of entries1575
(countably infinite) and also continuous-valued functions (uncountably1576
infinite). Then, the sum over individual components of vectors, see (3.6)1577
for example, turns into an integral.1578
An inner product of two functions u : R → R and v : R → R can bedefined as the definite integral
〈u, v〉 :=
∫ b
a
u(x)v(x)dx (3.30)
for lower and upper limits a, b < ∞, respectively. As with our usual in-1579
ner product, we can define norms and orthogonality by looking at the1580
inner product. If (3.30) evaluates to 0, the functions u and v are orthogo-1581
nal. To make the above inner product mathematically precise, we need to1582
take care of measures, and the definition of integrals. Furthermore, unlike1583
inner product on finite dimensional vectors, inner products on functions1584
may diverge (have infinite value). Some careful definitions need to be ob-1585
served, which requires a foray into real and functional analysis which we1586
do not cover in this book.1587
Example (Inner Product of Functions)If we choose u = sin(x) and v = cos(x), the integrand f(x) = u(x)v(x)Figure 3.7 f(x) =
sin(x) cos(x).
−2 0 2
x
−0.4
−0.2
0.0
0.2
0.4
f(x
)=
sin
(x)
cos(x
)
of (3.30), is shown in Figure 3.7. We see that this function is odd, i.e.,f(−x) = −f(x). Therefore, the integral with limits a = −π, b = π of thisproduct evaluates to 0. Therefore, sin and cos are orthogonal functions.
Remark. It also holds that the collection of functions
1, cos(x), cos(2x), cos(3x), . . . (3.31)
is orthogonal if we integrate from −π to π, i.e., any pair of functions are1588
orthogonal to each other. ♦1589
In Chapter 6, we will have a look at a second type of unconventional1590
inner products: the inner product of random variables.1591
3.6 Orthogonal Projections1592
Projections are an important class of linear transformations (besides rota-1593
tions and reflections). Projections play an important role in graphics, cod-1594
ing theory, statistics and machine learning. In machine learning, we often1595
deal with data that is high-dimensional. High-dimensional data is often1596
hard to analyze or visualize. However, high-dimensional data quite often1597
Draft (2018-06-27) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
3.6 Orthogonal Projections 75Figure 3.8Orthogonalprojection of atwo-dimensionaldata set onto aone-dimensionalsubspace.
−4 −3 −2 −1 0 1 2 3 4x1
−4
−3
−2
−1
0
1
2
3
4
x2
(a) Original data set.
−4 −3 −2 −1 0 1 2 3 4x1
−4
−3
−2
−1
0
1
2
3
4
x2
(b) Original data points (black) and their cor-responding orthogonal projections (red) onto alower-dimensional subspace (straight line).
possesses the property that only a few dimensions contain most informa-1598
tion, and most other dimensions are not essential to describe key proper-1599
ties of the data. When we compress or visualize high-dimensional data we1600
will lose information. To minimize this compression loss, we ideally find1601
the most informative dimensions in the data. Then, we can project the1602
original high-dimensional data onto a lower-dimensional feature space “Feature” is acommonly usedword for “datarepresentation”.
1603
and work in this lower-dimensional space to learn more about the data set1604
and extract patterns. For example, machine learning algorithms, such as1605
Principal Component Analysis (PCA) by Pearson (1901); Hotelling (1933)1606
and Deep Neural Networks (e.g., deep auto-encoders Deng et al. (2010)),1607
heavily exploit the idea of dimensionality reduction. In the following, we1608
will focus on orthogonal projections, which we will use in Chapter 10 for1609
linear dimensionality reduction and in Chapter 12 for classification. Even1610
linear regression, which we discuss in Chapter 9 can be interpreted using1611
orthogonal projections. For a given lower-dimensional subspace, orthog-1612
onal projections of high-dimensional data retain as much information as1613
possible and minimize the difference/error between the original data and1614
the corresponding projection. An illustration of such an orthogonal pro-1615
jection is given in Figure 3.8.1616
Before we detail how to obtain these projections, let us define what a1617
projection actually is.1618
Definition 3.7 (Projection). Let V be a vector space and W ⊆ V a1619
subspace of V . A linear mapping π : V → W is called a projection if projection1620
π2 = π π = π.1621
Remark (Projection matrix). Since linear mappings can be expressed by1622
transformation matrices, the definition above applies equally to a special1623
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
76 Analytic Geometry
Figure 3.9Projection ofx ∈ R2 onto asubspace U withbasis b.
b
x
p = πU(x)
ω
Figure 3.10Projection of atwo-dimensionalvector x onto aone-dimensionalsubspace with‖x‖ = 1.
cosω
ω
sinω
b
x
kind of transformation matrices, the projection matrices P π, which exhibit projection matrices1624
the property that P 2π = P π.1625
Since P 2π = P π it follows that all eigenvalues of P π are either 0 or1626
1. The corresponding eigenspaces are the kernel and image of the pro-A good illustrationis given here:http://tinyurl.
com/p5jn5ws.
1627
jection, respectively. More details about eigenvalues and eigenvectors are1628
provided in Chapter 4. ♦1629
In the following, we will derive orthogonal projections of vectors in the1630
inner product space (Rn, 〈·, ·〉) onto subspaces. We will start with one-1631
dimensional subspaces, which are also called lines. If not mentioned oth-lines 1632
erwise, we assume the dot product 〈x,y〉 = x>y as the inner product.1633
3.6.1 Projection onto 1-Dimensional Subspaces (Lines)1634
Assume we are given a line (1-dimensional subspace) through the origin1635
with basis vector b ∈ Rn. The line is a one-dimensional subspace U ⊆ Rn1636
spanned by b. When we project x ∈ Rn onto U , we want to find the1637
point πU(x) ∈ U that is closest to x. Using geometric arguments, let us1638
characterize some properties of the projection πU(x) (Fig. 3.9 serves as1639
an illustration):1640
Draft (2018-06-27) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
3.6 Orthogonal Projections 77
• The projection πU(x) is closest to x, where “closest” implies that the1641
distance ‖x−πU(x)‖ is minimal. It follows that the segment πU(x)−x1642
from πU(x) to x is orthogonal to U and, therefore, the basis b of U . The1643
orthogonality condition yields 〈πU(x)−x, b〉 = 0 since angles between1644
vectors are defined by means of the inner product.1645
• The projection πU(x) of x onto U must be an element of U and, there-1646
fore, a multiple of the basis vector b that spans U . Hence, πU(x) = λb,1647
for some λ ∈ R. λ is then thecoordinate of πU (x)
with respect to b.
1648
In the following three steps, we determine the coordinate λ, the projection1649
πU(x) =∈ U and the projection matrix P π that maps arbitrary x ∈ Rn1650
onto U .1651
1. Finding the coordinate λ. The orthogonality condition yields
〈x− πU(x), b〉 = 0 (3.32)πU (x)=λb⇐⇒ 〈x− λb, b〉 = 0 . (3.33)
We can now exploit the bilinearity of the inner product and arrive at With a general innerproduct, we getλ = 〈x, b〉 if‖b‖ = 1.
〈x, b〉 − λ〈b, b〉 = 0 (3.34)
⇐⇒ λ =〈x, b〉〈b, b〉 =
〈x, b〉‖b‖2 (3.35)
If we choose 〈·, ·〉 to be the dot product, we obtain
λ =b>x
b>b=b>x
‖b‖2 (3.36)
If ‖b‖ = 1, then the coordinate λ of the projection is given by b>x.1652
2. Finding the projection point πU(x) ∈ U . Since πU(x) = λb we imme-diately obtain with (3.36) that
πU(x) = λb =〈x, b〉‖b‖2 b =
b>x
‖b‖2b , (3.37)
where the last equality holds for the dot product only. We can alsocompute the length of πU(x) by means of Definition 3.1 as
‖p‖ = ‖λb‖ = |λ| ‖b‖ . (3.38)
This means that our projection is of length |λ| times the length of b.1653
This also adds the intuition that λ is the coordinate of πU(x) with1654
respect to the basis vector b that spans our one-dimensional subspace1655
U .1656
If we use the dot product as an inner product we get
‖πU(x)‖ (3.37)=|b>x|‖b‖2 ‖b‖
(3.26)= | cosω| ‖x‖ ‖b‖ ‖b‖‖b‖2 = | cosω| ‖x‖ .
(3.39)
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
78 Analytic Geometry
Here, ω is the angle between x and b. This equation should be familiar1657
from trigonometry: If ‖x‖ = 1 then x lies on the unit circle. It follows1658
that the projection onto the horizontal axis spanned by b is exactlyThe horizontal axisis a one-dimensionalsubspace.
1659
cosω, and the length of the corresponding vector πU(x) = | cosω|. An1660
illustration is given in Figure 3.10.1661
3. Finding the projection matrix P π. We know that a projection is a lin-ear mapping (see Definition 3.7). Therefore, there exists a projectionmatrix P π, such that πU(x) = P πx. With the dot product as innerproduct and
πU(x) = λb = bλ = bb>x
‖b‖2 =bb>
‖b‖2x (3.40)
we immediately see that
P π =bb>
‖b‖2 . (3.41)
Note that bb> is a symmetric matrix (with rank 1) and ‖b‖2 = 〈b, b〉Projection matricesare alwayssymmetric.
1662
is a scalar.1663
The projection matrixP π projects any vector x ∈ Rn onto the line through1664
the origin with direction b (equivalently, the subspace U spanned by b).1665
Remark. The projection πU(x) ∈ Rn is still an n-dimensional vector and1666
not a scalar. However, we no longer require n coordinates to represent the1667
projection, but only a single one if we want to express it with respect to1668
the basis vector b that spans the subspace U : λ. ♦1669
Example (Projection onto a Line)Find the projection matrix P π onto the line through the origin spannedby b =
[1 2 2
]>. b is a direction and a basis of the one-dimensional
subspace (line through origin).With (3.41), we obtain
P π =bb>
b>b=
1
9
122
[1 2 2]
=1
9
1 2 22 4 42 4 4
. (3.42)
Let us now choose a particular x and see whether it lies in the subspacespanned by b. For x =
[1 1 1
]>, the projected point is
πU(x) = P πx =1
9
1 2 22 4 42 4 4
111
=1
9
51010
∈ span[
122
] . (3.43)
Note that the application of P π to πU(x) does not change anything, i.e.,P ππU(x) = πU(x). This is expected because according to Definition 3.7
Draft (2018-06-27) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
3.6 Orthogonal Projections 79
Figure 3.11Projection onto atwo-dimensionalsubspace U withbasis b1, b2. Theprojection πU (x) ofx ∈ R3 onto U canbe expressed as alinear combinationof b1, b2 and thedisplacement vectorx− πU (x) isorthogonal to bothb1 and b2.
0
x
b1
b2
U
πU (x)
x− πU (x)
we know that a projection matrix P π satisfies P 2πx = P πx. Therefore,
πU(x) is also an eigenvector of P π, and the corresponding eigenvalue is1.
3.6.2 Projection onto General Subspaces1670
In the following, we look at orthogonal projections of vectors x ∈ Rn1671
onto higher-dimensional subspaces U ⊆ Rn with dim(U) = m > 1. An1672
illustration is given in Figure 3.11.1673
Assume that (b1, . . . , bm) is an ordered basis of U . Any projection πU(x) If U is given by a setof spanning vectors,which are not abasis, make sureyou determine abasis b1, . . . , bmbefore proceeding.
1674
onto U is necessarily an element of U . Therefore, they can be represented1675
as linear combinations of the basis vectors b1, . . . , bm of U , such that1676
πU(x) =∑m
i=1 λibi.
The basis vectorsform the columns ofB ∈ Rn×m, whereB = [b1, . . . , bm].
1677
As in the 1D case, we follow a three-step procedure to find the projec-1678
tion πU(x) and the projection matrix P π:1679
1. Find the coordinates λ1, . . . , λm of the projection (with respect to thebasis of U), such that the linear combination
πU(x) =m∑i=1
λibi = Bλ , (3.44)
B = [b1, . . . , bm] ∈ Rn×m, λ = [λ1, . . . , λm]> ∈ Rm , (3.45)
is closest to x ∈ Rn. As in the 1D case, “closest” means “minimumdistance”, which implies that the vector connecting πU(x) ∈ U andx ∈ Rn must be orthogonal to all basis vectors of U . Therefore, weobtain m simultaneous conditions (assuming the dot product as theinner product)
〈b1,x− πU(x)〉 = b>1 (x− πU(x)) = 0 (3.46)
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
80 Analytic Geometry
... (3.47)
〈bm,x− πU(x)〉 = b>m(x− πU(x)) = 0 (3.48)
which, with πU(x) = Bλ, can be written as
b>1 (x−Bλ) = 0 (3.49)... (3.50)
b>m(x−Bλ) = 0 (3.51)
such that we obtain a homogeneous linear equation systemb>1...b>m
x−Bλ
= 0 ⇐⇒ B>(x−Bλ) = 0 (3.52)
⇐⇒ B>Bλ = B>x . (3.53)
The last expression is called normal equation. Since b1, . . . , bm are anormal equation
basis of U and, therefore, linearly independent, B>B ∈ Rm×m is reg-ular and can be inverted. This allows us to solve for the coefficients/coordinates
λ = (B>B)−1B>x . (3.54)
The matrix (B>B)−1B> is also called the pseudo-inverse of B, whichpseudo-inverse 1680
can be computed for non-square matricesB. It only requires thatB>B1681
is positive definite, which is the case if B is full rank.11682
2. Find the projection πU(x) ∈ U . We already established that πU(x) =Bλ. Therefore, with (3.54)
πU(x) = p = B(B>B)−1B>x . (3.55)
3. Find the projection matrix P π. From (3.55) we can immediately seethat the projection matrix that solves P πx = πU(x) must be
P π = B(B>B)−1B> . (3.56)
Remark. Comparing the solutions for projecting onto a one-dimensional1683
subspace and the general case, we see that the general case includes the1684
1D case as a special case: If dim(U) = 1 thenB>B ∈ R is a scalar and we1685
can rewrite the projection matrix in (3.56) (either give equation number1686
or repeat the equation but not both, i would just give the eq. number)1687
P π = B(B>B)−1B> as P π = BB>
B>B, which is exactly the projection1688
matrix in (3.41). ♦1689
1In practical applications (e.g., linear regression), we often add a “jitter term” εI to B>Bto guarantee increase numerical stability and positive definiteness. This “ridge” can be rigorouslyderived using Bayesian inference. See Chapter 9 for details.
Draft (2018-06-27) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
3.6 Orthogonal Projections 81
Example (Projection onto a Two-dimensional Subspace)
For a subspace U = span[
111
,0
12
] ⊆ R3 and x =
600
∈ R3 find the
coordinates λ of x in terms of the subspace U , the projection point πU(x)and the projection matrix P π.
First, we see that the generating set of U is a basis (linear indepen-
dence) and write the basis vectors of U into a matrix B =
1 01 11 2
.
Second, we compute the matrix B>B and the vector B>x as
B>B =
[1 1 10 1 2
]1 01 11 2
=
[3 33 5
], B>x =
[1 1 10 1 2
]600
=
[60
].
(3.57)
Third, we solve the normal equation B>Bλ = B>x to find λ:[3 33 5
] [λ1
λ2
]=
[60
]⇐⇒ λ =
[5−3
]. (3.58)
Fourth, the projection πU(x) of x onto U , i.e., into the column space ofB, can be directly computed via
πU(x) = Bλ =
52−1
. (3.59)
The corresponding projection error is the norm of the distance between projection error
the original vector and its projection onto U , i.e.,
‖x− πU(x)‖ =∥∥∥[1 −2 1
]>∥∥∥ =√
6 . (3.60)
The projection erroris also called thereconstruction error. reconstruction errorFifth, the projection matrix (for any x ∈ R3) is given by
P π = B(B>B)−1B> =1
6
5 2 −12 2 2−1 2 5
. (3.61)
To verify the results, we can (a) check whether the displacement vectorπU(x)−x is orthogonal to all basis vectors of U , (b) verify that P π = P 2
π
(see Definition 3.7).
Remark. The projections πU(x) are still vectors in Rn although they lie in1690
an m-dimensional subspace U ⊆ Rn. However, to represent a projected1691
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
82 Analytic GeometryFigure 3.12Projection onto anaffine space. (a) Theoriginal setting; (b)The setting isshifted by −x0, sothat x− x0 can beprojected onto thedirection space U ;(c) The projection istranslated back tox0 + πU (x− x0),which gives the finalorthogonalprojection πL(x).
L
x 0
x
b2
b10
(a) Setting.
b10
x − x 0
U = L− x 0
πU (x − x 0)
b2
(b) Reduce problem to projec-tion πU onto vector subspace.
L
x 0
x
b2
b10
πL(x )
(c) Add support point back in toget affine projection πL.
vector we only need the m coordinates λ1, . . . , λm with respect to the1692
basis vectors b1, . . . , bm of U . ♦1693
Remark. In vector spaces with general inner products, we have to pay1694
attention when computing angles and distances, which are defined by1695
means of the inner product. ♦1696
We can findapproximatesolutions tounsolvable linearequation systemsusing projections.
Projections allow us to look at situations where we have a linear system1697
Ax = b without a solution. Recall that this means that b does not lie in1698
the span of A, i.e., the vector b does not lie in the subspace spanned by1699
the columns ofA. Given that the linear equation cannot be solved exactly,1700
we can find an approximate solution. The idea is to find the vector in the1701
subspace spanned by the columns ofA that is closest to b, i.e., we compute1702
the orthogonal projection of b onto the subspace spanned by the columns1703
of A. This problem arises often in practice, and the solution is called the1704
least squares solution (assuming the dot product as the inner product) ofleast squaressolution
1705
an overdetermined system. This is discussed further in Chapter 9.1706
3.6.3 Projection onto Affine Subspaces1707
Thus far, we discussed how to project a vector onto a lower-dimensional1708
subspace U . In the following, we provide a solution to projecting a vector1709
onto an affine subspace.1710
Consider the setting in Figure 3.12(a). We are given an affine space L =x0 + U where b1, b2 are basis vectors of U . To determine the orthogonalprojection πL(x) of x onto L, we transform the problem into a problemthat we know how to solve: the projection onto a vector subspace. Inorder to get there, we subtract the support point x0 from x and from L,so that L− x0 = U is exactly the vector subspace U . We can now use theorthogonal projections onto a subspace we discussed in Section 3.6.2 andobtain the projection πU(x − x0), which is illustrated in Figure 3.12(b).This projection can now be translated back into L by adding x0, such that
Draft (2018-06-27) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
3.7 Orthonormal Basis 83
we obtain the orthogonal projection onto an affine space L as
πL(x) = x0 + πU(x− x0) , (3.62)
where πU(·) is the orthogonal projection onto the subspace U , i.e., the1711
direction space of L, see Figure 3.12(c).1712
From Figure 3.12 it is also evident that the distance of x from the affinespace L is identical to the distance of x− x0 from U , i.e.,
d(x, L) = ‖x− πL(x)‖ = ‖x− (x0 + πU(x− x0))‖ (3.63)
= d(x− x0, πU(x− x0)) . (3.64)
3.7 Orthonormal Basis1713
In Section 2.6.1, we characterized properties of basis vectors and found1714
that in an n-dimensional vector space, we need n basis vectors, i.e., n1715
vectors that are linearly independent. In Sections 3.3 and 3.4, we used1716
inner products to compute the length of vectors and the angle between1717
vectors. In the following, we will discuss the special case where the basis1718
vectors are orthogonal to each other and where the length of each basis1719
vector is 1. We will call this basis then an orthonormal basis.1720
Let us introduce this more formally.1721
Definition 3.8 (Orthonormal basis). Consider an n-dimensional vectorspace V and a basis b1, . . . , bn of V . If
〈bi, bj〉 = 0 for i 6= j (3.65)
〈bi, bi〉 = 1 (3.66)
for all i, j = 1, . . . , n then the basis is called an orthonormal basis (ONB). orthonormal basisONB
1722
If only (3.65) is satisfied and then the basis is called an orthogonal basis.orthogonal basis
1723
Note that (3.66) implies that every basis vector has length/norm 1. The1724
Gram-Schmidt process (Strang, 2003) is a constructive way to iteratively1725
build an orthonormal basis b1, . . . , bn given a set b1, . . . , bn of non-1726
orthogonal and unnormalized basis vectors.1727
Example (Orthonormal basis)The canonical/standard basis for a Euclidean vector space Rn is an or-thonormal basis, where the inner product is the dot product of vectors.
In R2, the vectors
b1 =1√2
[11
], b2 =
1√2
[1−1
](3.67)
form an orthonormal basis since b>1 b2 = 0 and ‖b1‖ = 1 = ‖b2‖.
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
84 Analytic Geometry
Figure 3.13 Therobotic arm needs torotate its joints inorder to pick upobjects or to placethem correctly.Figure takenfrom (Deisenrothet al., 2015).
In Section 3.6, we derived projections of vectors x onto a subspace Uwith basis vectors b1, . . . , bk. If this basis is an ONB, i.e., (3.65)–(3.66)are satisfied, the projection equation (3.55) simplifies greatly to
πU(x) = BB>x (3.68)
since B>B = I. This means that we no longer have to compute the1728
tedious inverse from (3.55), which saves us much computation time.1729
We will exploit the concept of an orthonormal basis in Chapter 12 and1730
Chapter 10 when we discuss Support Vector Machines and Principal Com-1731
ponent Analysis.1732
3.8 Rotations1733
An important category of linear mappings are rotations. A rotation is act-1734
ing to rotate an object (counterclockwise) by an angle θ about the origin.1735
Important application areas of rotations include computer graphics and1736
robotics. For example, in robotics, it is often important to know how to1737
rotate the joints of a robotic arm in order to pick up or place an object,1738
see Figure 3.13.1739
3.8.1 Rotations in the Plane1740
Consider the standard basis in R2
e1 =
[10
], e2 =
[01
], which defines1741
the standard coordinate system in R2. Assume we want to rotate this co-1742
ordinate system by an angle θ as illustrated in Figure 3.14. Note that the1743
rotated vectors are still linearly independent and, therefore, are a basis of1744
R2. This means that the rotation performs a basis change.1745
Since rotations Φ are linear mappings, we can express them by a rotationrotation matrix
matrix R(θ). Trigonometry allows us to determine the coordinates of the
Draft (2018-06-27) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
3.8 Rotations 85
Figure 3.14Rotation of thestandard basis in R2
by an angle θ.
e1
e2
θ
θ
Φ(e2) = [− sin θ, cos θ]T
Φ(e1) = [cos θ, sin θ]T
cos θ
sin θ
− sin θ
cos θ
rotated axes with respect to the standard basis in R2. We obtain
Φ(e1) =
[cos θsin θ
], Φ(e2) =
[− sin θcos θ
]. (3.69)
Therefore, the rotation matrix that performs the basis change into therotated coordinates R(θ) is given as
R(θ) =
[cos θ − sin θsin θ cos θ
](3.70)
Generally, in R2 rotations of any vector x ∈ R2 by an angle θ are givenby
x′ = Φ(x) = R(θ)x =
[cos θ − sin θsin θ cos θ
] [x1
x2
]=
[x1 cos θ − x2 sin θx1 cos θ + x2 sin θ
].
(3.71)
3.8.2 Rotations in Three Dimensions1746
In three dimensions, we have to define what “counterclockwise” means:1747
We will go by the convention that a “counterclockwise” (planar) rotation1748
about an axis refers to a rotation about an axis when we look at the axis1749
“from the end toward the origin”. In R3 there are three (planar) rotations1750
about three standard basis vectors (see Figure 3.15):1751
• Counterclockwise rotation about the e1-axis
R1(θ) =[Φ(e1) Φ(e2) Φ(e3)
]=
1 0 00 cos θ − sin θ0 sin θ cos θ
(3.72)
Here, the e1 coordinate is fixed, and the counterclockwise rotation is1752
performed in the e2e3 plane.1753
• Counterclockwise rotation about the e2-axis
R2(θ) =
cos θ 0 sin θ0 1 0
− sin θ 0 cos θ
(3.73)
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
86 Analytic Geometry
Figure 3.15Rotation of a vector(gray) in R3 by anangle θ about thee3-axis. The rotatedvector is shown inblue.
e1
e2
e3
θ
If we rotate the e1e3 plane about the e2 axis, we need to look at the e21754
axis from its “tip” toward the origin.1755
• Counterclockwise rotation about the e3-axis
R3(θ) =
cos θ − sin θ 0sin θ cos θ 0
0 0 1
. (3.74)
Figure 3.15 illustrates this.1756
3.8.3 Rotations in n Dimensions1757
The generalization of rotations from 2D and 3D to n-dimensional Eu-1758
clidean vector spaces can be intuitively described as keeping n − 2 di-1759
mensions fix and restrict the rotation to a two-dimensional plane in the1760
n-dimensional space.1761
Definition 3.9 (Givens Rotation). Let V be an n-dimensional Euclideanvector space and Φ : V → V an automorphism with transformation ma-trix
Rij(θ) :=
I i−1 0 · · · · · · 00 cos θ 0 − sin θ 00 0 Ij−i−1 0 00 sin θ 0 cos θ 00 · · · · · · 0 In−j
∈ Rn×n , θ ∈ R ,
(3.75)
for 1 6 i < j 6 n. Then Rij(θ) is called a Givens rotation. Essentially,Givens rotation
Rij(θ) is the identity matrix In with
rii = cos θ , rij = − sin θ , rji = sin θ , rjj = cos θ . (3.76)
In two dimensions (i.e., n = 2), we obtain (3.70) as a special case.1762
Draft (2018-06-27) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.
3.9 Further Reading 87
3.8.4 Properties of Rotations1763
Rotations exhibit some useful properties: Rotations preservelengths anddistances.Rotations are notcommutative.
1764
• The composition of rotations can be expressed as a single rotation wherethe angles are summed up:
R(φ)R(θ) = R(φ+ θ) (3.77)
• Rotations preserve lengths, i.e., ‖x‖ = ‖R(φ)x‖. The original vector1765
and the rotated vector have the same length. This also implies that1766
det(R(φ)) = 11767
• Rotations preserve distances, i.e., ‖x−y‖ = ‖Rθ(x)−Rθ(y)‖. In other1768
words, rotations leave the distance between any two points unchanged1769
after the transformation1770
• Rotations in three (or more) dimensions are generally not commuta-1771
tive. Therefore, the order in which rotations are applied is important,1772
even if they rotate about the same point. Only in two dimensions vector1773
rotations are commutative, such that R(φ)R(θ) = R(θ)R(φ) for all1774
φ, θ ∈ [0, 2π), and form an Abelian group (with multiplication) if they1775
rotate about the same point (e.g., the origin).1776
• Rotations have no real eigenvalues, except we rotate about nπ where1777
n ∈ Z.1778
3.9 Further Reading1779
In this chapter, we gave a brief overview of some of the important concepts1780
of analytic geometry, which we will use in later chapters of the book. For1781
a broader and more in-depth overview of some the concepts we presented1782
we refer to the excellent book by Boyd and Vandenberghe (2018), for1783
example.1784
Inner products allow us to determine specific bases of vector (sub)spaces,1785
where each vector is orthogonal to all others (orthogonal bases) using the1786
Gram-Schmidt method. These bases are important in optimization and1787
numerical algorithms for solving linear equation systems. For instance,1788
Krylov subspace methods, such as Conjugate Gradients or GMRES, mini-1789
mize residual errors that are orthogonal to each other (Stoer and Burlirsch,1790
2002).1791
In machine learning, inner products are important in the context of1792
kernel methods (Scholkopf and Smola, 2002). Kernel methods exploit the1793
fact that many linear algorithms can be expressed purely by inner prod-1794
uct computations. Then, the “kernel trick” allows us to compute these1795
inner products implicitly in a (potentially infinite-dimensional) feature1796
space, without even knowing this feature space explicitly. This allowed the1797
“non-linearization” of many algorithms used in machine learning, such as1798
kernel-PCA (Scholkopf et al., 1997) for dimensionality reduction. Gaus-1799
sian processes (Rasmussen and Williams, 2006) also fall into the category1800
c©2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
88 Analytic Geometry
of kernel methods and are the current state-of-the-art in probabilistic re-1801
gression (fitting curves to data points). The idea of kernels is explored1802
further in Chapter 12.1803
Projections are often used in computer graphics, e.g., to generate shad-1804
ows. In optimization, orthogonal projections are often used to (iteratively)1805
minimize residual errors. This also has applications in machine learning,1806
e.g., in linear regression where we want to find a (linear) function that1807
minimizes the residual errors, i.e., the lengths of the orthogonal projec-1808
tions of the data onto the linear function (Bishop, 2006). We will investi-1809
gate this further in Chapter 9. PCA (Hotelling, 1933; Pearson, 1901) also1810
uses projections to reduce the dimensionality of high-dimensional data.1811
We will discuss this in more detail in Chapter 10.1812
Exercises1813
3.1 Show that 〈·, ·〉 defined for all x = (x1, x2) and y = (y1, y2) in R2 by:
〈x,y〉 := x1y1 − (x1y2 + x2y1) + 2(x2y2)
is an inner product.1814
3.2 Consider R2 with 〈·, ·〉 defined for all x and y in R2 as:
〈x,y〉 := x>[2 0
1 2
]︸ ︷︷ ︸
=:A
y
Is 〈·, ·〉 an inner product?1815
3.3 Consider the Euclidean vector space R5 with the dot product. A subspaceU ⊆ R5 and x ∈ R5 are given by
U = span[
0
−1
2
0
2
,
1
−3
1
−1
2
,−3
4
1
2
1
,−1
−3
5
0
7
] , x =
−1
−9
−1
4
1
1. Determine the orthogonal projection πU (x) of x onto U1816
2. Determine the distance d(x, U)1817
3.4 Consider R3 with the inner product
〈x,y〉 := x>
2 1 0
1 2 −1
0 −1 2
y .Furthermore, we define e1, e2, e3 as the standard/canonical basis in R3.1818
1. Determine the orthogonal projection πU (e2) of e2 onto
U = span[e1, e3] .
Hint: Orthogonality is defined through the inner product.1819
2. Compute the distance d(e2, U).1820
Draft (2018-06-27) from Mathematics for Machine Learning. Errata and feedback to http://mml-book.com.