# Visual Explanation of the Conjugate Gradient Algorithm

So elevate now that we’ve a region of (n) A-orthogonal (or conjugate) instructions alongside which we wish to diagram shut our descent. Now that is a spacious assumption: How will we discover such a region of vectors? Right here is no longer any longer clear at all correct now nonetheless we are able to face
that dispute later. For now, we imagine a correct fairy handing us a region of vectors [{d_0,ldots, d_{n-1}}] which are conjugate (or A-orthogonal), i.e. for any (i =no longer j):

As an instance, within the environment of the optimization dispute and with conjugate vectors from identify 7 in (mathbb R^2), any pair of vectors depicted there’ll attain.

Then we beginning in some (x_0in mathbb R^n) and clarify [begin{aligned}&x_1 &&= x_0 + alpha_0 cdot d_0\ &x_2 &&= x_1 + alpha_1 cdot d_1\ &&&quadvdots \ &x_{i+1} &&= x_i + alpha_i cdot d_i \ &&&quadvdots \ &x_{n} &&= x_{n-1} + alpha_{n-1}cdot d_{n-1}end{aligned}]

Also, we’re being if fact be told grasping by annoying that (x_n = x^star,) i.e. we need the particular respond to be realized in at most (n) steps. Show that the sole unknowns on this sequence of equations are the step sizes (alpha_i,) all the pieces else is both of the instructions (d_i) or the cease outcomes of the preceding calculation.

How will we region those step sizes (alpha_i) such that we cease our purpose of (x_n = x^star)? The requirement is that (e_{i+1}) is conjugate to (d_i). Intuitively, this means that we are able to
by no methodologyThere is a spacious caveat to this “by no methodology” that it is seemingly you’ll per chance well well likely wish spotted: Gradient descent had the dispute that we lost optimality alongside instructions which we had already optimized alongside. How will we
assemble definite something admire this would no longer happen here? Right here is a if fact be told indispensable point which can per chance well objective additionally be resolved later. For now, make no longer trouble about it.
trudge in direction of (d_i) again. It is devoted, if fact be told: If we power the algorithm to preserve watch over with (n) steps, each and each of which
point in a specific direction (d_i,) it is going to most efficient trudge in direction of (d_i) in this step (or this will need an additional step alongside (d_i) later, which we forbid).

So, this means that we need [d_i^T Ae_{i+1} = 0,]
i.e.
[ begin{aligned}0 &= d_i^T A e_{i+1} = d_i^TA(e_i + alpha_i d_i)\
&= d_i^T (-r_i + alpha_i Ad_i),end{aligned}]
which is a equivalent to
In distinction to the a comparable situation for the “orthogonal instructions” conception, that is de facto computable: The (d_i) are mounted,again, we quiet have to dispute about how on earth we are able to secure those vectors, nonetheless that is for later.
and the residual (r_i) is true (b-Ax_i,) which is smartly bought.

On the same time, this conjugacy situationShow that (0 = d_i^T A e_{i+1}) methodology that (d_i) and (e_{i+1}) are conjugate. has one other interpretation: Selecting (alpha_i) is a equivalent to discovering the minimal point alongside the search direction (d_i):
[begin{aligned} 0 &= frac{mathrm d}{mathrm d alpha} f(x_{i+1})\
&= f'(x_{i+1})^Tfrac{mathrm d}{mathrm d alpha} x_{i+1} \
&= -r_{i+1}^T d_i \
&= d_i^TAe_{i+1}.end{aligned}]
Right here is pleasant: The worldwide minimization job is diminished to a series of minimization procedures alongside mounted, conjugate instructions. Inequity this to our failed conception of orthogonal directional descent where we would were forced to climb
up the hill first sooner than going alongside the following direction, even with the impracticality of orthogonal instructions aside.

Now we claim that the sequence of computations
[begin{aligned}&x_1 &&= x_0 + alpha_0 cdot d_0\ &x_2 &&= x_1 + alpha_1 cdot d_1\ &&&quadvdots \ &x_{i+1} &&= x_i + alpha_i cdot d_i \ &&&quadvdots \ &x_{n} &&= x_{n-1} + alpha_{n-1}cdot d_{n-1}end{aligned}]
with
finds us the optimum in at most (n) steps, i.e. (x_n = x^star.)

Earlier than we trudge into the indispensable ingredients of why that is advantageous, let’s strive to search out numerical validation first: We diagram shut our minimization dispute from sooner than, we beginning at (x_0 = (-2,2)) and we diagram shut the following
region of conjugate instructions: [d_0 = begin{pmatrix} 12 \ 8end{pmatrix}, d_1 = begin{pmatrix} 18 \ -13end{pmatrix}.]
For now we attain no longer trouble about how we bought this particular region of conjugate instructions. Incidentally (no longer if fact be told, obviously), we’ve chosen the first vector to be pointing alongside the direction of steepest
descent at the preliminary worth (x_0.) The iterations are calculated here: ( )

Now we bear chosen (d_0 = r_0) in declare that we if fact be told beginning off in direction of steepest descent. Right here is a salubrious nonetheless (virtually) arbitrary probability.
x_1 &= x_0 + alpha_0 d_0 = begin{pmatrix}0.08\-0.61overline 3end{pmatrix}\
r_1 &= b – Ax_1 = begin{pmatrix} 2.98overline 6\4.48end{pmatrix}\
x_2 &= x_1 + alpha_1 d_1 = begin{pmatrix}2\-2end{pmatrix},end{aligned} ]
and (x_2= x^star) certainly.

Plotting those two iterations yields identify 7. Show how the conjugate instructions peep perpendicular within the stretched situation on the correct hand aspect.

It appears to be like that the algorithm terminates
certainly after two steps. That is also a accident, so we are attempting one other pair of conjugate instructions, this time no longer picking the steepest descent as our first vector:
[d_0 = begin{pmatrix} 0 \ 1end{pmatrix}, d_1 = begin{pmatrix} 3 \ -1end{pmatrix}.]
The math is here: ( )

x_1 &= x_0 + alpha_0 d_0 = begin{pmatrix}-2\-2/3end{pmatrix}\
r_1 &= b – Ax_1 = begin{pmatrix} 28/3\0end{pmatrix}\
x_2 &= x_1 + alpha_1 d_1 = begin{pmatrix}2\-2end{pmatrix},end{aligned} ]
and (x_2= x^star) again.

All yet again, the conjugate instructions algorithm terminates in two steps.

In three dimensions, the conjugate instructions algorithm takes three steps.

So, within the event you glean those three definite examples as evidence for the effectiveness of the conjugate instructions optimization methodology, you most certainly can marvel how and why it is going to work. The next lemma proves this.

Lemma 1: Convergence of the conjugate instructions optimization methodology in (n) steps.

Given a basis of conjugate vectors ({d_i}_{i=0}^{n-1}), the methodology of conjugate instructions arrives at the optimum in at most (n) steps.

Proof: ( )

The proof for that is de facto somewhat fast and if fact be told illuminating. We beginning up by concerned in regards to the preliminary error (e_0,) i.e. the (unknown) distinction (x_0-x^star.) This we are able to write down as
[e_0 = sum_{j=0}^{n-1}delta_j cdot d_j.]
Why is that imaginable? The region ({d_i}_i) constitutes a basis of (mathbb R^n) and (e_0) is an argument of that vector home, therefore it is going to also objective additionally be bought as a linear mixture of the basis.

How will we produce the (delta_j)? We can multiply this characterization in turns by each and each (d_j^TA) from the left. Then on every occasion, by conjugacy, all nonetheless one term on the left hand aspect vanishes:
We resolve for (delta_k) and produce [delta_k = frac{d_k^TAe_0}{d_k^TAd_k}.]
The next step is “including 0”: We can encompass extra phrases next to (e_0) which vanish alongside with the term (d_k^TA) in front:
Now purchase that our iterations are given by
[x_{i+1} = x_i + alpha_i d_i]
and thus
[e_{i+1} = e_i + alpha_i d_i]
and, iteratively,
[e_k = {color{blue}e_0 +sum_{i=0}^{k-1}alpha_i d_i},]
so, all in all, changing the blue term:

Right here is precisely the negative of our step dimension:
[alpha_k = -delta_k.]
What does that mean? There are two ways of enraged by the conjugate instructions methodology:

The preliminary inform (x_0) is updated step for step in direction of the conjugate instructions (d_i) and is constructed up alongside those ingredients.

Alternatively we are able to be conscious of of (e_0) starting with error contributions from all instructions, with each and each step alongside a conjugate direction (d_i) “biting away” the corresponding
component on this error term till the last step takes away the remaining error:

[begin{aligned}e_i &= e_0 + sum_{j=0}^{i-1}alpha_j cdot d_j\
&= sum_{j=0}^{n-1}delta_jcdot d_j + sum_{j=0}^{i-1}(-delta_j) cdot d_j\
&= sum_{j=i}^{n-1}delta_j cdot d_jend{aligned} ]

Show that this sum contains less and no more phrases each and each iteration (imapsto i+1) and loses its last term at (n-1mapsto n,) which is precisely the last iteration.

Let’s diagram shut one other closer peep at the Conjugate Instructions methodology, i.e. somebody gave us a salubrious region of (n) conjugate (A-orthogonal) instructions ({d_i}_{i=0}^{n-1}).

Conjugate Instructions finds at each and each step the sole alternatives internal the boundaries of where it be been allowed to explore. We assemble one other commentary:

What does that mean? We clarify [mathcal D_i := operatorname{span}{d_0,d_1ldots,d_{i-1}}.] Then from the equation
[e_i = e_0 + sum_{j=0}^{i-1}alpha_j d_j,]
we secure that (e_i in e_0 + mathcal D_i.) We claim that Conjugate Instructions works such that (e_i) is “optimal internal the boundaries of where it be been allowed to explore”. Where has the iteration been allowed to explore up to now? That
is precisely (x_0 + mathcal D_i.) Equivalently we are able to utter that the error has been “allowed to explore” (e_0 + mathcal D_i.) Whereby sense is this optimal?
The claim is the following lemma:

Lemma 2: Optimality of the error term

We clarify the (A)-norm by
[|x|_A^2 := x^TAx.]
Then
[ e_i = operatorname{argmin} {|epsilon|_A:~ epsilonin e_0 + mathcal D_i}.]
I.e., (e_i) is the dispute in (e_0 + mathcal D_i) with minimal (A)-norm.

Proof: ( )

We can decompose (peek the proof of lemma 1)
[ e_i = sum_{j=i}^{n-1}delta_jcdot d_j]
and thus
On the quite plenty of hand, an arbitrary dispute (epsilon in e_0 + mathcal D_i) has the following assemble:
[epsilon = underbrace{sum_{j=0}^{n-1}delta_jcdot d_j}_{=e_0} + underbrace{sum_{j=0}^{i-1} kappa_jcdot d_j}_{in mathcal D_i} = sum_{j=0}^{i-1} (delta_j + kappa_j)cdot d_j + sum_{j=i}^{n-1} delta_jcdot d_j.]
The A-norm of (epsilon) is given by
[begin{aligned} |epsilon|_A^2 &= left| sum_{j=0}^{i-1} (delta_j + kappa_j)cdot d_j + sum_{j=i}^{n-1} delta_jcdot d_jright|_A^2\
&= left(sum_{j=0}^{i-1} (delta_j + kappa_j)cdot d_jright)^TA left(sum_{j=0}^{i-1} (delta_j + kappa_j)cdot d_jright) \
&+ sout{left(sum_{j=0}^{i-1} (delta_j + kappa_j)cdot d_jright)^TA left( sum_{j=i}^{n-1} delta_jcdot d_jright)}\
&+ sout{left(sum_{j=i}^{n-1} delta_jcdot d_jright)^TA left(sum_{j=0}^{i-1} (delta_j + kappa_j)cdot d_jright)}\
&+ left(sum_{j=i}^{n-1} delta_jcdot d_jright)^TA left(sum_{j=i}^{n-1} delta_jcdot d_jright)\
&= left|sum_{j=0}^{i-1} (delta_j + kappa_j)cdot d_jright|_A^2 + left|sum_{j=i}^{n-1} delta_jcdot d_jright|_A^2 \
&geq left|sum_{j=i}^{n-1} delta_jcdot d_jright|_A^2 \
&= |e_i|_A^2end{aligned}]
Show that the two phrases within the third and fourth line vanish due to A-orthogonality of the (d_j). Within the cease we true tumble a definite term and produce precisely the A-norm of (e_i). This methodology that for any
(epsilon in x_0 + mathcal D_i,) we’ve
[|epsilon|_A^2 geq |e_i|_A^2,]
which is precisely optimality of (e_i.)

Let’s dispute about intuition next. What does this sense of optimality mean?

If we beginning with some (x_0), then we ranking (x_1) such that it is the minimal point of (x_0 + mathcal D_1.) Right here is true how we defined Conjugate Instructions Descent (and precisely how Gradient Descent labored as smartly).

The objective piece is that optimality is conserved to all following iterations as smartly. This would no longer work in Gradient Descent, as we’ve already seen: ( )

Opt into consideration again identify 2 (proven again below for convenience): Starting up at (x_0,) Gradient Descent optimizes alongside search direction
(r_0), which yields (x_1.) From there, it optimizes alongside search direction (r_1), yielding (x_2.) After that, the following search direction is (r_2=r_0,) i.e. the same search direction as in step 1. But now,
we’re no longer optimal in (x_0 + mathcal D_2) anymore! We have to optimize alongside direction (r_0) again. On the assorted hand alongside (r_1,) due to we’ve lost optimality in that direction as smartly and plenty of others…

So, why may maybe maybe well objective quiet we certainly search details from that “optimality with appreciate to search direction (d_0)” is conserved after making one other step alongside (d_1) and stopping at (x_2)? There are again two imaginable explanations for that (diagram shut which ever suits you greater)

• Analytically: At point (x_2,) our remaining error term is (e_2 = sum_{j=2}^{n-1} delta_j cdot d_j,) i.e. there is no contribution and no contribution from direction (d_1) as smartly, nonetheless that is less objective due to (x_2)
used to be bought by cutting away the (d_1) contribution
alongside direction (d_0) left. This methodology, we are able to no longer decrease the A-norm anymore by going alongsidethat is lemma 1 and its proof (d_0.) We’re quiet optimal alongside direction (d_0!)
• By stretching: Recognize at identify 11b: (x_1) used to be bought by selecting a level on (x_0 + mathcal D_1) such that (e_1) used to be A-orthogonal to (d_0). On this stretched identify,
this means that (e_1) appears to be like perpendicular to (d_0=r_0.)
Arrived at (x_2,) we peek that this has no longer modified: Now (d_0) and (e_2) seem perpendicular, i.e. they’re in fact A-orthogonal. By going even a fast distance away from (x_2) in direction of (d_0) (proven in dashed), we would trudge away
the smaller sphere depicting a region of constant A-norm, hereby increasing our A-norm. Hence, (x_2) is quiet optimal in direction of (d_0.) Show that we’re obviously also optimal with appreciate to the extra contemporary search direction (d_1) nonetheless that is to be expected. The non-losing
of optimality alongside previous search instructions is extra objective.

A correct psychological characterize for the methodology identify 11b works is to imagine yourself standing at the respond point (x^star,) pulling a string linked to a bead that is constrained to lie in (x_0 + mathcal D_1.)
At any time when the expanding subspace (mathcal D) is enlarged by a dimension, the bead becomes free to switch rather of closer to you.

The next property of Conjugate Instructions Descent will seemingly be of paramount importance:

Lemma 3: Orthogonality of the residual

Within the (i)th iteration of Conjugate Instructions Descent, the residual is orthogonal to all old search instructions:
[r_{i+1} ~bot~ d_0, d_1, ldots, d_i.]

Proof: ( )

Opt that the error of the (j)th particle is given by
[e_{i+1} = sum_{l=i+1}^{n-1}delta_lcdot d_l.]
Then we premultiply this equation on all sides with (-d_j^TA) for some (j leq i,) which yields
[begin{aligned}-d_j^Tunderbrace{Ae_{i+1}}_{=-r_{i+1}} &= – sum_{l=i+1}^{n-1}delta_l cdot underbrace{d_j^TAd_l}_{=0~ (j

Note that an equivalent proposition does not hold for Gradient Descent: The new descent direction (r_{i+1}) is indeed orthogonal to the previous descent direction (r_i), but not orthogonal
to (r_{i-1}.) (which can be seen by looking at the characteristic zig-zag pattern).

Here’s a recap of this (long) section’s main ideas:

• The conjugate directions optimization method works really well.
• We arrive in (n) directions at the optimum.
• The iteration of “local optimization procedures” (along the current search directions (d_i)) does not lose its permanence as gradient descent does: Once we are optimal along some direction, we will always be (and
going along an old search direction will necessarily deteriorate the optimization).
• But… How do we get a set of conjugate directions to start with?

## The Conjugate Gradient (CG) algorithm

Some last tricks and a pudgy description

Let’s summarize again what we realized up to now:

Given a region of conjugate instructions ({d_i}_{i=0}^{n-1},) the algorithm of Conjugate Instructions Descent terminates in at most (n) steps at the minimal. Our remaining dispute is to diagram shut
this region of conjugate instructions. We realized that generating such a region in arrive (from an preliminary region of proposal instructions ({u_i}_{i=0}^{n-1}) that we are able to switch to be a conjugate region ({d_i}_{i=0}^{n-1})) is simply too dear.
Then we argued that by a wise amount of “proposal instructions” (u_i) on the flee (interspersed with Conjugate Instructions Descent iterations), we are able to be ready to lower down on computational complexity. We are able to peek in
this piece that a if fact be told correct probability for proposal instructions are the residuals/instructions of steepest descent, i.e.
[ u_i = r_i.]

Essentially the most indispensable will seemingly be to point to that this proposal direction (r_i) is already conjugate to all old search instructions, with exception of the last one, (d_{i-1}.) Then we are able to make the fresh search direction (d_i) out of the “proposal direction” (r_i) true by making it
conjugate to (d_{i-1}.)

With this methodology we are able to generate a region of conjugate instructions (d_i) on the flee and concurrently attain Conjugate Instructions Descent. This mixture generally known as Conjugate Gradients (due to we diagram shut the gradients and
assemble them conjugate to old search instructions).

The last remaining key commentary to point to is thus the following:

Lemma 4:

Opt into consideration the environment of Conjugate Instructions. If the region of “proposal instructions” is the region of residuals, i.e. (u_i = r_i,) then (r_{i+1}) is conjugate to all old search instructions with exception of the last one, i.e.
[r_{i+1} ~bot_A ~ d_0, d_1, ldots, d_{i-1},]
for (i=1,ldots, {n-1}.)

Also,
[r_{i+1} ~bot~ r_0, r_1, ldots, r_i]

Proof: ( )

Opt the definition (mathcal D_i := operatorname{span}{d_0,d_1,ldots d_{i-1}}.) The (i)th search instructions (d_i) is produced from the (i)th residual and all old search instructions, therefore
(d_i in operatorname{span}{r_i, mathcal D_{i}}.)
This methodology (impress that (d_0 = r_0)):

• (operatorname{span}{r_0} = operatorname{span}{d_0} = color{green}{mathcal D_1})
• (operatorname{span}{r_0, r_1} = operatorname{span}{{color{green}{mathcal D_1}}, r_1}= operatorname{span}{mathcal D_1, d_1} = {color{blue}{mathcal D_2}})
• (operatorname{span}{r_0, r_1, r_2} = operatorname{span}{{color{blue}{mathcal D_2}}, r_2}= operatorname{span}{mathcal D_2, d_2} = mathcal D_3)

and thus we are able to equivalently write
[mathcal D_i := operatorname{span}{r_0,r_1,ldots r_{i-1}}.]
We purchase that by lemma 3, (r_{i+1}bot mathcal D_{i+1},) and this proves the claim
[r_{i+1} ~bot~ r_0, r_1, ldots, r_i.]
Now impress that (r_{i} = -Ae_{i} = -A(e_{i-1} + alpha_{i-1}d_{i-1}) = r_{i-1} – alpha_{i-1}Ad_{i-1}.)

Thus, (r_{i} inoperatorname{span}{underbrace{r_{i-1}}_{in mathcal D_i}, underbrace{Ad_{i-1}}_{in Amathcal D_i}}. )
This methodology that (mathcal D_{i+1} = operatorname{span}{mathcal D_i, r_i} = operatorname{span}{mathcal D_i, Amathcal D_i},) and recursively,

Now we are able to point to the important commentary: By lemma 3 and utilizing the characterization of (mathcal D_{i+1}) from above, (r_{i+1} bot mathcal D_{i+1}=operatorname{span}{mathcal D_i, Amathcal D_i}.)
This methodology that, particularly, (r_{i+1}bot Amathcal D_{i}.)
In various words,
[r_{i+1}~bot_A mathcal D_i,]
or
[r_{i+1} ~bot_A ~ d_0,d_1,ldots, d_{i-1}.]

So which methodology that the residual (r_i) is a if fact be told correct proposal direction in point (x_i.) We are able to now gain the formula for generating a search direction (d_i) from it which is also
conjugate to (d_{i-1},) moreover to being conjugate to all search instructions before that, (d_0, ldots, d_{i-2}.) Right here is accomplished by Gram-Schmidt conjugation and basically consists
of purely symbolic manipulation with out additional huge tips (and it is seemingly you’ll per chance well well likely safely skip it within the event you are no longer attracted to the indispensable ingredients).

Lemma 5:

Given a contemporary inform (x_{i+1}), its residual (r_{i+1}) being a proposal direction (u_{i+1} = r_{i+1}) which is conjugate to simply about all old search instructions, i.e.
[ u_{i+1} ~ bot_A ~ d_0,d_1,ldots d_{i-1},]
then by environment
[d_{i+1} := r_{i+1} + beta_{i+1}cdot d_i,quad beta_{i+1} := frac{r_{i+1}^Tr_{i+1}}{r_i^Tr_i},]
this fresh search direction is now conjugate to all old search instructions, i.e.
[ d_{i+1} ~ bot_A ~ d_0,d_1,ldots d_{i}.]

Proof: ( )

We have to modify (u_{i+1}) right into a vector (d_{i+1}) such that (d_{i+1}) is conjugate to (d_i.)

The Gram-Schmidt conjugation job yields
[d_{i+1} := r_{i+1} + beta_{i+1}cdot d_i,quad beta_{i+1}]
Show that this certainly yields a vector which is conjugate to all search instructions, even the old ones. The remaining proof includes deriving the command formula for (beta_{i+1}.)

We purchase (from the proof of lemma 4)
[r_{i+1} = r_i – alpha_i Ad_i.]
Multiplying this with (r_i^T) from the left presents
[r_{i+1}^Tr_{i+1} = r_{i+1}^Tr_i – alpha_i r_{i+1}^TAd_i.]
But (r_i^Tr_{i+1}=0) by the 2nd commentary of lemma 4 and so