Posted: Apr 7, 2021. Last updated: Jun 28, 2021.
Tags:

While trying to learn about the linear quadratic regulator (LQR) controller, I came across UC Berkeley’s course on deep reinforcement learning. Sadly, their lecture slides on model-based planning (Lec. 10 in the 2020 offering of CS285) are riddled with typos, equations cutoff from the slides, and dense notation. This post presents my own derivations of the LQR controller for discrete-time finite-horizon time-varying systems.

## Contents

$\newcommand{\Sym}{\mathbb{S}} % symmetric real matrices \newcommand{\zero}{\mathbf{0}} % zeros vector \DeclareMathOperator{\tr}{tr} % trace \newcommand{\sa}[1][]{\begin{bmatrix} s_{#1} \\ a_{#1} \end{bmatrix}} \newcommand{\sat}{[s^\top \ a^\top]}$

## Notation

Let $$\Sym^n$$ denote the set of symmetric matrices in $$\R^{n \times n}$$. Also, superscripts are interpreted after subscripts: $$M_x^\top$$ denotes $$(M_x)^\top$$, and $$M_x^{-1}$$ denotes $$(M_x)^{-1}$$.

## Discrete-time Time-Varying Finite-Horizon Linear System

Consider a discrete-time time-varying finite-horizon linear system:

• finite-horizon: $$t = 1, \dotsc, T$$
• states: $$s_t \in \R^n$$
• actions: $$a_t \in \R^m$$
• random noise: $$w_t \in \R^n$$, $$\E[w_t] = 0$$, each $$w_t$$ is independent
• The only assumption on the random vector $$w_t$$ is that it has zero mean. No other assumptions are made about its distribution.
• $$w_t$$ and $$w_k$$ need not be drawn from the same distribution, for $$t \neq k$$.
• dynamics: $$s_{t+1} = A_t s_t + B_t a_t + w_t$$ for some given matrices $$A_t \in \R^{n \times n}$$ and $$B_t \in \R^{n \times m}$$
• An equivalent formulation is $$s_{t+1} = F_t x_t + w_t$$ where $$F_t = \begin{bmatrix} A_t & B_t \end{bmatrix}$$ and $$x_t = \sa[t]$$.
• instantaneous cost function $$c_t(s,a): \R^n \times \R^m \to \R$$
• The cost functions $$c_t$$ for $$t = 1, \dotsc, T-1$$ are known as the stage cost.
• The cost $$c_T$$ at the final time step is called the terminal cost.

The overall goal, given a starting state $$s_0$$, is to find a sequence of actions $$a_t$$ to minimize the total expected cost

$J = \sum_{t=0}^T \E_{w_t} \left[ c_t(s_t, a_t) \right].$

For each time step $$t$$, the $$q$$-function and value function are defined as

\begin{aligned} q_t(s,a) &= \begin{cases} c_T(s,a) & t = T \\ \begin{aligned} & c_t(s,a) + \E_{w_t} \left[ \min_{a_{t+1}} q_{t+1}(s_{t+1}, a_{t+1}) \right] \\ & = c_t(s,a) + \E_{w_t} \left[ v_{t+1}(s_{t+1}) \right] \end{aligned} & t < T \\ \end{cases} \\ v_t(s) &= \min_a q_t(s,a) \end{aligned}

The value function is also known as the optimal cost-to-go, sometimes denoted as $$J_t^*(s)$$.

Thus, the minimum expected cost is

\begin{aligned} J^* &:= \min_{a_t:\, t=0,\dotsc,T} \sum_{t=0}^T \E_{w_t} \left[ c_t(s_t, a_t) \right] \\ &= \min_a q_0(s_0, a) \\ &= v_0^*(s_0). \end{aligned}

## LQR

LQR, short for “linear quadratic regulator,” refers to the optimal controller for a linear system with quadratic costs. This post analyzes the discrete-time finite-horizon case, although similar results hold for continuous-time systems and infinite time horizons as well.

The standard (canonical) LQR cost function is $$c_t(s,a) = s^\top Q_t s + a^\top R_t a$$ for some given matrices $$Q_t \in \Sym^n$$ and $$R_t \in \Sym^m$$ with $$Q_t \succeq 0$$ and $$R_t \succ 0$$.

In standard LQR, the minimum instantaneous cost is achieved by $$s, a = 0$$. Furthermore, this is a fixed point in the linear system dynamics. Therefore, the optimization objective is essentially to find a series of actions that moves the state closer and closer to the origin.

A more general case is

$c_t(s,a) = x^\top D_t x + d_t^\top x + \delta_t$

given $$\delta_t \in \R$$, $$D_t = \begin{bmatrix} Q_t & S_t^\top \\ S_t & R_t \end{bmatrix} \in \Sym^{n+m}$$ where $$D_t, Q_t \succeq 0$$ and $$R_t \succ 0$$, and $$d_t = \begin{bmatrix} d_{t,s} \\ d_{t,a} \end{bmatrix} \in \R^{n+m}$$.

Setting $$S_t = 0$$ and $$d_t = 0$$ yields the standard LQR cost function.

Here, $$A_t$$, $$B_t$$, $$D_t$$, and $$d_t$$ are assumed to be known for all $$t = 0, \dotsc, T$$. When these quantities are not known in real-world settings, they can usually be estimated by first collecting some state transitions from an arbitrary policy.

## Deriving the Optimal LQR Controller

The derivation of the optimal LQR controller is inductive. It relies on the following lemma, which shows that in a linear system with a quadratic cost function, the optimal action $$a^*$$ is a linear function of the state $$s$$, and the optimal cost function $$c^*$$ is a quadratic function of $$s$$.

Lemma 1: Consider the function $$c: \R^n \times \R^m \to \R$$ given by

$c(s,a) = \sat D \sa + d^\top \sa + \delta$

where $$\delta \in \R$$, $$D = \begin{bmatrix} Q & S^\top \\ S & R \end{bmatrix} \in \Sym^{n+m}$$ with $$D, Q \succeq 0$$ and $$R \succ 0$$, and $$d = \begin{bmatrix} d_s \\ d_a \end{bmatrix} \in \R^{n+m}$$. Then $$c(s,a)$$ is minimized by setting

$a^* = Ks + k$

where $$K = -R^{-1} S$$ and $$k = -\frac{1}{2} R^{-1} d_a$$. Plugging $$a^*$$ into the cost function yields the optimal cost function

$c^*(s) :=c(s, a^*) = s^\top Z s + z^\top s + \text{const}$

where $$Z = Q + S^\top K$$ is positive semidefinite, and $$z = 2 S^\top k + d_s$$. The term $$\text{const} = \delta - \frac{1}{4} d_a^\top R^{-1} d_a$$ is constant with respect to the state $$s$$ and action $$a$$.

Proof \begin{aligned} c(s,a) &= \sat D \sa + d^\top \sa + \delta \\ &= s^\top Q s + s^\top S^\top a + a^\top S s + a^\top R a + d_s^\top s + d_a^\top a + \delta \\ &= s^\top Q s + a^\top (S + S) s + a^\top R a + d_s^\top s + d_a^\top a + \delta \\ &= s^\top Q s + 2 a^\top S s + a^\top R a + d_s^\top s + d_a^\top a + \delta \\ \nabla_a c(s,a) &= 2 S s + 2R a + d_a \\ \nabla_a^2 c(s,a) &= 2 R \end{aligned}

Since $$R \succ 0$$, $$c(s,a)$$ is strictly convex in $$a$$ and is minimized by the unique value $$a^*$$ at which the gradient is 0. Setting $$\nabla_a c(s,a) = 0$$ and solving for $$a$$ yields

$a^* = -R^{-1} S s - \frac{1}{2} R^{-1} d_a = Ks + k$

where $$K = -R^{-1} S$$ and $$k = -\frac{1}{2} R^{-1} d_a$$. Plugging this into the cost function yields

\begin{aligned} c^*(s) &:= c(s, a^*) \\ &= s^\top Q s + 2 (Ks+k)^\top S s + (Ks+k)^\top R (Ks+k) + d_s^\top s + d_a^\top (Ks+k) + \delta \\ &= s^\top (Q + 2 K^\top S + K^\top R K) s + (2 S^\top k + 2 K^\top R k + K^\top d_a + d_s)^\top s + k^\top R k + d_a^\top k + \delta \\ &= s^\top Z s + z^\top s + \underbrace{k^\top R k + d_a^\top k + \delta}_{\text{constant w.r.t. s and a}} \end{aligned}

where

\begin{aligned} Z &= Q + 2 K^\top S + K^\top R K \\ &= Q + 2 K^\top S - K^\top R (R^{-1} S) \\ &= Q + 2 K^\top S - K^\top S \\ &= Q + K^\top S = Q - S^\top R^{-1} S \\ z &= 2 S^\top k + 2 K^\top R k + K^\top d_a + d_s \\ &= 2 S^\top k + (R^{-1} S)^\top R R^{-1} d_a - (R^{-1} S)^\top d_a + d_s \\ &= 2 S^\top k + S^\top R^{-1} d_a - S^\top R^{-1} d_a + d_s \\ &= 2 S^\top k + d_s. \end{aligned}

Note that $$Z \in \Sym^n$$ is symmetric because $$Q$$ and $$R^{-1}$$ are both symmetric. Furthermore, $$Z \succeq 0$$ because $$Z$$ is the Schur complement of $$R$$ with $$Q \succeq 0$$ and $$R \succ 0$$. (For more on Schur complements, see this blog post.)

The constant term can be simplified as

\begin{aligned} \text{const} &= \delta + k^\top R k + d_a^\top k \\ &= \delta + \frac{1}{4} (R^{-1} d_a)^\top R R^{-1} d_a - \frac{1}{2} d_a^\top R^{-1} d_a \\ &= \delta + \frac{1}{4} d_a^\top R^{-1} d_a - \frac{1}{2} d_a^\top R^{-1} d_a \\ &= \delta - \frac{1}{4} d_a^\top R^{-1} d_a. \end{aligned}

### Inductive Step

The inductive derivation of the LQR works backwards from the final time step $$t=T$$ down to $$t=0$$. Consider the inductive hypothesis that $$q_t$$ has the form

$q_t(s,a) = x^\top P_t x + p_t^\top x + \lambda_t$

for some vector $$p_t$$ and symmetric matrix $$P_t = \begin{bmatrix} P_{t,s} & P_{t,as}^\top \\ P_{t,as} & P_{t,a} \end{bmatrix}$$ with $$P_t, P_{t,s} \succeq 0$$ and $$P_{t,a} \succ 0$$. The scalar $$\lambda_t$$ is constant with respect to the state $$s$$ and action $$a$$.

Notation: For the rest of this section, the subscript $$t$$ is dropped from variables to reduce notational clutter. Every variable that is not written with a subscript is understood to have a subscript $$t$$. Variables with subscripts other than $$t$$ are written with the appropriate subscript.

The base case $$q_T(s,a) = c_T(s,a) = x^\top D_T x + d_T^\top x + \delta_T$$ clearly satisfies the inductive hypothesis with $$P_T = D_T$$, $$p_T = d_T$$, and $$\lambda_T = \delta_T$$.

For the inductive step, suppose that the hypothesis holds for some time step $$t+1$$. By Lemma 1, the optimal value function at time $$t+1$$ is

$v_{t+1}(s) = s^\top Z_{t+1} s + z_{t+1}^\top s + \text{const}$

where $$Z_{t+1} \succeq 0$$ and $$z_{t+1}$$ are functions of $$P_{t+1}$$ and $$p_{t+1}$$. The constant term is $$\text{const} = \lambda_{t+1} - \frac{1}{4} d_{t+1,a}^\top (P_{t+1,a})^{-1} d_{t+1,a}$$.

If $$s_{t+1} = A s + B a + w = F x + w$$, then

\begin{aligned} v_{t+1}(s_{t+1}) &= (F x + w)^\top Z_{t+1} (F x + w) + z_{t+1}^\top (F x + w) + \text{const} \\ &= x^\top F^\top Z_{t+1} F x + 2 w^\top Z_{t+1} F x + w^\top Z_{t+1} w + z_{t+1}^\top F x + w^\top z_{t+1} + \text{const} \\ \E_w\left[ v_{t+1}(s') \right] &= x^\top F^\top Z_{t+1} F x + \E_w[ w^\top Z_{t+1} w ] + z_{t+1}^\top F x + \text{const} \\ &= x^\top F^\top Z_{t+1} F x + z_{t+1}^\top F x + \text{const}_w \\ q_t(s,a) &= c_t(s,a) + \E_w \left[ \min_{a'} q_{t+1}(s', a') \right] \\ &= x^\top D x + d^\top x + \delta + \E_w \left[ v_{t+1}(s') \right] \\ &= x^\top (D + F^\top Z_{t+1} F) x + (d + F^\top z_{t+1})^\top x + \delta + \text{const}_w \\ &= x^\top P x + p^\top x + \lambda \end{aligned}

where $$P = D + F^\top Z_{t+1} F$$ and $$p = d + F^\top z_{t+1}$$. The first equation for $$\E_w[v_{t+1}(s')]$$ uses the assumption that $$\E[w] = 0$$; the second equality absorbs $$\E_w[w^\top Z_{t+1} w]$$ into $$\text{const}_w$$ because it is constant with respect to states and actions. The constant term is therefore

$\lambda = \delta + \text{const}_w = \delta + \E_w[ w^\top Z_{t+1} w ] + \lambda_{t+1} - \frac{1}{4} d_{t+1,a}^\top (P_{t+1,a})^{-1} d_{t+1,a}.$

Since $$D, Z_{t+1} \succeq 0$$, it follows that $$P \succeq 0$$. The block matrix form of $$P$$ is

\begin{aligned} P &= D + F^\top Z_{t+1} F \\ &= \begin{bmatrix} Q + A^\top Z_{t+1} A & S^\top + A^\top Z_{t+1} B \\ S + B^\top Z_{t+1} A & R + B^\top Z_{t+1} B \end{bmatrix} \\ &= \begin{bmatrix} P_s & P_{as}^\top \\ P_{as} & P_a \end{bmatrix} \end{aligned}

which implies $$P_a = R + B^\top Z_{t+1} B \succ 0$$ because $$R \succ 0$$ and $$Z_{t+1} \succeq 0$$.

This shows that $$q_t(s,a)$$ satisfies the inductive hypothesis, which gives a backwards recursion:

• For the final time step $$t = T$$:

\begin{aligned} P_T &= D_T & p_T &= d_T \\ K_T &= -R_T^{-1} S_T, & k_T &= -\frac{1}{2} R_T^{-1} d_{T,a} \\ Z_T &= Q_T + S_T^\top K_T, & z_T &= 2 S_T^\top k_T + d_{T,s} \end{aligned}
• For $$t = T-1, \dotsc, 0$$:

\begin{aligned} P &= D + F^\top Z_{t+1} F, & p &= d + F^\top z_{t+1} \\ K &= -P_a^{-1} P_{as}, & k &= -\frac{1}{2} P_a^{-1} p_a \\ Z &= P_s + P_{as}^\top K, & z &= 2 P_{as}^\top k + p_s \end{aligned}

This can be simplified further by defining $$Z_{T+1} = \zero_{n \times n}$$ and $$z_{T+1} = \zero_n$$. Then the equations above hold for all $$t = T, \dotsc, 0$$.

Additionally, it is common to expand out the recursion equations (which removes the need for explicitly constructing $$P$$ and $$p$$). In particular, the expression for $$Z$$ is known as the discrete-time Ricatti equation.

\begin{aligned} K &= -(R + B^\top Z_{t+1} B)^{-1} (S + B^\top Z_{t+1} A) \\ k &= -\frac{1}{2} (R + B^\top Z_{t+1} B)^{-1} (d_a + B^\top z_{t+1}) \\ Z &= Q + A^\top Z_{t+1} A - (S + B^\top Z_{t+1} A)^\top (R + B^\top Z_{t+1} B)^{-1} (S + B^\top Z_{t+1} A) \\ z &= -(S + B^\top Z_{t+1} A)^\top (R + B^\top Z_{t+1} B)^{-1} (d_a + B^\top z_{t+1}) + d_s + A^\top z_{t+1} \end{aligned}

Finally, to solve for actions $$a_t$$, apply forward recursion. For $$t = 0, \dotsc, T$$:

\begin{aligned} a_t &= K_t s_t + k_t \\ s_{t+1} &= A_t s_t + B_t a_t + w_t \end{aligned}

The optimal controller defined by the matrices $$K_t$$ and vectors $$k_t$$ is known as the linear quadratic regulator (LQR). Note that the optimal controller $$(K_t, k_t)$$ does not depend on the actual observed state $$s_t$$. As long as the future noise variables $$w_t$$ are independently distributed with mean 0, then the sequence of controllers $$\{(K_t, k_t)\}_{t=0}^T$$ is always optimal.

### Standard LQR

Consider the standard LQR setting where $$S = 0$$, $$d = 0$$, and $$\delta = 0$$, so the cost function at each step is simply

$c(s,a) = s^\top Q s + a^\top R a.$

(As in the previous section, the subscript $$t$$ is dropped to reduce notational clutter. All variables that are written without a subscript are understood to implicitly have a subscript $$t$$.)

For the inductive hypothesis, suppose that $$z_{t+1} = 0$$. The base case is satisfied with $$z_{T+1} = 0$$. Then the recursion vectors all become 0:

\begin{aligned} p &= d + F^\top z_{t+1} = 0 + F^\top 0 = 0 \\ k &= -\frac{1}{2} P_a^{-1} p_a = -\frac{1}{2} P_a^{-1} 0 = 0 \\ z &= 2 P_{as}^\top k + p_s = 2 P_{as}^\top 0 + 0 = 0 \end{aligned}

The recursion matrices simplify to

\begin{aligned} K &= -(R + B^\top Z_{t+1} B)^{-1} B^\top Z_{t+1} A \\ Z &= Q + A^\top Z_{t+1} A - (B^\top Z_{t+1} A)^\top (R + B^\top Z_{t+1} B)^{-1} B^\top Z_{t+1} A \end{aligned}

and the optimal action from any state $$s_t$$ at time $$t$$ is $$a_t = K_t s_t$$. In this case, the linear quadratic regulator is the set of matrices $$\{K_t\}_{t=0}^T$$.

Furthermore, if we assume $$\E[w w^\top] = \Sigma$$ (for example, $$w \sim \mathcal{N}(0, \Sigma)$$), then

$\E_w\left[ w^\top Z_{t+1} w \right] = \E_w\left[ \tr( w w^\top Z_{t+1} ) \right] = \tr( \E_w[ w w^\top ] Z_{t+1} ) = \tr(\Sigma Z_{t+1})$

so $$\lambda = \tr(\Sigma Z_{t+1}) + \lambda_{t+1}$$. Applying Lemma 1 to $$q_t(s,a)$$, the value function at each time step is precisely

$v_t(s) = s^\top Z s + \tr(\Sigma Z_{t+1}) + \lambda_{t+1},$

with base case $$\lambda_T = 0$$ and $$v_T(s) = c_T^*(s) = s^\top Q_T s$$.

If there is no randomness in the state transitions (i.e., $$w_t = 0$$ always), then $$\E_w\left[ w^\top Z_{t+1} w \right] = 0$$ so $$v_t(s) = s^\top Z s$$.

### Non-Symmetric Cost Matrices

In the setup above, it was assumed that $$D_t, Q_t \succeq 0$$ and $$R_t \succ 0$$. However, this requirement can in fact be generalized a bit. By Lemma 2 below, it suffices that

$D_t = \begin{bmatrix} Q_t & D_{sa} \\ D_{as} & R_t \end{bmatrix}$

satisfy $$D_t + D_t^\top \succeq 0$$, $$Q_t + Q_t^\top \succeq 0$$, and $$R_t + R_t^\top \succ 0$$.

Lemma 2: For any square matrix $$M \in \R^{n \times n}$$ and vector $$x \in \R^n$$,

$x^\top M x = x^\top \left[ \frac{1}{2} (M + M^\top) \right] x.$

In other words, $$x^\top M x$$ can always be replaced with an equivalent expression $$x^\top \hat{M} x$$ where $$\hat{M} = \frac{1}{2} (M + M^\top)$$ is symmetric.

Proof

Note that

$x^\top M x = (x^\top M x)^\top = x^\top (x^\top M)^\top = x^\top M^\top x.$

Therefore,

$x^\top \left[ \frac{1}{2} (M + M^\top) \right] x = \frac{1}{2} (x^\top M x + x^\top M^\top x) = x^\top M x.$

## Summary

Consider a discrete-time finite-horizon linear dynamical system

\begin{aligned} s_{t+1} &= A_t s_t + B_t a_t + w_t \\ \E[w_t] &= 0 \end{aligned}

where the goal is to choose actions $$a_t$$ ($$t=0, \dotsc, T$$) that minimize quadratic costs

$c_t(s,a) = \sat D_t \sa + d_t^\top \sa + \text{const}.$

Define the $$q$$-functions and value functions

\begin{aligned} q_t(s,a) &= \begin{cases} c_T(s,a) & t = T \\ \begin{aligned} & c_t(s,a) + \E_{w_t} \left[ \min_{a_{t+1}} q_{t+1}(s_{t+1}, a_{t+1}) \right] \\ & = c_t(s,a) + \E_{w_t} \left[ v_{t+1}(s_{t+1}) \right] \end{aligned} & t < T \\ \end{cases} \\ v_t(s) &= \min_a q_t(s,a). \end{aligned}

In the inductive step, I showed that if

$q_t(s,a) = \sat P_t \sa + p_t^\top \sa + \lambda_t$

for some matrix $$P_t$$, vector $$p_t$$, and scalar $$\lambda_t$$, then

$v_t(s) = s^\top Z_t s + z_t^\top s + \text{const}.$

In addition, $$q_{t-1}$$ and $$v_{t-1}$$ have the same form as $$q_t$$ and $$v_t$$, and I derived recursive formulas for $$P_t, p_t, Z_t, z_t$$ in terms of the given cost matrices for all $$t = T, \dotsc, 0$$. In particular, the equation for $$Z_t$$ is known as the Ricatti equation.

Finally, given $$P_t, p_t$$, I showed how to compute $$K_t, k_t$$ such that the optimal action from any state $$s_t$$ is

$a_t = K_t s_t + k_t.$

The set of controllers $$\{(K_t, k_t)\}_{t=0}^T$$ form the linear quadratic regulator (LQR). Importantly, the LQR is optimal from any state $$s$$ and does not depend on the noise distribution for the $$w_t$$. The optimal policy is always the same, as long as $$w_t$$ has zero mean!

## Discussion

### Linear Dynamics with Constant

In some settings, the system dynamics are expressed as

$s_{t+1} = A s_t + B a_t + C$

with an additional constant vector $$C \in \R^n$$. This can actually be re-written as a standard linear dynamical system as

$\begin{bmatrix} s_{t+1} \\ 1 \end{bmatrix} = \begin{bmatrix} A & C \\ \zero_n^\top & 1 \end{bmatrix} \begin{bmatrix} s_t \\ 1 \end{bmatrix} + \begin{bmatrix} B \\ \zero_m^\top \end{bmatrix} a_t.$

This reformulation is commonly used when linearizing nonlinear dynamical systems.

### Closed-Loop vs. Open-Loop

This post presented the discrete-time finite-horizon unconstrained LQR problem and provided a recursive (a.k.a. dynamic programming) approach to finding the optimal controller. Because the recursive controller

$a_t = K_t s_t + k_t$

depends on the actual state $$s_t$$ to decide the next action $$a_t$$, this controller is a “closed-loop” controller; the state influences the next action, creating a feedback loop.

However, it is also possible to formulate this unconstrained LQR problem as one-shot quadratic optimization problem, sometimes known as the “batch” or “least-squares” approach. This results in an “open-loop” controller, because the control actions $$a_t$$ for all time steps are treated as optimization variables and solved for simultaneously.

Sources

### Constrained LQR

The setting presented above assumes that all states $$s_t \in \R^n$$ and $$a_t \in \R^m$$ are allowed. However, many real-world control problems have constraints on the states and/or actions:

$\forall t:\ s_t \in \mathcal{S},\ a_t \in \mathcal{A}.$

With state and input constraints, Lemma 1 no longer applies, so the recursion equations no longer hold.

If the state and input constraints are polytopes, then the constrained LQR problem can still be formulated as a quadratic optimization problem. However, the dynamic programming approach is more involved.

Sources

Linearization and Differential Dynamic Programming (DDP)

Let $$s_{t+1} = f_t(s,a)$$ express the system dynamics. LQR works when $$f$$ is a linear function. If $$f$$ is a differentiable and continuous nonlinear function, it can be approximated by a Taylor series expansion. Linearization requires choosing a point $$(s,a)$$ around which to linearize $$f_t$$.

In some problems, if the initial state and a desired ending state are close, then choosing either of those points (or something in between) may suffice. For example, in the standard LQR setting, the desired end state is typically the origin, and often the objective is to simply keep the system centered around the origin.

In general, though, it may be unclear what point to choose for linearization. One strategy is simply to choose random points $$(s_t^{(0)}, a_t^{(0)})$$ around which to linearize each $$f_t$$. After solving the LQR problem, then the optimal points $$(s_t^{(1)}, a_t^{(1)})$$ from the LQR solution can be used as the new points for linearization. This process can be repeated until some stopping criterion (e.g., a maximum number of iterations, or convergence).

However, linearization generally does not result in cost nor stability guarantees.

LQR can be further generalized to the partially observable Markov decision process (POMDP) setting. Instead of observing the states $$s_t$$, suppose that we only observe a random vector $$y_t$$ whose conditional distribution depends on $$s_t$$. Specifically, the problem of minimizing a quadratic cost function

$J = \E\left[ \sum_{t=0}^T x_t^\top Q_t x_t + a_t^\top R_t a_t \right]$

with the system described by

\begin{aligned} s_{t+1} &= A_t s_t + B_t a_t + w_t & w_t &\sim \mathcal{N}(0, W_t) \\ y_t &= C_t s_t + v_t & v_t &\sim \mathcal{N}(0, V_t) \end{aligned}

is called the linear-quadratic-Gaussian (LQG) control problem. In this problem, only the $$y_t$$ variables are observed, even though the underlying system dynamics and cost function are in terms of the unobserved state vectors $$s_t$$.

As with vanilla LQR, there exists a recursive derivation of the optimal LQG controller, not too different from the standard LQR controller.

Continuous and infinite-horizon

Even though I only provided derivation for the finite-horizon discrete-time setting, the results are similar for continuous-time and infinite-horizon settings. Essentially, the sum in the cost function is replaced by an integral, and the discrete-time algebraic Ricatti equation is replaced by a continuous-time Ricatti differential equation.

## References

• UC Berkeley CS285 (2020), Lecture 10
• the original slides (sadly riddled with many typos) which inspired this blog post
• presents the general LQR setting without noise
• Stanford CS229 (2018) Lecture Notes
• covers standard LQR with Gaussian noise, linearization of dynamics, DDP, and a simplified version of the linear quadratic Gaussian problem
• Stanford EE363 (2008-09) Lecture Notes
• Lecture 1 presents dynamic programming and least-squares solutions to the standard LQR problem without noise
• Lecture 5 presents the standard LQR problem with Gaussian noise
• Caltech CS159 (2020-21), Lecture 2
• presents the standard LQR setting without noise
• includes batch approach to solving constrained LQR and discussion on linearization of nonlinear systems
• Predictive Control for Linear and Hybrid Systems by Borelli et al. (2016). Available here.
• all-around good reference
• covers the standard LQR setting with and without noise and goes in-depth into both batch and recursive controllers for constrained and nonlinear problems