Gradient of a real-valued function of a symmetric matrix

Wednesday, July 22, 2020 • edited Tuesday, September 1, 2020

This page represents my first attempt to use the typesetting library named $\KaTeX$. I followed some advice found here.


I was intrigued by the details of the question "What is the derivative of $\log(\det(X))$ when $X$ is symmetric?" on math.stackexchange.com. The OP cites two different answers that do not agree.

One of the answers links to a recent preprint, entitled What is the gradient of a scalar function of a symmetric matrix?, by Shriram Srinivasan and Nishant Panda, post-doctoral researchers at Los Alamos National Lab.

I like this preprint. The authors ultimately prove that if $G$ is value of the gradient of a real-valued function of an $n$-by-$n$ real matrix (evaluated at a symmetric matrix), then

$$ \textrm{sym}(G) = \frac{1}{2}(G + G^{\textsf{T}}) $$

is the gradient of the same function in the space of symmetric real matrices. They also show how to find the equivalent expressions when the matrix argument $X$ is replaced with the $\textrm{vec}$ value $\textrm{vec}(X)$. They even suggest a source of confusion that could have given rise to an incorrect but published formula ($G + G^{\textsf{T}} - G\odot I$, where $\odot$ is the Hadamard product) for the gradient on the symmetric subspace.

Basics

The appropriate inner product and norm are the Frobenius inner product and Frobenius norm, respectively:

  • $\left<A,B\right>_F = \textrm{tr}(A^{\mathsf{T}}B) = \sum_{i=1}^{n}\sum_{j=1}^{n}a_{ij}b_{ij}$
  • $||A||_F = \sqrt{\left<A,A\right>_F}$

If $\phi:\mathbb{R}^{n\times n}\to\mathbb{R}$ has a Fréchet derivative $D\phi(X)$ (with respect to the Frobenius norm) at $X$, where

$$ \lim_{||H||_F\to 0}\frac{||\phi(X + H) - \phi(X) - D\phi(X)[H]||_F}{||H||_F} = 0, $$

then the Riesz Representation Theorem shows that there is a gradient $\nabla\phi(X)$ at $X$ such that $\left<\nabla\phi(X),H\right>_F = D\phi(X)[H]$.

Note: Symmetry of $X$ does not guarantee symmetry of $\nabla\phi(X)$.

Restriction to symmetric matrices

Let $\phi_{\textrm{sym}} = \left.\phi\right|_{\mathbb{S}^{n\times n}}$, where $\mathbb{S}^{n\times n}$ is the space of real symmetric $n\times n$ matrices. If $\phi$ has a Fréchet derivative at $X\in\mathbb{S}^{n\times n}$, then $\phi_{\textrm{sym}}$ does, too. The Fréchet derivative of $\phi$ at $X$ ensures the existence of a gradient $\nabla\phi(X)$. Consider

$$ \left<\nabla\phi(X),H\right> = D\phi(X)[H] $$

for $H\in\mathbb{S}^{n\times n}$. Since $H\in\mathbb{S}^{n\times n}$,

$$ \left<\nabla\phi(X),H\right> = \left<\textrm{sym}(\nabla\phi(X)),H\right>. $$

This shows that

$$ \nabla\phi_{\textrm{sym}} = \textrm{sym}(\nabla\phi). $$

This shows that the gradient of a real-valued function on $\mathbb{S}^{n\times n}$ is symmetric.

Vectors in place of matrices

The most interesting and novel parts of the preprint explain the rôles of the $\textrm{vec}$ operator, the map from $\textrm{vec}$-value to a non-redudant vector for symmetric matrices, and the inverse of that map. The authors show that the map from a vector-valued gradient to the equivalent $\mathbb{S}^{n\times n}$-valued gradient has a non-obvious form.

Let's consider explicitly the $3\times 3$ case of the map from symmetric matrix $X\in\mathbb{S}^{3\times 3}$ to a vector $\textrm{vec}(X)\in\mathbb{R}^9$ with some redundant entries:

$$ \left(\begin{matrix} x_{11} & x_{12} & x_{13}\cr \textcolor{red}{x_{21}} & x_{22} & x_{23}\cr \textcolor{red}{x_{31}} & \textcolor{red}{x_{32}} & x_{33} \end{matrix}\right) \mapsto \left(\begin{matrix} x_{11}\cr x_{12}\cr x_{13}\cr \textcolor{red}{x_{21}}\cr x_{22}\cr x_{23}\cr \textcolor{red}{x_{31}}\cr \textcolor{red}{x_{32}}\cr x_{33} \end{matrix}\right) $$

The map that removes redundant entries from $\textrm{vec}(X)$ is $\mathsf{P}$, the so-called elimination map. $\mathsf{P}$ maps from $\textrm{vec}(\mathbb{S}^{n\times n})$ onto $\mathbb{R}^{n(n+1)/2}$. In the $3\times 3$ case, $\mathsf{P}$ has the form

$$ \left(\begin{matrix} x_{11}\cr x_{12}\cr x_{13}\cr \textcolor{red}{x_{21}}\cr x_{22}\cr x_{23}\cr \textcolor{red}{x_{31}}\cr \textcolor{red}{x_{32}}\cr x_{33} \end{matrix}\right) \mapsto \left(\begin{matrix} x_{11}\cr x_{12}\cr x_{13}\cr x_{22}\cr x_{23}\cr x_{33} \end{matrix}\right) $$

Since $\mathsf{P}$ is onto $\mathbb{R}^{n(n+1)/2}$, there is an inverse of $\mathsf{P}$. $\mathsf{D}:\mathbb{R}^{n(n+1)/2}\to\textrm{vec}(\mathbb{S}^{n\times n})$ is called the duplication map.

Let's see $\mathsf{D}$ in action in the $3\times 3$ case. If $\mathbf{u}\in\mathbb{R}^6$, then $\mathsf{D}\mathbf{u}\in\textrm{vec}(\mathbb{S}^{3\times 3})$:

$$ \left(\begin{matrix}u_1\cr u_2\cr u_3\cr u_4\cr u_5\cr u_6\end{matrix}\right) \mapsto \left(\begin{matrix} u_{1}\cr u_{2}\cr u_{3}\cr \textcolor{red}{u_{2}}\cr u_{4}\cr u_{5}\cr \textcolor{red}{u_{3}}\cr \textcolor{red}{u_{5}}\cr u_{6} \end{matrix}\right) $$

$\mathsf{D}\mathbf{u}$ can then be mapped with $\textrm{mat}$ (the inverse of $\textrm{vec}$) to $\textrm{mat}(\mathsf{D}\mathbf{u})\in\mathbb{S}^{3\times 3}$:

$$ \left(\begin{matrix} u_{1}\cr u_{2}\cr u_{3}\cr \textcolor{red}{u_{2}}\cr u_{4}\cr u_{5}\cr \textcolor{red}{u_{3}}\cr \textcolor{red}{u_{5}}\cr u_{6} \end{matrix}\right) \mapsto \left(\begin{matrix} u_{1} & u_{2} & u_{3}\cr \textcolor{red}{u_{2}} & u_{4} & u_{5}\cr \textcolor{red}{u_{3}} & \textcolor{red}{u_{5}} & u_{6} \end{matrix}\right) $$

In the $3\times 3$ case, $\mathsf{D}:\mathbb{R}^6\to\textrm{vec}(\mathbb{S}^{3\times 3})$ is

$$ \mathsf{D} = \left(\begin{matrix} 1 & 0 & 0 & 0 & 0 & 0\cr 0 & 1 & 0 & 0 & 0 & 0\cr 0 & 0 & 1 & 0 & 0 & 0\cr 0 & \textcolor{red}{1} & 0 & 0 & 0 & 0\cr 0 & 0 & 0 & 1 & 0 & 0\cr 0 & 0 & 0 & 0 & 1 & 0\cr 0 & 0 & \textcolor{red}{1} & 0 & 0 & 0\cr 0 & 0 & 0 & 0 & \textcolor{red}{1} & 0\cr 0 & 0 & 0 & 0 & 0 & 1 \end{matrix}\right). $$

While this can map a vector in $\mathbb{R}^{n(n+1)/2}$ to a symmetric $n\times n$ matrix, it is not the proper way to map the gradient of an $\mathbb{R}$-valued function on $\mathbb{R}^{n(n+1)/2}$ to the matrix-valued gradient of an $\mathbb{R}$-valued function on $\mathbb{S}^{n\times n}$.

WARNING: If one has a function $f:\mathbb{S}^{n\times n}\to\mathbb{R}$ and another function $\tilde{f}:\mathbb{R}^{n(n+1)/2}\to\mathbb{R}$ such that

$\tilde{f}(\mathsf{P}\textrm{vec(X)}) = f(X)$,

then the gradient $\nabla\tilde{f}$ can be represented a vector in $\mathbb{R}^{n(n+1)/2}$, but the map

$\nabla\tilde{f}(\mathsf{P}\textrm{vec}(X))\mapsto\mathsf{D}\tilde{f}(\mathsf{P}\textrm{vec}(X))\mapsto\textrm{mat}(\mathsf{D}\tilde{f}(\mathsf{P}\textrm{vec}(X)))$

does not yield the derivative of $f$ with respect to its $\mathbb{S}^{n\times n}$ argument. That is one of the primary points of the preprint. The proper map is more complicated and appears below.

Restriction to symmetric matrices, again

If $\phi_{\textrm{sym}} = \left.\phi\right|_{\mathbb{S}^{n\times n}}$, then there is a $\widehat{\phi}:\mathbb{R}^{n(n+1)/2}\to\mathbb{R}$ that "agrees" with $\phi_{\textrm{sym}}$. If $\mathbf{x}\in\mathbb{R}^{n(n + 1)/2}$, then $\textrm{mat}(\mathsf{D}\mathbf{x})\in\mathbb{S}^{n\times n}$, and we can define

$$ \widehat{\phi}(\mathbf{x}) = \phi_{\textrm{sym}}\left(\textrm{mat}(\mathsf{D}\mathbf{x})\right)). $$

The big theorem is that if $\nabla\widehat{\phi}(\mathbf{x})$ is the gradient of $\widehat{\phi}$ at $\mathbf{x}\in\mathbb{R}^{n(n+1)/2}$, if we want to find a symmetric matrix $G$ such that

$$ \left<G,H\right>_F = \left<\nabla\widehat{\phi}(\mathbf{x}),\mathsf{P}(\textrm{vec}(H))\right>_{\mathbb{R}^{n(n+1)/2}} $$

for all $H\in\mathbb{S}^{n\times n}$, then the symmetric matrix that satisfies this, i.e. the true gradient, is

$$ G = \textrm{mat}\left(\mathsf{D}(\mathsf{D}^{\mathsf{T}}\mathsf{D})^{-1}\nabla\hat{\phi}(\mathbf{x})\right). $$

Deriving $ \textrm{mat}(\mathsf{D}(\mathsf{D}^{\mathsf{T}}\mathsf{D})^{-1}\nabla\hat{\phi}(\mathbf{x}))$

How do we get to that formula? The definition of $\textrm{vec}$ ensures that

$$ \left<G,H\right>_F = \left<\textrm{vec}(G),\textrm{vec}(H)\right>_{\mathbb{R}^{n^2}}, $$

but it is crucial to understand that

$$ \left<\textrm{vec}(G),\textrm{vec}(H)\right>_{\mathbb{R}^{n^2}} \neq \left<\mathsf{P}(\textrm{vec}(G)),\mathsf{P}(\textrm{vec}(H))\right>_{\mathbb{R}^{n(n+1)/2}}. $$

Rather, for $G, H\in\mathbb{S}^{n\times n}$,

$$ \left<\mathsf{D}\mathsf{P}(\textrm{vec}(G)),\mathsf{D}\mathsf{P}(\textrm{vec}(H))\right>_{\mathbb{R}^{n^2}} = \left<\textrm{vec}(G),\textrm{vec}(H)\right>_{\mathbb{R}^{n^2}}. $$

Recall that we seek a $G\in\mathbb{S}^{n\times n}$ such that

$$ \left<G,H\right>_F = \left<\nabla\widehat{\phi}(\mathbf{x}),\mathsf{P}(\textrm{vec}(H))\right>_{\mathbb{R}^{n(n+1)/2}} $$

$$\begin{matrix} \left<G,H\right>_F &=& \left<\mathsf{D}\mathsf{P}(\textrm{vec}(G)),\mathsf{D}\mathsf{P}(\textrm{vec}(H))\right>_{\mathbb{R}^{n^2}}\cr &=& \left<\mathsf{D}^{\textsf{T}}\mathsf{D}\mathsf{P}(\textrm{vec}(G)),\mathsf{P}(\textrm{vec}(H))\right>_{\mathbb{R}^{n(n+1)/2}} \end{matrix} $$

This shows that we want

$$ \mathsf{D}^{\textsf{T}}\mathsf{D}\mathsf{P}(\textrm{vec}(G)) = \nabla\widehat{\phi}(\mathbf{x}), $$

so, assuming that $G$ is symmetric,

$$ \begin{matrix} \mathsf{P}(\textrm{vec}(G)) &=& (\mathsf{D}^{\textsf{T}}\mathsf{D})^{-1}\nabla\widehat{\phi}(\mathbf{x})\cr \mathsf{D}\mathsf{P}(\textrm{vec}(G)) &=& \mathsf{D}(\mathsf{D}^{\textsf{T}}\mathsf{D})^{-1}\nabla\widehat{\phi}(\mathbf{x})\cr \textrm{vec}(G) &=& \mathsf{D}(\mathsf{D}^{\textsf{T}}\mathsf{D})^{-1}\nabla\widehat{\phi}(\mathbf{x})\cr \textrm{mat}(\textrm{vec}(G)) &=& \textrm{mat}(\mathsf{D}(\mathsf{D}^{\textsf{T}}\mathsf{D})^{-1}\nabla\widehat{\phi}(\mathbf{x}))\cr G&=& \textrm{mat}(\mathsf{D}(\mathsf{D}^{\textsf{T}}\mathsf{D})^{-1}\nabla\widehat{\phi}(\mathbf{x})) \end{matrix} $$


The bibliography cites several related questions on StackExchange.

mathematicsstackexchange

Legitimately free statistics books

Links 2020.07.07