Solving Linear Regression Problems Using Fréchet Derivatives

Posted: 2022-10-07 (Updated: 2022-10-09)
Suppose we have a linear regression problem, where input vectors are in \(\mathbb{R}^n\) and output vectors are in \(\mathbb{R}^m\). Let \(N\) be the size of dataset. For each sample \((x^i,\,y^i)\), we write \begin{gather*} x^i = \begin{pmatrix} x^i_1 \\ \vdots \\ x^i_n \end{pmatrix}, \quad y^i = \begin{pmatrix} y^i_1 \\ \vdots \\ y^i_m \end{pmatrix}, \end{gather*} and define \(X\) and \(Y\) as \begin{gather*} X = \begin{pmatrix} x^1 & \cdots & x^N \end{pmatrix} = \begin{pmatrix} x^1_1 & \cdots & x^N_1 \\ \vdots & \ddots & \vdots\\ x^1_n & \cdots & x^N_n \\ \end{pmatrix}, \\ Y = \begin{pmatrix} y^1 & \cdots & y^N \end{pmatrix} = \begin{pmatrix} y^1_1 & \cdots & y^N_1 \\ \vdots & \ddots & \vdots \\ y^1_m & \cdots & y^N_m \end{pmatrix}. \end{gather*} Let \(W \in \mathbb{R}^{m \times n}\) be the matrix of coefficients as follows: \begin{gather*} W = \begin{pmatrix} w_{11} & \cdots & w_{1n} \\ \vdots & \ddots & \vdots \\ w_{m1} & \cdots & w_{mn} \end{pmatrix}. \\ \end{gather*} The squared error \(E\) is \begin{align*} E(W) = \frac{1}{2} \| WX - Y \|^2, \end{align*} where \(\|\cdot\|\) is the Frobenius norm.

If we change \(W\) along \(H\), \(E\) changes as below: \begin{align} E(W + H) - E(H) &= \frac{1}{2}{ \left\| (W + H)X - Y \right\| }^2 - \frac{1}{2} { \left\| WX - Y \right\| }^2 \notag \\ &= \frac{1}{2} \langle WX - Y + HX,\, WX - Y + HX \rangle - \frac{1}{2} { \left\| WX - Y \right\| }^2 \notag \\ &= \frac{1}{2} {\| WX - Y \|}^2 + \langle WX - Y,\, HX \rangle + \frac{1}{2} {\| HX \|}^2 - \frac{1}{2} {\| WX - Y \|}^2 \notag \\ &= \trace ( (WX - Y) (HX)^\top ) +\frac{1}{2} {\|HX\|}^2 \notag \\ &= \trace( (WX - Y)X^\top H^\top ) +\frac{1}{2} {\|HX\|}^2 \notag \\ &= \langle (WX - Y)X^\top,\, H \rangle + \frac{1}{2}{\|HX\|}^2 \label{result} \end{align} According to Cauchy–Schwarz inequality, \(\|HX\|^2 \leq {\|H\|}^2 {\|X\|}^2 \). Therefore, for any \(\varepsilon > 0\), if we take \(H\) so small that \(\|H\| \leq \displaystyle \frac{\varepsilon}{2 \|X\|^2 + 1},\) \begin{align*} \|f(W + H) - f(W) - \langle ( WX - Y)X^\top,\, H \rangle\| \leq \varepsilon \|H\| \end{align*} The derivative of \(f\) with respect to \(W\) at an arbitrary point \(W\) along \(H\) is the first term of the right-hand side of \((\ref{result})\): \begin{align*} D_W f(W)(H) = \langle ( WX - Y )X^\top,\, H \rangle. \end{align*} Since \(f\) is convex, \(f\) has the global minimum if and only if \begin{gather*} ( WX - Y )X^\top = 0 \\ W XX^\top = YX^\top \end{gather*} If \(XX^\top\) is regular, \begin{gather*} W = Y X^\top (XX^\top)^{-1}. \end{gather*}