Consider the following learning scenario: We have a data set $\mathcal{D}$ with $n$ data points $(x_1,y_1),\ldots (x_n,y_n)$ and $x_t$ as the test point. Let $h(.)$ be the true hypothesis s.t. $y = h(x)+\epsilon$, where $\epsilon$ is zero mean noise. The learnt hypothesis depends on $\mathcal{D}$ and is represented by $f_D$. As dataset $D$ and test point $t$ are R.V., the generalization error for squared loss function is: $$E_t\left[E_D \left[(f_D(x_t)-y_t)^2 \right] \right]$$ To decompose the noise part, add and subtract $h(x_t)$: $$\begin{align}E_t\left[E_D \left[(f_D(x_t)-h(x_t)+h(x_t)-y_t)^2 \right] \right] &= E_t\left[E_D \left[(y_t-h(x_t))^2\right] \right]\\&+E_t\left[E_D \left[(f_D(x_t)-h(x_t))^2\right] \right]\\&+2E_t\left[E_D \left[(f_D(x_t)-h(x_t))(y_t-h(x_t))\right] \right]\end{align}$$ Let's take a look at each of these terms individually. $E_t\left[E_D \left[(y_t-h(x_t))^2\right] \right]$ has nothing that varies with D, so this term is $\sigma^2_{\epsilon}$. Further, assuming noise $\epsilon$ is independent of data point $x_t$, $$E_tE_D \left[(f_D(x_t)-h(x_t))(y_t-h(x_t))\right] = E_tE_D \left[(f_D(x_t)-h(x_t))\right]E_t\left[\epsilon_t\right] = 0$$ Finally, $\begin{align}E_D \left[(f_D(x_t)-h(x_t))^2\right] &= E_D \left[(f_D(x_t)-E_D[f_D(x_t)]+E_D[f_D(x_t)]-h(x_t))^2\right]\\ &= E_D \left[(f_D(x_t)-E_D[f_D(x_t)])^2\right] +(E_D\left[f_D(x_t)\right]-h(x_t))^2 \\ &= Variance + Bias^2 \end{align}$
This is the decomposition for a given test point $x_t$. To decompose the mean square error, we further need to take expectation over $x_t$. After taking expectation over $x_t$, we get mean squared error as expected variance and expected bias squared.
So far, we haven't made use of the fact that our hypothesis class is linear. For training data $\mathcal{D}=(X,Y)$ where $X \in \mathbb{R}^{n\times d}$ is the input data and $Y\in \mathbb{R}^{n\times1}$ are the input labels, $f_D(x_t) = x_t^T(X^TX)^{-1}X^TY$ is the learnt hypothesis and $h(x) = x^Tw$ is the true hypothesis. The expected value for our prediction over different learnt hypothesis is $$E_D \left[f_D(x_t)\right] = E_D \left[x_t^T(X^TX)^{-1}X^TY\right] = E_D \left[x_t^T(X^TX)^{-1}X^T(Xw+\epsilon)\right]$$ $$= h(x_t) + E_D \left[x_t^T(X^TX)^{-1}X^T\epsilon)\right]$$ Again, treating noise as independent of data, we have $$E_D \left[f_D(x_t)\right] = h(x_t)$$ This estimate is unbiased. What about variance,$$\begin{align}E_D \left[(f_D(x_t)-E_D[f_D(x_t)])^2\right] &= E_D \left[(f_D(x_t)-h(x_t))^2\right]\\
&= E_D[(x_t^T(X^TX)^{-1}X^T(Xw+\epsilon) - x_t^Tw)^2] \\
&= E_D[(x_t^T(X^TX)^{-1}X^T\epsilon)^2] \\
&= E_D[(x_t^T(X^TX)^{-1}X^T\epsilon)(x_t^T(X^TX)^{-1}X^T\epsilon)^T]\\
&= E_D[x_t^T(X^TX)^{-1}X^T\epsilon\epsilon^TX(X^TX)^{-1}x_t]\\
&= E_D[tr(x_t^T(X^TX)^{-1}X^T\epsilon\epsilon^TX(X^TX)^{-1}x_t)]\\
&= E_D[tr(X(X^TX)^{-1}x_tx_t^T(X^TX)^{-1}X^T\epsilon\epsilon^T)]\\
&= tr(E_D[X(X^TX)^{-1}x_tx_t^T(X^TX)^{-1}X^T]E_D[\epsilon\epsilon^T])\\
&= \sigma^2_{\epsilon}E_D[x_t^T(X^TX)^{-1}x_t]\end{align}$$ If we know the data source we can further compute this expectation. Suppose $x_i \sim \mathcal{N}(0,1)$ as $$\begin{align} variance &= \sigma_{\epsilon}^2 E_{D,t}[x_t^T(X^TX)^{-1}x_t]\\
&= \sigma^2_{\epsilon} E_{D,t}[tr(x_t^T(X^TX)^{-1}x_t)]\\ &= \sigma^2_{\epsilon} E_{D,t}[tr(x_tx_t^T(X^TX)^{-1})]\\ &= \sigma^2 _{\epsilon}tr(E_{t}[x_tx_t^T]E_{D}[(X^TX)^{-1}]) \\ &= \frac{\sigma^2 _{\epsilon}d}{n} \end{align}$$
Here is simplification of above calculation:
$$\begin{align} E_{t}[x_tx_t^T] &= I_d \\
E_{D}[(X^TX)^{-1}] &= E_{D}[(X^TX)]^{-1} \\
&=E_{D}[\sum_{i=1}^{n} x_ix_i^T]^{-1}\\
&=(\sum_{i=1}^{n} I_d)^{-1}\\
&=\frac{1}{n}I_d
tr(E_{t}[x_tx_t^T] E_{D}[(X^TX)^{-1}] ) &= tr( I_d\frac{1}{n}I_d)\\
&=\frac{d}{n}
\end{align}$$
Source:www.cs.berkeley.edu/~jegonzal/talks/linear_regression.pdf
This is the decomposition for a given test point $x_t$. To decompose the mean square error, we further need to take expectation over $x_t$. After taking expectation over $x_t$, we get mean squared error as expected variance and expected bias squared.
So far, we haven't made use of the fact that our hypothesis class is linear. For training data $\mathcal{D}=(X,Y)$ where $X \in \mathbb{R}^{n\times d}$ is the input data and $Y\in \mathbb{R}^{n\times1}$ are the input labels, $f_D(x_t) = x_t^T(X^TX)^{-1}X^TY$ is the learnt hypothesis and $h(x) = x^Tw$ is the true hypothesis. The expected value for our prediction over different learnt hypothesis is $$E_D \left[f_D(x_t)\right] = E_D \left[x_t^T(X^TX)^{-1}X^TY\right] = E_D \left[x_t^T(X^TX)^{-1}X^T(Xw+\epsilon)\right]$$ $$= h(x_t) + E_D \left[x_t^T(X^TX)^{-1}X^T\epsilon)\right]$$ Again, treating noise as independent of data, we have $$E_D \left[f_D(x_t)\right] = h(x_t)$$ This estimate is unbiased. What about variance,$$\begin{align}E_D \left[(f_D(x_t)-E_D[f_D(x_t)])^2\right] &= E_D \left[(f_D(x_t)-h(x_t))^2\right]\\
&= E_D[(x_t^T(X^TX)^{-1}X^T(Xw+\epsilon) - x_t^Tw)^2] \\
&= E_D[(x_t^T(X^TX)^{-1}X^T\epsilon)^2] \\
&= E_D[(x_t^T(X^TX)^{-1}X^T\epsilon)(x_t^T(X^TX)^{-1}X^T\epsilon)^T]\\
&= E_D[x_t^T(X^TX)^{-1}X^T\epsilon\epsilon^TX(X^TX)^{-1}x_t]\\
&= E_D[tr(x_t^T(X^TX)^{-1}X^T\epsilon\epsilon^TX(X^TX)^{-1}x_t)]\\
&= E_D[tr(X(X^TX)^{-1}x_tx_t^T(X^TX)^{-1}X^T\epsilon\epsilon^T)]\\
&= tr(E_D[X(X^TX)^{-1}x_tx_t^T(X^TX)^{-1}X^T]E_D[\epsilon\epsilon^T])\\
&= \sigma^2_{\epsilon}E_D[x_t^T(X^TX)^{-1}x_t]\end{align}$$ If we know the data source we can further compute this expectation. Suppose $x_i \sim \mathcal{N}(0,1)$ as $$\begin{align} variance &= \sigma_{\epsilon}^2 E_{D,t}[x_t^T(X^TX)^{-1}x_t]\\
&= \sigma^2_{\epsilon} E_{D,t}[tr(x_t^T(X^TX)^{-1}x_t)]\\ &= \sigma^2_{\epsilon} E_{D,t}[tr(x_tx_t^T(X^TX)^{-1})]\\ &= \sigma^2 _{\epsilon}tr(E_{t}[x_tx_t^T]E_{D}[(X^TX)^{-1}]) \\ &= \frac{\sigma^2 _{\epsilon}d}{n} \end{align}$$
Here is simplification of above calculation:
$$\begin{align} E_{t}[x_tx_t^T] &= I_d \\
E_{D}[(X^TX)^{-1}] &= E_{D}[(X^TX)]^{-1} \\
&=E_{D}[\sum_{i=1}^{n} x_ix_i^T]^{-1}\\
&=(\sum_{i=1}^{n} I_d)^{-1}\\
&=\frac{1}{n}I_d
tr(E_{t}[x_tx_t^T] E_{D}[(X^TX)^{-1}] ) &= tr( I_d\frac{1}{n}I_d)\\
&=\frac{d}{n}
\end{align}$$
Source:www.cs.berkeley.edu/~jegonzal/talks/linear_regression.pdf
This comment has been removed by a blog administrator.
ReplyDelete