Machine Learning

Thursday, 12 November 2015

Open questions

I'll keep on adding problems for which I can't find any solutions.
1) We want to estimate the mean of the normal distribution $N(\mu^*,1)$ given a sample from the distribution. An obvious choice is to take the sample mean. In this case the estimator, say $\overline{\theta}$, is unbiased i.e. $E[\overline{\theta}] = \mu^*$ and variance of estimate is $\frac{1}{2n}$. The question is, what is the mean and variance of the estimate based on sample median. It seems that the estimate should be unbiased, but i don't know how to show it. No idea about the variance.

Sunday, 8 November 2015

Bias-Variance decomposition for linear regression

Consider the following learning scenario: We have a data set $\mathcal{D}$ with $n$ data points $(x_1,y_1),\ldots (x_n,y_n)$ and $x_t$ as the test point. Let $h(.)$ be the true hypothesis s.t. $y = h(x)+\epsilon$, where $\epsilon$ is zero mean noise. The learnt hypothesis depends on $\mathcal{D}$ and is represented by $f_D$. As dataset $D$ and test point $t$ are R.V., the generalization error for squared loss function is: $$E_t\left[E_D \left[(f_D(x_t)-y_t)^2 \right] \right]$$ To decompose the noise part, add and subtract $h(x_t)$: $$\begin{align}E_t\left[E_D \left[(f_D(x_t)-h(x_t)+h(x_t)-y_t)^2 \right] \right] &= E_t\left[E_D \left[(y_t-h(x_t))^2\right] \right]\\&+E_t\left[E_D \left[(f_D(x_t)-h(x_t))^2\right] \right]\\&+2E_t\left[E_D \left[(f_D(x_t)-h(x_t))(y_t-h(x_t))\right] \right]\end{align}$$ Let's take a look at each of these terms individually. $E_t\left[E_D \left[(y_t-h(x_t))^2\right] \right]$ has nothing that varies with D, so this term is $\sigma^2_{\epsilon}$. Further, assuming noise $\epsilon$ is independent of data point $x_t$, $$E_tE_D \left[(f_D(x_t)-h(x_t))(y_t-h(x_t))\right] = E_tE_D \left[(f_D(x_t)-h(x_t))\right]E_t\left[\epsilon_t\right] = 0$$ Finally, $\begin{align}E_D \left[(f_D(x_t)-h(x_t))^2\right] &= E_D \left[(f_D(x_t)-E_D[f_D(x_t)]+E_D[f_D(x_t)]-h(x_t))^2\right]\\ &= E_D \left[(f_D(x_t)-E_D[f_D(x_t)])^2\right] +(E_D\left[f_D(x_t)\right]-h(x_t))^2 \\ &= Variance + Bias^2 \end{align}$

This is the decomposition for a given test point $x_t$. To decompose the mean square error, we further need to take expectation over $x_t$. After taking expectation over $x_t$, we get mean squared error as expected variance and expected bias squared.

So far, we haven't made use of the fact that our hypothesis class is linear. For training data $\mathcal{D}=(X,Y)$ where $X \in \mathbb{R}^{n\times d}$ is the input data and $Y\in \mathbb{R}^{n\times1}$ are the input labels, $f_D(x_t) = x_t^T(X^TX)^{-1}X^TY$ is the learnt hypothesis and $h(x) = x^Tw$ is the true hypothesis. The expected value for our prediction over different learnt hypothesis is $$E_D \left[f_D(x_t)\right] = E_D \left[x_t^T(X^TX)^{-1}X^TY\right] = E_D \left[x_t^T(X^TX)^{-1}X^T(Xw+\epsilon)\right]$$ $$= h(x_t) + E_D \left[x_t^T(X^TX)^{-1}X^T\epsilon)\right]$$ Again, treating noise as independent of data, we have $$E_D \left[f_D(x_t)\right] = h(x_t)$$ This estimate is unbiased. What about variance,$$\begin{align}E_D \left[(f_D(x_t)-E_D[f_D(x_t)])^2\right] &= E_D \left[(f_D(x_t)-h(x_t))^2\right]\\
&= E_D[(x_t^T(X^TX)^{-1}X^T(Xw+\epsilon) - x_t^Tw)^2] \\
&= E_D[(x_t^T(X^TX)^{-1}X^T\epsilon)^2] \\
&= E_D[(x_t^T(X^TX)^{-1}X^T\epsilon)(x_t^T(X^TX)^{-1}X^T\epsilon)^T]\\
&= E_D[x_t^T(X^TX)^{-1}X^T\epsilon\epsilon^TX(X^TX)^{-1}x_t]\\
&= E_D[tr(x_t^T(X^TX)^{-1}X^T\epsilon\epsilon^TX(X^TX)^{-1}x_t)]\\
&= E_D[tr(X(X^TX)^{-1}x_tx_t^T(X^TX)^{-1}X^T\epsilon\epsilon^T)]\\
&= tr(E_D[X(X^TX)^{-1}x_tx_t^T(X^TX)^{-1}X^T]E_D[\epsilon\epsilon^T])\\
&= \sigma^2_{\epsilon}E_D[x_t^T(X^TX)^{-1}x_t]\end{align}$$ If we know the data source we can further compute this expectation. Suppose $x_i \sim \mathcal{N}(0,1)$ as $$\begin{align} variance &= \sigma_{\epsilon}^2 E_{D,t}[x_t^T(X^TX)^{-1}x_t]\\
&= \sigma^2_{\epsilon} E_{D,t}[tr(x_t^T(X^TX)^{-1}x_t)]\\ &= \sigma^2_{\epsilon} E_{D,t}[tr(x_tx_t^T(X^TX)^{-1})]\\ &= \sigma^2 _{\epsilon}tr(E_{t}[x_tx_t^T]E_{D}[(X^TX)^{-1}]) \\ &= \frac{\sigma^2 _{\epsilon}d}{n} \end{align}$$

Here is simplification of above calculation:
$$\begin{align} E_{t}[x_tx_t^T] &= I_d \\
E_{D}[(X^TX)^{-1}] &= E_{D}[(X^TX)]^{-1} \\
&=E_{D}[\sum_{i=1}^{n} x_ix_i^T]^{-1}\\
&=(\sum_{i=1}^{n} I_d)^{-1}\\
&=\frac{1}{n}I_d
tr(E_{t}[x_tx_t^T] E_{D}[(X^TX)^{-1}] ) &= tr( I_d\frac{1}{n}I_d)\\
&=\frac{d}{n}
\end{align}$$
Source:www.cs.berkeley.edu/~jegonzal/talks/linear_regression.pdf

Friday, 20 December 2013

Murphy's ML book

For the next few months, I'll be working on making the Kernel methods faster.
I was in the process of preparing a FAQ kind of article on SVM and Kernel trick, with emphasis on the computational cost, when I came across Murphy's book on ML. I was amazed to find an entirely new perspective to SVM and some severe criticism of SVM in this book. To understand the criticism, I had to delve into the theory of Bayesian way of ML, which is the method used in Murphy's book. I have done a course on Machine Learning earlier (CS771: Machine Learning: Tools and Techniques), but not much emphasis was given to Bayesian approach to ML in that course. I have also seen some of the lectures of Yaseer Mustafa on ML and there as well, no mention of Bayesian approach was found. This kind of approach is new to me and reading into this has led me to several fascinating concepts. My guide tells me that the community is also taking a liking to the Bayesian approaches off late. Now, I am reading about several kinds of vector machines like sparse vector machines, l1 reqularized vector machines, l2-normalized vector machines, relevance vector machines. There is a lot of talk about sparsity promoting priors in Murphy's book. This whole prior business comes from the Bayesian outlook. I don't fully understand this and I'll have to go through first 14 chapters of Murphy's book before I can continue with my article on SVM and accelerating kernel methods.