Why do we have to take out the mean before doing a ridge regression?  

Here's an illustration why.  

Suppose we have the following observations:  

targ(y)    attribute(x)
1002            1000004
998             999996

The fit for this is obvious.  

y = -2000000 + 1000 + 2*x (run through the calculation using the regression formulas from class.  Check that regression gives this answer)

Here the weight on x is 2 and there's a bias offset between x and y.  When we do ridge regression we penalize the sum of the squares of the regression coefficients.  Should we include the constant term or not?  

Penalty with constant -         4e12 (roughly)
Penalty without constant         4

Suppose we slightly reformulate the orignal problem by taking out the mean values from y and from x before we try the fit.  Define

yy = y - 1000
xx = x - 1000000

Then the reformulated problem is:  

targ(yy)        attribute(xx)
 2                   4
-2                  -4

obviously

yy = 2xx

In this formulation the sum square of the xx-coefficients is 4.  Suppose that we penalized the xx coefficient to only allow its square to be 2.  Then the estimates would be

yy = sqrt(2)xx


Give some thought to what would happen in the original formulation if we only let the sum of squared coefficients be 2.  Remember that ridge regression alters the original regression problem by adding a penalty on the coefficients.  The figure of merit becomes  

(sum square fit error) + lambda*(sum square coefficients).  

What's the asymptotic behavior as lambda increases.  If we include the bias coefficient our estmate of y approaches zero.  If we exclude the bias coefficient, then our estimate of y approaches mean(y).  Which of these makes more sense?  Is y = 0 a more meaningful limit or is y = 1000?