% Script 1 for Week 6 

Least Squares Regression

The Theory:

Suppose M is ANY kxn matrix and y is a kx1 target column and we need to find the nx1 vector a that minimizes the length of M*a-y.

This is a vector that minimizes the dot product E(a)=(M*a-y)'*(M*a-y).

E(a) = a'*M'*M*a - 2*y'*M*a + y'*y.

If we are at a minimum the partial derivative of E with respect to each of its n "slots" must be 0, so we have necessary conditions for a minimum:

For each i we must have 

dE/da_i = 0
= [0...1...0]*M'*M*a + a'*M'*M*[0...1...0]' -2*y'*M*[0....1...0]'

It follows easily that the ith row (i.e. the ith entry since it is a column) of M'*y must equal the ith row of M'*Ma for every i.

That is, those two column matrices are equal at a minimum.

So a = inv(M'*M)*(M'*y).

The backslash operation A\z is roughly the same as inv(A)*z when A is a square matrix and z is a column. So the operation A\z may fail if the columns of A are dependent and A is square. 

If M is NOT square but the columns of M are independent then M\y should return the least-squares solution, which is inv(M'*M)*(M'*y) but calculated by a very efficient method. In applications below M is never square because the number of constants we are looking for will be always be fewer than the number of data points.  And the columns of M are unlikely to be dependent.

The point of this paragraph: In our applications you may usually (always?) use M\y in place of inv(M'*M)*(M'*y) if you wish. And you should.

Below is our first application: fit noisy data to a polynomial of specified degree. We start out with a situation where we know the "answer"


stdevofnoise=2;

clearvars x y M

x=sort(10*rand(20,1)-5);

% We have created a random vector of length 20 uniformly selected from [-5,5]

y = -3*sin(5*x-4) + 4*exp(x);

t=linspace(-5,5);

yexact=-3*sin(5*t-4) + 4*exp(t);

% vector yexact contains the function f(x) = -3*sin(5*x-4) + 4*exp(x) evaluated on vector t of domain values.

% So we know the exact y values given by the perfect-fit function.

ynoisy=y+stdevofnoise*randn(20,1);

plot(t,yexact,x,y,'o',x,ynoisy,'*')

% adds a noise term, normally distributed with standard deviation stdevofnoise. This would correspond to measurement uncertainty or noise in the environment. This is intended to represent the data you actually have.

Mfunc=@(t) [ t^3 t^2 t 1]

% These particular functions could be replaced by ANY functions of ANY domain variable. The nature of the domain variable is irrelevant, only the function values on these domain variables and the measured y values are used. 

% Let me say this again (it should be in bold italic):

% You could have Mfunc=@(t) [ f1(t) f2(t) ... fL(t) ] and then we would be looking for a best fit of the form y=a1*f1(x)+...+aL*fL(x) for ANY functions f1,...,fL. 

for j=1:20
M(j,:)=Mfunc(x(j));
end

A=M\ynoisy

% A is the column of calculated coefficients which will minimize the sum of the squared error of the y coefficients.

% As noted above the backslash operation may fail if the columns of M are dependent. That means either that some of the functions f1,...,fL  are dependent on others and therefore redundant or, by coincidence, because the data points taken were unable to reveal enough differences. 

% That would have happened in our example here if we had taken data at only the three points -1,0 and 1. Then matrix M would be

% -1  1  1  1
%  0  0  0  1
%  1  1  1  1

% Using the current data you could just delete redundant columns and not use some of the functions. Alternatively (if the functions really are independent) you could take more/different data until the independence of the functions is revealed by the independence of the columns.


YA=polyval(A,x);

Error(1)=sum( (y-YA).^2 )

% This is a measure of the error, the sum of the individual squared errors. 
% If you have several approximation methods 
% this is one way to compare them. Smaller is generally better.


% The easy way to get A for polynomial fitting:

A

Pmatlab=polyfit(x,ynoisy,3)'

CoefficientDiff=A-Pmatlab

t=linspace(min(x)-0.1,max(x)+0.1);

YA=polyval(A,t);

clf

plot(x,y,'o',t,YA)

hold on

% We can force a "better" fit by using a 19th degree polynomial which will actually go through all 20 data points.


Pbetter=polyfit(x,y,19)'

Ybetter=polyval(Pbetter,x);

Error(2)=sum( (y-Ybetter).^2 )

Ytbetter=polyval(Pbetter,t);

plot(t,Ytbetter,x,y,'o')

% Just because it reproduces the measured data better .... is it better? What will you actually be using the function you are creating FOR?

% Let's start again.

% Suppose from theoretical considerations and/or experience in the subject area you felt that your y values SHOULD BE noisy observations of a combination of functions of the form cos(5*x) and sin(5*x) and log(x.^2) and exp(x).

Mfunc=@(t) [ cos(5*t) sin(5*t) log(t.^2) exp(t)]

for j=1:20
M(j,:)=Mfunc(x(j));
end

A=M\ynoisy

% Remember the function we are trying to "find" is y = -3*sin(5*x-4) + 4*exp(x);

Ysol=@(t) A(1)*cos(5*t) +A(2)*sin(5*t)+A(3)*log(t.^2)+A(4)*exp(t)

YB=Ysol(x);

YBt=Ysol(t);

Error(3)=sum( (y-YB).^2 )

clf

plot(x,y,'o',t,YBt)


% Nonlinear Least-Squares Regression

% It frequently happens that your experience in a subject area will prompt you to think measured data come from a function of a certain form but with unknown parameters, and NOT just linear combinations of specific functions.

% Linearization by various means such as taking logs, exponentials or reciprocals of the suspected function form may still allow you to use the methods outlined above, but these methods have a major flaw. When, for instance, you take the log of data you are taking the log of true-function-value-plus-noise and if the noise was normally distributed with a certain standard deviation those facts will not be true of the transformed data. From a statistical standpoint that is a problem. 

% So we use the incredible power of MATLAB fminsearch to get at the solution directly. If g is any function of x with unknown parameters a1,...,aL we look at
f(a1,...,aL) equal to the sum of all (ynoisy-g(a1,...,aL,x))^2. Remember that ynoisy is a vector of all measured data at vector x of domain values so they are constant for this problem. It is the a values we want.

% [a, fval]=fminsearch(f,a0) should do the trick where a0 is the vector of "seed" or "guessed" parameter values.

% Remember that for our data we added normal noise to the function y = -3*sin(5*x-4) + 4*exp(x) and suppose we know from outside means that our function OUGHT to be of the form a(1)*sin(a(2)*x+a(3)) + a(4)*exp(a(5)*x) )

f=@(a) sum( (ynoisy - ( a(1)*sin(a(2)*x+a(3)) + a(4)*exp(a(5)*x) )).^2 )

[a,fval]=fminsearch(f, [-4,4,-3,5,2]) 

h=@(x)  a(1)*sin(a(2)*x+a(3)) + a(4)*exp(a(5)*x) 

yh=h(x);

Error(4)=sum( (y-yh).^2 )

yht=h(t);

clf

plot(x,y,'o',t,yht)


% Multiple Linear Regression

% If you have real values Y for data X in R^L rather than R finding the column matrix K and constant c for which Y=K'*X + c minimizes the error is called multiple linear regression. (Recall K'*X is dot product.)

% (In the following the domain data in R^L are listed as rows! We want each row in the domain data matrix we create to correspond to a different data point so we transpose the usual notation for R^L!)

% If you have n explanatory (domain) data points, each listed as a row vector X(1,:),...,X(n,:) of matrix X with n corresponding response function values Ynoisy. We are trying to find constants K1,...,KL and c so that L1*X(1,:)+...+KL*X(n,:)+c is as close as possible to Ynoisy.

Mfunc=@(w) [ w 1]

% Mfunc produces a 1x(L+1) row vector. Each column before the last is a coordinate of the row vector w.

for j=1:n
M(j,:)=Mfunc(X(j,:));
end

% So M is an nx(L+1) matrix when there are n data points. (Note that with our convention about X, M is just X with a column of ones appended to the right end.)

% We solve

A=M\Ynoisy

% and K is the first L entries of A, constant c is the last entry of A.

% If you insist that the solution should go through the origin (i.e. c=0) just let Mfunc=@(w) w instead of Mfunc=@(w) [ w 1], in which case M=X.

% Example: We find 11 random ordered pairs whose coordinates are in the low hundreds. These are to be the explanatory values, the domain variables.

X=50+100*randn(2,11)';

plot(X(:,1),X(:,2),'o')

w=@(x) (17*x(:,1)-5*x(:,2))

% So this plane will go through the origin.

% We create a column of noisy response variable values:

Ynoisy=w(X)+50*randn(11,1);

clf

plot3(X(:,1),X(:,2),Ynoisy','o')
hold on
grid on

a=X\Ynoisy

% This vector of constants minimizes the squared-error function

fa=@(c) sum(  (Ynoisy-c(1)*X(:,1)-c(2)*X(:,2) ).^2  )

fa(a)

[XT YT]=meshgrid(min(X(:,1)):10:max(X(:,1)),min(X(:,2)):10:max(X(:,2)));

Z=a(1)*XT+a(2)*YT;
 
surf(XT,YT,Z)

% If you don't insist on the through-the-origin condition let

M=[X ones(11,1)]

b=M\Ynoisy

% This b minimizes the squared-error function

fb=@(c) sum(  (Ynoisy-c(1)*X(:,1)-c(2)*X(:,2)-c(3) ).^2  )

fb(b)

% This will usually be slightly smaller but that does NOT necessarily mean a better fit! In fact, the more extraneous arbitrary constants you include the smaller the squared error will likely be. But then you will have extra components in your answer that were not necessarily in the original function. USE ANY KNOWLEDGE YOU MAY HAVE OF THE PARTICULAR FORM OF THE UNDERLYING FUNCTION!