Let a function f map Rn into R. In other words, f takes n real numbers and cranks out another real number y. We have reason to believe f is linear, but there are small perturbations, some random noise that gets in the way. So for any given data set, the relation might not be perfectly linear. Given a set of k data points, with their output y1 through yk, we would like to calculate the underlying linear relation g that lies at the heart of f.
If n = 1, we are merely graphing points in the xy plane and drawing a line through them. One point is insufficient, and two points determine a line. The problem doesn't get interesting until there are three or more data points. Draw a line through these points, with some points above and some points below. We want to find the best line, the closest fit.
If x is taken from Rn, n+1 points determine a plane, and additional points imply a "best fit" plane, with deviations, i.e. some points lying above the plane and some points lying below. We want to select a plane g that minimizes this error.
Without any additional information about f, the best guess is the line, or plane, that minimizes the sum of the squares of the differences between g and f. Thus the procedure is called "least squares fit", because it minimizes the deviations squared.
Each point xi is a point from Rn, an input to the experiment, and yi is the corresponding output. Since x is multidimensional, xi,j is the jth coordinate of the ith input. Since yi lives in R1, we can simply leave that one alone.
Remember that g is the best-fit linear relation from x to y, according to our data. Let g = a0 + a.x, where a is a vector of n coefficients a1 through an, and x is any point in Rn. But remember, a is what we are trying to find. The "coefficients" a0 through an are the unknowns. We are given k data points, x1 through xk, and y1 through yk, and these values are treated as constants. In a game of roll reversal, a0 through an are the variables. Let these variables slide about, until the sum of the squares of the deviations reaches a minimum.
Let di be the difference in the ith data point. In other words, di = yi - a0 - a.xi.
For notational convenience, prepend 1 to each vector x in the data set, whence both a and x are vectors of length n.
di = yi - a.xi { as i runs through the k points in our data set }
The error that we want to minimize is the sum of di2. To find a local minimum, all the partials must be 0. For an arbitrary j between 0 and n, and for a data point i, evaluate the partial with respect to aj. This is 2 times di times the partial of di with respect to aj, or 2×di×xi,j. We can forget about the 2, since partials are being set to 0.
∑{i} di × xi,j = 0
When j = 0, and xi,j = 1, the equation says the sum of the differences is 0. The data points are perfectly balanced above and below the plane.
Counting j = 0, There are n+1 partials, hence n+1 equations, in n+1 unknowns. Build a matrix to represent this simultaneous system of linear equations. We just described the jth equation, which becomes the jth row of our matrix M. Focus on the entry Mj,l. This is the coefficient on al in the above equation, and it is equal to the sum of xi,j times xi,l, as i runs from 1 to k. Let's put this another way. Run through all the points in your data set, multiply the jth coordinate by the lth coordinate, take the sum, and place the result in Mj,l.
Build a column of constants z, based on y. By the above equation, zj is the sum over yi×xi,j. The output y is multiplied by the jth coordinate of the input, and the sum becomes zj. With this in mind, the following matrix equation establishes the critical points, i.e. the points with 0 partial derivatives. These points form a linear subspace within Rn+1.
M*a = z
As a warm up exercise, let n = 1. The points lie in the xy plane, and g is a line that approximates the points. The matrix M is {k,s} {s,t}, where s is the sum of the x coordinates (i.e. the inputs), and t is the sum of their squares. Let v be the average s/k. Thus s = kv. Let wi = xi - v. Thus the sum over wi is 0. Write t as the sum of (wi+v)2. This becomes kv2 plus the sum of wi2. The determinant is 0 iff the sum of wi2 is 0, iff each wi is 0, iff xi is constant. A typical experiment is not going to enter the same input k times over, hence M is nonsingular, and there is a unique solution. Let's generalize this to n dimensions.
Assume the k points in our experiment do not lie in a hyperplane within Rn. Let b0 through bn be any nonzero vector in Rn+1. Take the dot product of each xi with b. Suppose each dot product is 0, whence all the points lie in a plane perpendicular to b. Each xi,0 is 1, so pull out b0, and each xi in our original data set lies in a plane. This is a contradiction, so some xi.b is nonzero.
Let u be the sum of yi2, the error when a0 through an are set to 0. Scale b by c, and consider the error of cb as a function of c. The ith term is the square of xi.cb-yi. Add and regroup, and get c2 times the sum of (xi.b)2, -2c times the sum of xi.b×yi, plus u. The coefficient on c2 is nonzero, giving a nontrivial parabola, opening upward, and passing through error(0) = u. There is exactly one positive value of c that gives error(c) = u+1, and one negative value of c. In any direction b, there is one point with error(cb) = u+1.
Verify that the scaling factor c is a continuous function of the direction b. (This is intuitive, but not trivial to prove.) We have placed a level surface, homeomorphic to the sphere, about the origin, with error equal to u+1 on the surface. Since the error is continuous throughout the ball it attains a local minimum somewhere inside the ball, no larger than u. Let a be such a vector, having minimal error.
The critical points form a subspace, the solutions to n+1 simultaneous linear equations. This subspace includes the point a. If this subspace contains a line passing through a then the error must be constant along this line. Otherwise the mean value theorem shows the error is increasing at a point on the line, a directional derivative is nonzero, and the point is not critical. However, the line intersects the sphere, where the error is strictly larger. Therefore the subspace is a single point. The determinant of M is nonzero, and there is one local minimum, which is M-1*z.
Following the aforementioned parabola, all points outside the sphere have error > u+1, hence a is the global minimum. It defines the optimal plane for the data. As long as there are n+1 independent data points in Rn, linear algebra provides the coefficients a0 through an that best fit the data.
Given some other point x not in our data set, is g(x) a reliable predictor of f(x)? Entire books have been written on this subject. Unfortunately, there is almost no statistics on my website, at least not yet. The only thing I remember from my classes, some 25 years ago, is that the confidence in your guess decreases quickly as you stray farther from your original data set. (There is actually a formula for this.) In fact we have come up with two separate words to capture this idea. If x is between, or among, other points of data, then you are interpolating, and g(x) is an interpolation. You can be pretty sure f(x) will be close to g(x). However, if x lies outside the data, you are extrapolating, and g(x) is an extrapolation. There is little confidence in g(x). You can see this, geometrically, as small wiggles in the line g lead to large errors far away.
The risks of extrapolation apply to science in general. If you have smashed two marbles together, and two bowling balls together, you can sort of take the average (interpolate) and guess what would happen if you smashed two pool balls together. But you can't extrapolate down to electrons, where quantum mechanics may cause the particles to pass right through each other. Nor can you extrapolate up to planets and stars, where mutual gravity, rather than kinetic energy, rules the day.
If theory predicts the output y is linear in log(x), take the log of each xi, find the best fit line using the above procedure, and let g = a0 + a1×log(x). If you don't stray too far outside your original data, g(x) will be a good approximation to f(x).
This type of remapping can be done per coordinate when x lives in Rn. In a rather contrived example, y might be linear in the log of the first coordinate, the cube of the second coordinate, and the cosine of the third. In practice, this does not come up very often. Usually y is linear in the inputs, as indicated by a quick glance at the scatter plot - and g, as computed above, is the most likely linear relationship.
Sometimes x and y are two variables that are correlated. It is not clear whether y is a function of x, or x is a function of y. The relationship appears linear, yet the line, and the error, can be quite different under the two scenarios. Let's look at a somewhat contrived example.
Consider the three points 100,0 -100,0 and 0,1. When y is a function of x, the best-fit line is horizontal, floating just above the x axis, and the error is small. Now - leave the graph in place and turn your head sideways, so that x is a function of y. The three points are practically on top of each other, with a large vertical spread. The best-fit line is the y axis, perpendicular to the previous line, and the error is huge.
If you're not sure which variable is the input and which is the output, run the analysis both ways and see if one of the errors is smaller than the other. In practice, the lines will be almost identical, but the errors can differ substantially.