Probability, Least Squares Fit

Overview

Let a function f map Rⁿ into R. In other words, f takes n real numbers and cranks out another real number y. We have reason to believe f is linear, but there are small perturbations, some random noise that gets in the way. So for any given data set, the relation might not be perfectly linear. Given a set of k data points, with their output y₁ through y_k, we would like to calculate the underlying linear relation g that lies at the heart of f.

If n = 1, we are merely graphing points in the xy plane and drawing a line through them. One point is insufficient, and two points determine a line. The problem doesn't get interesting until there are three or more data points. Draw a line through these points, with some points above and some points below. We want to find the best line, the closest fit.

If x is taken from Rⁿ, n+1 points determine a plane, and additional points imply a "best fit" plane, with deviations, i.e. some points lying above the plane and some points lying below. We want to select a plane g that minimizes this error.

Without any additional information about f, the best guess is the line, or plane, that minimizes the sum of the squares of the differences between g and f. Thus the procedure is called "least squares fit", because it minimizes the deviations squared.

Notation

Each point x_i is a point from Rⁿ, an input to the experiment, and y_i is the corresponding output. Since x is multidimensional, x_i,j is the j^th coordinate of the i^th input. Since y_i lives in R¹, we can simply leave that one alone.

Remember that g is the best-fit linear relation from x to y, according to our data. Let g = a₀ + a.x, where a is a vector of n coefficients a₁ through a_n, and x is any point in Rⁿ. But remember, a is what we are trying to find. The "coefficients" a₀ through a_n are the unknowns. We are given k data points, x₁ through x_k, and y₁ through y_k, and these values are treated as constants. In a game of roll reversal, a₀ through a_n are the variables. Let these variables slide about, until the sum of the squares of the deviations reaches a minimum.

Let d_i be the difference in the i^th data point. In other words, d_i = y_i - a₀ - a.x_i.

For notational convenience, prepend 1 to each vector x in the data set, whence both a and x are vectors of length n.

d_i = y_i - a.x_i { as i runs through the k points in our data set }

Critical Points

The error that we want to minimize is the sum of d_i². To find a local minimum, all the partials must be 0. For an arbitrary j between 0 and n, and for a data point i, evaluate the partial with respect to a_j. This is 2 times d_i times the partial of d_i with respect to a_j, or 2×d_i×x_i,j. We can forget about the 2, since partials are being set to 0.

∑{i} d_i × x_i,j = 0

When j = 0, and x_i,j = 1, the equation says the sum of the differences is 0. The data points are perfectly balanced above and below the plane.

Counting j = 0, There are n+1 partials, hence n+1 equations, in n+1 unknowns. Build a matrix to represent this simultaneous system of linear equations. We just described the j^th equation, which becomes the j^th row of our matrix M. Focus on the entry M_j,l. This is the coefficient on a_l in the above equation, and it is equal to the sum of x_i,j times x_i,l, as i runs from 1 to k. Let's put this another way. Run through all the points in your data set, multiply the j^th coordinate by the l^th coordinate, take the sum, and place the result in M_j,l.

Build a column of constants z, based on y. By the above equation, z_j is the sum over y_i×x_i,j. The output y is multiplied by the j^th coordinate of the input, and the sum becomes z_j. With this in mind, the following matrix equation establishes the critical points, i.e. the points with 0 partial derivatives. These points form a linear subspace within Rⁿ⁺¹.

M*a = z

Best Fit Line

As a warm up exercise, let n = 1. The points lie in the xy plane, and g is a line that approximates the points. The matrix M is {k,s} {s,t}, where s is the sum of the x coordinates (i.e. the inputs), and t is the sum of their squares. Let v be the average s/k. Thus s = kv. Let w_i = x_i - v. Thus the sum over w_i is 0. Write t as the sum of (w_i+v)². This becomes kv² plus the sum of w_i². The determinant is 0 iff the sum of w_i² is 0, iff each w_i is 0, iff x_i is constant. A typical experiment is not going to enter the same input k times over, hence M is nonsingular, and there is a unique solution. Let's generalize this to n dimensions.

Best Fit Plane, Existence and Uniqueness

Assume the k points in our experiment do not lie in a hyperplane within Rⁿ. Let b₀ through b_n be any nonzero vector in Rⁿ⁺¹. Take the dot product of each x_i with b. Suppose each dot product is 0, whence all the points lie in a plane perpendicular to b. Each x_i,0 is 1, so pull out b₀, and each x_i in our original data set lies in a plane. This is a contradiction, so some x_i.b is nonzero.

Let u be the sum of y_i², the error when a₀ through a_n are set to 0. Scale b by c, and consider the error of cb as a function of c. The i^th term is the square of x_i.cb-y_i. Add and regroup, and get c² times the sum of (x_i.b)², -2c times the sum of x_i.b×y_i, plus u. The coefficient on c² is nonzero, giving a nontrivial parabola, opening upward, and passing through error(0) = u. There is exactly one positive value of c that gives error(c) = u+1, and one negative value of c. In any direction b, there is one point with error(cb) = u+1.

Verify that the scaling factor c is a continuous function of the direction b. (This is intuitive, but not trivial to prove.) We have placed a level surface, homeomorphic to the sphere, about the origin, with error equal to u+1 on the surface. Since the error is continuous throughout the ball it attains a local minimum somewhere inside the ball, no larger than u. Let a be such a vector, having minimal error.

The critical points form a subspace, the solutions to n+1 simultaneous linear equations. This subspace includes the point a. If this subspace contains a line passing through a then the error must be constant along this line. Otherwise the mean value theorem shows the error is increasing at a point on the line, a directional derivative is nonzero, and the point is not critical. However, the line intersects the sphere, where the error is strictly larger. Therefore the subspace is a single point. The determinant of M is nonzero, and there is one local minimum, which is M^-1*z.

Following the aforementioned parabola, all points outside the sphere have error > u+1, hence a is the global minimum. It defines the optimal plane for the data. As long as there are n+1 independent data points in Rⁿ, linear algebra provides the coefficients a₀ through a_n that best fit the data.

Interpolate / Extrapolate

Given some other point x not in our data set, is g(x) a reliable predictor of f(x)? Entire books have been written on this subject. Unfortunately, there is almost no statistics on my website, at least not yet. The only thing I remember from my classes, some 25 years ago, is that the confidence in your guess decreases quickly as you stray farther from your original data set. (There is actually a formula for this.) In fact we have come up with two separate words to capture this idea. If x is between, or among, other points of data, then you are interpolating, and g(x) is an interpolation. You can be pretty sure f(x) will be close to g(x). However, if x lies outside the data, you are extrapolating, and g(x) is an extrapolation. There is little confidence in g(x). You can see this, geometrically, as small wiggles in the line g lead to large errors far away.

The risks of extrapolation apply to science in general. If you have smashed two marbles together, and two bowling balls together, you can sort of take the average (interpolate) and guess what would happen if you smashed two pool balls together. But you can't extrapolate down to electrons, where quantum mechanics may cause the particles to pass right through each other. Nor can you extrapolate up to planets and stars, where mutual gravity, rather than kinetic energy, rules the day.

Variations on the Coordinates

If theory predicts the output y is linear in log(x), take the log of each x_i, find the best fit line using the above procedure, and let g = a₀ + a₁×log(x). If you don't stray too far outside your original data, g(x) will be a good approximation to f(x).

This type of remapping can be done per coordinate when x lives in Rⁿ. In a rather contrived example, y might be linear in the log of the first coordinate, the cube of the second coordinate, and the cosine of the third. In practice, this does not come up very often. Usually y is linear in the inputs, as indicated by a quick glance at the scatter plot - and g, as computed above, is the most likely linear relationship.

Switch Variables

Sometimes x and y are two variables that are correlated. It is not clear whether y is a function of x, or x is a function of y. The relationship appears linear, yet the line, and the error, can be quite different under the two scenarios. Let's look at a somewhat contrived example.

Consider the three points 100,0 -100,0 and 0,1. When y is a function of x, the best-fit line is horizontal, floating just above the x axis, and the error is small. Now - leave the graph in place and turn your head sideways, so that x is a function of y. The three points are practically on top of each other, with a large vertical spread. The best-fit line is the y axis, perpendicular to the previous line, and the error is huge.

If you're not sure which variable is the input and which is the output, run the analysis both ways and see if one of the errors is smaller than the other. In practice, the lines will be almost identical, but the errors can differ substantially.