## Probability, Mean and Variance

### Mean and Variance

The "mean", or "average", or "expected value" is the weighted sum of all possible outcomes. The roll of two dice, for instance, has a mean of 7. Multiply 2 by 1/36, the odds of rolling a 2. Multiply 3 by 2/36, the odds of rolling a 3. Do this for all outcomes up to 12. Add them up, and the result is 7. Toss the dice 100 times and the sum of all those throws is going to be close to 700, i.e. 100 times the expected value of 7.

The mean need not be one of the possible outcomes. Toss one die, and the mean is 3.5, even though there is no single outcome with value 3.5. But toss the die 100 times and the sum of all those throws will be close to 350.

Given a continuous density function f(x), the expected value is the integral of x×f(x). This is the limit of the discrete weighted sum described above.

Let's consider a pathological example. Let f(x) = 1/x2, from 1 to infinity. This is a valid density function with integral equal to 1. What is its expected value? Multiply by x to get 1/x, and integrate to get log(x). Evaluate log(x) at 1 and infinity, giving an infinite expected value. Whatever the outcome, you can expect larger outcomes in the future.

Add a constant c to each outcome, and you add c to the expected value. Prove this for discrete and continuous density functions.

Similarly, scale the output by a constant c, and the mean is multiplied by c. This is proved using integration by substitution.

The sum of two independent variables adds their means. This is intuitive, but takes a little effort to prove. If f and g are the density functions of x and y, then the density function for both variables is f(x)g(y). Multiply by x+y and take the integral over the xy plane. Treat it as two integrals:

∫{ f(x)g(y)x } + ∫{ f(x)g(y)y }

The first integral becomes the mean of x times 1, and the second becomes 1 times the mean of y. Hence the mean of the sum is the sum of the means.

### Arithmetic and Geometric Mean

The arithmetic mean is the mean, as described above. If all values are positive, the geometric mean is computed by taking logs, finding the arithmetic mean, and taking the exponential. If there are just a few values, the same thing can be accomplished by multiplying them together and taking the nth root. In the arithmetic mean, you add up and divide by n; in the geometric mean, you multiply up and take the nth root. The geometric mean of 21, 24, and 147 is 42.

The geometric mean is used when the log of a measurement is a better indicator (for whatever reason) than the measurement itself. If we wanted to find, for example, the "average" strength of a solar flare, we might use a geometric mean, because the strength can vary by orders of magnitude. Of course, scientists usually develop logarithmic scales for these phenomenon - such as the ricter scale, the decibel scale, and so on. When logs are already implicit in the measurements we can return to the arithmetic mean.

### The Arithmetic Mean Exceeds the Geometric Mean

The average of 2, 5, 8, and 9 is 6, yet the geometric mean is 5.18. The geometric mean always comes out smaller.

Let f be a differentiable function that maps the reals, or an unbroken segment of the reals, into the reals. Let f′ be everywhere positive, and let f′′ be everywhere negative. Let g be the inverse of f.

Let s be a finite set of real numbers with mean m. Apply f to s, take the average, and apply g. The result is less than m, or equal to m if everything in s is equal to m. When f = log(x), the relationship between the geometric mean and the arithmetic mean is a simple corollary.

Shift f(x) up or down, so that f(m) = 0. Let v = f′(m). If x is a value in s less than m, and if f were a straight line with slope v, f(x) would be v×(x-m). Actually f(x) has to be smaller, else the mean value theorem implies a first derivative ≤ v, and a second derivative ≥ 0. On the other side, when x is greater than m, similar reasoning shows f(x) is less than v×(x-m). The entire curve lies below the line with slope v passing through (m,0).

If f was a line, f(s) would have a mean of 0. But for every x ≠ m, f(x) is smaller. This pulls the mean below 0, and when we apply f inverse, the result lies below m.

If f′′ is everywhere positive then the opposite is true; the mean of the image of s in f pulls back to a value greater than m.

All this can be extended to the average of a continuous function h(x) from a to b. Choose riemann nets with regular spacing, and apply the theorem to the resulting riemann sums. As the spacing approaches 0, the average remains ≤ m, and in the limit, the average of f(h), pulled back through g, is no larger than the average of h.

If h is nonconstant the average through f comes out strictly smaller than the average of h. You'll need uniform continuity, which is assured by continuity across the closed interval [a,b]. The scaled riemann sums approach the average of f(h), and after a while, the mean, according to each riemann sum, can be bounded below f(m). I'll leave the details to you.

### Variance and Standard Deviation

If the mean of a random variable is m, the variance is the sum or integral of f(x)(x-m)2. To illustrate, let m = 0. The variance is now the weighted sum of the outcomes squared. In other words, how far does the random variable stray from its mean? If the variance is 0, the outcome is always zero. Any nonzero outcome produces a positive variance.

Consider the example of throwing two dice. The average throw produces 7, so subtract 7 from everything. Ten times out of 36 you get a 6 or an 8, giving 10/36×12, or 10/36. Eight times out of 36 you get a 5 or a 9, so add in 8×4/36. Continue through all possible rolls. When your done, the variance is 35/6.

Recall that (x-m)2 = x2-2mx+m2. This lets us compute both mean and variance in one pass, which is helpfull if the data set is large. Add up f(x)×x, and f(x)×x2. The former becomes m, the mean. The latter is almost the variance, but we must add m2 times the sum of f(x) (which is m2), and subtract 2m times the sum of xf(x) (which becomes 2m2). Hence the variance is the sum of f(x)x2, minus the square of the mean.

The above is also true for continuous variables. The variance is the integral of f(x)x2, minus the square of the mean. The proof is really the same as above.

Variance is a bit troublesome however, because the units are wrong. Let a random variable indicate the height of a human being on earth. Height is measured in meters, and the mean, the average height of all people, is also measured in meters. Yet the variance, the variation of height about the mean, seems to be measured in meters squared. To compensate for this, the standard deviation is the square root of variance. Now we're back to meters again. If the average height is 1.7 meters, and the standard deviation is 0.3 meters, we can be pretty sure that a person, chosen at random, will be between 1.4 and 2.0 meters tall. How sure? We'll quantify that later on. For now, the standard deviation gives a rough measure of the spread of a random variable about its mean.

### The Variance of the Sum

We showed that the mean of the sum of two random variables is the sum of the individual means. What about variance?

Assume, without loss of generality, that mean(x) = mean(y) = 0. If x and y have density functions f and g, the individual variances are the integrals of f(x)x2 and g(y)y2, respectively. Taken together, the combined density function is f×g, and we want to know the variance of x+y. Consider the following double integral.

∫∫f(x)g(y)(x+y)2 =

∫∫{ f(x)g(y)x2 } + ∫∫{ 2f(x)g(y)xy } + ∫∫{ f(x)g(y)y2 }

The first integral is the variance of x, and the third is the variance of y. The middle integral is the mean of x times the mean of y, or zero. Therefore the variance of the sum is the sum of the variances.

### Reverse Engineering

If a random variable has a mean of 0 and a variance of 1, what can we say about it? Not a whole lot. The outcome could be 0 most of the time, and on rare occasions, a million. That gives a variance of 1. But for all practical purposes the "random" variable is always 0. Alternatively, x could be ±1, like flipping a coin. This has mean 0 and variance 1, yet the outcome is never 0. Other functions produce values of 1/3, 0.737, sqrt(½), and so on. There's really no way to know.

We can however say something about the odds of finding x ≥ c, for c ≥ 1. Let |x| exceed c with probability p. The area of the curve, beyond c, is p. This portion of the curve contributes at least pc2 to the variance. Since this cannot exceed 1, the probability of finding x beyond c is bounded by 1/c2.

Generalize the above proof to a random variable with mean m and standard deviation s. If c is at least s, x is at least c away from m with probability at most s2/c2.

### The Mean is Your Best Guess

Let a random variable x have a density function f and a mean m. You would like to predict the value of x, in a manner that minimizes error. If your prediction is t, the error is defined as (x-t)2, i.e. the square of the difference between your prediction and the actual outcome. What should you guess to minimize error?

The expected error is the integral of f(x)(x-t)2, from -infinity to +infinity. Write this as three separate integrals:

error = ∫{ f(x)x2 } - ∫{ 2f(x)xt } + ∫{ f(x)t2 }

The first integral becomes a constant, i.e. it does not depend on t. The second becomes -2mt, where m is the mean, and the third becomes t2. This gives a quadratic in t. Find its minimum by setting its first derivative equal to 0. Thus t = m, and the mean is your best guess. The expected error is the variance of f.