The mean need not be one of the possible outcomes. Toss one die, and the mean is 3.5, even though there is no single outcome with value 3.5. But toss the die 100 times and the sum of all those throws will be close to 350.
Given a continuous density function f(x), the expected value is the integral of x×f(x). This is the limit of the discrete weighted sum described above.
Let's consider a pathological example. Let f(x) = 1/x2, from 1 to infinity. This is a valid density function with integral equal to 1. What is its expected value? Multiply by x to get 1/x, and integrate to get log(x). Evaluate log(x) at 1 and infinity, giving an infinite expected value. Whatever the outcome, you can expect larger outcomes in the future.
Add a constant c to each outcome, and you add c to the expected value. Prove this for discrete and continuous density functions.
Similarly, scale the output by a constant c, and the mean is multiplied by c. This is proved using integration by substitution.
The sum of two independent variables adds their means. This is intuitive, but takes a little effort to prove. If f and g are the density functions of x and y, then the density function for both variables is f(x)g(y). Multiply by x+y and take the integral over the xy plane. Treat it as two integrals:
∫{ f(x)g(y)x } + ∫{ f(x)g(y)y }
The first integral becomes the mean of x times 1, and the second becomes 1 times the mean of y. Hence the mean of the sum is the sum of the means.
The geometric mean is used when the log of a measurement is a better indicator (for whatever reason) than the measurement itself. If we wanted to find, for example, the "average" strength of a solar flare, we might use a geometric mean, because the strength can vary by orders of magnitude. Of course, scientists usually develop logarithmic scales for these phenomenon - such as the ricter scale, the decibel scale, and so on. When logs are already implicit in the measurements we can return to the arithmetic mean.
Let f be a differentiable function that maps the reals, or an unbroken segment of the reals, into the reals. Let f′ be everywhere positive, and let f′′ be everywhere negative. Let g be the inverse of f.
Let s be a finite set of real numbers with mean m. Apply f to s, take the average, and apply g. The result is less than m, or equal to m if everything in s is equal to m. When f = log(x), the relationship between the geometric mean and the arithmetic mean is a simple corollary.
Shift f(x) up or down, so that f(m) = 0. Let v = f′(m). If x is a value in s less than m, and if f were a straight line with slope v, f(x) would be v×(x-m). Actually f(x) has to be smaller, else the mean value theorem implies a first derivative ≤ v, and a second derivative ≥ 0. On the other side, when x is greater than m, similar reasoning shows f(x) is less than v×(x-m). The entire curve lies below the line with slope v passing through (m,0).
If f was a line, f(s) would have a mean of 0. But for every x ≠ m, f(x) is smaller. This pulls the mean below 0, and when we apply f inverse, the result lies below m.
If f′′ is everywhere positive then the opposite is true; the mean of the image of s in f pulls back to a value greater than m.
All this can be extended to the average of a continuous function h(x) from a to b. Choose riemann nets with regular spacing, and apply the theorem to the resulting riemann sums. As the spacing approaches 0, the average remains ≤ m, and in the limit, the average of f(h), pulled back through g, is no larger than the average of h.
If h is nonconstant the average through f comes out strictly smaller than the average of h. You'll need uniform continuity, which is assured by continuity across the closed interval [a,b]. The scaled riemann sums approach the average of f(h), and after a while, the mean, according to each riemann sum, can be bounded below f(m). I'll leave the details to you.
Consider the example of throwing two dice. The average throw produces 7, so subtract 7 from everything. Ten times out of 36 you get a 6 or an 8, giving 10/36×12, or 10/36. Eight times out of 36 you get a 5 or a 9, so add in 8×4/36. Continue through all possible rolls. When your done, the variance is 35/6.
Recall that (x-m)2 = x2-2mx+m2. This lets us compute both mean and variance in one pass, which is helpfull if the data set is large. Add up f(x)×x, and f(x)×x2. The former becomes m, the mean. The latter is almost the variance, but we must add m2 times the sum of f(x) (which is m2), and subtract 2m times the sum of xf(x) (which becomes 2m2). Hence the variance is the sum of f(x)x2, minus the square of the mean.
The above is also true for continuous variables. The variance is the integral of f(x)x2, minus the square of the mean. The proof is really the same as above.
Variance is a bit troublesome however, because the units are wrong. Let a random variable indicate the height of a human being on earth. Height is measured in meters, and the mean, the average height of all people, is also measured in meters. Yet the variance, the variation of height about the mean, seems to be measured in meters squared. To compensate for this, the standard deviation is the square root of variance. Now we're back to meters again. If the average height is 1.7 meters, and the standard deviation is 0.3 meters, we can be pretty sure that a person, chosen at random, will be between 1.4 and 2.0 meters tall. How sure? We'll quantify that later on. For now, the standard deviation gives a rough measure of the spread of a random variable about its mean.
Assume, without loss of generality, that mean(x) = mean(y) = 0. If x and y have density functions f and g, the individual variances are the integrals of f(x)x2 and g(y)y2, respectively. Taken together, the combined density function is f×g, and we want to know the variance of x+y. Consider the following double integral.
∫∫f(x)g(y)(x+y)2 =
∫∫{ f(x)g(y)x2 } + ∫∫{ 2f(x)g(y)xy } + ∫∫{ f(x)g(y)y2 }
The first integral is the variance of x, and the third is the variance of y. The middle integral is the mean of x times the mean of y, or zero. Therefore the variance of the sum is the sum of the variances.
We can however say something about the odds of finding x ≥ c, for c ≥ 1. Let |x| exceed c with probability p. The area of the curve, beyond c, is p. This portion of the curve contributes at least pc2 to the variance. Since this cannot exceed 1, the probability of finding x beyond c is bounded by 1/c2.
Generalize the above proof to a random variable with mean m and standard deviation s. If c is at least s, x is at least c away from m with probability at most s2/c2.
The expected error is the integral of f(x)(x-t)2, from -infinity to +infinity. Write this as three separate integrals:
error = ∫{ f(x)x2 } - ∫{ 2f(x)xt } + ∫{ f(x)t2 }
The first integral becomes a constant, i.e. it does not depend on t. The second becomes -2mt, where m is the mean, and the third becomes t2. This gives a quadratic in t. Find its minimum by setting its first derivative equal to 0. Thus t = m, and the mean is your best guess. The expected error is the variance of f.