Probability, Normal Distribution and the erf Function

Normal Distribution

The normal distribution is so named because it occurs everywhere in nature. It is "normal" to see variables distributed in this manner.

Consider the height of sunflowers, for example. I don't know what the average is, but let's say it is 2 meters. (One web site says 185cm, so I guess I'm pretty close.) Thus the "average" sunflower is 2 meters tall, and most mature sunflowers are between 1.5 and 2.5 meters tall. It's not too hard to find sunflowers that are 1.2 or 2.8 meters in height, but as you move towards the extremes, such as 0.7m or 3.3m, well, those plants are very hard to find. If you're a statistician, you might begin to wonder if they are sunflowers at all.

In a normal distribution the variable stays close to its mean most of the time. It may wander a bit to one side or the other, but it won't stray very far, as though it were tethered by a rubber band. Beyond a certain distance, the probability decreases exponentially.

For convenience, place the average value at 0. In the sunflower example, make your measurements from 2 meters off the ground. Tall flowers are positive and short flowers are negative. We are now ready to describe the normal density function. I'll call it n(x) for convenience, i.e. normal(x).

n(x) = E^-½x² over a

Here a is the area under the curve E^-½x². And why do we divide by a? So the area under the curve becomes 1. In other words, the odds of finding x between -∞ and +∞ is 1, as it should be.

So what is a? Replace x with sqrt(2)v, giving the following.

area = ∫sqrt(2) E^-v²

A clever trick involving double integrals solves this integral. The area is sqrt(2π). This is the value of a in the formula for n(x) given above.

Now graph n(x) in the xy plane. The graph reaches its maximum at 0,0.399, and curves down symmetrically on either side. As we stray farther from zero the graph flattens out, and approaches 0 as x approaches ±∞. This curve is called the bell curve, because it looks a little like a bell. A bell is round at the top, slopes down, and flatens out at the base. The bell analogy suggests a point of inflection on either side of the center, where the bell switches from concave down to concave up. Verify this by taking the second derivative of n(x) and setting it equal to 0. The first derivative is -xn(x), and the second derivative is (x²-1)n(x). The inflection points occur at x = ±1.

What about mean and variance? Note that n(x) is an even function, symmetric about 0, hence the mean is 0.

To find the variance, multiply n(x) by x² and integrate by parts. The result is shown below.

∫x²n(x) = -xn(x) + ∫n(x)

The term -xn(x) becomes 0 at ±∞. Thus we are left with the integral of n(x), which is simply 1. The variance is 1, and the standard deviation is 1.

What if we want a different standard deviation? Stretch or shrink the graph along the x axis, so that the bell becomes fatter or thinner. Of course the height of the curve must decrease as the curve widens, or increase as the curve narrows, so that the area under the curve is always 1. Here is the corresponding algebra.

n_σ(x) = E^½(x/σ)² over σa

Integrate by substitution, setting x = σv, to show the area under this curve is still 1. Further integration shows the standard deviation is σ.

Establish a mean of m by replacing x with x-m. This merely slides the curve along the x axis. Thus a normal distribution is completely characterized by two parameters, its mean m and its standard deviation σ.

Distribution Function and erf()

Let d(x) be the distribution function corresponding to the density function n(x). In other words, d(x) is the integral of n(t) as t runs from -∞ to x. Thus d(x) gives the probability of being less than x.

Note that d(0) is ½. With 0 as the mean, you're just as likely to be positive as negative. When x drops below -2, d(x) is practically 0, and when x goes beyond 2, d(x) is practically 1. Rarely is a random variable less than -2 or greater than 2. These are the tails of the bell curve, the long thin regions far from the mean.

As illustrated above, we are usually interested in the probability that an outcome is between x and -x, i.e. within a certain distance of the mean. Call this function h(x), whence h(x) = d(x)-d(-x).

Returning to the sunflower example, what are the odds that a sunflower is 2.5 meters tall, half a meter from the mean? Sunflowers range (approximately) from one to 3 meters, with a standard deviation of 0.7 meters, so 2.5 meters is well inside the bell. We're nowhere near the tail. Compute h(0.5/0.7) and get 52%. So 52% of sunflowers are closer to the mean, and 48% are farther from the mean. This looks like a typical sunflower.

Now picture a sunflower that is 5 meters tall. This time h(3.0/0.7) = 99.998%. Virtually all sunflowers are shorter than this one; maybe there's an error. Maybe this isn't a sunflower after all. Maybe it's a tree.

Recall that h(x) = d(x)-d(-x). Since the bell curve is symmetric, we can use the following integral.

h(x) = 2 ∫n(t) [t runs from 0 to x]

Expand n(t), and substitute t = sqrt(v), and get the following.

h(x) = 2×sqrt(2)/a ∫E^-v² [v runs from 0 to x/sqrt(2)]

Remember that a = sqrt(2π), so the scaling factor becomes 2/sqrt(π).

If you have access to a Unix/Linux computer, type `man erf'. The erf function, in the math library, computes the above integral from 0 to x. Well this isn't quite right. We want to integrate from 0 to x/sqrt(2). You have to remember to divide by sqrt(2). Therefore, a random variable with a normal distribution, and standard deviation s, is within x of its mean with probability erf(x/(sqrt(2)s)).

Note that the name erf is short for error function, since it helps us decide if we've made some kind of error. If the experiment produces values far from the mean, relative to the standard deviation, we can use erf() to quantify the likelyhood of these anomalies, and decide if something is amiss.