Probability, Geometric Distribution

Geometric Distribution

In the last section we considered an experiment with two outcomes: success, with probability p, and failure, with probability q, where q = 1-p. Now imagine running this experiment again and again, until you achieve success. What is the probability that the experiment will run k times? It must fail k-1 times, then succeed. This gives a probability of qk-1p. As you look at all the values of k, this forms a geometric progression, hence it is called a geometric distribution.

As a sanity check, take the sum as k runs from 1 to infinity. The powers of q sum to 1/(1-q), or 1/p. Multiply by p and get 1. Your experiment will succeed, some time in the future, with probability 1.

Finding the mean is a bit tricky. We need to take the sum of kpqk-1 as k runs from 1 to infinity. Divide out by p; we'll bring it back in at the end. Now treat q as a real variable, and the kth term becomes the derivative of qk. Interchange summation and differentiation. (Yes, this is legal, although it is far from obvious.) In other words, take the sum of these functions, then differentiate with respect to q. We know that the sum of all the powers of q, starting with q0, is 1/p. In this case we are starting with q1, so our sum is actually 1/p-1. (Remember that p is shorthand for 1-q.) Take the derivative with respect to q, giving 1/p2. Remember that we need to multiply by p at the end, since we divided by p at the outset. Therefore the mean is 1/p.

The math is complicated, but the answer is intuitive. If the experiment fails 99% of the time, then p is 1%, and 1/p is 100. We need to run the experiment about 100 times, on average, before it succeeds.

To find the variance, we need to compute the weighted sum of k2, then subtract the mean squared. What is the sum of pk2qk-1?

Start by adding the mean. We'll subtract the mean at the end, to compensate. Remember that the mean is the sum of pkqk-1. So the weighted sum of k2, plus the mean, has this formula.

Sum, as k runs from 1 to ∞, p (k2+k) qk-1

Divide by p, as we did before. We'll multiply by p later on, to compensate. Now each term becomes the second derivative of qk+1. Interchange summmation and differentiation. In other words, add up the terms qk+1, then differentiate. The sum of the powers of q, starting with q2, equals 1/p-q-1. Take the second derivative with respect to q, and the second and third terms disappear, leaving 2/p3.

Now we have to remember where we parked. We divided by p, so multiply by p to get 2/p2. And we added the mean, so subtract the mean, giving 2/p2-1/p. This is the weighted sum of k2. To find the variance, subtract the square of the mean.

variance = 2/p2 - 1/p - 1/p2 =
1/p2 - 1/p =
(1-p) over p2 =
q/p2

The standard deviation is sqrt(q)/p. When q is close to 1, such as 99%, the standard deviation is practically the same as the mean. In other words, you shouldn't be surprised if the experiment runs twice as long as expected, before success occurs.