Probability, Seeing all the Marbles

Seeing all the Marbles

There are k marbles in a drawer, of different colors, you pull one out at random and put it back; what is the expected number of trials to see all k marbles?

One marble, it's green, one trial, you're done. k = 1: expected value = 1.

Two marbles, first one is red (let's say), 2 trials reveals the green marble with probability one half, green marble appears on the third trial with odds 1/4, green marble on fourth trial with odds 1/8, and so on. Bring up a calculator, like bc, in fact you can paste this right into bc (don't forget to say scale=5):

2/2 + 3/4 + 4/8 + 5/16 + 6/32 + 7/64 = 2.859.

The actual answer is 3, but let's prove it. We want the sum, starting at n = 2, of n times ½ to the n-1. This is the sum, starting at 2, of nx^n-1, evaluated at x = ½. Now for a beautiful calculus trick! The sum of nx^n-1 is the derivative of the sum of xⁿ. Mathematicians have died and gone to heaven so we can exchange differentiation and summation, even for an infinite sum. The sum of xⁿ from 0 to infinity is 1/(1-x). Differentiate 1/(1-x) and get 1/(1-x)². That will equal the sum 0 + 1 + 2x + 3x² + 4x³ + 5x⁴ … Substitute x = ½ in 1/(1-x)² and get 4. But n starts at 2, so subtract away the lead terms, 0 + 1, and get 3. There you have it, you can expect to see both colors by three trials. Sometimes 2, sometimes more than 3, but usually 3. k = 2: expected value = 3.

Move on to three marbles. The last color seen, at trial n, is, without loss of generality, green. It was all red and blue before. For shorthand let p = 2/3 and q = 1/3. Has to be blue or red up to n so p^n-1. But we also didn't see just red or just blue, so subtract away twice q^n-1. This is multiplied by n, then sum over n. Using the same calculus trick, sum of np^n-1 is 9. First three terms trials 0 1 and 2 don't count, you have to pull out at least 3 marbles. Subtract away 1 + 2p, leaving 20/3. The sum of nq^n-1 is 9/4. Subtract away 1+2q and get 7/12. This is doubled so 7/6. Subtract from 20/3 and get 33/6. That's 5 and a half trials to see all three colors. k = 3: expected value = 5.5.

Interestingly, if we don't adjust for the lead terms, if we just write 9 - twice 9/4, we get 4.5. This is 1 less than 5.5. More on this below.

Move on to k = 4, hoping for a pattern. Last marble green: red blue and yellow up to that point, sum of n times ¾^n-1, equals 16. Subtract lead terms 1 + 2×¾ + 3×¾², leaving 11 + 13/16.

We counted just red and blue, and shouldn't have, and also red and yellow and blue and yellow. Subtract 3 times the sum of n×½^n-1. The sum is 4, minus lead terms 1 + 2×½ + 3×½² yields 5/4. Multiply by three, and subtract from 11 + 13/16, and get 8 + 1/16.

All red was counted in the first sum, then subtracted away twice, that is, overcounted, in the second sum. We want it to be subtracted just once, not twice, so add it back in, and same for yellow and for blue. Three times the sum of n×¼^n-1. Sum is 16/9, minus lead terms 1 + 2×¼ + 3×¼² yields 13/144. Multiply by 3 and add it back in and get 8 and a third. As you go past 8 trials, you can expect to see all four colors.

so what about these lead terms? Note that the math follows reality. First trial is not green with odds ¾. Confined to red or blue with odds ½. Fixed at red with odds ¼. Write ¾ - 3×½ + 3×¼, and get 0. The odds of representing all three colors on the first trial are 0. This is multiplied by 2 in our expected value calculation, the term we called 2x, but 2 times 0 is 0.

How bout the term we called 3x²? Start by looking at x², we can multiply by 3 later. This mirrors what happens after two trials. ¾² - 3×½² + 3×¼² = 0. Multiply by 3 and get 0.

Aha, we don't have to subtract away these lead terms as we go, they will all come out 0. Except for the first term, the term of 1. When k = 3 we had 1 - 2×1 = -1. With k = 4 we have 1 - 3×1 + 3×1 = 1. For k in general, the alternating sum is the binary expansion of (1-1) to the k, which is 0. The last term in the expansion is not included in our calculation, so we are off by 1 or -1, depending on the parity of k. Don't worry about the lead terms, just add or subtract 1 at the end. That makes the math easier.

With k = 3, the two series combine to produce 4.5. Add 1 to get 5.5. Let's confirm with k = 4. 16 - 3×4 + 3×16/9 = 9 and a third. Subtract 1 and get 8 and a third, which matches our long-hand calculation.

k = 5, and thank goodness we aren't messing with lead terms any more. I'll do this one on a calculator.

5^2 - 4*(5/2)^2 + 6*(5/3)^2 - 4*(5/4)^2 = 10.41666

Add 1 and get 11.41666. 11 and a half trials to see all 5 colors.

Now let's be bold and try k in general. What are the odds of seeing green, the last color, at trial n?

for shorthand let l = k-1. There is no green before, so we start with l/k to the n-1. I don't want to count, with in this total, the odds of drawing the first l-1 colors only, that is (l-1)/k to the n-1. And this can happen in l different ways, so, subtract l times (l-1)/k to the n-1.

I'm going to stop writing "to the n-1", we'll put it back later.

What if we only saw the first l-2 colors in the first n-1 trials? I have oversubtracted those, so have to add them back in. This is called inclusion exclusion in combinatorics. Add (l:2) times (l-2)/k. Here (l:2) is the binomial coefficient l choose 2. This continues all the way down to just one color across the first n-1 trials, (l:l-1) times (l-(l-1))/k. Thus we have the alternating sum, as j runs from 0 to l-1, of:

(l:j) times (l-j)/k to the n-1.

Employ the calculus trick, and this becomes the alternating sum of:

(l:j) times k² / (k-(l-j))²

(l:j) times k² / (j+1)²

You'll recognize this as the calculation we performed when k = 5.

What if we let j run all the way up to l, instead of l-1? This adds or subtracts 1, as per the parity of l. It is precisely the fudge factor we need. Our formula is then:

k² times {
alternating sum(j = 0,l) (l:j) / (j+1)²
}

Note that k * (l:j) / (j+1) = (l,j+1). This is binomial coefficient magic. With that in mind, we can rewrite the formula this way:

k * {
alternating sum(j = 1,k) (k:j) / j
}

Verify with k = 5.

5 * (5/1 - 10/2 + 10/3 - 5/4 + 1/5) = 11.41666.

k = 6:

6 * (6/1 - 15/2 + 20/3 - 15/4 +6/5 - 1/6) = 14.7.

k = 10:

10 * (10/1 - 45/2 + 120/3 - 210/4 + 252/5 - 210/6 + 120/7 - 45/8 + 10/9 - 1/10) = 29.289.

Let's try to clean up the formula further. Recall that we had the alternating sum of (k:j)/j, as j runs from 1 to k. This is minus ∑ (k:j)/j * x^j, evaluated at x = -1. If we plug in x = 0, we get 0. Thus it is the sum evaluated at 0, minus the sum evaluated at -1.

Each of these terms is the integral of (k:j)*x^j-1, as j runs from -1 to 0. Thus we want the integral of the sum of (k:j)*x^j-1, as x runs from -1 to 0.

Multiply the sum by -x and divide the sum by -x. A positive denominator is helpful here. Now we have - ∫ { (∑ (k:j)*x^j) divided by -x }.

Since j starts at 1, rather than 0, we replace the sum with (1+x)^k - 1.

- ∫ { ((1+x)^k - 1) divided by -x }.

The integral of 1/-x is -log(-x). Put -log(-x) over in the answer column. Since -x is positive, there is no trouble taking the log. That leaves (1+x)^k / -x to deal with.

For shorthand, let q = 1+x, and let l = log(x). Let s = sin(θ) and let c = cosine(θ). Integrate by substitution. Let x be -s². θ ranges from 0 to π/2, as x ranges from -1 to 0.

q^k becomes c^2k. ∂x becomes -2cs. The minus in -2cs cancels the minus that was outside the integral. The denominator is s². The integrand is now 2c^2k+1/s.

Pull 2 outside the integral, and set it to the side.

k is at least 1, so replace c² with 1-s². This gives c^2k-1 times (s + 1/s). The integral of c^2k-1s is - c^2k/(2k). The second piece is a power of cosine, divided by sine, and we can do the same trick again. That spins off - c^2k-2/(2k-2). This continues all the way down to cotangent, with integral l(s).

We had a factor of 2, so it's really 2l(s). Twice l(s) is l(s²), or l(-x). Put l(-x) in the answer column. This joins - l(-x), and they cancel. Good thing, cause they were each approaching infinity as x approaches 0.

Consider a term like -c⁴/4. Evaluate at π/2 and get 0. Evaluate at 0 and get -1/4. Subtract and get 1/4. The answer is a sum of even reciprocals. Bring in the factor of k that was part of the original formula, and get this.

2k * { ∑(i=1 to k) 1/2i }

k * { ∑(i=1 to k) 1/i }

Let's verify with the values of k that we have already established.

k	expected value
1	1
2	3
3	5.5
4	8.333
5	11.416
6	14.7
7	18.15
8	21.743
9	25.460
10	29.289
22	81.198

As k approaches infinity, the harmonic series approaches log(k). Thus k*log(k) is a good approximation. In fact, k*log(k) + ½k is remarkably good.

McDonalds Drinks

What does all this have to do with McDonalds drinks?

Around 2018, McDonalds began printing drink-friendly phrases on the lids of their plastic cups. Naturally, I kept a log of the ones we saw. I couldn't find a list of the lids online, so my home-grown database would have to serve. Here are the lids, as of this writing.

Lids
AAAAH, so Tasty!
Don't Waste the Taste
Time for Taste
Sip the Day
Time to Sip Away
Sip for Joy
Enjoy the Sip
Enjoy Every Sip
Savor Every Sip
Savor Every Last Drop
Refreshingly Tasty
Enter Refresh Mode
Time to Refresh!
Drench Your Thirst
Thirst has no Chance
Goodbye Thirst
Feeling Bubbly
Keep it Bubbly
It's Nice on Ice
It's Nice to Ice
Drinking of You
Jump for Joy

There are 22 lids, and I wondered, is this all of them? Can I be reasonably sure?

Suppose there are 22 lids, and use the earlier formula. We can expect to see all the lids after 81 trials. How often have we been to McDonalds? Let's say we go once a week, and my wife and I each get a drink. We've been keeping trrack for three years, which is 150 weeks. Times 2 drinks is 300 cups. That is far more than 80. If there were more than 22 distinct lids, I think we would have seen them by now. Furthermore, we haven't seen a new lid in well over a year. Therefore, I conclude these are all the lids in circulation.