The Birthday Problem

My wife and I have four kids and every year, we notice that our birthdays all fall between late summer and late winter. Because I was bored of reading and writing about critical theory, I’d decided to work out the exact probability of how likely this clustering is. The problem turned out to be fairly interesting, so I’m posting the answer below. Interested mathematicians can stick around until the end and I’ll sketch the proof.

Problem

The problem is simple: given N birthdays distributed randomly throughout the year, what is the probability P that they will all fall within the span of M months? To simplify things, let’s call x the fraction of the year that the birthdays span so that x=0.25 would mean the birthdays span 3 months, x=0.5 would mean that the birthdays span 6 months, etc… We can label the birthdays x_1, x_2, x_3, …, x_N based on their position with respect to January 1 (see Figure 1). We’re then interested in the probability distribution function P_N(x). Note that months are cyclical. For example if two birthdays fell on Dec. 1 and Feb. 1, they would span 2 months, not 11 months. This feature is what makes the problem interesting.

Figure 1 – Example of N=6 birthdays distributed over an interval x

One key property of the distribution function P_N(x) is that it’s piecewise smooth and is naturally defined over a series of intervals of the general form x = [(r-1)/r, r/(r+1)] for r=1, 2, 3, 4, etc… In other words, Region 1 is the interval x = [0,1/2], Region 2 is the interval x = [1/2, 2/3], Region 3 is the interval x = [2/3, 3/4], etc… The function is smooth within each of these intervals, but has a discontinuous derivative at their boundaries. Another property of the function is that P_N(x) is only non-zero in Regions 1, 2, 3, …, N-1. In Regions N, N+1, N+2, etc… P_N(x) = 0. A simple illustration of the actual probability distribution P_3(x) is show below, illustrating these properties (Figure 3). Note that the function is non-zero in Region 1 and 2 and is zero outside of those regions.

Figure 2 – Predicted distribution for N=2 (blue line) compared to simulated distribution (green bars).

Solution

With this preface, the final solution for P_N(x) is shown below in Eq. 1. Remember that the function is zero everywhere except on the interval x = [0,(N-1)/N]. There are N-1 terms in Eq. 1. and each term is “turned on” by a Heaviside step function. The first term is therefore non-zero in all Regions (i.e. x > 0). The second term is non-zero only in Regions 2, 3, 4, etc… (i.e. x > 1/2). The third term is non-zero only in Regions 3,4, etc… (i.e. x > 2/3). etc…

Equation 1

The behavior of this function is quite interesting. For the simple case N=2 it predictably yields a uniform distribution over the interval x = [0,1/2]. That is, if there are only two birthdays, they can be at most 1/2 a year apart and all spans are equally likely, assuming the birthdays are randomly distributed. The case for N=3 (see Figure 2 above) is more interesting. We see two obviously distinct regions, Region 1 where x = [0, 1/2] and Region 2 where x = [1/2, 2/3]. The fact that the probability is 0 when x > 2/3 makes sense because 3 birthdays can span at most 2/3 of the year (e.g. if they occur on Jan. 1, May 1, and Sept. 1).

The function’s behavior grows more complex as N increases. Figure 3 shows the predicted probability distribution (blue lines) versus the simulated distribution (green bars) for N=6. As N increases, the average number of months spanned by a randomly chosen set of birthdays increases, as we’d expect because as N approaches infinity, the entire calendar will be filled so that the only possible value for x will be 1.

Figure 3 – Predicted distribution for N=6 (blue line) compared to simulated distribution (green bars).

Finally, using Equation 1, we can create a chart of cumulative probabilities as a function of N (table 1). For my family, N=6 and our birthdays all fall roughly within a span of 7 months. The chance of this happening given randomly distributed birthdays is 40.3%, which is not particularly unlikely. In contrast, the chances that a family of N=12 would have birthdays which all fell within the same span is only 3.2%.

Table 1 – Cumulative probabilities for various N

I found this problem fascinating and was surprised that I was able to obtain a solution so quickly. For those interested in the mathematical details, see below:

Mathematical details

To solve this problem, we first convert the set of birthdays {x_1, x_2, x_3, …, x_N} into a set of gaps {a_1, a_2, a_3, …, a_{N-1}} where a_1 = x_2-x_1, a_2 = x_3-x_2, …, a_{N-1} = x_N – x_{N-1}, where we have assumed without loss of generality that the biggest gap occurs between x_N and x_1. The total span of the birthdays x is then given by x = a_1 + a+2 + a_3 + … a_{N-1} (see Figure 4).

Figure 4 – Converting birthdays {x_1, x_2, …, x_N} to gaps {a_1, a_2, …, a_{N-1}}

The problem of finding the probability distribution P_N(x) is then equivalent to finding the “volume” of the (N-1)-dimensional polytope containing all the points (a_1, a_2, …, a_{N-1}) which satisfy Equation 2:

Equation 2

However, we must solve this equation subject to the constraints that for all j, a_j < 1-x. These constraints are necessary to ensure that none of our gaps a_j is larger than the gap between x_N and x_1, which we assumed at the beginning.

In general, finding the volume of a polytope is a hard problem. However, we can solve the problem above through the judicious use of the inclusion-exclusion theorem. First, we find the volume of the polytope described by Eq. 2 without imposing any constraints. Then we subtract the volume of the polytope which satisfies Eq. 2 while violating exactly one constraint. Then we add back in the volume of the polytope which satisfies Eq. 2 while violating exactly two constraints. Then we subtract the volume of the polytope which satisfies Eq. 2 while violating exactly three constraints. And so on and so on. Geometrically, we note that Region 1 corresponds to the region in which no constraints can be violated because 1-x > x. Region 2 corresponds to the region in which at most 1 constraint can be violated because 2(1-x) > x. Region 3 corresponds to the region in which at most 2 constraints can be violated because 3(1-x) > x. etc..

Amazingly, the math required to solve this problem is fairly limited. The calculations can be performed with basic knowledge of calculus. The real difficulty is in setting the problem up properly and then applying the inclusion-exclusion theorem. But it provides an excellent illustration of how real-life problems can be framed as abstract mathematical problems which yield elegant, satisfying solutions.


Related articles: