I was looking to understand the discrepancy between my analysis and simulations. In the last post, I blamed it on not considering conditional probability. But, reexamining the results, I have concluded that the analytic approximation is, in fact, very accurate.
In hindsight, it is remarkable that no simulation at all (i.e., one employing a normal distribution random number generator) is needed to solve this problem.
Here are the results for running a simulation of "sample" persons where the sample ranges from 7 to 447. The mean is the fraction of groups that had at least 1 shared birthday. The distribution of ages is P(i), a normal PDF with a mean of 60 years, a sigma of 12 years, and cutoffs at 0 and 120 years, with i being the day number. The distribution was renormalized to integrate to exactly 1.0 even with the tiny 5-sigma cutoff. I ran each sample 30,000 times, repeating 5 times to get a mean and sigma.
My approximation for the probability P of at least 1 birthday for sample n is given by:
P(n) = 1-P(pair)^(n(n-1)/2),
where P(pair)={sum, i=1,m of p(i)(1-p(i)}
and where m is the number of days in 120 years, m=43825.
p(i) is the usual gaussian.
The program is
here. In the table below, I compare the simulated mean to the analytic approximation. 1-sigma for the simulation is given in parentheses
sample 7, mean= 0.0014 (0.0001), approx = 0.0013
sample 27, mean= 0.0218 (0.0008), approx = 0.0222
sample 47, mean= 0.0672 (0.0009), approx = 0.0669
sample 67, mean= 0.1342 (0.0024), approx = 0.1321
sample 87, mean= 0.2143 (0.0023), approx = 0.2132
sample 107, mean= 0.3092 (0.0033), approx = 0.3047
sample 127, mean= 0.4027 (0.0022), approx = 0.4011
sample 147, mean= 0.5005 (0.0019), approx = 0.4972
sample 167, mean= 0.5916 (0.0016), approx = 0.5886
sample 187, mean= 0.6750 (0.0004), approx = 0.6719
sample 207, mean= 0.7465 (0.0013), approx = 0.7449
sample 227, mean= 0.8107 (0.0008), approx = 0.8067
sample 247, mean= 0.8597 (0.0021), approx = 0.8573
sample 267, mean= 0.8999 (0.0016), approx = 0.8973
sample 287, mean= 0.9301 (0.0008), approx = 0.9279
sample 307, mean= 0.9522 (0.0004), approx = 0.9507
sample 327, mean= 0.9669 (0.0009), approx = 0.9671
sample 347, mean= 0.9791 (0.0009), approx = 0.9787
sample 367, mean= 0.9865 (0.0007), approx = 0.9865
sample 387, mean= 0.9915 (0.0004), approx = 0.9917
sample 407, mean= 0.9952 (0.0003), approx = 0.9950
Conditional probabilities can help explain why the approximation is a tiny bit low: strictly speaking, successive addition of more people to the sample leaves fewer "unclaimed" birthdays. But this fraction nearly 1.
Maximum fraction = (43825-407)/43825 = 0.99 for the samples above.
I am not sure why my previous simulation (with a different algorithm) gave the high number of 157 for the 50% probability. 147 is correct. Simulations with the new algorithm agree give the smaller values above.
An exact analytic analysis might require considering all possible current populated birthday configurations and ways adding one more person to yield a match. But this might alternatively take a probabilistic form. Perhaps there is some neat recursion trick to generate the equation governing the addition of each new person.