< January 2020 >
MonTueWedThuFriSatSun
30310102030405
06070809101112
13141516171819
20212223242526
27282930310102

Friday, 03 January 2020

02:00 PM

The birthday paradox puzzle: tidy simulation in R [Variance Explained] 02:00 PM, Friday, 03 January 2020 03:20 PM, Friday, 03 January 2020

Previously in this series:

The birthday problem is a classic probability puzzle, stated something like this.

A room has n people, and each has an equal chance of being born on any of the 365 days of the year. (For simplicity, we’ll ignore leap years). What is the probability that two people in the room have the same birthday?

If you’re not familiar with the puzzle, you might expect that in a moderately sized room (like 20-30 people) that the chance will be pretty small that two people have the same birthday. But in fact, if the room has more than 23 people, the chance is greater than 50%! This makes the puzzle a classic for intro statistics classes.

I’ve been interested for a while in the tidyverse approach to simulation. In this post, I’ll use the birthday problem as an example of this kind of tidy simulation, most notably the use of the underrated crossing() function.

Simulating the birthday paradox

First, I’ll show the combined approach, before breaking it down.

library(tidyverse)
theme_set(theme_light())

summarized <- crossing(people = seq(2, 50, 2),
                       trial = 1:10000) %>%
  mutate(birthday = map(people, ~ sample(365, ., replace = TRUE)),
         multiple = map_lgl(birthday, ~ any(duplicated(.)))) %>%
  group_by(people) %>%
  summarize(chance = mean(multiple))

# Visualizing the probability
ggplot(summarized, aes(people, chance)) +
  geom_line() +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(y = "Probability two have the same birthday")

center

This visualization shows that the probability two people have the same birthday is low if there are 10 people in the room, moderate if there are 10-40 people in the room, and very high if there are more than 40. It crosses over to become more likely than not when there are ~23 people in the room.

I’ll break down the simulation a bit below.

Simulating one case

When you’re approaching a simulation problem, it can be worth simulating a single case first.

Suppose we have 20 people in a room. Ignoring leap years (and treating each calendar day as a number from 1 to 365), we can simulate their birthdays with sample(365, 20, replace = TRUE).

# 10 random numbers from 1 to 365
sample(365, 10, replace = TRUE)
##  [1]  53 216 220 309  13  37  35 299 263 333

We then use two handy base R functions, duplicated and any, to discover if there are any duplicated days.

# When the values 2 and 1 appear for the second time, they're TRUE
duplicated(c(1, 2, 3, 4, 2, 5, 1))
## [1] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
# This asks "are any values here duplicated?"
any(duplicated(c(1, 2, 3, 4, 2, 5, 1)))
## [1] TRUE

This gives us a one-liner for simulating a single case of the birthday problem, with (say) 20 people:

any(duplicated(sample(365, 20, replace = TRUE)))
## [1] TRUE

If you run this line a few times, you’ll see it’s sometimes true and sometime false, which already gives us the sense that pairs of people sharing a birthday in a moderately-sized room aren’t as rare as you might expect.

The tidy approach to many simulations across multiple parameters

I find the crossing() function in tidyr incredibly valuable for simulation. crossing() creates a tibble of all combinations of its arguments.

crossing(first = c("a", "b", "c"),
         second = c(1, 2, 3))
## # A tibble: 9 x 2
##   first second
##   <chr>  <dbl>
## 1 a          1
## 2 a          2
## 3 a          3
## 4 b          1
## 5 b          2
## 6 b          3
## 7 c          1
## 8 c          2
## 9 c          3

Our simulation starts by generating 10000 * 25 combinations of people and trial, with values of people ranging from 2 to 50. trial exists only so that we have many observations of each, in order to minimize statistical noise.

# Generate combinations of 10,000 replications and number of people (2-50)
crossed <- summarized <- crossing(trial = seq_len(10000),
                                  people = seq(2, 50, 2))

crossed
## # A tibble: 250,000 x 2
##    trial people
##    <int>  <dbl>
##  1     1      2
##  2     1      4
##  3     1      6
##  4     1      8
##  5     1     10
##  6     1     12
##  7     1     14
##  8     1     16
##  9     1     18
## 10     1     20
## # … with 249,990 more rows

We then use functions from purrr, useful for operating across a list, to generate a list column of integer vectors. With a second step, we determine which of them have multiples of any birthday.

# Generate a list column of samples
sim <- crossed %>%
  mutate(birthday = map(people, ~ sample(365, ., replace = TRUE)),
         multiple = map_lgl(birthday, ~ any(duplicated(.))))

sim
## # A tibble: 250,000 x 4
##    trial people birthday   multiple
##    <int>  <dbl> <list>     <lgl>   
##  1     1      2 <int [2]>  FALSE   
##  2     1      4 <int [4]>  FALSE   
##  3     1      6 <int [6]>  TRUE    
##  4     1      8 <int [8]>  FALSE   
##  5     1     10 <int [10]> FALSE   
##  6     1     12 <int [12]> FALSE   
##  7     1     14 <int [14]> FALSE   
##  8     1     16 <int [16]> FALSE   
##  9     1     18 <int [18]> FALSE   
## 10     1     20 <int [20]> FALSE   
## # … with 249,990 more rows

Finally, we use one of my favorite tricks in R: using mean() to find the fraction of a logical vector that’s TRUE.

# Find the percentage of cases where multiple is TRUE
summarized <- sim %>%
  group_by(people) %>%
  summarize(chance = mean(multiple))

summarized
## # A tibble: 25 x 2
##    people chance
##     <dbl>  <dbl>
##  1      2 0.0019
##  2      4 0.0157
##  3      6 0.0434
##  4      8 0.0706
##  5     10 0.113 
##  6     12 0.162 
##  7     14 0.223 
##  8     16 0.279 
##  9     18 0.344 
## 10     20 0.414 
## # … with 15 more rows

Together these three steps (crossed+sim+summarized) make up the solution in the first code chunk above. Notice this simulation combines the base R approach (sample, duplicated, and any) with tidyverse functions from dplyr (like mutate, group_by, and summarize), tidyr (crossing) and purrr (map and map_lgl). I’ve found this crossing/map/summarize approach useful in many simulations.

How can we check our math? R has a built-in pbirthday function that gives an exact solution: for instance, pbirthday(20) gives the probability two people in a room of 20 will have the same birthday. We can compare our su

# pbirthday is not vectorized, so we have to use map_dbl
summarized %>%
  mutate(exact = map_dbl(people, pbirthday)) %>%
  ggplot(aes(people, chance)) +
  geom_line() +
  geom_line(aes(y = exact), lty = 2, color = "red") +
  labs(x = "# of people",
       y = "Probability two have the same birthday")

center

Looks like we got very close!

Generalizing the birthday problem

We’ve seen how likely it is that there’s a pair of people with the same birthday. What about a group of 3, 4, or 5 who all have the same birthday? How likely is that?

First, instead of working with ~ any(duplicated(.)) as our vectorized operation, we could find how many people are in the most common group of birthdays, with another combination of base R functions: max and table.

# table() returns a vector of frequencies
table(c("a", "b", "b", "c", "c", "d", "d", "d"))
## 
## a b c d 
## 1 2 2 3
# So max(table()) tells us that in this group, one letter occurs 3 times
max(table(c("a", "b", "b", "c", "c", "d", "d", "d")))
## [1] 3
# Use this max(table()) approach in the map_int
sim_most_common <- crossing(people = seq(5, 100, 5),
                            trial = 1:5000) %>%
  mutate(birthday = map(people, ~ sample(365, ., replace = TRUE))) %>%
  mutate(most_common = map_int(birthday, ~ max(table(.))))

sim_most_common
## # A tibble: 100,000 x 4
##    people trial birthday  most_common
##     <dbl> <int> <list>          <int>
##  1      5     1 <int [5]>           1
##  2      5     2 <int [5]>           1
##  3      5     3 <int [5]>           1
##  4      5     4 <int [5]>           1
##  5      5     5 <int [5]>           1
##  6      5     6 <int [5]>           1
##  7      5     7 <int [5]>           1
##  8      5     8 <int [5]>           1
##  9      5     9 <int [5]>           2
## 10      5    10 <int [5]>           1
## # … with 99,990 more rows

Now that we know within each trial what the largest group with the same birthday is. We now need to know the fraction within each number where the most_common is at least 2, 3, 4, or 5. We use another application of the ever-useful crossing function, to apply multiple thresholds to each number of people.

summarized_most_common <- sim_most_common %>%
  crossing(threshold = 2:5) %>%
  group_by(people, threshold) %>%
  summarize(chance = mean(most_common >= threshold))

summarized_most_common
## # A tibble: 80 x 3
## # Groups:   people [20]
##    people threshold chance
##     <dbl>     <int>  <dbl>
##  1      5         2 0.0302
##  2      5         3 0     
##  3      5         4 0     
##  4      5         5 0     
##  5     10         2 0.115 
##  6     10         3 0.001 
##  7     10         4 0     
##  8     10         5 0     
##  9     15         2 0.250 
## 10     15         3 0.0036
## # … with 70 more rows

We can visualize these and compare them to the exact values from pbirthday() (pbirthday() takes a coincident argument to extend the problem to more than two participants).

summarized_most_common %>%
  mutate(exact = map2_dbl(people, threshold,
                          ~ pbirthday(.x, coincident = .y))) %>%
  ggplot(aes(people, chance, color = factor(threshold))) +
  geom_line() +
  geom_line(aes(y = exact), lty = 2) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(x = "# of people in room",
       y = "Probability a group with the same birthday exists",
       color = "Group size")

center

From this we learn that in a room of 100 people, it is almost certain that a pair exists with the same birthday, more likely than not that that a group of 3 does, unlikely that a group of 4 does, and almost impossible that a group of 5 does.

What I like about the tidy approach to simulation is that we can keep adding parameters that we’d like to vary through an extra crossing() and an extra term in the group_by. For instance, if we’d wanted to see how the probability varied as the number of days in a year changed (not relevant for birthdays, but might be relevant in other applications like the cryptographic birthday attack), we could have added that to the crossing() and worked with the first argument to sample.

Feeds

FeedRSSLast fetchedNext fetched after
XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Bits of DNA XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
blogs.perl.org XML 12:00 AM, Tuesday, 18 January 2022 12:15 AM, Tuesday, 18 January 2022
Blue Collar Bioinformatics XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Boing Boing XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Epistasis Blog XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Futility Closet XML 12:00 AM, Tuesday, 18 January 2022 12:15 AM, Tuesday, 18 January 2022
gCaptain XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Hackaday XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
In between lines of code XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
InciWeb Incidents for California XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
LeafSpring XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Living in an Ivory Basement XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
LWN.net XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Mastering Emacs XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Planet Debian XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Planet Emacsen XML 12:00 AM, Tuesday, 18 January 2022 12:15 AM, Tuesday, 18 January 2022
RNA-Seq Blog XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
RStudio Blog - Latest Comments XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
RWeekly.org - Blogs to Learn R from the Community XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
The Adventure Blog XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
The Allium XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
Variance Explained XML 12:00 AM, Tuesday, 18 January 2022 12:30 AM, Tuesday, 18 January 2022
January 2022
MonTueWedThuFriSatSun
27282930310102
03040506070809
10111213141516
17181920212223
24252627282930
31010203040506
December 2021
MonTueWedThuFriSatSun
29300102030405
06070809101112
13141516171819
20212223242526
27282930310102
November 2021
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29300102030405
October 2021
MonTueWedThuFriSatSun
27282930010203
04050607080910
11121314151617
18192021222324
25262728293031
September 2021
MonTueWedThuFriSatSun
30310102030405
06070809101112
13141516171819
20212223242526
27282930010203
August 2021
MonTueWedThuFriSatSun
26272829303101
02030405060708
09101112131415
16171819202122
23242526272829
30310102030405
July 2021
MonTueWedThuFriSatSun
28293001020304
05060708091011
12131415161718
19202122232425
26272829303101
June 2021
MonTueWedThuFriSatSun
31010203040506
07080910111213
14151617181920
21222324252627
28293001020304
May 2021
MonTueWedThuFriSatSun
26272829300102
03040506070809
10111213141516
17181920212223
24252627282930
31010203040506
April 2021
MonTueWedThuFriSatSun
29303101020304
05060708091011
12131415161718
19202122232425
26272829300102
March 2021
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29303101020304
February 2021
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
November 2020
MonTueWedThuFriSatSun
26272829303101
02030405060708
09101112131415
16171819202122
23242526272829
30010203040506
September 2020
MonTueWedThuFriSatSun
31010203040506
07080910111213
14151617181920
21222324252627
28293001020304
July 2020
MonTueWedThuFriSatSun
29300102030405
06070809101112
13141516171819
20212223242526
27282930310102
June 2020
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29300102030405
May 2020
MonTueWedThuFriSatSun
27282930010203
04050607080910
11121314151617
18192021222324
25262728293031
April 2020
MonTueWedThuFriSatSun
30310102030405
06070809101112
13141516171819
20212223242526
27282930010203
February 2020
MonTueWedThuFriSatSun
27282930310102
03040506070809
10111213141516
17181920212223
24252627282901
January 2020
MonTueWedThuFriSatSun
30310102030405
06070809101112
13141516171819
20212223242526
27282930310102
December 2019
MonTueWedThuFriSatSun
25262728293001
02030405060708
09101112131415
16171819202122
23242526272829
30310102030405
November 2019
MonTueWedThuFriSatSun
28293031010203
04050607080910
11121314151617
18192021222324
25262728293001
October 2019
MonTueWedThuFriSatSun
30010203040506
07080910111213
14151617181920
21222324252627
28293031010203
August 2019
MonTueWedThuFriSatSun
29303101020304
05060708091011
12131415161718
19202122232425
26272829303101
July 2019
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29303101020304
June 2019
MonTueWedThuFriSatSun
27282930310102
03040506070809
10111213141516
17181920212223
24252627282930
May 2019
MonTueWedThuFriSatSun
29300102030405
06070809101112
13141516171819
20212223242526
27282930310102
April 2019
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29300102030405
March 2019
MonTueWedThuFriSatSun
25262728010203
04050607080910
11121314151617
18192021222324
25262728293031
February 2019
MonTueWedThuFriSatSun
28293031010203
04050607080910
11121314151617
18192021222324
25262728010203
January 2019
MonTueWedThuFriSatSun
31010203040506
07080910111213
14151617181920
21222324252627
28293031010203
December 2018
MonTueWedThuFriSatSun
26272829300102
03040506070809
10111213141516
17181920212223
24252627282930
31010203040506
November 2018
MonTueWedThuFriSatSun
29303101020304
05060708091011
12131415161718
19202122232425
26272829300102
October 2018
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29303101020304
September 2018
MonTueWedThuFriSatSun
27282930310102
03040506070809
10111213141516
17181920212223
24252627282930
August 2018
MonTueWedThuFriSatSun
30310102030405
06070809101112
13141516171819
20212223242526
27282930310102
July 2018
MonTueWedThuFriSatSun
25262728293001
02030405060708
09101112131415
16171819202122
23242526272829
30310102030405
June 2018
MonTueWedThuFriSatSun
28293031010203
04050607080910
11121314151617
18192021222324
25262728293001
May 2018
MonTueWedThuFriSatSun
30010203040506
07080910111213
14151617181920
21222324252627
28293031010203
April 2018
MonTueWedThuFriSatSun
26272829303101
02030405060708
09101112131415
16171819202122
23242526272829
30010203040506
February 2018
MonTueWedThuFriSatSun
29303101020304
05060708091011
12131415161718
19202122232425
26272801020304
January 2018
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29303101020304
December 2017
MonTueWedThuFriSatSun
27282930010203
04050607080910
11121314151617
18192021222324
25262728293031
November 2017
MonTueWedThuFriSatSun
30310102030405
06070809101112
13141516171819
20212223242526
27282930010203
September 2017
MonTueWedThuFriSatSun
28293031010203
04050607080910
11121314151617
18192021222324
25262728293001
August 2017
MonTueWedThuFriSatSun
31010203040506
07080910111213
14151617181920
21222324252627
28293031010203
March 2017
MonTueWedThuFriSatSun
27280102030405
06070809101112
13141516171819
20212223242526
27282930310102
January 2017
MonTueWedThuFriSatSun
26272829303101
02030405060708
09101112131415
16171819202122
23242526272829
30310102030405
November 2016
MonTueWedThuFriSatSun
31010203040506
07080910111213
14151617181920
21222324252627
28293001020304
October 2016
MonTueWedThuFriSatSun
26272829300102
03040506070809
10111213141516
17181920212223
24252627282930
31010203040506
September 2016
MonTueWedThuFriSatSun
29303101020304
05060708091011
12131415161718
19202122232425
26272829300102
August 2016
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29303101020304
July 2016
MonTueWedThuFriSatSun
27282930010203
04050607080910
11121314151617
18192021222324
25262728293031
May 2016
MonTueWedThuFriSatSun
25262728293001
02030405060708
09101112131415
16171819202122
23242526272829
30310102030405
April 2016
MonTueWedThuFriSatSun
28293031010203
04050607080910
11121314151617
18192021222324
25262728293001
December 2014
MonTueWedThuFriSatSun
01020304050607
08091011121314
15161718192021
22232425262728
29303101020304
October 2014
MonTueWedThuFriSatSun
29300102030405
06070809101112
13141516171819
20212223242526
27282930310102