Understanding the Pigeonhole Principle with Python

Imagine you’re in a interview for a data science position, and the interviewer throws you a curveball — not about algorithms or coding, but about counting. Yes, you heard that right. So, the question goes like this:

“In a city with 1 million people, is it possible for every person to have a unique number of hairs on their head? Explain why or why not using your analytical skills.”

This might seem like a quirky, even trivial question at first, but it’s a clever application of the Pigeonhole principle. Firstly, let’s consider some facts:

  • The average human head has between 100,000 to 150,000 hair threads.
  • Some people might have fewer hairs due to balding or hair loss, and some might have more, but realistically, the number doesn’t vary wildly from the average for most of the population.

So, there are 1 million people in the city, and the range of possible hair counts a person can have is roughly between 100,000 and 150,00. Even if we assume the extreme case where the number of possible hair counts ranges from 0 (completely bald) to 200,000 (extremely hairy), we have about 200,001 possible hair counts.

Given that we have 1 million people but only around 200,001 possible different hair counts. So, it’s mathematically impossible for each person in the city to have a unique number of hairs on their head — there simply aren’t enough different possible hair counts to go around. So, you have your answer! And this is what Pigeonhole principle is.

Formal Definition

Formally, the Pigeonhole Principle is expressed as follows:

If you have more items than containers to put them into, and each item must go into a container, then at least one container must hold more than one item. This principle is also referred to as Dirichlet’s box principle.

Let’s see an application of this principle.

Birthday Paradox

A common example of the Pigeonhole principle‘s application is the Birthday Paradox. It asks how many people need to be in a room for there to be a 50% chance that at least two of them share the same birthday. Surprisingly, the number is quite low — only 23 people.

Applying the Pigeonhole Principle

Albeit, 50% probability seems surprisingly high, given the number of people and the 365 (number of days in an year) possible birthdays. Birthday Paradox isn’t a true paradox, but rather a counterintuitive statistical result. Let’s understand, why is this so?

To compute the probability that no two people in a group of n individuals share the same birthday, we can use the principle of combinatorics:

  • Person 1 could have birthday on any of the 365 days.
  • Person 2 could have birthday on any of the remaining 364 days. (for unique birthdays)
  • Person 3 could have birthday on any of the remaining 363 days. (for unique birthdays)

and so on..

So, the probability that all n people have different birthdays can be calculated as:

P(all different) = \frac{365}{365} \times \frac{364}{365} \times \frac{363}{365} \times ... \times \frac{365 - n + 1}{365}

When n = 23,

P(all different) = \frac{365}{365} \times \frac{364}{365} \times \frac{363}{365} \times ... \times \frac{343}{365}

This gives the probability as 0.492 for all 23 people in a room to have unique birthdays. This means that there is a 49.2% chance that no one shares a birthday in a group of 23, or conversely, a 50.8% probability that at least two people share the same birthday.

Python Simulation

import random


def has_duplicate_birthdays(num_people, num_simulations):
    random.seed(42)  # For reproducibility
    num_matches = 0

    # Generating random birthdays (from 365 days) for each simulation
    for _ in range(num_simulations):
        # using a 'set' to check birthdays for uniqueness
        birthdays = [random.randint(1, 365) for _ in range(num_people)]
        if len(birthdays) != len(set(birthdays)):  # Check for duplicates
            num_matches += 1

    return num_matches / num_simulations


# Testing the function
num_people = 23
num_simulations = 10000
probability = has_duplicate_birthdays(num_people, num_simulations)
print(f"Probability of sharing a birthday in a group of {num_people} "
      f"people: {probability:.2f}")

Output

Probability of sharing a birthday in a group of 23 people: 0.51
The Birthday Paradox and the Pigeonhole Principle both highlight unexpected probabilities in scenarios involving groups and shared characteristics. The Pigeonhole Principle states that if you have more items than containers to put them in, at least one container must hold more than one item. Applied to birthdays paradox: if you have 366 people (considering a non-leap year with 365 days), you must have at least one shared birthday due to there being more people than days in the year.

The Birthday Paradox extends this idea by showing that even with only 23 people, there is about a 50% chance that two of them share the same birthday. This happens because each additional person increases the number of potential pairs for shared birthdays, thereby increasing the likelihood of a match.

Similarly, the Pigeonhole principle can be used to guarantee the presence of duplicates or repetitions in large datasets. For instance, if you are analyzing user data and there are more user actions than possible action types, the Pigeonhole principle ensures that at least one action type must have been performed by more than one user.

The Pigeonhole principle is a useful tool that can be applied in various scenarios, from mathematical proofs to everyday problem-solving in programming and data science. By understanding and applying this principle, one can design more efficient algorithms, perform data integrity checks, and make statistical predictions.

Leave a Reply

Your email address will not be published. Required fields are marked *