Bayesian Estimation of Disease Prevalence with Metropolis-Hastings

Author

Jiachen Ai; UID: 206182615

Problem: Bayesian Estimation of Disease Prevalence Rate

Imagine you are analyzing data on a rare disease to estimate its prevalence in a population. You surveyed a random sample of people and found that out of \(n = 100\) individuals, \(y = 7\) tested positive for the disease. Assume the test is perfectly accurate.

Let \(p\) represent the prevalence rate of the disease in the population, which is unknown. We aim to estimate \(p\) using a Bayesian approach with a prior belief about \(p\) and data from the survey.

1. Define the Prior Distribution

Assume a prior distribution for \(p\) given by a Beta distribution: \(p \sim ext{Beta}(2, 5)\) This reflects prior knowledge that the disease is rare but not exceedingly so.

2. Define the Likelihood Function

Given \(p\), the likelihood of observing \(y\) positive cases out of \(n\) can be modeled using the Binomial distribution: \(L(y | p) \propto p^y (1 - p)^{n - y}\)

3. Define the Posterior Distribution

The posterior distribution for \(p\) is proportional to the product of the prior and the likelihood: \({posterior}(p | y) \propto p^{y+1} (1 - p)^{n - y + 4}\) Since this distribution is difficult to sample directly, we will use the Metropolis-Hastings algorithm.

4. Implement the Metropolis-Hastings Algorithm

To sample from the posterior, use a Gaussian proposal distribution with mean equal to the current value of \(p\) and standard deviation \(\sigma = 0.05\), truncated to the interval \([0, 1]\) since \(p\) represents a probability.

Tasks

  1. Set up the parameters: Choose values for \(n\), \(y\), and \(\sigma\) as described above. Initialize the MCMC chain with an initial value for \(p\). (say \(p_0=0.05\))
  2. Write a function for the unnormalized posterior, based on the prior and likelihood.
  3. Implement the Metropolis-Hastings algorithm:
    • At each iteration, propose a new value for \(p\) using the Gaussian proposal distribution.
    • Calculate the acceptance probability.
    • Accept or reject the proposal based on the acceptance probability.
  4. Run the algorithm for a sufficient number of iterations (e.g., 10,000) to obtain a sample of \(p\) values.

Problem Setup and Understanding

1. Bayesian Framework:

  • Goal: Estimate the disease prevalence \(p\) (a probability) in the population.
  • Prior: \(p \sim \text{Beta}(2, 5)\) (reflecting prior belief about the prevalence). This reflects the belief that the disease is rare.
  • Likelihood: Given \(p\), the likelihood of observing \(y = 7\) positive cases out of \(n = 100\) can be modeled using the Binomial distribution: \(L(y | p) \propto p^y (1 - p)^{n - y}\). This represents the probability of observing the data given the prevalence rate.
  • Posterior: The posterior distribution for \(p\) is proportional to the product of the prior and the likelihood: \(p(p | y) \propto p^{y+1} (1 - p)^{n - y + 4}\). This is the distribution we want to sample from to estimate \(p\). Since this distribution is difficult to sample directly, we will use the Metropolis-Hastings algorithm.

2. Metropolis-Hastings Algorithm:

  • Proposal Distribution: Gaussian proposal distribution with mean equal to the current value of \(p_{t}\) (current value) and standard deviation \(\sigma = 0.05\), truncated to the interval \([0, 1]\) since \(p\) represents a probability.
  • Process:
    • Start with an initial value for \(p\) (e.g., \(p_0 = 0.05\)).
    • Propose new values \(p^*\) using the Gaussian proposal distribution.
    • Decide whether to accept the proposed value \(p^*\) based on its posterior probability.

Steps to Implement the Algorithm

Step 1: Parameters and Initialization

  • \(n = 100\), \(y = 7\), \(\sigma = 0.05\)
  • Initialize the MCMC chain with an initial value for \(p_0\) (e.g., \(p_0 = 0.05\)).
  • Total number of iterations: \(T = 10,000\)

Step 2: Posterior Function

The unnormalized posterior is: \[ \text{posterior}(p | y) \propto p^{y+1} (1 - p)^{n - y + 4} \] This function will be used to compute the acceptance ratio.

Step 3: Metropolis-Hastings Algorithm

  1. Initialize \(p_t = p_0\).
  2. For each iteration:
    • Propose a new value \(p^* \sim \text{Normal}(p_t, \sigma)\), truncated to \([0, 1]\).
    • Calculate the acceptance probability: \(\alpha = \min\left(1, \frac{\text{posterior}(p^* | y)}{\text{posterior}(p_t | y)}\right)\).
    • Accept the proposal with probability \(\alpha\); otherwise, keep \(p_t\).
    • Update \(p_t\) based on the acceptance decision.

Step 4: Sampling and Burn-in

  • Discard the first \(K\) samples as “burn-in” to allow the chain to reach its stationary distribution.
  • Use the remaining samples to estimate properties of the posterior.

Implementation in R

Set the parameters

# Parameters
n <- 100        # Total sample size
y <- 7          # Number of positive cases
sigma <- 0.05   # Proposal standard deviation
T <- 10000      # Total iterations
burn_in <- 2000 # Burn-in period
p0 <- 0.05      # Initial value

Define the unnormalized posterior function

# Define the unnormalized posterior function
posterior <- function(p, y, n) {
  if (p < 0 || p > 1) {
    return(0) # p must be within [0, 1]
  }
  
  # Prior: Beta(2, 5)
  prior <- dbeta(p, 2, 5)
  
  # Likelihood: Binomial
  likelihood <- p^y * (1 - p)^(n - y)
  
  # Unnormalized posterior
  return(prior * likelihood)
}

Implement the Metropolis-Hastings algorithm

# Metropolis-Hastings algorithm
metropolis_hastings <- function(y, n, p0, sigma, T) {
  samples <- numeric(T)
  current_p <- p0
  
  for (t in 1:T) {
    # Propose new value from Gaussian distribution
    proposed_p <- rnorm(1, mean = current_p, sd = sigma)
    
    # Truncate to [0, 1]
    proposed_p <- max(0, min(1, proposed_p))
    
    # Compute acceptance ratio
    acceptance_ratio <- posterior(proposed_p, y, n) / posterior(current_p, y, n)
    alpha <- min(1, acceptance_ratio)
    
    # Accept or reject the proposal
    if (runif(1) < alpha) {
      current_p <- proposed_p
    }
    
    samples[t] <- current_p
  }
  
  return(samples)
}

Run the algorithm and summarize the results

# Run the algorithm
samples <- metropolis_hastings(y, n, p0, sigma, T)

# Discard burn-in samples
posterior_samples <- samples[(burn_in + 1):T]

# Plot the results
hist(posterior_samples, breaks = 50, probability = TRUE, col = "lightblue",
     main = "Posterior Distribution of Disease Prevalence", 
     xlab = "Prevalence (p)", ylab = "Density")

# Overlay the true Beta posterior
curve(dbeta(x, y + 2, n - y + 5), col = "red", lwd = 2, add = TRUE)
legend("topright", legend = c("Posterior Samples", "True Posterior (Beta)"), 
       col = c("lightblue", "red"), lty = 1, lwd = 2)

# Summarize the posterior
mean_p <- mean(posterior_samples)
cat(sprintf("Estimated Prevalence (mean): %.3f\n", mean_p))
Estimated Prevalence (mean): 0.085
credible_interval <- quantile(posterior_samples, c(0.025, 0.975))
cat(sprintf("95%% Credible Interval: [%.3f, %.3f]\n", credible_interval[1], credible_interval[2]))
95% Credible Interval: [0.041, 0.146]

Discussion

library(ggplot2)
library(dplyr)
library(tidyr)

# Simulate different scenarios
simulate_mh <- function(y, n, p0, sigma, T, burn_in, prior_alpha, prior_beta) {
  
  posterior <- function(p, y, n, alpha, beta) {
    if (p <= 0 || p >= 1) return(0)
    prior <- dbeta(p, alpha, beta)
    likelihood <- p^y * (1 - p)^(n - y)
    return(prior * likelihood)
  }
  
  metropolis_hastings <- function(y, n, p0, sigma, T, alpha, beta) {
    samples <- numeric(T)
    current_p <- p0
    
    for (t in 1:T) {
      proposed_p <- rnorm(1, mean = current_p, sd = sigma)
      proposed_p <- max(0, min(1, proposed_p))
      acceptance_ratio <- posterior(proposed_p, y, n, alpha, beta) / 
                          posterior(current_p, y, n, alpha, beta)
      alpha_accept <- min(1, acceptance_ratio)
      if (runif(1) < alpha_accept) {
        current_p <- proposed_p
      }
      samples[t] <- current_p
    }
    return(samples)
  }
  
  samples <- metropolis_hastings(y, n, p0, sigma, T, prior_alpha, prior_beta)
  posterior_samples <- samples[(burn_in + 1):T]
  return(posterior_samples)
}
  1. Prior Influence: What effect does the choice of prior (Beta(2, 5)) have on your estimate of \(p\)?

The choice of the prior reflects our beliefs about \(p\) before observing the data.

  • The \(\text{Beta}(2, 5)\) prior puts more weight on smaller values of \(p\), reflecting the belief that the disease is rare but not extremely so.
  • The posterior combines the prior with the likelihood. If the data is sparse or uninformative, the prior will have a stronger influence. For example:
    • A \(\text{Beta}(2, 5)\) prior with \(y = 7\) positive cases out of \(n = 100\) will result in a posterior that is more concentrated around lower values of \(p\).
    • A flat prior (e.g., \(\text{Beta}(1, 1)\)) (Uniform) would allow the data to dominate more.
# Simulate different scenarios
params <- expand.grid(prior_alpha = c(1, 2, 8, 10), 
                      prior_beta = c(1, 5, 7, 10))

results <- params %>%
  rowwise() %>%
  mutate(samples = list(simulate_mh(7, 100, 0.05, 0.05, 10000, 2000, prior_alpha, prior_beta)))

# Unnest results for plotting
results_long <- results %>%
  ungroup() %>%  # Ensure no grouping structure remains
  mutate(id = row_number())  %>%  # Add an ID column for facetting
  unnest(cols = c(samples)) %>% # Use `unnest()` to handle list-columns in modern tidyr
  rename(prevalence = samples)  # Rename for clarity

# Plot with ggplot2
ggplot(results_long, aes(x = prevalence)) +
  geom_density(alpha = 0.6, fill = "skyblue") +
  facet_grid(prior_alpha ~ prior_beta, scales = "fixed") +
  labs(x = "Prevalence (p)", 
       y = "Density", 
       title = "Comparison of Posterior Distributions",
       subtitle = "Faceted by Prior Parameters (Top: Beta, Right: Alpha)") +
  theme_minimal() +
  theme(legend.position = "none")

  1. Proposal Standard Deviation: How does changing the proposal standard deviation \(\sigma\) affect the acceptance rate and efficiency of the algorithm?

The proposal standard deviation (σσ) controls the step size of the proposed values:

  • If \(\sigma\) is too small:
    • The algorithm takes very small steps.
    • The acceptance rate may be high, but the algorithm may take a long time to explore the posterior distribution.
  • If \(\sigma\) is too large:
    • The algorithm takes large steps.
    • Proposed values are often far from the current value.
    • Low acceptance rate, as many proposals fall into regions with low posterior probability.
# Simulate different scenarios
params <- expand.grid(sigma = c(0.01, 0.05, 0.1, 0.2, 0.5, 0.8))

results <- params %>%
  rowwise() %>%
  mutate(samples = list(simulate_mh(7, 100, 0.05, sigma, 10000, 2000, 2, 5)))

# Unnest results for plotting
results_long <- results %>%
  ungroup() %>%  # Ensure no grouping structure remains
  mutate(id = row_number())  %>%  # Add an ID column for facetting
  unnest(cols = c(samples)) %>% # Use `unnest()` to handle list-columns in modern tidyr
  rename(prevalence = samples)  # Rename for clarity

# Plot with ggplot2
ggplot(results_long, aes(x = prevalence)) +
  geom_density(alpha = 0.6, fill = "skyblue") +
  facet_wrap(~ sigma, scales = "fixed") +
  labs(x = "Prevalence (p)", 
       y = "Density", 
       title = "Comparison of Posterior Distributions",
       subtitle = "Faceted by Proposal Standard Deviation (sigma)") +
  theme_minimal() +
  theme(legend.position = "none")

It seems that with the increase in \(\sigma\), the algorithm explores the posterior distribution worse.

Next we want to calculate the approximation performance, acceptance rate and efficiency of the algorithm for different values of \(\sigma\). We need to modify the metropolis_hastings() function to return the acceptance rate and the effective sample size (ESS) of the posterior samples.

posterior <- function(p, y, n, alpha, beta) {
  # Ensure p is in (0, 1)
  if (p <= 0 || p >= 1) return(0)
  likelihood <- dbinom(y, size = n, prob = p, log = FALSE)
  prior <- dbeta(p, alpha, beta, log = FALSE)
  return(likelihood * prior)
}
metropolis_hastings_new <- function(y, n, p0, sigma, T, alpha, beta) {
  samples <- numeric(T)
  current_p <- p0
  accepted <- 0 # Counter for accepted proposals

  for (t in 1:T) {
    proposed_p <- rnorm(1, mean = current_p, sd = sigma)
    proposed_p <- max(0, min(1, proposed_p))
    acceptance_ratio <- posterior(proposed_p, y, n, alpha, beta) / 
                        posterior(current_p, y, n, alpha, beta)
    alpha_accept <- min(1, acceptance_ratio)
    if (runif(1) < alpha_accept) {
      current_p <- proposed_p
      accepted <- accepted + 1 # Count accepted proposal
    }
    samples[t] <- current_p
  }

  return(list(samples = samples, acceptance_rate = accepted / T))
}

Acceptance Rate

Code
results <- params %>%
  rowwise() %>%
  mutate(simulation = list(metropolis_hastings_new(7, 100, 0.05, sigma, 10000, 2, 5)),
         samples = list(simulation$samples),
         acceptance_rate = simulation$acceptance_rate) %>%
  suppressMessages() 

results %>%
  select(sigma, acceptance_rate) %>%
  unnest(cols = c(acceptance_rate)) %>%
  ggplot(aes(x = sigma, y = acceptance_rate)) +
  geom_point() +
  geom_smooth(method = "auto") +
  labs(x = "Proposal Standard Deviation (sigma)", 
       y = "Acceptance Rate", 
       title = "Acceptance Rate vs. Proposal Standard Deviation") %>%
  suppressMessages()
Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf

It seems that the acceptance rate exponentially decreases as the proposal standard deviation increases. This is expected, as larger step sizes lead to more proposals being rejected due to lower posterior probabilities.

Posterior Approximation Efficiency

Code
# Compute posterior approximation metrics
results <- results %>%
  mutate(
    mean_sampled = mean(unlist(samples)),  # Mean of sampled values
    sd_sampled = sd(unlist(samples)),  # Standard deviation of sampled values
    credible_interval = list(quantile(unlist(samples), c(0.025, 0.975))),  # 95% Credible Interval
    true_mean = (2 + 7) / (2 + 5 + 100),  # True mean of Beta posterior
    true_sd = sqrt((2 + 7) * (5 + 100 - 7) / ((2 + 5 + 100)^2 * (2 + 5 + 100 + 1)))  # True SD of Beta posterior
  )

# Prepare data for overlaying true Beta posterior
true_posterior <- data.frame(
  x = seq(0, 1, length.out = 500),  # Generate x values (p) in [0, 1]
  density = dbeta(seq(0, 1, length.out = 500), shape1 = 7 + 2, shape2 = 100 - 7 + 5)  # Beta density
)

# Plot: Compare posterior samples with the true Beta posterior
ggplot() +
  # Plot the density of posterior samples
  geom_density(data = results_long, aes(x = prevalence), alpha = 0.6, fill = "skyblue") +
  # Overlay the true Beta posterior
  geom_line(data = true_posterior, aes(x = x, y = density), color = "red", linewidth = 0.5) +
  facet_wrap(~ sigma, scales = "fixed") +
  labs(
    x = "Prevalence (p)", 
    y = "Density", 
    title = "Comparison of Posterior Distributions",
    subtitle = "Faceted by Proposal Standard Deviation (sigma); Red line = True Beta Posterior"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

It seems that with the increase in \(\sigma\), the algorithm explores the posterior distribution worse.

Efficiency (Effective Sample Size)

Code
library(coda)

results <- results %>%
  mutate(effective_sample_size = effectiveSize(unlist(samples)))

results %>%
  select(sigma, effective_sample_size) %>%
  unnest(cols = c(effective_sample_size)) %>%
  ggplot(aes(x = sigma, y = effective_sample_size)) +
  geom_bar(stat = "identity") +
  labs(x = "Proposal Standard Deviation (sigma)", 
       y = "Effective Sample Size", 
       title = "Effective Sample Size vs. Proposal Standard Deviation")

It’s interesting to note that the effective sample size is highest when the proposal standard deviation is set to 0.05. This is because the acceptance rate is highest when the proposal standard deviation is set to 0.05, which leads to a more efficient exploration of the parameter space.

  1. Different Data Outcome: How would the posterior change if you had observed \(y = 20\) positive cases out of \(n = 100\)?

If more positive cases are observed \((y = 20)\), the likelihood function shifts toward higher values of \(p\), leading to:

  • A posterior distribution that is more concentrated around higher values of \(p\).
  • A posterior distribution that is less influenced by the prior, especially if the prior is relatively flat or uninformative.
# Simulate different scenarios
params <- expand.grid(y = c(7, 20, 50))

results <- params %>%
  rowwise() %>%
  mutate(samples = list(simulate_mh(y, 100, 0.05, 0.05, 10000, 2000, 2, 5)))

# Unnest results for plotting
results_long <- results %>%
  ungroup() %>%  # Ensure no grouping structure remains
  mutate(id = row_number())  %>%  # Add an ID column for facetting
  unnest(cols = c(samples)) %>% # Use `unnest()` to handle list-columns in modern tidyr
  rename(prevalence = samples)  # Rename for clarity

# Plot with ggplot2
ggplot(results_long, aes(x = prevalence)) +
  geom_density(alpha = 0.6, fill = "skyblue") +
  facet_wrap(~ y, scales = "fixed") +
  labs(x = "Prevalence (p)", 
       y = "Density", 
       title = "Comparison of Posterior Distributions",
       subtitle = "Faceted by Data Outcome (y)") +
  theme_minimal() +
  theme(legend.position = "none")