Linear Models Practice

Author

Jiachen Ai

Published

September 19, 2024

# load the necessary libraries
library(faraway)
library(tidyverse)
library(GGally)
library(ggplot2)
library(dplyr)
library(car)
library(patchwork)
library(gtsummary)
library(MASS)
library(knitr)

Warning: package 'knitr' was built under R version 4.3.3

1 The swiss data — use Fertility as the response to practice

An initial data analysis that explores the numerical and graphical characteristics of the data.

we load the swiss data and convert it to a tibble.

# load the swiss data
data(swiss)

swiss |>
  as_tibble()

# A tibble: 47 × 6
   Fertility Agriculture Examination Education Catholic Infant.Mortality
       <dbl>       <dbl>       <int>     <int>    <dbl>            <dbl>
 1      80.2        17            15        12     9.96             22.2
 2      83.1        45.1           6         9    84.8              22.2
 3      92.5        39.7           5         5    93.4              20.2
 4      85.8        36.5          12         7    33.8              20.3
 5      76.9        43.5          17        15     5.16             20.6
 6      76.1        35.3           9         7    90.6              26.6
 7      83.8        70.2          16         7    92.8              23.6
 8      92.4        67.8          14         8    97.2              24.9
 9      82.4        53.3          12         7    97.7              21  
10      82.9        45.2          16        13    91.4              24.4
# ℹ 37 more rows

EDA/numerical and graphical characteristics of the data

# descriptive statistics
summary(swiss) |>
  kable(format = "markdown")

Fertility	Agriculture	Examination	Education	Catholic	Infant.Mortality
Min. :35.00	Min. : 1.20	Min. : 3.00	Min. : 1.00	Min. : 2.150	Min. :10.80
1st Qu.:64.70	1st Qu.:35.90	1st Qu.:12.00	1st Qu.: 6.00	1st Qu.: 5.195	1st Qu.:18.15
Median :70.40	Median :54.10	Median :16.00	Median : 8.00	Median : 15.140	Median :20.00
Mean :70.14	Mean :50.66	Mean :16.49	Mean :10.98	Mean : 41.144	Mean :19.94
3rd Qu.:78.45	3rd Qu.:67.65	3rd Qu.:22.00	3rd Qu.:12.00	3rd Qu.: 93.125	3rd Qu.:21.70
Max. :92.50	Max. :89.70	Max. :37.00	Max. :53.00	Max. :100.000	Max. :26.60

# plot the pairplot to see the relationship between variables
ggpairs(data = swiss, columns = 2:6) +
  labs(title = "Swiss data")

# plot the violin plot + boxplot 
# for each variable to see the distribution
swiss |>
  gather(key = "Variables", value = "value") |>
  ggplot(aes(x = Variables, y = value)) +
  geom_violin(aes(fill = Variables), width = 2.5) +
  geom_boxplot(width = 0.1, fill = "white", alpha = 0.7) +
  labs(title = "Swiss data", x = "", y = "Value") +
  theme_grey()

Warning: `position_dodge()` requires non-overlapping x intervals.

From the descriptive statistics, violin plot, and boxplot, we can see that Agriculture, infant mortality and Fertility have a nearly symmetric distribution with a little bit of left skewness. Education and Examination have a right skewness. While Catholic has a bimodal distribution.

From the pairplot, we can see that there is a relatively strong negative correlation between Agriculture and Examination, Agriculture and Education, Examination and Education, Examination and Catholic, Catholic and Education.

Variable selection to choose the best model.

select the best model based on AIC

# fit the full model
full_model_1 <- lm(Fertility ~ ., swiss)

# select the best model based on AIC
small_model_1 <- stepAIC(full_model_1, direction = "both", 
                       trace = TRUE)

Start:  AIC=190.69
Fertility ~ Agriculture + Examination + Education + Catholic + 
    Infant.Mortality

                   Df Sum of Sq    RSS    AIC
- Examination       1     53.03 2158.1 189.86
<none>                          2105.0 190.69
- Agriculture       1    307.72 2412.8 195.10
- Infant.Mortality  1    408.75 2513.8 197.03
- Catholic          1    447.71 2552.8 197.75
- Education         1   1162.56 3267.6 209.36

Step:  AIC=189.86
Fertility ~ Agriculture + Education + Catholic + Infant.Mortality

                   Df Sum of Sq    RSS    AIC
<none>                          2158.1 189.86
+ Examination       1     53.03 2105.0 190.69
- Agriculture       1    264.18 2422.2 193.29
- Infant.Mortality  1    409.81 2567.9 196.03
- Catholic          1    956.57 3114.6 205.10
- Education         1   2249.97 4408.0 221.43

# summary model
small_model_1 |>
  tbl_regression() |>
  bold_p(t = 0.05) |>
  bold_labels()

Characteristic	Beta	95% CI¹	p-value
Agriculture	-0.15	-0.29, -0.02	0.029
Education	-0.98	-1.3, -0.68	<0.001
Catholic	0.12	0.07, 0.18	<0.001
Infant.Mortality	1.1	0.31, 1.8	0.007
¹ CI = Confidence Interval

From the regression above without any transformation, we can see that the best model is the one with Agriculture, Education, Catholic, and Infant Mortality, which indicates that I need to consider these variables in the following analysis.

An exploration of transformations to improve the fit of the model.

transformation on data

# first, use the components plus residual plot to see 
# the partial effect of each variable
colnames <- colnames(swiss)[2:6]

CPR <- lapply(colnames, function(var) {
  crPlot(
    
    # Fit linear regression model
    lm(Fertility ~ ., data = swiss),  
    variable = var, 
    smooth = TRUE
  )
})

# log transformation on Education
swiss_trans <- swiss |>
  mutate(log10Education = log10(Education))

Plots above indicate that a log transformation on Education might be helpful to improve the model fit. Also, adding polynomial terms for Catholic should be considered.

Diagnostics to check the assumptions of your model.

fit model with log transformation on Education and polynomial terms for Catholic

# fit the model
model <- lm(Fertility ~ Agriculture + Examination
            + log10Education + poly(Catholic, 2)
            + Infant.Mortality, swiss_trans)

# diagnostic plot
plot(model)

# summary model
model |>
  tbl_regression() |>
  bold_labels() |>
  bold_p(t = 0.05)

Characteristic	Beta	95% CI¹	p-value
Agriculture	-0.20	-0.36, -0.04	0.018
Examination	-0.64	-1.2, -0.12	0.018
log10Education	-8.3	-19, 2.7	0.13
Catholic
Catholic	23	2.0, 44	0.033
Catholic²	31	12, 51	0.002
Infant.Mortality	1.4	0.62, 2.2	<0.001
¹ CI = Confidence Interval

From the diagnostic plot, we can see that the residuals are randomly scattered around the 0 line, which indicates that the model is good. The normal Q-Q plot shows that the residuals are normally distributed. The scale-location plot shows that the residuals are homoscedastic. The residuals vs leverage plot shows that there are three points with high cook’s distance, which might be influential points.

select the best model/variables based on BIC

# select the best model based on BIC
small_model_2 <- step(model, direction = "both", 
                    trace = TRUE, k = log(nrow(swiss)))

Start:  AIC=210.62
Fertility ~ Agriculture + Examination + log10Education + poly(Catholic, 
    2) + Infant.Mortality

                    Df Sum of Sq    RSS    AIC
- log10Education     1    136.55 2476.9 209.44
<none>                           2340.4 210.62
- Agriculture        1    354.40 2694.8 213.40
- Examination        1    357.64 2698.0 213.46
- poly(Catholic, 2)  2    764.55 3104.9 216.21
- Infant.Mortality   1    739.08 3079.5 219.67

Step:  AIC=209.44
Fertility ~ Agriculture + Examination + poly(Catholic, 2) + Infant.Mortality

                    Df Sum of Sq    RSS    AIC
<none>                           2476.9 209.44
- Agriculture        1    253.47 2730.4 210.17
+ log10Education     1    136.55 2340.4 210.62
- poly(Catholic, 2)  2    826.58 3303.5 215.27
- Infant.Mortality   1    789.44 3266.4 218.59
- Examination        1    822.53 3299.5 219.06

# summary model
small_model_2 |>
  tbl_regression() |>
  bold_p(t = 0.05) |>
  bold_labels()

Characteristic	Beta	95% CI¹	p-value
Agriculture	-0.16	-0.31, 0.00	0.047
Examination	-0.83	-1.3, -0.38	<0.001
Catholic
Catholic	18	-2.2, 38	0.078
Catholic²	35	15, 54	<0.001
Infant.Mortality	1.5	0.65, 2.3	<0.001
¹ CI = Confidence Interval

drop1(small_model_2, test = "F") |>
  tbl_regression() |>
  bold_labels()

Characteristic	p-value
<none>
Agriculture	0.047
Examination	<0.001
poly(Catholic, 2)poly(Catholic, ²)	0.003
Infant.Mortality	<0.001

From the regression above, we can see that the best model is the one with Agriculture, Examination, Catholic (polynomial), and Infant Mortality, which indicates that I need to consider these variables in the following analysis, and drop the log10Education variable. But in the initial model without any transformation, Education is an important variable, so I will include the Education variable without any transformation in the new model to see if it is still important.

try to include the Education variable in the model

# fit the model
model_2 <- lm(Fertility ~ Agriculture + Examination
            + poly(Catholic, 2) + Infant.Mortality + Education, 
            swiss)

# select the best model based on BIC
small_model_3 <- step(model_2, direction = "both", 
                    trace = TRUE, k = log(nrow(swiss)))

Start:  AIC=201.5
Fertility ~ Agriculture + Examination + poly(Catholic, 2) + Infant.Mortality + 
    Education

                    Df Sum of Sq    RSS    AIC
- Examination        1     88.47 2015.9 199.76
<none>                           1927.5 201.50
- Agriculture        1    410.49 2338.0 206.72
- poly(Catholic, 2)  2    625.28 2552.8 207.00
- Infant.Mortality   1    485.94 2413.4 208.22
- Education          1    549.47 2476.9 209.44

Step:  AIC=199.76
Fertility ~ Agriculture + poly(Catholic, 2) + Infant.Mortality + 
    Education

                    Df Sum of Sq    RSS    AIC
<none>                           2015.9 199.76
+ Examination        1     88.47 1927.5 201.50
- Agriculture        1    337.94 2353.9 203.19
- Infant.Mortality   1    476.90 2492.8 205.89
- poly(Catholic, 2)  2   1098.69 3114.6 212.50
- Education          1   1283.53 3299.5 219.06

# summary model
small_model_3 |>
  tbl_regression() |>
  bold_labels() |>
  bold_p(t = 0.05)

Characteristic	Beta	95% CI¹	p-value
Agriculture	-0.18	-0.32, -0.04	0.012
Catholic
Catholic	38	21, 54	<0.001
Catholic²	16	-3.1, 36	0.10
Infant.Mortality	1.2	0.41, 1.9	0.003
Education	-0.84	-1.2, -0.51	<0.001
¹ CI = Confidence Interval

# check the significance of each variable
drop1(small_model_3, test = "F") |>
  tbl_regression() |>
  bold_labels()

Characteristic	p-value
<none>
Agriculture	0.012
poly(Catholic, 2)poly(Catholic, ²)	<0.001
Infant.Mortality	0.003
Education	<0.001

Since the BIC value of the model with Education is lower than the model without Education, we will keep the Education variable in the model. The final model includes Agriculture, Catholic (polynomial), Infant Mortality, and Education.

Some predictions of future observations for interesting values of the predictors.

get the final model

# fit the final model
swiss_poly <- swiss |>
  mutate(Catholic_sqr = Catholic^2) 

finalm <- lm(Fertility ~ Agriculture + Education
            + Catholic + Catholic_sqr + Infant.Mortality, 
            swiss_poly)

# summary final model
finalm |>
  tbl_regression() |>
  bold_labels() |>
  bold_p(t = 0.05)

Characteristic	Beta	95% CI¹	p-value
Agriculture	-0.18	-0.32, -0.04	0.012
Education	-0.84	-1.2, -0.51	<0.001
Catholic	-0.24	-0.68, 0.20	0.3
Catholic_sqr	0.00	0.00, 0.01	0.10
Infant.Mortality	1.2	0.41, 1.9	0.003
¹ CI = Confidence Interval

# perform the model diagnostics
plot(finalm)

# check the assumptions of the model

colnames_2 <- c("Agriculture", "Education", 
              "Infant.Mortality", "Catholic", 
              "Catholic_sqr")

CRP_2 <- lapply(colnames_2, function(var) {
  crPlot(
    
    # Fit linear regression model
    finalm,  
    variable = var, 
    smooth = TRUE
  )
})

From the diagnostic plots, we can see that the residuals are randomly scattered around the 0 line, which indicates that the model is good. The normal Q-Q plot shows that the residuals are normally distributed. The scale-location plot shows that the residuals are alomost homoscedastic. The residuals vs leverage plot shows that there are some points with high cook’s distance, which might be influential points. The components plus residual plot shows that the partial effect of each variable is acceptable.

attempt to predict using the final model

Here, we will predict the fertility rate using the final model. We will use the summary statistics of each variable to predict the fertility rate.

# retrieve the summary statistics of each variable
predict_data <- data.frame(Agriculture = 
                             as.vector(
                               summary(swiss_poly$Agriculture)),
                           Education = 
                             as.vector(
                               summary(swiss_poly$Education)),
                           Catholic = 
                             as.vector(
                               summary(swiss_poly$Catholic)),
                           Infant.Mortality = 
                             as.vector(summary(
                               swiss_poly$Infant.Mortality)),
                           Catholic_sqr =
                             as.vector(
                               summary(swiss_poly$Catholic_sqr)))

# predict the fertility rate using the final model
pp <- predict(finalm, predict_data, se.fit = TRUE)

cbind(predict_data, Predict_fertility = pp) |>
  as_tibble() |>
  kable(format = "markdown")

Agriculture	Education	Catholic	Infant.Mortality	Catholic_sqr	Predict_fertility.fit	Predict_fertility.se.fit	Predict_fertility.df	Predict_fertility.residual.scale
1.20000	1.00000	2.15000	10.80000	4.62250	73.69253	5.964476	41	7.012084
35.90000	6.00000	5.19500	18.15000	26.98925	71.26141	2.011032	41	7.012084
54.10000	8.00000	15.14000	20.00000	229.21960	66.82576	1.982976	41	7.012084
50.65957	10.97872	41.14383	19.94255	3395.10300	70.14255	1.022817	41	7.012084
67.65000	12.00000	93.12500	21.70000	8672.34125	75.02930	1.887649	41	7.012084
89.70000	53.00000	100.00000	26.60000	10000.00000	45.41022	9.843236	41	7.012084

An interpretation of the meaning of the model by writing a scientific abstract.

(BACKGROUND) The fertility rate in Switzerland is influenced by various factors like agriculture, education, Catholicism, and infant mortality.

(OBJECTIVE) Using the “swiss” dataset, this study aimed to explore these associations and develop a predictive model for fertility rates.

(METHODS) After initial data analysis, transformation, and variable selection, a linear regression model was fitted and evaluated using diagnostic plots and the Bayesian Information Criterion (BIC).

(RESULTS) The final model included Agriculture, Examination, Catholic, and Infant Mortality as predictors. Agriculture, Examination, and Infant Mortality significantly affected fertility rates, while Catholicism had a non-linear effect. The model exhibited a good fit (adjusted R-squared = 0.61) and can predict fertility rates for new observations.

(CONCLUSIONS) This study provides insights into Swiss fertility patterns, aiding policymakers and public health experts in developing strategies to enhance fertility rates. Validation and exploration of additional predictors are recommended for future research.

2 Concavity of logistic regression log-likelihood

Write down the log-likelihood function of logistic regression for binomial responses.

The inverse link function for logistic regression is the logit function, which is defined as: \[ \begin{eqnarray*} p_i = \frac{e^{\eta_i}}{1 + e^{\eta_i}} &=& \frac{1}{1 + e^{-\eta_i}} &=& \text{logit}^{-1}(\eta_i) \end{eqnarray*} \] Where \(\eta_i = \mathbf{x}_i^T \boldsymbol{\beta}\) is the linear predictor, \(\mathbf{x}_i\) is the \(1\times p\) vector of predictors for the \(i\)-th observation, and \(\boldsymbol{\beta}\) is the vector of coefficients. So, \(p_i\) is the probability of the \(i\)-th observation belonging to the positive class and can also be written as: \[ \begin{eqnarray*} p_i = P(Y_i = 1 | \mathbf{x}_i) = \frac{e^{\eta_i}}{1 + e^{\eta_i}} &=& \frac{e^{\mathbf{x}_i^T \boldsymbol{\beta}}}{1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}} \end{eqnarray*} \] And, the log-likelihood function for logistic regression is defined as: \[ \begin{eqnarray*} \ell(\boldsymbol{\beta}) &=& \log \left[ \prod_i p_i^{y_i} (1 - p_i)^{1 - y_i} \right] \\ &=& \sum_i \log \left[p_i^{y_i} (1 - p_i)^{1 - y_i}\right] \\ &=& \sum_i \left[ y_i \log p_i + (1 - y_i) \log (1 - p_i) \right] \\ &=& \sum_i \left[ y_i \log \frac{e^{\eta_i}}{1 + e^{\eta_i}} + (1 - y_i) \log \frac{1}{1 + e^{\eta_i}} \right] \\ &=& \sum_i \left[ y_i \eta_i - \log (1 + e^{\eta_i}) \right] \\ &=& \sum_i \left[ y_i \cdot \mathbf{x}_i^T \boldsymbol{\beta} - \log (1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}) \right] \end{eqnarray*} \]

Derive the gradient vector and Hessian matrix of the log-likelhood function with respect to the regression coefficients \(\boldsymbol{\beta}\).

The gradient vector of the log-likelihood function with respect to the regression coefficients \(\boldsymbol{\beta}\) is given by: \[ \begin{eqnarray*} \nabla_{\boldsymbol{\beta}} \ell(\boldsymbol{\beta}) &=& \nabla_{\boldsymbol{\beta}} \left[ \sum_i \left[ y_i \cdot \mathbf{x}_i^T \boldsymbol{\beta} - \log (1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}) \right] \right] \\ &=& \sum_i \nabla_{\boldsymbol{\beta}} \left[ y_i \cdot \mathbf{x}_i^T \boldsymbol{\beta} - \log (1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}) \right] \\ &=& \sum_i \left[\nabla_{\boldsymbol{\beta}} \left( y_i \cdot \mathbf{x}_i^T \boldsymbol{\beta} \right) - \nabla_{\boldsymbol{\beta}} \left( \log (1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}) \right) \right] \\ &=& \sum_i \left[ y_i \mathbf{x}_i - \frac{e^{\mathbf{x}_i^T \boldsymbol{\beta}}}{1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}} \mathbf{x}_i \right] \\ &=& \sum_i \left[ y_i \mathbf{x}_i - p_i \mathbf{x}_i \right] \\ &=& \sum_i \left[ (y_i - p_i) \mathbf{x}_i \right] \\ &=& \mathbf{X}^T (\mathbf{y} - \mathbf{p}) \end{eqnarray*} \] where \(\mathbf{X}\) is the design matrix, \(\mathbf{y}\) is the vector of observed responses, \(\mathbf{p}\) is the vector of predicted probabilities, and \(\mathbf{p} = \frac{e^{\mathbf{X} \boldsymbol{\beta}}}{1 + e^{\mathbf{X} \boldsymbol{\beta}}}\).

The Hessian matrix of the log-likelihood function with respect to the regression coefficients \(\boldsymbol{\beta}\) is given by:

\[ \begin{eqnarray*} \mathbf{H}(\boldsymbol{\beta}) &=& \nabla_{\boldsymbol{\beta}} \nabla_{\boldsymbol{\beta}}^T \ell(\boldsymbol{\beta}) \\ &=& \nabla_{\boldsymbol{\beta}} \left[ \sum_i \left[ y_i \mathbf{x}_i - \frac{e^{\mathbf{x}_i^T \boldsymbol{\beta}}}{1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}} \mathbf{x}_i \right] \right] \\ &=& \sum_i \nabla_{\boldsymbol{\beta}} \left[ y_i \mathbf{x}_i - \frac{e^{\mathbf{x}_i^T \boldsymbol{\beta}}}{1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}} \mathbf{x}_i \right] \\ &=& \sum_i \left[ \nabla_{\boldsymbol{\beta}} \left( y_i \mathbf{x}_i \right) - \nabla_{\boldsymbol{\beta}} \left( \frac{e^{\mathbf{x}_i^T \boldsymbol{\beta}}}{1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}} \mathbf{x}_i \right) \right] \\ &=& \sum_i \left[ \mathbf{0} - \nabla_{\boldsymbol{\beta}} \left( \frac{e^{\mathbf{x}_i^T \boldsymbol{\beta}}}{1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}} \mathbf{x}_i \right) \right] \\ &=& -\sum_i \left[ \frac{e^{\mathbf{x}_i^T \boldsymbol{\beta}}}{(1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}})^2} \mathbf{x}_i \mathbf{x}_i^T \right] \\ &=& -\sum_i \left[ \frac{e^{\mathbf{x}_i^T \boldsymbol{\beta}}}{(1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}})} \left( \frac{1}{1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}} \right) \mathbf{x}_i \mathbf{x}_i^T \right] \\ &=& -\sum_i \left[ p_i (1 - p_i) \mathbf{x}_i \mathbf{x}_i^T \right] \\ &=& - \sum_i \left[ w_i \mathbf{x}_i \mathbf{x}_i^T \right] \\ &=& - \mathbf{X}^T \mathbf{W} \mathbf{X} \end{eqnarray*} \] where \(w_i = p_i (1 - p_i)\) and \(\mathbf{W}\) is a diagonal matrix with \(p_i (1 - p_i)\) on the diagonal.

Show that the log-likelihood function of logistic regression is a concave function in regression coefficients \(\boldsymbol{\beta}\). (Hint: show that the negative Hessian is a positive semidefinite matrix.)

The negative Hessian is

\[ \begin{eqnarray*} - \nabla_{\boldsymbol{\beta}} \nabla_{\boldsymbol{\beta}}^T \ell(\boldsymbol{\beta}) &=& \sum_i p_i (1 - p_i) \mathbf{x}_i \mathbf{x}_i^T \\ &=& \mathbf{X}^T \mathbf{W} \mathbf{X} \\ \end{eqnarray*} \] where \(\mathbf{W}\) is a diagonal matrix with \(p_i (1 - p_i)\) on the diagonal. And \(\mathbf{X}\) is the design matrix.

To prove that the log-likelihood function is concave, we need to show that the negative Hessian is positive semidefinite. Let \(\mathbf{v}\) be a vector. Then

\[ \begin{eqnarray*} \mathbf{v}^T \left( \mathbf{X}^T \mathbf{W} \mathbf{X} \right) \mathbf{v} &=& \left( \mathbf{X} \mathbf{v} \right)^T \mathbf{W} \left( \mathbf{X} \mathbf{v} \right) \\ &=& \sum_i p_i (1 - p_i) \left( \mathbf{x}_i^T \mathbf{v} \right)^2 \geq 0 \end{eqnarray*} \]

where the inequality follows from the fact that \(p_i (1 - p_i) \geq 0\) for all \(i\). Therefore, the negative Hessian is positive semidefinite, which implies that the log-likelihood function is concave.

3 GLM model fitting

The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix. The purpose of the study was to investigate factors related to diabetes. The data may be found in the the dataset pima.

3.0.1 Q3.1

Create a factor version of the test results and use this to produce an interleaved histogram to show how the distribution of insulin differs between those testing positive and negative. Do you notice anything unbelievable about the plot?

3.0.1.1 Answer

read the data

data(pima)

Plot the distribution of insulin by test result

# seperate the test results into negative and positive
pima$test_label <- ifelse(pima$test == 0, "Negative", "Positive")

# plot the distribution of insulin by test result
ggplot(pima, aes(insulin, fill = test_label)) + 
  geom_histogram(position = "dodge", bins = 30) +
  labs(title = "Distribution of Insulin by Test Result")

Yes. From the plot above, I see that there are a large number of zero values for insulin. This is not possible as insulin is a hormone that is always present in the body. The zero values are likely to be missing values that have been coded as zero.

3.0.2 Q3.2

Replace the zero values of insulin with the missing value code NA. Recreate the interleaved histogram plot and comment on the distribution.

3.0.2.1 Answer

# replace the zero values of insulin with NA
pima_insulin_with_na <- pima |> 
  mutate(insulin = ifelse(insulin == 0, NA, insulin))

# plot the distribution of insulin by test result
ggplot(pima_insulin_with_na, aes(insulin, fill = test_label)) + 
  geom_histogram(position = "dodge", bins = 30)

Warning: Removed 374 rows containing non-finite outside the scale range
(`stat_bin()`).

Comment:

After replacing the zero values of insulin with NA, the distribution of insulin by test result looks more reasonable. The plot shows that the distribution of insulin is higher for those who tested positive for diabetes compared to those who tested negative. Additionally, the distribution of insulin for those who tested negative is more skewed to the right compared to those who tested positive.

3.0.3 Q3.3

Replace the incredible zeroes in other variables with the missing value code. Fit a model with the result of the diabetes test as the response and all the other variables as predictors. How many observations were used in the model fitting? Why is this less than the number of observations in the data frame.

3.0.3.1 Answer

# replace the zero values of other variables with NA
pima_with_na <- pima |> 
  mutate(
    glucose  = ifelse(glucose == 0, NA, glucose),
    diastolic = ifelse(diastolic == 0, NA, diastolic),
    triceps = ifelse(triceps == 0, NA, triceps),
    insulin = ifelse(insulin == 0, NA, insulin),
    bmi = ifelse(bmi == 0, NA, bmi),
    age = ifelse(age == 0, NA, age),
    diabetes = ifelse(diabetes == 0, NA, diabetes)
  )

# fit a model1 with the result of the diabetes test as 
# the response and all the other variables as predictors
model1 <- glm(test ~ glucose + diastolic + triceps + insulin + 
               bmi + pregnant + age + diabetes, 
             data = pima_with_na, 
             family = binomial)

summary(model1)


Call:
glm(formula = test ~ glucose + diastolic + triceps + insulin + 
    bmi + pregnant + age + diabetes, family = binomial, data = pima_with_na)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.004e+01  1.218e+00  -8.246  < 2e-16 ***
glucose      3.827e-02  5.768e-03   6.635 3.24e-11 ***
diastolic   -1.420e-03  1.183e-02  -0.120  0.90446    
triceps      1.122e-02  1.708e-02   0.657  0.51128    
insulin     -8.253e-04  1.306e-03  -0.632  0.52757    
bmi          7.054e-02  2.734e-02   2.580  0.00989 ** 
pregnant     8.216e-02  5.543e-02   1.482  0.13825    
age          3.395e-02  1.838e-02   1.847  0.06474 .  
diabetes     1.141e+00  4.274e-01   2.669  0.00760 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 498.10  on 391  degrees of freedom
Residual deviance: 344.02  on 383  degrees of freedom
  (376 observations deleted due to missingness)
AIC: 362.02

Number of Fisher Scoring iterations: 5

# the number of observations used in the model fitting
nobs(model1)

[1] 392

In the model1, there are some significant predictors such as glucose, bmi, diabetes, and age.

The model fitting process only used the 392 observations that did not contain missing values. 376 observations were deleted from the data frame because they contained missing values.

Reason: This is because the glm function deletes any observations that contain missing values in any of the variables used in the model fitting process.

3.0.4 Q3.4

Refit the model but now without the insulin and triceps predictors. How many observations were used in fitting this model? Devise a test to compare this model with that in the previous question.

3.0.4.1 Answer

First, fit the model without the insulin and triceps predictors. In this case, the model fitting process will use the observations that do not contain missing values in the glucose, diastolic, bmi, pregnant, age, and diabetes variables. So, it is expected that the number of observations used in the model fitting will be larger than the number of observations used in the model1 fitting process.

# fit a model2 without the insulin and triceps predictors
model2 <- glm(test ~ glucose + diastolic + bmi + 
                pregnant + age + diabetes, 
              data = pima_with_na, 
              family = binomial)

summary(model2)


Call:
glm(formula = test ~ glucose + diastolic + bmi + pregnant + age + 
    diabetes, family = binomial, data = pima_with_na)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.962146   0.820892 -10.918  < 2e-16 ***
glucose      0.035194   0.003605   9.763  < 2e-16 ***
diastolic   -0.008916   0.008618  -1.035  0.30084    
bmi          0.090926   0.015740   5.777 7.61e-09 ***
pregnant     0.117863   0.033418   3.527  0.00042 ***
age          0.016944   0.009834   1.723  0.08489 .  
diabetes     0.960515   0.306415   3.135  0.00172 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 931.94  on 723  degrees of freedom
Residual deviance: 672.86  on 717  degrees of freedom
  (44 observations deleted due to missingness)
AIC: 686.86

Number of Fisher Scoring iterations: 5

# the number of observations used in the model fitting
nobs(model2)

[1] 724

This means that the model fitting process used the 724 observations that did not contain missing values. 44 observations were deleted from the data frame because they contained missing values.

Comparison of the two models:

The number of observations used in the model1 fitting process is 392, while the number of observations used in the model2 fitting process is 724, which means we cannot directly compare the two models due to the different number of observations used in the model fitting process. In order to compare the two models, we need to use the same number of observations. Therefore, I will use the drop_na function to remove the missing values from the data frame before fitting the models.

And then, I will use the deviance statistic to compare the two models.

# delete the missing values from the data frame
pima_complete <- pima_with_na |> drop_na()

# fit the two models
model_a <- glm(test ~ glucose + diastolic + triceps + insulin + 
               bmi + pregnant + age + diabetes, 
             data = pima_complete, 
             family = binomial)
model_b <- glm(test ~ glucose + diastolic + bmi + 
                pregnant + age + diabetes, 
              data = pima_complete, 
              family = binomial)

# the number of observations used in the model fitting
nobs(model_a)

[1] 392

nobs(model_b)

[1] 392

In this case, the number of observations used in the model fitting process is 392 for both models. Now, I will use the deviance statistic to compare the two models.

# compare the two models
anova(model_a, model_b, test = "Chisq")

Analysis of Deviance Table

Model 1: test ~ glucose + diastolic + triceps + insulin + bmi + pregnant + 
    age + diabetes
Model 2: test ~ glucose + diastolic + bmi + pregnant + age + diabetes
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       383     344.02                     
2       385     344.88 -2 -0.85931   0.6507

Since the p-value of the test is 0.6507, we fail to reject the null hypothesis that the two models are the same. This means that the model without the insulin and triceps predictors is not significantly different from the model with these predictors. Therefore, we can conclude that the insulin and triceps predictors do not contribute significantly to the model.

3.0.5 Q3.5

Use AIC to select a model. You will need to take account of the missing values. Which predictors are selected? How many cases are used in your selected model?

3.0.5.1 Answer

I need to use the data without missing values and fit a model. So, I will continue to use model_a from the previous question to select the model using AIC.

# select a model using AIC
model_aic <- stepAIC(model_a, direction = "both", trace = TRUE)

Start:  AIC=362.02
test ~ glucose + diastolic + triceps + insulin + bmi + pregnant + 
    age + diabetes

            Df Deviance    AIC
- diastolic  1   344.04 360.04
- insulin    1   344.42 360.42
- triceps    1   344.45 360.45
<none>           344.02 362.02
- pregnant   1   346.24 362.24
- age        1   347.55 363.55
- bmi        1   350.89 366.89
- diabetes   1   351.58 367.58
- glucose    1   396.95 412.95

Step:  AIC=360.04
test ~ glucose + triceps + insulin + bmi + pregnant + age + diabetes

            Df Deviance    AIC
- insulin    1   344.42 358.42
- triceps    1   344.46 358.46
<none>           344.04 360.04
- pregnant   1   346.24 360.24
- age        1   347.60 361.60
+ diastolic  1   344.02 362.02
- bmi        1   351.28 365.28
- diabetes   1   351.67 365.67
- glucose    1   397.31 411.31

Step:  AIC=358.42
test ~ glucose + triceps + bmi + pregnant + age + diabetes

            Df Deviance    AIC
- triceps    1   344.89 356.89
<none>           344.42 358.42
- pregnant   1   346.74 358.74
- age        1   347.87 359.87
+ insulin    1   344.04 360.04
+ diastolic  1   344.42 360.42
- bmi        1   351.32 363.32
- diabetes   1   351.90 363.90
- glucose    1   411.11 423.11

Step:  AIC=356.89
test ~ glucose + bmi + pregnant + age + diabetes

            Df Deviance    AIC
<none>           344.89 356.89
- pregnant   1   347.23 357.23
+ triceps    1   344.42 358.42
+ insulin    1   344.46 358.46
- age        1   348.72 358.72
+ diastolic  1   344.88 358.88
- diabetes   1   352.72 362.72
- bmi        1   360.44 370.44
- glucose    1   411.85 421.85

# the number of observations used in the selected model
nobs(model_aic)

[1] 392

# print the summary of the selected model
model_aic |>
  tbl_regression() |>
  bold_labels() |>
  bold_p(t = 0.05)

Characteristic	log(OR)¹	95% CI¹	p-value
glucose	0.04	0.03, 0.05	<0.001
bmi	0.08	0.04, 0.12	<0.001
pregnant	0.08	-0.02, 0.19	0.13
age	0.03	0.00, 0.07	0.054
diabetes	1.2	0.34, 2.0	0.007
¹ OR = Odds Ratio, CI = Confidence Interval

The selected model includes the predictors glucose, diabetes, bmi, pregnant and age. The number of observations used in the selected model is 392.

3.0.6 Q3.6

Create a variable that indicates whether the case contains a missing value. Use this variable as a predictor of the test result. Is missingness associated with the test result? Refit the selected model, but now using as much of the data as reasonable. Explain why it is appropriate to do this.

3.0.6.1 Answer

# indentify cases with missing values
pima <- pima |> 
  mutate(
    glucose2  = ifelse(glucose == 0, NA, glucose),
    diastolic2 = ifelse(diastolic == 0, NA, diastolic),
    triceps2 = ifelse(triceps == 0, NA, triceps),
    insulin2 = ifelse(insulin == 0, NA, insulin),
    bmi2 = ifelse(bmi == 0, NA, bmi),
    age2 = ifelse(age == 0, NA, age),
    diabetes2 = ifelse(diabetes == 0, NA, diabetes)
  )

# create a variable that indicates 
# whether the case contains a missing value
pima$missingNA = ifelse(apply(
  is.na(dplyr::select(pima, contains("2"))), 1, sum) > 0, 1, 0)

# fit a model with the missingness variable as a predictor
missing.glm <- glm(test ~ missingNA,
                   data = pima, 
                   family = binomial)

# summary of the model
missing.glm |>
  tbl_regression() |>
  bold_labels() |>
  bold_p(t = 0.05)

Characteristic	log(OR)¹	95% CI¹	p-value
missingNA	0.16	-0.14, 0.45	0.3
¹ OR = Odds Ratio, CI = Confidence Interval

Here I found out that missingness is not associated with the outcome because the p-value is 0.3. This means that the distribution of outcome when removing data with missing is still a representative of the original distribution. This justifies the use of “complete case” analysis.

In the following code, I will refit the selected model using as much of the data as reasonable. Since in the previous question, we found that the best model is between test and glucose, diabetes, bmi, pregnant and age. Also, we found that missingness is not associated with the outcome. Therefore, I will only drop the missing values in the selected model and refit the model.

# drop the missing values 
# just according to the variables in the selected model
pima_complete_2 <- pima |> drop_na(glucose2, bmi2, age2, diabetes2)

# refit the selected model using the new complete cases
model_4 <- glm(test ~ glucose2 + bmi2 + age2 + diabetes2 + pregnant, 
                      data = pima_complete_2, 
                      family = binomial)

# summary of the model
model_4 |>
  tbl_regression() |>
  bold_labels() |>
  bold_p(t = 0.05)

Characteristic	log(OR)¹	95% CI¹	p-value
glucose2	0.04	0.03, 0.04	<0.001
bmi2	0.09	0.06, 0.12	<0.001
age2	0.01	-0.01, 0.03	0.2
diabetes2	0.92	0.34, 1.5	0.002
pregnant	0.12	0.05, 0.18	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

summary(model_4)


Call:
glm(formula = test ~ glucose2 + bmi2 + age2 + diabetes2 + pregnant, 
    family = binomial, data = pima_complete_2)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -9.322789   0.737279 -12.645  < 2e-16 ***
glucose2     0.035941   0.003555  10.110  < 2e-16 ***
bmi2         0.087529   0.014722   5.945 2.76e-09 ***
age2         0.011366   0.009315   1.220 0.222405    
diabetes2    0.920583   0.300832   3.060 0.002212 ** 
pregnant     0.115058   0.032341   3.558 0.000374 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 974.75  on 751  degrees of freedom
Residual deviance: 703.24  on 746  degrees of freedom
AIC: 715.24

Number of Fisher Scoring iterations: 5

# the number of observations used in the model fitting
nobs(model_4)

[1] 752

The selected model includes the predictors glucose, diabetes, bmi, pregnant and age. The number of observations used in the selected model is 752.

Reason: The reason why it is appropriate to use the “complete case” analysis is that the missingness is not associated with the outcome. This means that the distribution of the outcome when removing data with missing is still a representative of the original distribution. Therefore, the model fitted using the complete case analysis is still valid.

3.0.7 Q3.7

Using the last fitted model of the previous question, what is the difference in the log odds of testing positive for diabetes for a woman with a BMI at the first quartile compared with a woman at the third quartile, assuming that all other factors are held constant? Give a confidence interval for this difference.

3.0.7.1 Answer

# calculate the first and third quartiles 
# of the BMI for woman who tested positive
bmi_q1 <- quantile(pima_complete_2$bmi2
                   [pima_complete_2$test == "1"], 0.25)
bmi_q3 <- quantile(pima_complete_2$bmi2
                   [pima_complete_2$test == "1"], 0.75)

# calculate the difference in the odds (odds ratio)
diff_of_odds <- coef(model_4)["bmi2"] * (bmi_q3 - bmi_q1)

# odds ratio
exp(diff_of_odds)

    bmi2 
1.957731

# calculate the 95% confidence interval for the difference
CI <- confint(model_4)["bmi2",] * c(bmi_q3 - bmi_q1)

# confidence interval for odds ratio
exp(CI)

   2.5 %   97.5 % 
1.575269 2.454763

cbind(diff_of_odds, CI[1], CI[2]) |>
  as_tibble() |>
  kable(col.names = c("Log Difference in Odds (Log Odds Ratio)",
                      "95% CI Lower", 
                      "95% CI Upper"))

Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
`.name_repair` is omitted as of tibble 2.0.0.
ℹ Using compatibility `.name_repair`.

Log Difference in Odds (Log Odds Ratio)	95% CI Lower	95% CI Upper
0.6717861	0.4544263	0.8980302

Therefore, the difference in the odds (odds ratio) of testing positive for diabetes and having a BMI at the first quartile compared to the third quartile is 1.957731, and the 95% confidence interval for this difference is between 1.575269 and 2.454763.

3.0.8 Q3.8

Do women who test positive have higher diastolic blood pressures? Is the diastolic blood pressure significant in the regression model? Explain the distinction between the two questions and discuss why the answers are only apparently contradictory.

3.0.8.1 Answer

Do women who test positive have higher diastolic blood pressures?

# plot the distribution of diastolic blood pressure 
# based on the test result

pima_complete_2 |> 
  ggplot(aes(x = test_label, y = diastolic2, fill = test_label)) +
  geom_boxplot() +
  labs(title = 
         "Distribution of Diastolic Blood Pressure by Test Result",
       x = "Test Result",
       y = "Diastolic Blood Pressure") +
  theme_minimal()

Warning: Removed 28 rows containing non-finite outside the scale range
(`stat_boxplot()`).

# fit a model with only diastolic blood pressure as a predictor
positive_diastolic.glm <- glm(test ~ diastolic2, 
                              data = pima_complete_2, 
                              family = binomial)

# summary of the model

positive_diastolic.glm |>
  tbl_regression() |>
  bold_labels() |>
  bold_p(t = 0.05)

Characteristic	log(OR)¹	95% CI¹	p-value
diastolic2	0.03	0.02, 0.04	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

From the model, we can see that the diastolic blood pressure is significant in the regression model. The p-value is less than 0.05. And from the boxplot, we can see that test results are positively associated with diastolic blood pressure. Therefore, women who test positive have higher diastolic blood pressures.

Is the diastolic blood pressure significant in the regression model?

# fit a model with all predictors

model_5 <- glm(test ~ glucose2 + diastolic2 + triceps2 + 
                 insulin2 + bmi2 + pregnant + age2 + diabetes2, 
               data = pima_complete_2, 
               family = binomial)


# summary of the model

model_5 |>
  tbl_regression() |>
  bold_labels() |>
  bold_p(t = 0.05)

Characteristic	log(OR)¹	95% CI¹	p-value
glucose2	0.04	0.03, 0.05	<0.001
diastolic2	0.00	-0.02, 0.02	>0.9
triceps2	0.01	-0.02, 0.04	0.5
insulin2	0.00	0.00, 0.00	0.5
bmi2	0.07	0.02, 0.13	0.010
pregnant	0.08	-0.03, 0.19	0.14
age2	0.03	0.00, 0.07	0.065
diabetes2	1.1	0.32, 2.0	0.008
¹ OR = Odds Ratio, CI = Confidence Interval

From the model, we can see that the diastolic blood pressure is not significant in the regression model. The p-value is greater than 0.05. This means that the diastolic blood pressure is not associated with the test result when other predictors are included in the model.

Explain the distinction:

The distinction between the two questions is that the first question is asking whether the diastolic blood pressure is associated with the test result when it is the only predictor in the model. The second question is asking whether the diastolic blood pressure is associated with the test result when other predictors are included in the model. The answers are only apparently contradictory because the significance of a predictor in a model depends on the other predictors included in the model. When the diastolic blood pressure is the only predictor in the model, it is significant. However, when other predictors are included in the model, the significance of the diastolic blood pressure changes because the other predictors may explain the variation in the test result that was previously explained by the diastolic blood pressure.