# load the necessary librarieslibrary(faraway)library(tidyverse)library(GGally)library(ggplot2)library(dplyr)library(car)library(patchwork)library(gtsummary)library(MASS)library(knitr)
Warning: package 'knitr' was built under R version 4.3.3
1 The swiss data — use Fertility as the response to practice
An initial data analysis that explores the numerical and graphical characteristics of the data.
we load the swiss data and convert it to a tibble.
# load the swiss datadata(swiss)swiss |>as_tibble()
# plot the pairplot to see the relationship between variablesggpairs(data = swiss, columns =2:6) +labs(title ="Swiss data")
# plot the violin plot + boxplot # for each variable to see the distributionswiss |>gather(key ="Variables", value ="value") |>ggplot(aes(x = Variables, y = value)) +geom_violin(aes(fill = Variables), width =2.5) +geom_boxplot(width =0.1, fill ="white", alpha =0.7) +labs(title ="Swiss data", x ="", y ="Value") +theme_grey()
Warning: `position_dodge()` requires non-overlapping x intervals.
From the descriptive statistics, violin plot, and boxplot, we can see that Agriculture, infant mortality and Fertility have a nearly symmetric distribution with a little bit of left skewness. Education and Examination have a right skewness. While Catholic has a bimodal distribution.
From the pairplot, we can see that there is a relatively strong negative correlation between Agriculture and Examination, Agriculture and Education, Examination and Education, Examination and Catholic, Catholic and Education.
Variable selection to choose the best model.
select the best model based on AIC
# fit the full modelfull_model_1 <-lm(Fertility ~ ., swiss)# select the best model based on AICsmall_model_1 <-stepAIC(full_model_1, direction ="both", trace =TRUE)
From the regression above without any transformation, we can see that the best model is the one with Agriculture, Education, Catholic, and Infant Mortality, which indicates that I need to consider these variables in the following analysis.
An exploration of transformations to improve the fit of the model.
transformation on data
# first, use the components plus residual plot to see # the partial effect of each variablecolnames <-colnames(swiss)[2:6]CPR <-lapply(colnames, function(var) {crPlot(# Fit linear regression modellm(Fertility ~ ., data = swiss), variable = var, smooth =TRUE )})
# log transformation on Educationswiss_trans <- swiss |>mutate(log10Education =log10(Education))
Plots above indicate that a log transformation on Education might be helpful to improve the model fit. Also, adding polynomial terms for Catholic should be considered.
Diagnostics to check the assumptions of your model.
fit model with log transformation on Education and polynomial terms for Catholic
# fit the modelmodel <-lm(Fertility ~ Agriculture + Examination+ log10Education +poly(Catholic, 2)+ Infant.Mortality, swiss_trans)# diagnostic plotplot(model)
From the diagnostic plot, we can see that the residuals are randomly scattered around the 0 line, which indicates that the model is good. The normal Q-Q plot shows that the residuals are normally distributed. The scale-location plot shows that the residuals are homoscedastic. The residuals vs leverage plot shows that there are three points with high cook’s distance, which might be influential points.
select the best model/variables based on BIC
# select the best model based on BICsmall_model_2 <-step(model, direction ="both", trace =TRUE, k =log(nrow(swiss)))
drop1(small_model_2, test ="F") |>tbl_regression() |>bold_labels()
Characteristic
p-value
<none>
Agriculture
0.047
Examination
<0.001
poly(Catholic, 2)poly(Catholic, ²)
0.003
Infant.Mortality
<0.001
From the regression above, we can see that the best model is the one with Agriculture, Examination, Catholic (polynomial), and Infant Mortality, which indicates that I need to consider these variables in the following analysis, and drop the log10Education variable. But in the initial model without any transformation, Education is an important variable, so I will include the Education variable without any transformation in the new model to see if it is still important.
try to include the Education variable in the model
# fit the modelmodel_2 <-lm(Fertility ~ Agriculture + Examination+poly(Catholic, 2) + Infant.Mortality + Education, swiss)# select the best model based on BICsmall_model_3 <-step(model_2, direction ="both", trace =TRUE, k =log(nrow(swiss)))
# check the significance of each variabledrop1(small_model_3, test ="F") |>tbl_regression() |>bold_labels()
Characteristic
p-value
<none>
Agriculture
0.012
poly(Catholic, 2)poly(Catholic, ²)
<0.001
Infant.Mortality
0.003
Education
<0.001
Since the BIC value of the model with Education is lower than the model without Education, we will keep the Education variable in the model. The final model includes Agriculture, Catholic (polynomial), Infant Mortality, and Education.
Some predictions of future observations for interesting values of the predictors.
get the final model
# fit the final modelswiss_poly <- swiss |>mutate(Catholic_sqr = Catholic^2) finalm <-lm(Fertility ~ Agriculture + Education+ Catholic + Catholic_sqr + Infant.Mortality, swiss_poly)# summary final modelfinalm |>tbl_regression() |>bold_labels() |>bold_p(t =0.05)
Characteristic
Beta
95% CI1
p-value
Agriculture
-0.18
-0.32, -0.04
0.012
Education
-0.84
-1.2, -0.51
<0.001
Catholic
-0.24
-0.68, 0.20
0.3
Catholic_sqr
0.00
0.00, 0.01
0.10
Infant.Mortality
1.2
0.41, 1.9
0.003
1 CI = Confidence Interval
# perform the model diagnosticsplot(finalm)
# check the assumptions of the modelcolnames_2 <-c("Agriculture", "Education", "Infant.Mortality", "Catholic", "Catholic_sqr")CRP_2 <-lapply(colnames_2, function(var) {crPlot(# Fit linear regression model finalm, variable = var, smooth =TRUE )})
From the diagnostic plots, we can see that the residuals are randomly scattered around the 0 line, which indicates that the model is good. The normal Q-Q plot shows that the residuals are normally distributed. The scale-location plot shows that the residuals are alomost homoscedastic. The residuals vs leverage plot shows that there are some points with high cook’s distance, which might be influential points. The components plus residual plot shows that the partial effect of each variable is acceptable.
attempt to predict using the final model
Here, we will predict the fertility rate using the final model. We will use the summary statistics of each variable to predict the fertility rate.
# retrieve the summary statistics of each variablepredict_data <-data.frame(Agriculture =as.vector(summary(swiss_poly$Agriculture)),Education =as.vector(summary(swiss_poly$Education)),Catholic =as.vector(summary(swiss_poly$Catholic)),Infant.Mortality =as.vector(summary( swiss_poly$Infant.Mortality)),Catholic_sqr =as.vector(summary(swiss_poly$Catholic_sqr)))# predict the fertility rate using the final modelpp <-predict(finalm, predict_data, se.fit =TRUE)cbind(predict_data, Predict_fertility = pp) |>as_tibble() |>kable(format ="markdown")
Agriculture
Education
Catholic
Infant.Mortality
Catholic_sqr
Predict_fertility.fit
Predict_fertility.se.fit
Predict_fertility.df
Predict_fertility.residual.scale
1.20000
1.00000
2.15000
10.80000
4.62250
73.69253
5.964476
41
7.012084
35.90000
6.00000
5.19500
18.15000
26.98925
71.26141
2.011032
41
7.012084
54.10000
8.00000
15.14000
20.00000
229.21960
66.82576
1.982976
41
7.012084
50.65957
10.97872
41.14383
19.94255
3395.10300
70.14255
1.022817
41
7.012084
67.65000
12.00000
93.12500
21.70000
8672.34125
75.02930
1.887649
41
7.012084
89.70000
53.00000
100.00000
26.60000
10000.00000
45.41022
9.843236
41
7.012084
An interpretation of the meaning of the model by writing a scientific abstract.
(BACKGROUND) The fertility rate in Switzerland is influenced by various factors like agriculture, education, Catholicism, and infant mortality.
(OBJECTIVE) Using the “swiss” dataset, this study aimed to explore these associations and develop a predictive model for fertility rates.
(METHODS) After initial data analysis, transformation, and variable selection, a linear regression model was fitted and evaluated using diagnostic plots and the Bayesian Information Criterion (BIC).
(RESULTS) The final model included Agriculture, Examination, Catholic, and Infant Mortality as predictors. Agriculture, Examination, and Infant Mortality significantly affected fertility rates, while Catholicism had a non-linear effect. The model exhibited a good fit (adjusted R-squared = 0.61) and can predict fertility rates for new observations.
(CONCLUSIONS) This study provides insights into Swiss fertility patterns, aiding policymakers and public health experts in developing strategies to enhance fertility rates. Validation and exploration of additional predictors are recommended for future research.
2 Concavity of logistic regression log-likelihood
Write down the log-likelihood function of logistic regression for binomial responses.
The inverse link function for logistic regression is the logit function, which is defined as: \[
\begin{eqnarray*}
p_i = \frac{e^{\eta_i}}{1 + e^{\eta_i}}
&=& \frac{1}{1 + e^{-\eta_i}}
&=& \text{logit}^{-1}(\eta_i)
\end{eqnarray*}
\] Where \(\eta_i = \mathbf{x}_i^T \boldsymbol{\beta}\) is the linear predictor, \(\mathbf{x}_i\) is the \(1\times p\) vector of predictors for the \(i\)-th observation, and \(\boldsymbol{\beta}\) is the vector of coefficients. So, \(p_i\) is the probability of the \(i\)-th observation belonging to the positive class and can also be written as: \[
\begin{eqnarray*}
p_i = P(Y_i = 1 | \mathbf{x}_i) = \frac{e^{\eta_i}}{1 + e^{\eta_i}}
&=& \frac{e^{\mathbf{x}_i^T \boldsymbol{\beta}}}{1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}}
\end{eqnarray*}
\] And, the log-likelihood function for logistic regression is defined as: \[
\begin{eqnarray*}
\ell(\boldsymbol{\beta})
&=& \log \left[ \prod_i p_i^{y_i} (1 - p_i)^{1 - y_i} \right] \\
&=& \sum_i \log \left[p_i^{y_i} (1 - p_i)^{1 - y_i}\right] \\
&=& \sum_i \left[ y_i \log p_i + (1 - y_i) \log (1 - p_i) \right] \\
&=& \sum_i \left[ y_i \log \frac{e^{\eta_i}}{1 + e^{\eta_i}} + (1 - y_i) \log \frac{1}{1 + e^{\eta_i}} \right] \\
&=& \sum_i \left[ y_i \eta_i - \log (1 + e^{\eta_i}) \right] \\
&=& \sum_i \left[ y_i \cdot \mathbf{x}_i^T \boldsymbol{\beta} - \log (1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}) \right]
\end{eqnarray*}
\]
Derive the gradient vector and Hessian matrix of the log-likelhood function with respect to the regression coefficients \(\boldsymbol{\beta}\).
The gradient vector of the log-likelihood function with respect to the regression coefficients \(\boldsymbol{\beta}\) is given by: \[
\begin{eqnarray*}
\nabla_{\boldsymbol{\beta}} \ell(\boldsymbol{\beta})
&=& \nabla_{\boldsymbol{\beta}} \left[ \sum_i \left[ y_i \cdot \mathbf{x}_i^T \boldsymbol{\beta} - \log (1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}) \right] \right] \\
&=& \sum_i \nabla_{\boldsymbol{\beta}} \left[ y_i \cdot \mathbf{x}_i^T \boldsymbol{\beta} - \log (1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}) \right] \\
&=& \sum_i \left[\nabla_{\boldsymbol{\beta}} \left( y_i \cdot \mathbf{x}_i^T \boldsymbol{\beta} \right) - \nabla_{\boldsymbol{\beta}} \left( \log (1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}) \right) \right] \\
&=& \sum_i \left[ y_i \mathbf{x}_i - \frac{e^{\mathbf{x}_i^T \boldsymbol{\beta}}}{1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}} \mathbf{x}_i \right] \\
&=& \sum_i \left[ y_i \mathbf{x}_i - p_i \mathbf{x}_i \right] \\
&=& \sum_i \left[ (y_i - p_i) \mathbf{x}_i \right] \\
&=& \mathbf{X}^T (\mathbf{y} - \mathbf{p})
\end{eqnarray*}
\] where \(\mathbf{X}\) is the design matrix, \(\mathbf{y}\) is the vector of observed responses, \(\mathbf{p}\) is the vector of predicted probabilities, and \(\mathbf{p} = \frac{e^{\mathbf{X} \boldsymbol{\beta}}}{1 + e^{\mathbf{X} \boldsymbol{\beta}}}\).
The Hessian matrix of the log-likelihood function with respect to the regression coefficients \(\boldsymbol{\beta}\) is given by:
Show that the log-likelihood function of logistic regression is a concave function in regression coefficients \(\boldsymbol{\beta}\). (Hint: show that the negative Hessian is a positive semidefinite matrix.)
The negative Hessian is
\[
\begin{eqnarray*}
- \nabla_{\boldsymbol{\beta}} \nabla_{\boldsymbol{\beta}}^T \ell(\boldsymbol{\beta})
&=& \sum_i p_i (1 - p_i) \mathbf{x}_i \mathbf{x}_i^T \\
&=& \mathbf{X}^T \mathbf{W} \mathbf{X} \\
\end{eqnarray*}
\] where \(\mathbf{W}\) is a diagonal matrix with \(p_i (1 - p_i)\) on the diagonal. And \(\mathbf{X}\) is the design matrix.
To prove that the log-likelihood function is concave, we need to show that the negative Hessian is positive semidefinite. Let \(\mathbf{v}\) be a vector. Then
where the inequality follows from the fact that \(p_i (1 - p_i) \geq 0\) for all \(i\). Therefore, the negative Hessian is positive semidefinite, which implies that the log-likelihood function is concave.
3 GLM model fitting
The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix. The purpose of the study was to investigate factors related to diabetes. The data may be found in the the dataset pima.
3.0.1 Q3.1
Create a factor version of the test results and use this to produce an interleaved histogram to show how the distribution of insulin differs between those testing positive and negative. Do you notice anything unbelievable about the plot?
3.0.1.1 Answer
read the data
data(pima)
Plot the distribution of insulin by test result
# seperate the test results into negative and positivepima$test_label <-ifelse(pima$test ==0, "Negative", "Positive")# plot the distribution of insulin by test resultggplot(pima, aes(insulin, fill = test_label)) +geom_histogram(position ="dodge", bins =30) +labs(title ="Distribution of Insulin by Test Result")
Yes. From the plot above, I see that there are a large number of zero values for insulin. This is not possible as insulin is a hormone that is always present in the body. The zero values are likely to be missing values that have been coded as zero.
3.0.2 Q3.2
Replace the zero values of insulin with the missing value code NA. Recreate the interleaved histogram plot and comment on the distribution.
3.0.2.1 Answer
# replace the zero values of insulin with NApima_insulin_with_na <- pima |>mutate(insulin =ifelse(insulin ==0, NA, insulin))# plot the distribution of insulin by test resultggplot(pima_insulin_with_na, aes(insulin, fill = test_label)) +geom_histogram(position ="dodge", bins =30)
Warning: Removed 374 rows containing non-finite outside the scale range
(`stat_bin()`).
Comment:
After replacing the zero values of insulin with NA, the distribution of insulin by test result looks more reasonable. The plot shows that the distribution of insulin is higher for those who tested positive for diabetes compared to those who tested negative. Additionally, the distribution of insulin for those who tested negative is more skewed to the right compared to those who tested positive.
3.0.3 Q3.3
Replace the incredible zeroes in other variables with the missing value code. Fit a model with the result of the diabetes test as the response and all the other variables as predictors. How many observations were used in the model fitting? Why is this less than the number of observations in the data frame.
3.0.3.1 Answer
# replace the zero values of other variables with NApima_with_na <- pima |>mutate(glucose =ifelse(glucose ==0, NA, glucose),diastolic =ifelse(diastolic ==0, NA, diastolic),triceps =ifelse(triceps ==0, NA, triceps),insulin =ifelse(insulin ==0, NA, insulin),bmi =ifelse(bmi ==0, NA, bmi),age =ifelse(age ==0, NA, age),diabetes =ifelse(diabetes ==0, NA, diabetes) )# fit a model1 with the result of the diabetes test as # the response and all the other variables as predictorsmodel1 <-glm(test ~ glucose + diastolic + triceps + insulin + bmi + pregnant + age + diabetes, data = pima_with_na, family = binomial)summary(model1)
Call:
glm(formula = test ~ glucose + diastolic + triceps + insulin +
bmi + pregnant + age + diabetes, family = binomial, data = pima_with_na)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.004e+01 1.218e+00 -8.246 < 2e-16 ***
glucose 3.827e-02 5.768e-03 6.635 3.24e-11 ***
diastolic -1.420e-03 1.183e-02 -0.120 0.90446
triceps 1.122e-02 1.708e-02 0.657 0.51128
insulin -8.253e-04 1.306e-03 -0.632 0.52757
bmi 7.054e-02 2.734e-02 2.580 0.00989 **
pregnant 8.216e-02 5.543e-02 1.482 0.13825
age 3.395e-02 1.838e-02 1.847 0.06474 .
diabetes 1.141e+00 4.274e-01 2.669 0.00760 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 498.10 on 391 degrees of freedom
Residual deviance: 344.02 on 383 degrees of freedom
(376 observations deleted due to missingness)
AIC: 362.02
Number of Fisher Scoring iterations: 5
# the number of observations used in the model fittingnobs(model1)
[1] 392
In the model1, there are some significant predictors such as glucose, bmi, diabetes, and age.
The model fitting process only used the 392 observations that did not contain missing values. 376 observations were deleted from the data frame because they contained missing values.
Reason: This is because the glm function deletes any observations that contain missing values in any of the variables used in the model fitting process.
3.0.4 Q3.4
Refit the model but now without the insulin and triceps predictors. How many observations were used in fitting this model? Devise a test to compare this model with that in the previous question.
3.0.4.1 Answer
First, fit the model without the insulin and triceps predictors. In this case, the model fitting process will use the observations that do not contain missing values in the glucose, diastolic, bmi, pregnant, age, and diabetes variables. So, it is expected that the number of observations used in the model fitting will be larger than the number of observations used in the model1 fitting process.
# fit a model2 without the insulin and triceps predictorsmodel2 <-glm(test ~ glucose + diastolic + bmi + pregnant + age + diabetes, data = pima_with_na, family = binomial)summary(model2)
Call:
glm(formula = test ~ glucose + diastolic + bmi + pregnant + age +
diabetes, family = binomial, data = pima_with_na)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.962146 0.820892 -10.918 < 2e-16 ***
glucose 0.035194 0.003605 9.763 < 2e-16 ***
diastolic -0.008916 0.008618 -1.035 0.30084
bmi 0.090926 0.015740 5.777 7.61e-09 ***
pregnant 0.117863 0.033418 3.527 0.00042 ***
age 0.016944 0.009834 1.723 0.08489 .
diabetes 0.960515 0.306415 3.135 0.00172 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 931.94 on 723 degrees of freedom
Residual deviance: 672.86 on 717 degrees of freedom
(44 observations deleted due to missingness)
AIC: 686.86
Number of Fisher Scoring iterations: 5
# the number of observations used in the model fittingnobs(model2)
[1] 724
This means that the model fitting process used the 724 observations that did not contain missing values. 44 observations were deleted from the data frame because they contained missing values.
Comparison of the two models:
The number of observations used in the model1 fitting process is 392, while the number of observations used in the model2 fitting process is 724, which means we cannot directly compare the two models due to the different number of observations used in the model fitting process. In order to compare the two models, we need to use the same number of observations. Therefore, I will use the drop_na function to remove the missing values from the data frame before fitting the models.
And then, I will use the deviance statistic to compare the two models.
# delete the missing values from the data framepima_complete <- pima_with_na |>drop_na()# fit the two modelsmodel_a <-glm(test ~ glucose + diastolic + triceps + insulin + bmi + pregnant + age + diabetes, data = pima_complete, family = binomial)model_b <-glm(test ~ glucose + diastolic + bmi + pregnant + age + diabetes, data = pima_complete, family = binomial)# the number of observations used in the model fittingnobs(model_a)
[1] 392
nobs(model_b)
[1] 392
In this case, the number of observations used in the model fitting process is 392 for both models. Now, I will use the deviance statistic to compare the two models.
# compare the two modelsanova(model_a, model_b, test ="Chisq")
Analysis of Deviance Table
Model 1: test ~ glucose + diastolic + triceps + insulin + bmi + pregnant +
age + diabetes
Model 2: test ~ glucose + diastolic + bmi + pregnant + age + diabetes
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 383 344.02
2 385 344.88 -2 -0.85931 0.6507
Since the p-value of the test is 0.6507, we fail to reject the null hypothesis that the two models are the same. This means that the model without the insulin and triceps predictors is not significantly different from the model with these predictors. Therefore, we can conclude that the insulin and triceps predictors do not contribute significantly to the model.
3.0.5 Q3.5
Use AIC to select a model. You will need to take account of the missing values. Which predictors are selected? How many cases are used in your selected model?
3.0.5.1 Answer
I need to use the data without missing values and fit a model. So, I will continue to use model_a from the previous question to select the model using AIC.
# select a model using AICmodel_aic <-stepAIC(model_a, direction ="both", trace =TRUE)
# the number of observations used in the selected modelnobs(model_aic)
[1] 392
# print the summary of the selected modelmodel_aic |>tbl_regression() |>bold_labels() |>bold_p(t =0.05)
Characteristic
log(OR)1
95% CI1
p-value
glucose
0.04
0.03, 0.05
<0.001
bmi
0.08
0.04, 0.12
<0.001
pregnant
0.08
-0.02, 0.19
0.13
age
0.03
0.00, 0.07
0.054
diabetes
1.2
0.34, 2.0
0.007
1 OR = Odds Ratio, CI = Confidence Interval
The selected model includes the predictors glucose, diabetes, bmi, pregnant and age. The number of observations used in the selected model is 392.
3.0.6 Q3.6
Create a variable that indicates whether the case contains a missing value. Use this variable as a predictor of the test result. Is missingness associated with the test result? Refit the selected model, but now using as much of the data as reasonable. Explain why it is appropriate to do this.
3.0.6.1 Answer
# indentify cases with missing valuespima <- pima |>mutate(glucose2 =ifelse(glucose ==0, NA, glucose),diastolic2 =ifelse(diastolic ==0, NA, diastolic),triceps2 =ifelse(triceps ==0, NA, triceps),insulin2 =ifelse(insulin ==0, NA, insulin),bmi2 =ifelse(bmi ==0, NA, bmi),age2 =ifelse(age ==0, NA, age),diabetes2 =ifelse(diabetes ==0, NA, diabetes) )# create a variable that indicates # whether the case contains a missing valuepima$missingNA =ifelse(apply(is.na(dplyr::select(pima, contains("2"))), 1, sum) >0, 1, 0)# fit a model with the missingness variable as a predictormissing.glm <-glm(test ~ missingNA,data = pima, family = binomial)# summary of the modelmissing.glm |>tbl_regression() |>bold_labels() |>bold_p(t =0.05)
Characteristic
log(OR)1
95% CI1
p-value
missingNA
0.16
-0.14, 0.45
0.3
1 OR = Odds Ratio, CI = Confidence Interval
Here I found out that missingness is not associated with the outcome because the p-value is 0.3. This means that the distribution of outcome when removing data with missing is still a representative of the original distribution. This justifies the use of “complete case” analysis.
In the following code, I will refit the selected model using as much of the data as reasonable. Since in the previous question, we found that the best model is between test and glucose, diabetes, bmi, pregnant and age. Also, we found that missingness is not associated with the outcome. Therefore, I will only drop the missing values in the selected model and refit the model.
# drop the missing values # just according to the variables in the selected modelpima_complete_2 <- pima |>drop_na(glucose2, bmi2, age2, diabetes2)# refit the selected model using the new complete casesmodel_4 <-glm(test ~ glucose2 + bmi2 + age2 + diabetes2 + pregnant, data = pima_complete_2, family = binomial)# summary of the modelmodel_4 |>tbl_regression() |>bold_labels() |>bold_p(t =0.05)
Characteristic
log(OR)1
95% CI1
p-value
glucose2
0.04
0.03, 0.04
<0.001
bmi2
0.09
0.06, 0.12
<0.001
age2
0.01
-0.01, 0.03
0.2
diabetes2
0.92
0.34, 1.5
0.002
pregnant
0.12
0.05, 0.18
<0.001
1 OR = Odds Ratio, CI = Confidence Interval
summary(model_4)
Call:
glm(formula = test ~ glucose2 + bmi2 + age2 + diabetes2 + pregnant,
family = binomial, data = pima_complete_2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.322789 0.737279 -12.645 < 2e-16 ***
glucose2 0.035941 0.003555 10.110 < 2e-16 ***
bmi2 0.087529 0.014722 5.945 2.76e-09 ***
age2 0.011366 0.009315 1.220 0.222405
diabetes2 0.920583 0.300832 3.060 0.002212 **
pregnant 0.115058 0.032341 3.558 0.000374 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 974.75 on 751 degrees of freedom
Residual deviance: 703.24 on 746 degrees of freedom
AIC: 715.24
Number of Fisher Scoring iterations: 5
# the number of observations used in the model fittingnobs(model_4)
[1] 752
The selected model includes the predictors glucose, diabetes, bmi, pregnant and age. The number of observations used in the selected model is 752.
Reason: The reason why it is appropriate to use the “complete case” analysis is that the missingness is not associated with the outcome. This means that the distribution of the outcome when removing data with missing is still a representative of the original distribution. Therefore, the model fitted using the complete case analysis is still valid.
3.0.7 Q3.7
Using the last fitted model of the previous question, what is the difference in the log odds of testing positive for diabetes for a woman with a BMI at the first quartile compared with a woman at the third quartile, assuming that all other factors are held constant? Give a confidence interval for this difference.
3.0.7.1 Answer
# calculate the first and third quartiles # of the BMI for woman who tested positivebmi_q1 <-quantile(pima_complete_2$bmi2 [pima_complete_2$test =="1"], 0.25)bmi_q3 <-quantile(pima_complete_2$bmi2 [pima_complete_2$test =="1"], 0.75)# calculate the difference in the odds (odds ratio)diff_of_odds <-coef(model_4)["bmi2"] * (bmi_q3 - bmi_q1)# odds ratioexp(diff_of_odds)
bmi2
1.957731
# calculate the 95% confidence interval for the differenceCI <-confint(model_4)["bmi2",] *c(bmi_q3 - bmi_q1)# confidence interval for odds ratioexp(CI)
2.5 % 97.5 %
1.575269 2.454763
cbind(diff_of_odds, CI[1], CI[2]) |>as_tibble() |>kable(col.names =c("Log Difference in Odds (Log Odds Ratio)","95% CI Lower", "95% CI Upper"))
Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
`.name_repair` is omitted as of tibble 2.0.0.
ℹ Using compatibility `.name_repair`.
Log Difference in Odds (Log Odds Ratio)
95% CI Lower
95% CI Upper
0.6717861
0.4544263
0.8980302
Therefore, the difference in the odds (odds ratio) of testing positive for diabetes and having a BMI at the first quartile compared to the third quartile is 1.957731, and the 95% confidence interval for this difference is between 1.575269 and 2.454763.
3.0.8 Q3.8
Do women who test positive have higher diastolic blood pressures? Is the diastolic blood pressure significant in the regression model? Explain the distinction between the two questions and discuss why the answers are only apparently contradictory.
3.0.8.1 Answer
Do women who test positive have higher diastolic blood pressures?
# plot the distribution of diastolic blood pressure # based on the test resultpima_complete_2 |>ggplot(aes(x = test_label, y = diastolic2, fill = test_label)) +geom_boxplot() +labs(title ="Distribution of Diastolic Blood Pressure by Test Result",x ="Test Result",y ="Diastolic Blood Pressure") +theme_minimal()
Warning: Removed 28 rows containing non-finite outside the scale range
(`stat_boxplot()`).
# fit a model with only diastolic blood pressure as a predictorpositive_diastolic.glm <-glm(test ~ diastolic2, data = pima_complete_2, family = binomial)# summary of the modelpositive_diastolic.glm |>tbl_regression() |>bold_labels() |>bold_p(t =0.05)
Characteristic
log(OR)1
95% CI1
p-value
diastolic2
0.03
0.02, 0.04
<0.001
1 OR = Odds Ratio, CI = Confidence Interval
From the model, we can see that the diastolic blood pressure is significant in the regression model. The p-value is less than 0.05. And from the boxplot, we can see that test results are positively associated with diastolic blood pressure. Therefore, women who test positive have higher diastolic blood pressures.
Is the diastolic blood pressure significant in the regression model?
# fit a model with all predictorsmodel_5 <-glm(test ~ glucose2 + diastolic2 + triceps2 + insulin2 + bmi2 + pregnant + age2 + diabetes2, data = pima_complete_2, family = binomial)# summary of the modelmodel_5 |>tbl_regression() |>bold_labels() |>bold_p(t =0.05)
Characteristic
log(OR)1
95% CI1
p-value
glucose2
0.04
0.03, 0.05
<0.001
diastolic2
0.00
-0.02, 0.02
>0.9
triceps2
0.01
-0.02, 0.04
0.5
insulin2
0.00
0.00, 0.00
0.5
bmi2
0.07
0.02, 0.13
0.010
pregnant
0.08
-0.03, 0.19
0.14
age2
0.03
0.00, 0.07
0.065
diabetes2
1.1
0.32, 2.0
0.008
1 OR = Odds Ratio, CI = Confidence Interval
From the model, we can see that the diastolic blood pressure is not significant in the regression model. The p-value is greater than 0.05. This means that the diastolic blood pressure is not associated with the test result when other predictors are included in the model.
Explain the distinction:
The distinction between the two questions is that the first question is asking whether the diastolic blood pressure is associated with the test result when it is the only predictor in the model. The second question is asking whether the diastolic blood pressure is associated with the test result when other predictors are included in the model. The answers are only apparently contradictory because the significance of a predictor in a model depends on the other predictors included in the model. When the diastolic blood pressure is the only predictor in the model, it is significant. However, when other predictors are included in the model, the significance of the diastolic blood pressure changes because the other predictors may explain the variation in the test result that was previously explained by the diastolic blood pressure.