GROUP
WORK - DEADLINE 27-Nov-23.
Please submit your final report using this form.
For this lab, we will use the tidyverse package because it will help us with loading data, wrangling, visualizing and analyzing it. You can find more about the packages in the tidyverse at tidyverse.org.
We will be using R Markdown (see tutorials here, here
and here) to
create reproducible lab reports. The file that we use to create the
report has the extension *.rmd, and consists of three
different parts:
R code appears in a gray box, which we call “code chunks.”
The R Markdown file knows that the gray box contains R code
because it begins with three tick marks (```), followed by two curly
braces that contain a lowercase letter r ({r}).Potentially the most important feature of R Markdown files is that
they allow for us to embed our R code within a written
report.
Instead of typing our R code into the console, we
encourage you to type any code you want to save in the R
code chunk associated with each problem. You can execute the
R code you type in these code chunks similar to how you
typed code into the console and pressed enter/return. Within the code
chunk there are a few ways to execute lines of R code:
Ctrl-Enter or Cmd-Enter at the same time,
orR code in a given code
chunk, you can click on the “Play” button in the upper right hand corner
of the code chunk (green sideways triangle).If at any point you need to start over and run all of the code chunks
before a specific code chunk, you click on the “Fastforward” button in
the upper right hand corner of that code chunk (gray upside down
triangle with a bar below). This will run every code chunk that occurred
before that code chunk, but will not execute the
R code included in that code chunk.
You will find all the work-space for your lab on posit cloud using this link.
Before starting, you are advised to change some of the global options
as shown in the figure on the right. We will be using R Markdown to
create reproducible lab reports. In RStudio, you will find the file
lab01.Rmd in the Files panel. Double click it
to show the script in the code panel.
tidyverse, broom,
table1, GGally, car,
easystats, datawizard,
marginaleffects and the modelsummary
packages.natality.rds data set into your workspace, and
save it as natality.The 2018 United States Birth Data were compiled from information on birth certificates by the National Vital Statistics System, part of the National Center for Health Statistics, in cooperation with States (National Center for Health Statistics 2022). The dataset consists of a random sample from the original dataset, registering births in the United States in the year 2018. Specifically, we will be looking at the following variables:
table1 package to create a table with
appropriate descriptive information about the data set. Use the examples
provided in the following vignette to help you build a table.
Bonus: Provide a brief description, interpretation or
reflection about your findings from the table: (e.g., what do you
notice? what do you wonder? how does this relate to you or your
community?)| Non smoker (N=1564) |
Smoker (N=124) |
|
|---|---|---|
| Birth Weight (g) | ||
| Mean (SD) | 3250 (601) | 3070 (597) |
| Median [Min, Max] | 3290 [369, 5880] | 3090 [630, 4220] |
| Risks | ||
| No | 1088 (69.6%) | 84 (67.7%) |
| Yes | 476 (30.4%) | 40 (32.3%) |
| Mother Age | ||
| Mean (SD) | 29.0 (5.79) | 27.8 (5.79) |
| Median [Min, Max] | 29.0 [15.0, 49.0] | 28.0 [17.0, 42.0] |
| Mother Race/Hispanic Origin | ||
| NH White | 844 (54.0%) | 94 (75.8%) |
| NH Black | 266 (17.0%) | 14 (11.3%) |
| NH Other | 125 (8.0%) | 10 (8.1%) |
| Hispanic | 329 (21.0%) | 6 (4.8%) |
| Marital Status | ||
| Married | 1017 (65.0%) | 40 (32.3%) |
| Unmarried | 547 (35.0%) | 84 (67.7%) |
| Mother Education | ||
| lower | 177 (11.3%) | 34 (27.4%) |
| HS | 377 (24.1%) | 45 (36.3%) |
| Some college | 437 (27.9%) | 37 (29.8%) |
| Uni | 573 (36.6%) | 8 (6.5%) |
| Prior births (living) | ||
| 0 | 601 (38.4%) | 38 (30.6%) |
| 1 | 516 (33.0%) | 41 (33.1%) |
| 2 | 265 (16.9%) | 17 (13.7%) |
| 3 | 93 (5.9%) | 17 (13.7%) |
| 4+ | 89 (5.7%) | 11 (8.9%) |
| Prior births (dead) | ||
| 0 | 1551 (99.2%) | 123 (99.2%) |
| 1+ | 13 (0.8%) | 1 (0.8%) |
| Mother BMI | ||
| Mean (SD) | 26.9 (6.32) | 27.3 (6.42) |
| Median [Min, Max] | 25.6 [15.6, 58.4] | 26.3 [16.5, 50.1] |
ggpairs function to explore the association
between the different variables. Try to replicate the figure below, or
create your own variation. Bonus: Provide a brief
description, interpretation or reflection about your findings from the
table: (e.g., what do you notice? what do you wonder? how does this
relate to you or your community?)In the following exercises, you will assess the association between birth weight (the outcome) and smoking during pregnancy (the primary predictor), adjusted for risk factors reported, mother’s age, mother’s race/ethnicity, marital status, mother’s education, prior births now living and prior births now dead, and mother’s body mass index.
step function find the optimal model using the stepwise
method. Then use anova to compare the four models. What is
your conclusion? Now compare your models using modelsummary
function. What does the number in the (brackets) mean? How can you
identify the optimal model just by looking at the \(R^2\) and the AIC?M.NULL <- lm(DBWT ~ 1, natality)
M.UNADJUSTED <- lm(DBWT ~ CIG_REC, natality)
# The dot represents all variables except the outcome DBWT
M.FULL <- lm(DBWT ~ ., natality)
# Finding the optimal model
step(M.NULL,
scope = list(upper = M.FULL),
direction="both",
data=natality)
# Displaying the four models by each other
modelsummary(
list(
null = M.NULL,
unadjasted = M.UNADJUSTED,
___ = ___,
___ = ___
),
stars = TRUE
)Stepwise regression is a systematic method for adding and removing terms from a linear or generalized linear model based on their statistical significance in explaining the response variable. The method begins with an initial model and then compares the explanatory power of incrementally larger and smaller models.
The step function uses forward and backward stepwise regression to determine a final model. At each step, the function searches for terms to add to the model or remove from the model based on the value of the ‘Criterion’ name-value pair argument.
Use the step function to run through different models -
the last model shown is the one with the smallest Akaike
information criterion (AIC)
| null | unadjusted | optimal | full | |
|---|---|---|---|---|
| (Intercept) | 3234.746 (14.658)*** | 3247.903 (15.187)*** | 3080.559 (64.625)*** | 3162.373 (107.021)*** |
| CIG_RECYes | −179.096 (56.032)** | −160.083 (55.632)** | −156.593 (56.325)** | |
| DMARUnmarried | −184.275 (31.823)*** | −189.782 (35.084)*** | ||
| MRACEHISPNH Black | −292.140 (41.690)*** | −290.175 (41.829)*** | ||
| MRACEHISPNH Other | −121.695 (52.796)* | −120.774 (52.960)* | ||
| MRACEHISPHispanic | −74.492 (38.075)+ | −72.464 (39.120)+ | ||
| BMI | 10.261 (2.259)*** | 10.493 (2.277)*** | ||
| risksYes | −158.808 (31.655)*** | −152.902 (32.055)*** | ||
| PRIORLIVE1 | 130.686 (33.814)*** | 140.483 (34.725)*** | ||
| PRIORLIVE2 | 107.170 (41.746)* | 123.385 (43.880)** | ||
| PRIORLIVE3 | 138.776 (59.972)* | 161.298 (63.149)* | ||
| PRIORLIVE4+ | 131.949 (62.333)* | 162.508 (67.562)* | ||
| PRIORDEAD1+ | 258.434 (153.905)+ | 265.467 (154.184)+ | ||
| MAGER | −3.836 (3.085) | |||
| MEDUCHS | 8.448 (49.909) | |||
| MEDUCSome college | −0.240 (50.838) | |||
| MEDUCUni | 31.610 (56.902) | |||
| Num.Obs. | 1688 | 1688 | 1688 | 1688 |
| R2 | 0.000 | 0.006 | 0.104 | 0.105 |
| R2 Adj. | 0.000 | 0.005 | 0.098 | 0.097 |
| AIC | 26401.9 | 26393.7 | 26239.9 | 26246.1 |
| BIC | 26412.7 | 26410.0 | 26315.9 | 26343.8 |
| Log.Lik. | −13198.942 | −13193.843 | −13105.946 | −13105.039 |
| F | 10.217 | 16.259 | 12.291 | |
| RMSE | 602.05 | 600.23 | 569.78 | 569.47 |
Identify the largest coefficient in the optimal model (on an absolute scale). Do you think it has the largest effect on the birth-weight?
Not necessarily! Each variable has a different scale with different boundaries, so coefficients cannot be compared when they are not on a common scale. The model only tells us which parameters are statistically different from zero, but it does not allow us to compare between them.
This situation is why Gelman recommends centering each continuous variable (that is, subtracting the mean from each value) and scaling it by two standard deviations (that is - dividing it by two standard deviations). You may often hear people scaling by just one standard deviation. This is useful; the regression coefficient that emerges comes from inputs where 0 is the mean, communicates magnitude effects that might be more interesting, and it will make the intercept interpretable as well. However, scaling by two standard deviations does all that and has the helpful side effect of putting everything on a roughly common scale. Basically, scaling by two standard deviations results in variables that have a mean 0 and a standard deviation of .5. Binary variables usually also have standard deviations close to 0.5. This may not be perfect, but it’s really good and often close.
# Scaling
natality.ctr <- natality |>
standardize(two_sd = TRUE)
# Fitting the model
M.OPT.ctr <- lm(
DBWT ~ DMAR + MRACEHISP + BMI +
risks + PRIORLIVE + CIG_REC +
PRIORDEAD,
data = natality.ctr
)
# creating a forest plot
M.OPT.ctr %>%
model_parameters() %>%
plot()You can display the various statistics using
M.FULL |> tidy(conf.int = TRUE).
| model | CIG_RECYes | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| Unadjasted | −0.149 | 0.047 | −3.196 | 1.42 × 10−3 | −0.240 | −0.057 |
| Optimal | −0.133 | 0.046 | −2.878 | 4.06 × 10−3 | −0.224 | −0.042 |
| Full | −0.130 | 0.047 | −2.780 | 5.49 × 10−3 | −0.222 | −0.038 |
anova
to test whether the model with the interaction is significantly better
than the one without it. Regardless of statistical significance, create
a dumbell
plot like the ones shown below, illustrating the smoking effect at
each level of mother’s education and estimate the cigarette smoking
effect at each level of mother’s education.Notes: You may get a different plot than the ones
shown below, depending on the way you set your models. We will be using
the marginaleffects
package to create the plots. You need only to create one of the
plots, but bonus for creating ALL the plots. Bonus:
Provide a brief description, interpretation or reflection about what you
see in the plot (e.g., what do you notice? what do you wonder?)
To create the dumbbell figure, you will need to predict the birth weight for smoking and non-smoking mums with different education levels. Here is the code for the calculations you can use (this uses a three-way interaction, which is not strictly requested in the question above):
library(marginaleffects)
M.FULL_int.ctr <- lm(formula = DBWT ~ ___, data = natality.ctr)
predictions(M.FULL_int.ctr,
newdata = datagrid(
MRACEHISP = c("NH White", "NH Black", "NH Other", "Hispanic"),
MEDUC = c("lower", "HS", "Some college", "Uni"),
CIG_REC = c("No", "Yes")
)
) |>
as.data.frame() |>
mutate(DBWT = estimate) |>
select(DBWT, MRACEHISP, CIG_REC, MEDUC) Once you have your data, here is how you create your dumbells (for
the race NH White, for example):
tribble(
~Education, ~No_Smoking, ~Smoking,
"Lower", 0.06227469, 0.04447809,
"HS", 0.08854779, -0.05014673,
# ___, ___, ___,
# ___, ___, ___
) |> ggplot() +
geom_segment(
aes(
x = "No Smoking", y = No_Smoking,
xend = "Smoking", yend = Smoking,
color = Education
),
linewidth = 2
) +
geom_point(
aes(x = "No Smoking", y = No_Smoking, color = Education),
size = 6
) +
geom_point(
aes(x = "Smoking", y = Smoking, color = Education),
size = 6
)anova to test whether the model
with the interaction is significantly better than the one without it.
Regardless of statistical significance, estimate the BMI effect at each
level of mother’s race/ethnicity and plot the BMI effect at each level
of mother’s race/ethnicity using a dumbell plot. Bonus:
Provide a brief description, interpretation or reflection about what you
see in the plot (e.g., what do you notice? what do you wonder?)Note: Since we have scaled the continuous variable
BMI by TWO standard deviations, you will
need to estimate the right tip of the dumbbell at two
units and not at one unit.
anova to test whether the model with the interaction is
significantly better than the one without it. Regardless of statistical
significance, estimate the BMI effect for mothers age 20, 30, and 40
years and make a plot illustrating the BMI effect at each of these ages.
Bonus: Provide a brief description, interpretation or
reflection about what you see in the plot (e.g., what do you notice?
what do you wonder?)