GROUP WORK - DEADLINE 27-Nov-23.
Please submit your final report using this form.

For this lab, we will use the tidyverse package because it will help us with loading data, wrangling, visualizing and analyzing it. You can find more about the packages in the tidyverse at tidyverse.org.

We will be using R Markdown (see tutorials here, here and here) to create reproducible lab reports. The file that we use to create the report has the extension *.rmd, and consists of three different parts:

  • The YAML: This is the header of the file, and it includes the file’s meta-data: information about the appearance of the final report. Go ahead and put your name(s) in the YAML in the appropriate place.
  • The Markdown: This is the formatted text that goes into the report, and is like the text you would write in a word document, for example
  • The code chunks: In an R Markdown file, R code appears in a gray box, which we call “code chunks.” The R Markdown file knows that the gray box contains R code because it begins with three tick marks (```), followed by two curly braces that contain a lowercase letter r ({r}).

Potentially the most important feature of R Markdown files is that they allow for us to embed our R code within a written report.

Instead of typing our R code into the console, we encourage you to type any code you want to save in the R code chunk associated with each problem. You can execute the R code you type in these code chunks similar to how you typed code into the console and pressed enter/return. Within the code chunk there are a few ways to execute lines of R code:

  • For a single line, place your cursor on the line on code and press Ctrl-Enter or Cmd-Enter at the same time, or
  • Alternatively, place your cursor on the line and press the “Run” button in the upper right hand corner of the R Markdown file.

  • If you wanted to run all of the R code in a given code chunk, you can click on the “Play” button in the upper right hand corner of the code chunk (green sideways triangle).

If at any point you need to start over and run all of the code chunks before a specific code chunk, you click on the “Fastforward” button in the upper right hand corner of that code chunk (gray upside down triangle with a bar below). This will run every code chunk that occurred before that code chunk, but will not execute the R code included in that code chunk.

Creating a reproducible lab report

You will find all the work-space for your lab on posit cloud using this link.

Before starting, you are advised to change some of the global options as shown in the figure on the right. We will be using R Markdown to create reproducible lab reports. In RStudio, you will find the file lab01.Rmd in the Files panel. Double click it to show the script in the code panel.

  • In the file, update the YAML with your name, the date and the name of the lab.
  • Load the tidyverse, broom, table1, GGally, car, easystats, datawizard, marginaleffects and the modelsummary packages.
  • Load the natality.rds data set into your workspace, and save it as natality.
  • Knit your file to see that everything is working.
  • For each question, please add the text of the question, the code chunks and your answer.

The 2018 United States Birth Data were compiled from information on birth certificates by the National Vital Statistics System, part of the National Center for Health Statistics, in cooperation with States (National Center for Health Statistics 2022). The dataset consists of a random sample from the original dataset, registering births in the United States in the year 2018. Specifically, we will be looking at the following variables:

  • DBWT - birth weight (g) 👶. This will be the outcome we will be predicting using a variety of linear regression models
  • CIG_REC - smoked during pregnancy 🚭
  • risks - risk factors reported
  • MAGER - mother’s age 🕜
  • MRACEHISP - mother’s race/ethnicity
  • DMAR - marital status💍
  • MEDUC - mother’s education
  • PRIORLIVE - prior births now living
  • PRIORDEAD - prior births now dead
  • BMI - mother’s body mass index, (kg/m2)

Descriptive statistics

  1. Use the table1 package to create a table with appropriate descriptive information about the data set. Use the examples provided in the following vignette to help you build a table. Bonus: Provide a brief description, interpretation or reflection about your findings from the table: (e.g., what do you notice? what do you wonder? how does this relate to you or your community?)
Maternal demographics stratified by smoking
Non smoker
(N=1564)
Smoker
(N=124)
Birth Weight (g)
Mean (SD) 3250 (601) 3070 (597)
Median [Min, Max] 3290 [369, 5880] 3090 [630, 4220]
Risks
No 1088 (69.6%) 84 (67.7%)
Yes 476 (30.4%) 40 (32.3%)
Mother Age
Mean (SD) 29.0 (5.79) 27.8 (5.79)
Median [Min, Max] 29.0 [15.0, 49.0] 28.0 [17.0, 42.0]
Mother Race/Hispanic Origin
NH White 844 (54.0%) 94 (75.8%)
NH Black 266 (17.0%) 14 (11.3%)
NH Other 125 (8.0%) 10 (8.1%)
Hispanic 329 (21.0%) 6 (4.8%)
Marital Status
Married 1017 (65.0%) 40 (32.3%)
Unmarried 547 (35.0%) 84 (67.7%)
Mother Education
lower 177 (11.3%) 34 (27.4%)
HS 377 (24.1%) 45 (36.3%)
Some college 437 (27.9%) 37 (29.8%)
Uni 573 (36.6%) 8 (6.5%)
Prior births (living)
0 601 (38.4%) 38 (30.6%)
1 516 (33.0%) 41 (33.1%)
2 265 (16.9%) 17 (13.7%)
3 93 (5.9%) 17 (13.7%)
4+ 89 (5.7%) 11 (8.9%)
Prior births (dead)
0 1551 (99.2%) 123 (99.2%)
1+ 13 (0.8%) 1 (0.8%)
Mother BMI
Mean (SD) 26.9 (6.32) 27.3 (6.42)
Median [Min, Max] 25.6 [15.6, 58.4] 26.3 [16.5, 50.1]




  1. Use the ggpairs function to explore the association between the different variables. Try to replicate the figure below, or create your own variation. Bonus: Provide a brief description, interpretation or reflection about your findings from the table: (e.g., what do you notice? what do you wonder? how does this relate to you or your community?)

Linear regression

In the following exercises, you will assess the association between birth weight (the outcome) and smoking during pregnancy (the primary predictor), adjusted for risk factors reported, mother’s age, mother’s race/ethnicity, marital status, mother’s education, prior births now living and prior births now dead, and mother’s body mass index.

  1. Fit four linear models, each nested within the other: the null model, the unadjusted model, the full model and the optimal model. Use the step function find the optimal model using the stepwise method. Then use anova to compare the four models. What is your conclusion? Now compare your models using modelsummary function. What does the number in the (brackets) mean? How can you identify the optimal model just by looking at the \(R^2\) and the AIC?
M.NULL <- lm(DBWT ~ 1, natality)
M.UNADJUSTED <- lm(DBWT ~ CIG_REC, natality)
# The dot represents all variables except the outcome DBWT
M.FULL <- lm(DBWT ~ ., natality)  

# Finding the optimal model
step(M.NULL,
     scope = list(upper = M.FULL),
     direction="both",
     data=natality)


# Displaying the four models by each other
modelsummary(
  list(
    null = M.NULL, 
    unadjasted = M.UNADJUSTED, 
    ___ = ___, 
    ___ = ___
    ), 
  stars = TRUE
  )

Stepwise regression is a systematic method for adding and removing terms from a linear or generalized linear model based on their statistical significance in explaining the response variable. The method begins with an initial model and then compares the explanatory power of incrementally larger and smaller models.

The step function uses forward and backward stepwise regression to determine a final model. At each step, the function searches for terms to add to the model or remove from the model based on the value of the ‘Criterion’ name-value pair argument.

Use the step function to run through different models - the last model shown is the one with the smallest Akaike information criterion (AIC)

null unadjusted optimal full
(Intercept) 3234.746 (14.658)*** 3247.903 (15.187)*** 3080.559 (64.625)*** 3162.373 (107.021)***
CIG_RECYes −179.096 (56.032)** −160.083 (55.632)** −156.593 (56.325)**
DMARUnmarried −184.275 (31.823)*** −189.782 (35.084)***
MRACEHISPNH Black −292.140 (41.690)*** −290.175 (41.829)***
MRACEHISPNH Other −121.695 (52.796)* −120.774 (52.960)*
MRACEHISPHispanic −74.492 (38.075)+ −72.464 (39.120)+
BMI 10.261 (2.259)*** 10.493 (2.277)***
risksYes −158.808 (31.655)*** −152.902 (32.055)***
PRIORLIVE1 130.686 (33.814)*** 140.483 (34.725)***
PRIORLIVE2 107.170 (41.746)* 123.385 (43.880)**
PRIORLIVE3 138.776 (59.972)* 161.298 (63.149)*
PRIORLIVE4+ 131.949 (62.333)* 162.508 (67.562)*
PRIORDEAD1+ 258.434 (153.905)+ 265.467 (154.184)+
MAGER −3.836 (3.085)
MEDUCHS 8.448 (49.909)
MEDUCSome college −0.240 (50.838)
MEDUCUni 31.610 (56.902)
Num.Obs. 1688 1688 1688 1688
R2 0.000 0.006 0.104 0.105
R2 Adj. 0.000 0.005 0.098 0.097
AIC 26401.9 26393.7 26239.9 26246.1
BIC 26412.7 26410.0 26315.9 26343.8
Log.Lik. −13198.942 −13193.843 −13105.946 −13105.039
F 10.217 16.259 12.291
RMSE 602.05 600.23 569.78 569.47

Identify the largest coefficient in the optimal model (on an absolute scale). Do you think it has the largest effect on the birth-weight?

Not necessarily! Each variable has a different scale with different boundaries, so coefficients cannot be compared when they are not on a common scale. The model only tells us which parameters are statistically different from zero, but it does not allow us to compare between them.

This situation is why Gelman recommends centering each continuous variable (that is, subtracting the mean from each value) and scaling it by two standard deviations (that is - dividing it by two standard deviations). You may often hear people scaling by just one standard deviation. This is useful; the regression coefficient that emerges comes from inputs where 0 is the mean, communicates magnitude effects that might be more interesting, and it will make the intercept interpretable as well. However, scaling by two standard deviations does all that and has the helpful side effect of putting everything on a roughly common scale. Basically, scaling by two standard deviations results in variables that have a mean 0 and a standard deviation of .5. Binary variables usually also have standard deviations close to 0.5. This may not be perfect, but it’s really good and often close.

  1. Center and scale all the continuous variables, and create a forest plot for the optimal model. Bonus: Provide a brief description, interpretation or reflection about what you see in the forest plot (e.g., what do you notice? what do you wonder?)
# Scaling
natality.ctr <- natality |> 
  standardize(two_sd = TRUE)

# Fitting the model
M.OPT.ctr <- lm(
  DBWT ~ DMAR + MRACEHISP + BMI + 
    risks + PRIORLIVE + CIG_REC + 
    PRIORDEAD, 
  data = natality.ctr
  )

# creating a forest plot
M.OPT.ctr %>% 
  model_parameters() %>% 
  plot()
  1. How do the effect estimate of smoking, the standard error of its estimate, its 95% confidence interval, and its statistical significance change between the un-adjusted, optimal and full models? Please use the centered and scaled models to answer this question, adding a brief description, interpretation or reflection about what your findings.

You can display the various statistics using M.FULL |> tidy(conf.int = TRUE).

model CIG_RECYes std.error statistic p.value conf.low conf.high
Unadjasted −0.149 0.047 −3.196 1.42 × 10−3 −0.240 −0.057
Optimal −0.133 0.046 −2.878 4.06 × 10−3 −0.224 −0.042
Full −0.130 0.047 −2.780 5.49 × 10−3 −0.222 −0.038

Exploring interactions

  1. Does the association between cigarette smoking during pregnancy and birth weight depend on mother’s education? Modify the FULL model appropriately to be able to answer this question. Use anova to test whether the model with the interaction is significantly better than the one without it. Regardless of statistical significance, create a dumbell plot like the ones shown below, illustrating the smoking effect at each level of mother’s education and estimate the cigarette smoking effect at each level of mother’s education.

Notes: You may get a different plot than the ones shown below, depending on the way you set your models. We will be using the marginaleffects package to create the plots. You need only to create one of the plots, but bonus for creating ALL the plots. Bonus: Provide a brief description, interpretation or reflection about what you see in the plot (e.g., what do you notice? what do you wonder?)

To create the dumbbell figure, you will need to predict the birth weight for smoking and non-smoking mums with different education levels. Here is the code for the calculations you can use (this uses a three-way interaction, which is not strictly requested in the question above):

library(marginaleffects)
M.FULL_int.ctr <- lm(formula = DBWT ~  ___, data = natality.ctr)

predictions(M.FULL_int.ctr, 
            newdata = datagrid(
              MRACEHISP = c("NH White", "NH Black", "NH Other", "Hispanic"),
              MEDUC   = c("lower", "HS", "Some college", "Uni"),
              CIG_REC = c("No", "Yes")
              )
            ) |> 
  as.data.frame() |> 
  mutate(DBWT = estimate)  |> 
  select(DBWT, MRACEHISP, CIG_REC, MEDUC) 

Once you have your data, here is how you create your dumbells (for the race NH White, for example):

tribble(
  ~Education, ~No_Smoking, ~Smoking, 
  "Lower",    0.06227469,  0.04447809, 
  "HS",       0.08854779,  -0.05014673, 
  # ___,        ___,         ___, 
  # ___,        ___,         ___
  ) |> ggplot() + 
  geom_segment(
    aes(
      x = "No Smoking", y = No_Smoking, 
      xend = "Smoking", yend = Smoking, 
      color = Education
      ),
    linewidth = 2
    ) + 
  geom_point(
    aes(x = "No Smoking", y = No_Smoking, color = Education), 
    size = 6
    ) + 
  geom_point(
    aes(x = "Smoking", y = Smoking, color = Education), 
    size = 6
    )
  1. Does the association between mother’s BMI and birth-weight differ between mothers of different race/ethnicity? Modify the OPTIMAL model appropriately to be able to answer this question, using the dataset with the centered variables. Use anova to test whether the model with the interaction is significantly better than the one without it. Regardless of statistical significance, estimate the BMI effect at each level of mother’s race/ethnicity and plot the BMI effect at each level of mother’s race/ethnicity using a dumbell plot. Bonus: Provide a brief description, interpretation or reflection about what you see in the plot (e.g., what do you notice? what do you wonder?)

Note: Since we have scaled the continuous variable BMI by TWO standard deviations, you will need to estimate the right tip of the dumbbell at two units and not at one unit.

  1. Does the association between mother’s BMI and birth-weight depend on mother’s age? Modify the FULL model appropriately to be able to answer this question, using the dataset with the centered variables. Use anova to test whether the model with the interaction is significantly better than the one without it. Regardless of statistical significance, estimate the BMI effect for mothers age 20, 30, and 40 years and make a plot illustrating the BMI effect at each of these ages. Bonus: Provide a brief description, interpretation or reflection about what you see in the plot (e.g., what do you notice? what do you wonder?)