Generalized Linear Mixed Models (GLMM)

Binary Outcomes (GLMM)

Apply generalized linear mixed models for binary repeated measures outcomes, modeling the probability of events like substance use initiation over time while accounting for individual heterogeneity.

GLMM โ€บ Binary
glmmTMBabcd-studygeneralized-linear-mixed-modelbinary-outcomeslogistic-regression
Work in ProgressExamples are a work in progress. Please exercise caution when using code examples, as they may not be fully verified. If you spot gaps, errors, or have suggestions, we'd love your feedbackโ€”use the "Suggest changes" button to help us improve!

Overview

Generalized Linear Mixed Models (GLMMs) extend linear mixed models to handle non-continuous outcomes through link functions that connect the linear predictor to the expected value of the outcome. For binary outcomes (yes/no, present/absent), the logit link transforms probabilities into log-odds, enabling modeling of repeated binary measurements while accounting for the correlation structure induced by repeated observations on the same individuals. This tutorial applies GLMM with a logit link to analyze substance use initiation across ABCD assessments, demonstrating how to specify random effects structures, interpret odds ratios, and translate results to the probability scale for substantive interpretation.

When to Use:
Apply when your outcome is binary (0/1) and measured repeatedly over time, and you want to model the probability of the event while accounting for individual differences in baseline risk and change over time.
Key Advantage:
GLMM handles the bounded nature of probabilities correctly, provides odds ratios for interpretation, and allows both random intercepts (individual differences in baseline risk) and random slopes (individual differences in change).
What You'll Learn:
How to specify binary GLMMs using glmmTMB, interpret fixed effects as odds ratios, understand marginal vs. conditional effects, and visualize predicted probabilities over time.

Data Access

Data Download

ABCD data can be accessed through the DEAP platform or the NBDC Data Access Platform (LASSO), which provide user-friendly interfaces for creating custom datasets with point-and-click variable selection. For detailed instructions on accessing and downloading ABCD data, see the DEAP documentation.

Loading Data with NBDCtools

Once you have downloaded ABCD data files, the NBDCtools package provides efficient tools for loading and preparing your data for analysis. The package handles common data management tasks including:

  • Automatic data joining - Merges variables from multiple tables automatically
  • Built-in transformations - Converts categorical variables to factors, handles missing data codes, and adds variable labels
  • Event filtering - Easily selects specific assessment waves

For more information, visit the NBDCtools documentation.

Basic Usage

The create_dataset() function is the main tool for loading ABCD data:

library(NBDCtools)

# Define variables needed for this analysis
requested_vars <- c(
  "var_1",   # Variable 1
  "var_2",   # Variable 2
  "var_3"    # Variable 3
)

# Set path to downloaded ABCD data files
data_dir <- Sys.getenv("ABCD_DATA_PATH", "/path/to/abcd/6_0/phenotype")

# Load data with automatic transformations
abcd_data <- create_dataset(
  dir_data = data_dir,
  study = "abcd",
  vars = requested_vars,
  release = "6.0",
  format = "parquet",
  categ_to_factor = TRUE,   # Convert categorical variables to factors
  value_to_na = TRUE,        # Convert missing codes (222, 333, etc.) to NA
  add_labels = TRUE          # Add variable and value labels
)
Key Parameters
  • vars - Vector of variable names to load
  • release - ABCD data release version (e.g., "6.0")
  • format - File format, typically "parquet" for efficiency
  • categ_to_factor - Automatically converts categorical variables to factors
  • value_to_na - Converts ABCD missing value codes to R's NA
  • add_labels - Adds descriptive labels to variables and values
Additional NBDCtools Resources

For more details on using NBDCtools:

Data Preparation

NBDCtools Setup and Data Loading
R31 lines
### Load necessary libraries
library(NBDCtools)    # ABCD data access helper
library(arrow)        # For reading Parquet files
library(tidyverse)    # For data manipulation & visualization
library(gtsummary)    # For generating publication-quality summary tables
library(glmmTMB)      # Generalized linear mixed models
library(lme4)         # Alternative GLMM fitting
library(broom.mixed)  # Tidy output for mixed models
library(gt)           # For creating formatted tables
library(emmeans)      # For marginal means and contrasts

### Load harmonized ABCD data required for this analysis
requested_vars <- c(
  "ab_g_dyn__design_site",
  "ab_g_stc__design_id__fam",
  "su_y_tlfb__alc__1mo_ud",       # Alcohol use days (last 30 days)
  "ab_g_stc__cohort_sex"          # Sex (time-invariant covariate)
)

data_dir <- Sys.getenv("ABCD_DATA_PATH", "/path/to/abcd/6_0/phenotype")

abcd_data <- create_dataset(
  dir_data = data_dir,
  study = "abcd",
  vars = requested_vars,
  release = "6.0",
  format = "parquet",
  categ_to_factor = TRUE,
  value_to_na = TRUE,
  add_labels = TRUE
)
Data Transformation
R36 lines
# Prepare data for GLMM analysis
df_long <- abcd_data %>%
  # Filter to available annual assessments using NBDCtools
  filter_events_abcd(conditions = c("annual")) %>%
  # Rename variables for clarity
  rename(
    site = ab_g_dyn__design_site,
    family_id = ab_g_stc__design_id__fam,
    alcohol_use = su_y_tlfb__alc__1mo_ud,
    sex = ab_g_stc__cohort_sex
  ) %>%
  # Create binary outcome: any alcohol use vs. none
  mutate(
    any_use = as.integer(alcohol_use > 0)
  ) %>%
  # Keep only participants with at least 2 observations
  group_by(participant_id) %>%
  filter(sum(!is.na(any_use)) >= 2) %>%
  ungroup() %>%
  drop_na(any_use, sex) %>%
  # Create numeric time variable from session_id
  mutate(
    time = as.numeric(session_id) - 1,
    # Convert sex codes to labeled factor (1=Male, 2=Female in ABCD)
    sex = factor(sex, levels = c(1, 2), labels = c("Male", "Female"))
  )

# Check the prevalence of use at each wave
df_long %>%
  group_by(session_id) %>%
  summarise(
    n = n(),
    n_users = sum(any_use),
    prevalence = mean(any_use),
    .groups = "drop"
  )
Descriptive Statistics
R26 lines
# Create prevalence table by wave
prevalence_table <- df_long %>%
  mutate(any_use = factor(any_use, levels = c(0, 1), labels = c("No Use", "Any Use"))) %>%
  select(session_id, any_use) %>%
  tbl_summary(
    by = session_id,
    missing = "no",
    label = list(
      any_use ~ "Alcohol Use"
    )
  ) %>%
  modify_header(all_stat_cols() ~ "**{level}**<br>N = {n}") %>%
  modify_spanning_header(all_stat_cols() ~ "**Assessment Wave**") %>%
  bold_labels() %>%
  italicize_levels()

### Apply compact styling
theme_gtsummary_compact()

prevalence_table <- as_gt(prevalence_table)

### Save the table as HTML
gt::gtsave(prevalence_table, filename = "prevalence_table.html")

### Print the table
prevalence_table
Characteristic
Assessment Wave
ses-00A
N = 11541
1
ses-01A
N = 11213
1
ses-02A
N = 10926
1
ses-03A
N = 10400
1
ses-04A
N = 9638
1
ses-05A
N = 8855
1
ses-06A
N = 5026
1
Alcohol Use






ย ย ย ย No Use 11,537 (100%) 11,209 (100%) 10,920 (100%) 10,360 (100%) 9,486 (98%) 8,524 (96%) 4,626 (92%)
ย ย ย ย Any Use 4 (<0.1%) 4 (<0.1%) 6 (<0.1%) 40 (0.4%) 152 (1.6%) 331 (3.7%) 400 (8.0%)
1 n (%)

Statistical Analysis

Model 1: Random Intercept Only
R9 lines
# Fit GLMM with random intercept only
# This allows individuals to differ in baseline risk (log-odds)
m1 <- glmmTMB(
  any_use ~ time + (1 | site:family_id:participant_id),
  family = binomial(link = "logit"),
  data = df_long
)

summary(m1)
Model 2: Random Intercept + Random Slope
R9 lines
# Add random slope for time
# This allows individuals to differ in how their risk changes over time
m2 <- glmmTMB(
  any_use ~ time + (1 + time | site:family_id:participant_id),
  family = binomial(link = "logit"),
  data = df_long
)

summary(m2)
Model 3: Add Time-Invariant Covariate (Sex)
R20 lines
# Add sex as a predictor
m3 <- glmmTMB(
  any_use ~ time + sex + (1 + time | site:family_id:participant_id),
  family = binomial(link = "logit"),
  data = df_long
)

# Full summary
summary(m3)

# Create formatted coefficient table
model_summary_table <- broom.mixed::tidy(m3, conf.int = TRUE, exponentiate = FALSE) %>%
  filter(effect == "fixed") %>%
  select(term, estimate, std.error, statistic, p.value, conf.low, conf.high) %>%
  gt() %>%
  tab_header(title = "GLMM for Binary Outcomes: Fixed Effects (Log-Odds Scale)") %>%
  fmt_number(columns = c(estimate, std.error, statistic, conf.low, conf.high), decimals = 3) %>%
  fmt_number(columns = p.value, decimals = 4)

gt::gtsave(model_summary_table, filename = "model_summary.html")
GLMM for Binary Outcomes: Fixed Effects (Log-Odds Scale)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) โˆ’13.609 0.605 โˆ’22.496 0.0000 โˆ’14.794 โˆ’12.423
time 0.661 0.105 6.276 0.0000 0.454 0.867
sexFemale 0.128 0.234 0.546 0.5847 โˆ’0.331 0.587
Odds Ratios
R24 lines
# Convert to odds ratios for interpretation
odds_ratios <- broom.mixed::tidy(m3, conf.int = TRUE, exponentiate = TRUE) %>%
  filter(effect == "fixed") %>%
  select(term, estimate, conf.low, conf.high, p.value) %>%
  rename(
    OR = estimate,
    CI_lower = conf.low,
    CI_upper = conf.high
  )

or_table <- odds_ratios %>%
  gt() %>%
  tab_header(title = "Odds Ratios with 95% Confidence Intervals") %>%
  fmt_number(columns = c(OR, CI_lower, CI_upper), decimals = 3) %>%
  fmt_number(columns = p.value, decimals = 4) %>%
  cols_label(
    term = "Predictor",
    OR = "Odds Ratio",
    CI_lower = "95% CI Lower",
    CI_upper = "95% CI Upper",
    p.value = "p-value"
  )

gt::gtsave(or_table, filename = "odds_ratios.html")
Odds Ratios with 95% Confidence Intervals
Predictor Odds Ratio 95% CI Lower 95% CI Upper p-value
(Intercept) 0.000 0.000 0.000 0.0000
time 1.937 1.575 2.381 0.0000
sexFemale 1.137 0.718 1.799 0.5847
Interpretation
Interpretation

The GLMM estimated a conditional odds ratio for time of 1.94 (95% CI: 1.58โ€“2.38, p < .001), indicating that the odds of any alcohol use nearly doubled with each assessment wave for a given individual. The sex effect was not significant (OR = 1.14, 95% CI: 0.72โ€“1.80, p = .585), suggesting no reliable difference between males and females in alcohol use probability after accounting for individual heterogeneity.

Model comparison strongly favored adding random slopes (M2 AIC = 5,539) over random intercepts alone (M1 AIC = 6,067), indicating that individuals differed substantially in their rate of increase in alcohol use probability. Adding sex as a covariate (M3 AIC = 5,541) did not meaningfully improve fit.

Conditional vs. marginal interpretation: These odds ratios are conditional on the random effects (subject-specific). The population-averaged (marginal) effects would be attenuated because the random intercept variance is absorbed. The prevalence data confirm the trajectory: alcohol use rose from near-zero at Baseline (<0.1%) to 8.0% by Year 6, consistent with the large conditional time effect operating across a population where most youth remain abstinent throughout the study period.

Model Comparison
R14 lines
# Compare models using AIC/BIC
model_comparison <- data.frame(
  Model = c("M1: Random Intercept", "M2: + Random Slope", "M3: + Sex Covariate"),
  AIC = c(AIC(m1), AIC(m2), AIC(m3)),
  BIC = c(BIC(m1), BIC(m2), BIC(m3)),
  LogLik = c(logLik(m1), logLik(m2), logLik(m3))
)

comparison_table <- model_comparison %>%
  gt() %>%
  tab_header(title = "Model Comparison") %>%
  fmt_number(columns = c(AIC, BIC, LogLik), decimals = 2)

gt::gtsave(comparison_table, filename = "model_comparison.html")
Model Comparison
Model AIC BIC LogLik
M1: Random Intercept 6,067.16 6,094.52 โˆ’3,030.58
M2: + Random Slope 5,538.84 5,584.45 โˆ’2,764.42
M3: + Sex Covariate 5,540.54 5,595.27 โˆ’2,764.27
Predicted Probabilities Visualization
R46 lines
# Generate predicted probabilities across time by sex
# Create prediction grid
newdata <- expand.grid(
  time = seq(0, 3, by = 0.5),
  sex = c("Male", "Female")
)

# Get marginal predictions (averaging over random effects)
# Using emmeans for marginal means
emm <- emmeans(m3, ~ time + sex, at = list(time = seq(0, 3, by = 0.5)),
               type = "response")
pred_df <- as.data.frame(emm)

# Create probability plot
prob_plot <- ggplot(pred_df, aes(x = time, y = prob, color = sex, fill = sex)) +
  geom_ribbon(aes(ymin = asymp.LCL, ymax = asymp.UCL), alpha = 0.2, color = NA) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 2) +
  scale_x_continuous(
    breaks = 0:3,
    labels = c("Baseline", "Year 2", "Year 4", "Year 6")
  ) +
  scale_y_continuous(
    limits = c(0, 1),
    labels = scales::percent_format()
  ) +
  scale_color_manual(values = c("Male" = "#2E86AB", "Female" = "#A23B72")) +
  scale_fill_manual(values = c("Male" = "#2E86AB", "Female" = "#A23B72")) +
  labs(
    title = "Predicted Probability of Any Alcohol Use Over Time",
    subtitle = "Marginal predictions from GLMM with 95% confidence intervals",
    x = "Assessment Wave",
    y = "Probability of Any Use",
    color = "Sex",
    fill = "Sex"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

prob_plot

ggsave(
  filename = "visualization.png",
  plot = prob_plot,
  width = 8, height = 6, dpi = 300
)
Predicted Probabilities of Alcohol Use
Interpretation
Visualization Notes

The plot displays marginal predicted probabilities - the average probability of alcohol use at each time point, accounting for the random effects distribution. The shaded regions represent 95% confidence intervals. This visualization makes the time trend and sex differences more interpretable than log-odds or odds ratios, as probabilities are bounded between 0 and 1 and directly represent the likelihood of the outcome.

Note that these marginal probabilities will be closer to 0.5 than the conditional (subject-specific) probabilities would be, because averaging over the random effects distribution compresses the probability scale. Individual participants may have substantially higher or lower probabilities depending on their random intercept value.

Discussion

The analysis reveals a steep developmental increase in alcohol use probability across adolescence, with odds nearly doubling per biennial assessment wave. Despite this strong time effect, prevalence remained low overall โ€” rising from near-zero at Baseline to roughly 8% by Year 6 โ€” so the large conditional odds ratio reflects rapid relative change from a very low base rather than high absolute risk. Sex did not significantly predict alcohol use in this sample after accounting for individual heterogeneity.

The substantial random slope variance (M2 AIC improvement of ~500 over M1) indicates that adolescents follow markedly different trajectories โ€” some transition to use earlier and more steeply than others. This heterogeneity is a natural target for conditional models: adding time-invariant predictors (e.g., family risk factors, neighborhood characteristics) could help explain who transitions earlier. Time-varying covariates (e.g., peer substance use, stress) could further clarify within-person drivers of use onset.

The distinction between conditional and marginal effects is particularly consequential here because the random intercept variance is large relative to the fixed effects. Population-averaged estimates from a GEE model would show a more attenuated time effect, reflecting the fact that most youth remain abstinent throughout the study period. Both perspectives โ€” individual-level risk trajectories (GLMM) and population-level prevalence trends (GEE) โ€” are informative for different policy and intervention questions.

Additional Resources

4

glmmTMB Package Documentation

DOCS

Comprehensive documentation for the glmmTMB package, covering model specification for binary and other non-Gaussian outcomes, including random effects structures and model diagnostics.

Visit Resource

Agresti: Categorical Data Analysis

BOOK

Foundational textbook covering generalized linear models for categorical outcomes, including detailed treatment of logistic regression, random effects models, and interpretation of odds ratios.

Visit Resource

emmeans Package for Marginal Means

TOOL

R package for computing estimated marginal means from fitted models, essential for obtaining predicted probabilities and pairwise comparisons from GLMMs.

Visit Resource

Bolker et al. (2009): GLMMs in Ecology

PAPER

Influential paper providing practical guidance on fitting and interpreting GLMMs, including discussion of random effects interpretation and model selection for binary outcomes.

Visit Resource