Overview

Generalized Linear Mixed Models (GLMMs) extend linear mixed models to handle non-continuous outcomes through link functions that connect the linear predictor to the expected value of the outcome. For binary outcomes (yes/no, present/absent), the logit link transforms probabilities into log-odds, enabling modeling of repeated binary measurements while accounting for the correlation structure induced by repeated observations on the same individuals. This tutorial applies GLMM with a logit link to analyze substance use initiation across ABCD assessments, demonstrating how to specify random effects structures, interpret odds ratios, and translate results to the probability scale for substantive interpretation.

When to Use:

Apply when your outcome is binary (0/1) and measured repeatedly over time, and you want to model the probability of the event while accounting for individual differences in baseline risk and change over time.

Key Advantage:

GLMM handles the bounded nature of probabilities correctly, provides odds ratios for interpretation, and allows both random intercepts (individual differences in baseline risk) and random slopes (individual differences in change).

What You'll Learn:

How to specify binary GLMMs using glmmTMB, interpret fixed effects as odds ratios, understand marginal vs. conditional effects, and visualize predicted probabilities over time.

Data Access

Data Download

ABCD data can be accessed through the DEAP platform or the NBDC Data Access Platform (LASSO), which provide user-friendly interfaces for creating custom datasets with point-and-click variable selection. For detailed instructions on accessing and downloading ABCD data, see the DEAP documentation.

Loading Data with NBDCtools

Once you have downloaded ABCD data files, the NBDCtools package provides efficient tools for loading and preparing your data for analysis. The package handles common data management tasks including:

Automatic data joining - Merges variables from multiple tables automatically
Built-in transformations - Converts categorical variables to factors, handles missing data codes, and adds variable labels
Event filtering - Easily selects specific assessment waves

For more information, visit the NBDCtools documentation.

Basic Usage

The create_dataset() function is the main tool for loading ABCD data:

library(NBDCtools)

# Define variables needed for this analysis
requested_vars <- c(
  "var_1",   # Variable 1
  "var_2",   # Variable 2
  "var_3"    # Variable 3
)

# Set path to downloaded ABCD data files
data_dir <- Sys.getenv("ABCD_DATA_PATH", "/path/to/abcd/6_0/phenotype")

# Load data with automatic transformations
abcd_data <- create_dataset(
  dir_data = data_dir,
  study = "abcd",
  vars = requested_vars,
  release = "6.0",
  format = "parquet",
  categ_to_factor = TRUE,   # Convert categorical variables to factors
  value_to_na = TRUE,        # Convert missing codes (222, 333, etc.) to NA
  add_labels = TRUE          # Add variable and value labels
)

Key Parameters

vars - Vector of variable names to load
release - ABCD data release version (e.g., "6.0")
format - File format, typically "parquet" for efficiency
categ_to_factor - Automatically converts categorical variables to factors
value_to_na - Converts ABCD missing value codes to R's NA
add_labels - Adds descriptive labels to variables and values

Additional NBDCtools Resources

For more details on using NBDCtools:

NBDCtools Getting Started Guide - Complete package overview
Joining Data - Advanced data merging strategies
Filtering Events - Selecting specific assessment waves
Data Transformations - Preprocessing and cleaning

Data Preparation

NBDCtools Setup and Data Loading

31 lines

### Load necessary libraries
library(NBDCtools)    # ABCD data access helper
library(arrow)        # For reading Parquet files
library(tidyverse)    # For data manipulation & visualization
library(gtsummary)    # For generating publication-quality summary tables
library(glmmTMB)      # Generalized linear mixed models
library(lme4)         # Alternative GLMM fitting
library(broom.mixed)  # Tidy output for mixed models
library(gt)           # For creating formatted tables
library(emmeans)      # For marginal means and contrasts

### Load harmonized ABCD data required for this analysis
requested_vars <- c(
  "ab_g_dyn__design_site",
  "ab_g_stc__design_id__fam",
  "su_y_tlfb__alc__1mo_ud",       # Alcohol use days (last 30 days)
  "ab_g_stc__cohort_sex"          # Sex (time-invariant covariate)
)

data_dir <- Sys.getenv("ABCD_DATA_PATH", "/path/to/abcd/6_0/phenotype")

abcd_data <- create_dataset(
  dir_data = data_dir,
  study = "abcd",
  vars = requested_vars,
  release = "6.0",
  format = "parquet",
  categ_to_factor = TRUE,
  value_to_na = TRUE,
  add_labels = TRUE
)

Show all 31 linesShow less

Data Transformation

36 lines

# Prepare data for GLMM analysis
df_long <- abcd_data %>%
  # Filter to available annual assessments using NBDCtools
  filter_events_abcd(conditions = c("annual")) %>%
  # Rename variables for clarity
  rename(
    site = ab_g_dyn__design_site,
    family_id = ab_g_stc__design_id__fam,
    alcohol_use = su_y_tlfb__alc__1mo_ud,
    sex = ab_g_stc__cohort_sex
  ) %>%
  # Create binary outcome: any alcohol use vs. none
  mutate(
    any_use = as.integer(alcohol_use > 0)
  ) %>%
  # Keep only participants with at least 2 observations
  group_by(participant_id) %>%
  filter(sum(!is.na(any_use)) >= 2) %>%
  ungroup() %>%
  drop_na(any_use, sex) %>%
  # Create numeric time variable from session_id
  mutate(
    time = as.numeric(session_id) - 1,
    # Convert sex codes to labeled factor (1=Male, 2=Female in ABCD)
    sex = factor(sex, levels = c(1, 2), labels = c("Male", "Female"))
  )

# Check the prevalence of use at each wave
df_long %>%
  group_by(session_id) %>%
  summarise(
    n = n(),
    n_users = sum(any_use),
    prevalence = mean(any_use),
    .groups = "drop"
  )

Show all 36 linesShow less

Descriptive Statistics

26 lines

# Create prevalence table by wave
prevalence_table <- df_long %>%
  mutate(any_use = factor(any_use, levels = c(0, 1), labels = c("No Use", "Any Use"))) %>%
  select(session_id, any_use) %>%
  tbl_summary(
    by = session_id,
    missing = "no",
    label = list(
      any_use ~ "Alcohol Use"
    )
  ) %>%
  modify_header(all_stat_cols() ~ "**{level}**<br>N = {n}") %>%
  modify_spanning_header(all_stat_cols() ~ "**Assessment Wave**") %>%
  bold_labels() %>%
  italicize_levels()

### Apply compact styling
theme_gtsummary_compact()

prevalence_table <- as_gt(prevalence_table)

### Save the table as HTML
gt::gtsave(prevalence_table, filename = "prevalence_table.html")

### Print the table
prevalence_table

Show all 26 linesShow less

Characteristic	Assessment Wave
Characteristic	ses-00A N = 11541¹	ses-01A N = 11213¹	ses-02A N = 10926¹	ses-03A N = 10400¹	ses-04A N = 9638¹	ses-05A N = 8855¹	ses-06A N = 5026¹
Alcohol Use
No Use	11,537 (100%)	11,209 (100%)	10,920 (100%)	10,360 (100%)	9,486 (98%)	8,524 (96%)	4,626 (92%)
Any Use	4 (<0.1%)	4 (<0.1%)	6 (<0.1%)	40 (0.4%)	152 (1.6%)	331 (3.7%)	400 (8.0%)
¹ n (%)

Statistical Analysis

Model 1: Random Intercept Only

9 lines

# Fit GLMM with random intercept only
# This allows individuals to differ in baseline risk (log-odds)
m1 <- glmmTMB(
  any_use ~ time + (1 | site:family_id:participant_id),
  family = binomial(link = "logit"),
  data = df_long
)

summary(m1)

Model 2: Random Intercept + Random Slope

9 lines

# Add random slope for time
# This allows individuals to differ in how their risk changes over time
m2 <- glmmTMB(
  any_use ~ time + (1 + time | site:family_id:participant_id),
  family = binomial(link = "logit"),
  data = df_long
)

summary(m2)

Model 3: Add Time-Invariant Covariate (Sex)

20 lines

# Add sex as a predictor
m3 <- glmmTMB(
  any_use ~ time + sex + (1 + time | site:family_id:participant_id),
  family = binomial(link = "logit"),
  data = df_long
)

# Full summary
summary(m3)

# Create formatted coefficient table
model_summary_table <- broom.mixed::tidy(m3, conf.int = TRUE, exponentiate = FALSE) %>%
  filter(effect == "fixed") %>%
  select(term, estimate, std.error, statistic, p.value, conf.low, conf.high) %>%
  gt() %>%
  tab_header(title = "GLMM for Binary Outcomes: Fixed Effects (Log-Odds Scale)") %>%
  fmt_number(columns = c(estimate, std.error, statistic, conf.low, conf.high), decimals = 3) %>%
  fmt_number(columns = p.value, decimals = 4)

gt::gtsave(model_summary_table, filename = "model_summary.html")

Show all 20 linesShow less

term	estimate	std.error	statistic	p.value	conf.low	conf.high
GLMM for Binary Outcomes: Fixed Effects (Log-Odds Scale)
(Intercept)	−13.609	0.605	−22.496	0.0000	−14.794	−12.423
time	0.661	0.105	6.276	0.0000	0.454	0.867
sexFemale	0.128	0.234	0.546	0.5847	−0.331	0.587

Odds Ratios

24 lines

# Convert to odds ratios for interpretation
odds_ratios <- broom.mixed::tidy(m3, conf.int = TRUE, exponentiate = TRUE) %>%
  filter(effect == "fixed") %>%
  select(term, estimate, conf.low, conf.high, p.value) %>%
  rename(
    OR = estimate,
    CI_lower = conf.low,
    CI_upper = conf.high
  )

or_table <- odds_ratios %>%
  gt() %>%
  tab_header(title = "Odds Ratios with 95% Confidence Intervals") %>%
  fmt_number(columns = c(OR, CI_lower, CI_upper), decimals = 3) %>%
  fmt_number(columns = p.value, decimals = 4) %>%
  cols_label(
    term = "Predictor",
    OR = "Odds Ratio",
    CI_lower = "95% CI Lower",
    CI_upper = "95% CI Upper",
    p.value = "p-value"
  )

gt::gtsave(or_table, filename = "odds_ratios.html")

Show all 24 linesShow less

Predictor	Odds Ratio	95% CI Lower	95% CI Upper	p-value
Odds Ratios with 95% Confidence Intervals
(Intercept)	0.000	0.000	0.000	0.0000
time	1.937	1.575	2.381	0.0000
sexFemale	1.137	0.718	1.799	0.5847

Interpretation

The GLMM estimated a conditional odds ratio for time of 1.94 (95% CI: 1.58–2.38, p < .001), indicating that the odds of any alcohol use nearly doubled with each assessment wave for a given individual. The sex effect was not significant (OR = 1.14, 95% CI: 0.72–1.80, p = .585), suggesting no reliable difference between males and females in alcohol use probability after accounting for individual heterogeneity.

Model comparison strongly favored adding random slopes (M2 AIC = 5,539) over random intercepts alone (M1 AIC = 6,067), indicating that individuals differed substantially in their rate of increase in alcohol use probability. Adding sex as a covariate (M3 AIC = 5,541) did not meaningfully improve fit.

Conditional vs. marginal interpretation: These odds ratios are conditional on the random effects (subject-specific). The population-averaged (marginal) effects would be attenuated because the random intercept variance is absorbed. The prevalence data confirm the trajectory: alcohol use rose from near-zero at Baseline (<0.1%) to 8.0% by Year 6, consistent with the large conditional time effect operating across a population where most youth remain abstinent throughout the study period.

Model Comparison

14 lines

# Compare models using AIC/BIC
model_comparison <- data.frame(
  Model = c("M1: Random Intercept", "M2: + Random Slope", "M3: + Sex Covariate"),
  AIC = c(AIC(m1), AIC(m2), AIC(m3)),
  BIC = c(BIC(m1), BIC(m2), BIC(m3)),
  LogLik = c(logLik(m1), logLik(m2), logLik(m3))
)

comparison_table <- model_comparison %>%
  gt() %>%
  tab_header(title = "Model Comparison") %>%
  fmt_number(columns = c(AIC, BIC, LogLik), decimals = 2)

gt::gtsave(comparison_table, filename = "model_comparison.html")

Model	AIC	BIC	LogLik
Model Comparison
M1: Random Intercept	6,067.16	6,094.52	−3,030.58
M2: + Random Slope	5,538.84	5,584.45	−2,764.42
M3: + Sex Covariate	5,540.54	5,595.27	−2,764.27

Predicted Probabilities Visualization

46 lines

# Generate predicted probabilities across time by sex
# Create prediction grid
newdata <- expand.grid(
  time = seq(0, 3, by = 0.5),
  sex = c("Male", "Female")
)

# Get marginal predictions (averaging over random effects)
# Using emmeans for marginal means
emm <- emmeans(m3, ~ time + sex, at = list(time = seq(0, 3, by = 0.5)),
               type = "response")
pred_df <- as.data.frame(emm)

# Create probability plot
prob_plot <- ggplot(pred_df, aes(x = time, y = prob, color = sex, fill = sex)) +
  geom_ribbon(aes(ymin = asymp.LCL, ymax = asymp.UCL), alpha = 0.2, color = NA) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 2) +
  scale_x_continuous(
    breaks = 0:3,
    labels = c("Baseline", "Year 2", "Year 4", "Year 6")
  ) +
  scale_y_continuous(
    limits = c(0, 1),
    labels = scales::percent_format()
  ) +
  scale_color_manual(values = c("Male" = "#2E86AB", "Female" = "#A23B72")) +
  scale_fill_manual(values = c("Male" = "#2E86AB", "Female" = "#A23B72")) +
  labs(
    title = "Predicted Probability of Any Alcohol Use Over Time",
    subtitle = "Marginal predictions from GLMM with 95% confidence intervals",
    x = "Assessment Wave",
    y = "Probability of Any Use",
    color = "Sex",
    fill = "Sex"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

prob_plot

ggsave(
  filename = "visualization.png",
  plot = prob_plot,
  width = 8, height = 6, dpi = 300
)

Show all 46 linesShow less

Visualization Notes

The plot displays marginal predicted probabilities - the average probability of alcohol use at each time point, accounting for the random effects distribution. The shaded regions represent 95% confidence intervals. This visualization makes the time trend and sex differences more interpretable than log-odds or odds ratios, as probabilities are bounded between 0 and 1 and directly represent the likelihood of the outcome.

Note that these marginal probabilities will be closer to 0.5 than the conditional (subject-specific) probabilities would be, because averaging over the random effects distribution compresses the probability scale. Individual participants may have substantially higher or lower probabilities depending on their random intercept value.

Discussion

The analysis reveals a steep developmental increase in alcohol use probability across adolescence, with odds nearly doubling per biennial assessment wave. Despite this strong time effect, prevalence remained low overall — rising from near-zero at Baseline to roughly 8% by Year 6 — so the large conditional odds ratio reflects rapid relative change from a very low base rather than high absolute risk. Sex did not significantly predict alcohol use in this sample after accounting for individual heterogeneity.

The substantial random slope variance (M2 AIC improvement of ~500 over M1) indicates that adolescents follow markedly different trajectories — some transition to use earlier and more steeply than others. This heterogeneity is a natural target for conditional models: adding time-invariant predictors (e.g., family risk factors, neighborhood characteristics) could help explain who transitions earlier. Time-varying covariates (e.g., peer substance use, stress) could further clarify within-person drivers of use onset.

The distinction between conditional and marginal effects is particularly consequential here because the random intercept variance is large relative to the fixed effects. Population-averaged estimates from a GEE model would show a more attenuated time effect, reflecting the fact that most youth remain abstinent throughout the study period. Both perspectives — individual-level risk trajectories (GLMM) and population-level prevalence trends (GEE) — are informative for different policy and intervention questions.

Additional Resources

glmmTMB Package Documentation

DOCS

Comprehensive documentation for the glmmTMB package, covering model specification for binary and other non-Gaussian outcomes, including random effects structures and model diagnostics.

Visit Resource

Agresti: Categorical Data Analysis

BOOK

Foundational textbook covering generalized linear models for categorical outcomes, including detailed treatment of logistic regression, random effects models, and interpretation of odds ratios.

Visit Resource

emmeans Package for Marginal Means

TOOL

R package for computing estimated marginal means from fitted models, essential for obtaining predicted probabilities and pairwise comparisons from GLMMs.

Visit Resource

Bolker et al. (2009): GLMMs in Ecology

PAPER

Influential paper providing practical guidance on fitting and interpreting GLMMs, including discussion of random effects interpretation and model selection for binary outcomes.

Visit Resource