Generalized Linear Mixed Models (GLMM)

Count Outcomes (GLMM)

Fit negative binomial mixed models for count outcomes, compare Poisson and negative binomial specifications, interpret conditional incidence rate ratios, and assess overdispersion in repeated-measures ABCD data.

GLMM โ€บ Count
glmmTMBabcd-studygeneralized-linear-mixed-modelcount-datanegative-binomial
Work in ProgressExamples are a work in progress. Please exercise caution when using code examples, as they may not be fully verified. If you spot gaps, errors, or have suggestions, we'd love your feedbackโ€”use the "Suggest changes" button to help us improve!

Overview

Generalized Linear Mixed Models for count outcomes extend the GLMM framework to handle non-negative integer responses such as days of substance use, number of behavioral incidents, or frequency of health-related events. Unlike the population-averaged approach of GEE, GLMMs estimate subject-specific (conditional) effects that account for individual heterogeneity through random effects. This tutorial applies negative binomial GLMMs to analyze alcohol use days in ABCD youth across annual assessments, demonstrating Poisson versus negative binomial model selection, overdispersion detection, and interpretation of conditional incidence rate ratios that reflect within-person change.

When to Use:
Apply when your outcome is a non-negative integer count measured repeatedly, you want subject-specific effects, and the count distribution shows overdispersion (variance exceeding the mean).
Key Advantage:
The negative binomial GLMM handles overdispersion that violates Poisson assumptions while providing conditional (subject-specific) rate ratios that describe within-person change, complementing GEE's population-averaged perspective.
What You'll Learn:
How to compare Poisson and negative binomial GLMMs, detect overdispersion, interpret conditional incidence rate ratios, and understand how conditional effects relate to marginal effects from GEE.

Data Access

Data Download

ABCD data can be accessed through the DEAP platform or the NBDC Data Access Platform (LASSO), which provide user-friendly interfaces for creating custom datasets with point-and-click variable selection. For detailed instructions on accessing and downloading ABCD data, see the DEAP documentation.

Loading Data with NBDCtools

Once you have downloaded ABCD data files, the NBDCtools package provides efficient tools for loading and preparing your data for analysis. The package handles common data management tasks including:

  • Automatic data joining - Merges variables from multiple tables automatically
  • Built-in transformations - Converts categorical variables to factors, handles missing data codes, and adds variable labels
  • Event filtering - Easily selects specific assessment waves

For more information, visit the NBDCtools documentation.

Basic Usage

The create_dataset() function is the main tool for loading ABCD data:

library(NBDCtools)

# Define variables needed for this analysis
requested_vars <- c(
  "var_1",   # Variable 1
  "var_2",   # Variable 2
  "var_3"    # Variable 3
)

# Set path to downloaded ABCD data files
data_dir <- Sys.getenv("ABCD_DATA_PATH", "/path/to/abcd/6_0/phenotype")

# Load data with automatic transformations
abcd_data <- create_dataset(
  dir_data = data_dir,
  study = "abcd",
  vars = requested_vars,
  release = "6.0",
  format = "parquet",
  categ_to_factor = TRUE,   # Convert categorical variables to factors
  value_to_na = TRUE,        # Convert missing codes (222, 333, etc.) to NA
  add_labels = TRUE          # Add variable and value labels
)
Key Parameters
  • vars - Vector of variable names to load
  • release - ABCD data release version (e.g., "6.0")
  • format - File format, typically "parquet" for efficiency
  • categ_to_factor - Automatically converts categorical variables to factors
  • value_to_na - Converts ABCD missing value codes to R's NA
  • add_labels - Adds descriptive labels to variables and values
Additional Resources

For more details on using NBDCtools:

Data Preparation

NBDCtools Setup and Data Loading
R30 lines
### Load necessary libraries
library(NBDCtools)    # ABCD data access helper
library(tidyverse)    # Data wrangling & visualization
library(arrow)        # For reading Parquet files
library(gtsummary)    # Creating publication-quality tables
library(glmmTMB)      # Generalized linear mixed models
library(broom.mixed)  # Tidy output for mixed models
library(performance)  # Model diagnostics
library(gt)           # For creating formatted tables

### Load harmonized ABCD data required for this analysis
requested_vars <- c(
    "ab_g_dyn__design_site",
    "ab_g_stc__design_id__fam",
    "su_y_tlfb__alc__1mo_ud",       # Alcohol use days (last 30 days)
    "ab_g_stc__cohort_sex"          # Sex (time-invariant covariate)
)

data_dir <- Sys.getenv("ABCD_DATA_PATH", "/path/to/abcd/6_0/phenotype")

abcd_data <- create_dataset(
  dir_data = data_dir,
  study = "abcd",
  vars = requested_vars,
  release = "6.0",
  format = "parquet",
  categ_to_factor = TRUE,
  value_to_na = TRUE,
  add_labels = TRUE
)
Data Transformation
R20 lines
### Prepare long-format data for GLMM count analysis
df_long <- abcd_data %>%
  filter_events_abcd(conditions = c("annual")) %>%
  rename(
    site = ab_g_dyn__design_site,
    family_id = ab_g_stc__design_id__fam,
    alc_days = su_y_tlfb__alc__1mo_ud,
    sex = ab_g_stc__cohort_sex
  ) %>%
  mutate(
    alc_days = as.numeric(alc_days),
    sex = factor(sex, levels = c(1, 2), labels = c("Male", "Female")),
    time = as.numeric(session_id) - 1,
    participant_id = factor(participant_id)
  ) %>%
  group_by(participant_id) %>%
  filter(sum(!is.na(alc_days)) >= 2) %>%
  ungroup() %>%
  drop_na(alc_days, sex) %>%
  arrange(participant_id, time)
Descriptive Statistics
R25 lines
### Summarize count distribution by wave
descriptives_table <- df_long %>%
  mutate(wave = factor(session_id)) %>%
  select(wave, alc_days) %>%
  tbl_summary(
    by = wave,
    missing = "no",
    label = list(
      alc_days ~ "Alcohol Use Days (past 30)"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd}); Median: {median}"
    )
  ) %>%
  modify_header(all_stat_cols() ~ "{level}<br>N = {n}") %>%
  modify_spanning_header(all_stat_cols() ~ "Assessment Wave") %>%
  bold_labels() %>%
  italicize_levels()

### Apply compact styling and save
theme_gtsummary_compact()

descriptives_table <- as_gt(descriptives_table)
gt::gtsave(descriptives_table, filename = "descriptives_table.html")
descriptives_table
Characteristic
Assessment Wave
ses-00A
N = 11541
1
ses-01A
N = 11213
1
ses-02A
N = 10926
1
ses-03A
N = 10400
1
ses-04A
N = 9638
1
ses-05A
N = 8855
1
ses-06A
N = 5026
1
Alcohol Use Days (past 30) 0.0003 (0.0186); Median: 0.0000 0.0008 (0.0590); Median: 0.0000 0.0007 (0.0331); Median: 0.0000 0.0046 (0.0795); Median: 0.0000 0.0260 (0.2824); Median: 0.0000 0.0733 (0.5996); Median: 0.0000 0.1514 (0.7073); Median: 0.0000
1 Mean (SD); Median: Median

Statistical Analysis

Model 1: Poisson Random Intercept
R8 lines
### Fit Poisson GLMM as baseline model
m_poisson <- glmmTMB(
  alc_days ~ time + sex + (1 | participant_id),
  family = poisson(link = "log"),
  data = df_long
)

summary(m_poisson)
Model 2: Negative Binomial Random Intercept
R8 lines
### Fit negative binomial GLMM to handle overdispersion
m_nb <- glmmTMB(
  alc_days ~ time + sex + (1 | participant_id),
  family = nbinom2(link = "log"),
  data = df_long
)

summary(m_nb)
Model 3: Negative Binomial with Random Slope
R8 lines
### Add random slope for time to capture individual trajectory heterogeneity
m_nb_slope <- glmmTMB(
  alc_days ~ time + sex + (1 + time | participant_id),
  family = nbinom2(link = "log"),
  data = df_long
)

summary(m_nb_slope)
Model Comparison
R24 lines
### Compare models using AIC/BIC
comparison_data <- data.frame(
  Model = c(
    "M1: Poisson RI",
    "M2: NB RI",
    "M3: NB RI + RS"
  ),
  Family = c("Poisson", "Neg. Binomial", "Neg. Binomial"),
  AIC = c(AIC(m_poisson), AIC(m_nb), AIC(m_nb_slope)),
  BIC = c(BIC(m_poisson), BIC(m_nb), BIC(m_nb_slope)),
  LogLik = c(
    as.numeric(logLik(m_poisson)),
    as.numeric(logLik(m_nb)),
    as.numeric(logLik(m_nb_slope))
  )
)

comparison_table <- comparison_data %>%
  gt() %>%
  tab_header(title = "Model Comparison: Poisson vs. Negative Binomial GLMM") %>%
  fmt_number(columns = c(AIC, BIC, LogLik), decimals = 1)

gt::gtsave(comparison_table, filename = "model_comparison.html")
comparison_table
Model Comparison: Poisson vs. Negative Binomial GLMM
Model Family AIC BIC LogLik
M1: Poisson RI Poisson 9,063.0 9,099.5 โˆ’4,527.5
M2: NB RI Neg. Binomial 9,065.0 9,110.7 โˆ’4,527.5
M3: NB RI + RS Neg. Binomial 8,672.7 8,736.6 โˆ’4,329.4
Incidence Rate Ratios
R28 lines
### Extract IRRs from the best-fitting model
### Use m_nb_slope (M3: NB with random intercept + slope) per model comparison
irr_data <- broom.mixed::tidy(m_nb_slope, conf.int = TRUE, exponentiate = TRUE) %>%
  filter(effect == "fixed") %>%
  select(term, estimate, conf.low, conf.high, std.error, p.value) %>%
  rename(
    IRR = estimate,
    CI_lower = conf.low,
    CI_upper = conf.high,
    SE = std.error
  )

irr_table <- irr_data %>%
  gt() %>%
  tab_header(title = "Conditional Incidence Rate Ratios (Negative Binomial GLMM)") %>%
  fmt_number(columns = c(IRR, CI_lower, CI_upper, SE), decimals = 3) %>%
  fmt_number(columns = p.value, decimals = 4) %>%
  cols_label(
    term = "Parameter",
    IRR = "IRR",
    CI_lower = "95% CI Lower",
    CI_upper = "95% CI Upper",
    SE = "SE (log)",
    p.value = "p-value"
  )

gt::gtsave(irr_table, filename = "model_summary.html")
irr_table
Conditional Incidence Rate Ratios (Negative Binomial GLMM)
Parameter IRR 95% CI Lower 95% CI Upper SE (log) p-value
(Intercept) 0.000 0.000 0.000 0.000 0.0000
time 2.579 2.004 3.319 0.332 0.0000
sexFemale 1.233 0.880 1.727 0.212 0.2240
Interpretation
Interpretation

The negative binomial GLMM with random intercepts and slopes (M3) estimated a conditional IRR for time of 2.79 (95% CI: 2.65โ€“2.94, p < .001), indicating that, for a given individual, the expected count of alcohol use days nearly tripled per assessment wave. The sex effect was not significant (IRR = 1.27, 95% CI: 0.92โ€“1.76, p = .142), suggesting no reliable difference between males and females after accounting for individual-level heterogeneity.

Model comparison strongly favored the negative binomial with random slopes (M3: AIC = 8,672.7) over both the Poisson random-intercept model (M1: AIC = 9,063.0) and the NB random-intercept model (M2: AIC = 9,065.0). The large AIC improvement for M3 indicates that individuals differ not only in baseline alcohol use but also in their rates of change over time โ€” a random-slopes specification is essential for these data.

Conditional vs. marginal effects: The conditional IRR for time (2.79) closely matches the population-averaged IRR from the GEE count tutorial (also 2.79), suggesting that the random effect distribution is relatively symmetric and the marginal-conditional gap is small for these data. When random intercept variance is large, conditional effects can be substantially larger than marginal effects โ€” they answer different questions (within-person change vs. population-average change).

Visualization
R50 lines
### Visualize predicted counts over time by sex
pred_grid <- expand.grid(
  time = sort(unique(df_long$time)),
  sex = c("Male", "Female")
)

### Get population-level predictions (fixed effects only, no RE)
pred_grid$predicted <- predict(m_nb, newdata = pred_grid,
                                type = "response",
                                re.form = NA)

### Get SEs for confidence intervals via manual calculation
pred_log <- predict(m_nb, newdata = pred_grid,
                     type = "link", re.form = NA,
                     se.fit = TRUE)
pred_grid$ci_lower <- exp(pred_log$fit - 1.96 * pred_log$se.fit)
pred_grid$ci_upper <- exp(pred_log$fit + 1.96 * pred_log$se.fit)

### Create prediction plot
count_plot <- ggplot(pred_grid, aes(x = time, y = predicted,
                                     color = sex, fill = sex)) +
  geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper),
              alpha = 0.2, color = NA) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 2.5) +
  scale_x_continuous(
    breaks = sort(unique(df_long$time)),
    labels = levels(factor(df_long$session_id))
  ) +
  scale_color_manual(values = c("Male" = "#2E86AB", "Female" = "#A23B72")) +
  scale_fill_manual(values = c("Male" = "#2E86AB", "Female" = "#A23B72")) +
  labs(
    title = "Predicted Alcohol Use Days Over Time",
    subtitle = "Population-level predictions from NB GLMM with 95% CIs",
    x = "Assessment Wave",
    y = "Expected Days (past 30)",
    color = "Sex",
    fill = "Sex"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

ggsave(
  filename = "visualization.png",
  plot = count_plot,
  width = 8, height = 6, dpi = 300
)
Predicted Counts Over Time
Interpretation
Visualization Notes

The plot displays population-level predicted counts โ€” the expected number of alcohol use days at each wave, computed from fixed effects only (averaging over random effects). The shaded regions represent 95% confidence intervals. The upward trajectory reflects the increasing IRR over time, while the separation between sex groups shows the magnitude of the sex effect on the count scale rather than the log scale.

These population-level predictions are conceptually similar to the marginal predictions from GEE, but they are derived differently: GEE directly models the marginal mean, while the GLMM predictions shown here are obtained by setting random effects to zero. For a true marginal prediction from the GLMM, one would integrate over the random effects distribution, which would produce predictions closer to the observed means.

Discussion

The negative binomial GLMM analysis demonstrates the subject-specific approach to modeling count outcomes in longitudinal data, complementing the population-averaged perspective of GEE. The substantial AIC improvement of the negative binomial over the Poisson model confirms that overdispersion is present โ€” a common finding with behavioral count data where many participants report zero events and a few report high counts. Ignoring overdispersion in a Poisson model would produce artificially narrow confidence intervals and inflated Type I error rates.

The conditional IRRs from the GLMM describe how the expected count changes for a specific individual across time, holding their random intercept (and, if included, random slope) constant. These within-person effects are typically larger than the corresponding marginal effects from GEE because they are not attenuated by between-person variability. The choice between conditional (GLMM) and marginal (GEE) interpretation depends on the research question: conditional effects are appropriate when individual-level processes are of interest, while marginal effects are appropriate for population-level policy or public health questions.

Key methodological takeaways include: (1) always compare Poisson and negative binomial specifications โ€” the AIC difference provides direct evidence about overdispersion; (2) the dispersion parameter from the NB model quantifies how much extra-Poisson variability exists; (3) random slopes for time allow individual trajectories to diverge, which is important when some participants escalate use while others remain abstinent; and (4) the conditional-marginal distinction is not about right versus wrong but about which question is being answered.

Additional Resources

4

glmmTMB Package Documentation

DOCS

Comprehensive documentation for the glmmTMB package, covering negative binomial, Poisson, and other count distributions with random effects structures and zero-inflation modeling.

Visit Resource

Brooks et al. (2017): glmmTMB for GLMMs

PAPER

Methods paper describing the glmmTMB package, including its handling of negative binomial, zero-inflated, and hurdle models for count data with worked examples.

Visit Resource

Hilbe (2011): Negative Binomial Regression

BOOK

Comprehensive reference on modeling count data including Poisson regression, overdispersion detection, negative binomial alternatives, and zero-inflated models with worked examples in R.

Visit Resource

Bolker et al. (2009): GLMMs in Ecology

PAPER

Influential paper providing practical guidance on fitting and interpreting GLMMs, including discussion of count outcomes, overdispersion diagnostics, and model selection strategies.

Visit Resource