Generalized Estimating Equations (GEE)

Count Outcomes (GEE)

Fit population-averaged GEE models for count outcomes, detect overdispersion, and interpret rate ratios for repeated-measures ABCD data.

GEE โ€บ Count
geepackabcd-studygeemarginal-modelcount-data
Work in ProgressExamples are a work in progress. Please exercise caution when using code examples, as they may not be fully verified. If you spot gaps, errors, or have suggestions, we'd love your feedbackโ€”use the "Suggest changes" button to help us improve!

Overview

Generalized Estimating Equations for count outcomes extend the GEE framework from binary to Poisson-family responses, modeling how the rate of events changes over time at the population level while accounting for within-subject correlation. Count outcomes โ€” such as days of substance use, number of behavioral incidents, or frequency of health-related events โ€” are common in longitudinal developmental research but violate the assumptions of standard linear models. This tutorial applies Poisson GEE to analyze alcohol use days in the past 30 days among ABCD youth across annual assessments, demonstrating overdispersion detection, robust standard error estimation, and interpretation of incidence rate ratios.

When to Use:
Apply when your outcome is a non-negative integer count measured repeatedly and you want population-averaged trends while accounting for within-subject correlation.
Key Advantage:
GEE with Poisson family provides valid marginal rate ratio estimates even when the variance-mean relationship is misspecified, as long as the mean model is correct and robust standard errors are used.
What You'll Learn:
How to fit Poisson GEE, detect and handle overdispersion, compare robust versus naive standard errors, and interpret incidence rate ratios for count outcomes.

Data Access

Data Download

ABCD data can be accessed through the DEAP platform or the NBDC Data Access Platform (LASSO), which provide user-friendly interfaces for creating custom datasets with point-and-click variable selection. For detailed instructions on accessing and downloading ABCD data, see the DEAP documentation.

Loading Data with NBDCtools

Once you have downloaded ABCD data files, the NBDCtools package provides efficient tools for loading and preparing your data for analysis. The package handles common data management tasks including:

  • Automatic data joining - Merges variables from multiple tables automatically
  • Built-in transformations - Converts categorical variables to factors, handles missing data codes, and adds variable labels
  • Event filtering - Easily selects specific assessment waves

For more information, visit the NBDCtools documentation.

Basic Usage

The create_dataset() function is the main tool for loading ABCD data:

library(NBDCtools)

# Define variables needed for this analysis
requested_vars <- c(
  "var_1",   # Variable 1
  "var_2",   # Variable 2
  "var_3"    # Variable 3
)

# Set path to downloaded ABCD data files
data_dir <- Sys.getenv("ABCD_DATA_PATH", "/path/to/abcd/6_0/phenotype")

# Load data with automatic transformations
abcd_data <- create_dataset(
  dir_data = data_dir,
  study = "abcd",
  vars = requested_vars,
  release = "6.0",
  format = "parquet",
  categ_to_factor = TRUE,   # Convert categorical variables to factors
  value_to_na = TRUE,        # Convert missing codes (222, 333, etc.) to NA
  add_labels = TRUE          # Add variable and value labels
)
Key Parameters
  • vars - Vector of variable names to load
  • release - ABCD data release version (e.g., "6.0")
  • format - File format, typically "parquet" for efficiency
  • categ_to_factor - Automatically converts categorical variables to factors
  • value_to_na - Converts ABCD missing value codes to R's NA
  • add_labels - Adds descriptive labels to variables and values
Additional Resources

For more details on using NBDCtools:

Data Preparation

NBDCtools Setup and Data Loading
R29 lines
### Load necessary libraries
library(NBDCtools)    # ABCD data access helper
library(tidyverse)    # Data wrangling & visualization
library(arrow)        # For reading Parquet files
library(gtsummary)    # Creating publication-quality tables
library(geepack)      # Generalized Estimating Equations
library(broom)        # For tidying model outputs
library(gt)           # For creating formatted tables

### Load harmonized ABCD data required for this analysis
requested_vars <- c(
    "ab_g_dyn__design_site",
    "ab_g_stc__design_id__fam",
    "su_y_tlfb__alc__1mo_ud",       # Alcohol use days (last 30 days)
    "ab_g_stc__cohort_sex"          # Sex (time-invariant covariate)
)

data_dir <- Sys.getenv("ABCD_DATA_PATH", "/path/to/abcd/6_0/phenotype")

abcd_data <- create_dataset(
  dir_data = data_dir,
  study = "abcd",
  vars = requested_vars,
  release = "6.0",
  format = "parquet",
  categ_to_factor = TRUE,
  value_to_na = TRUE,
  add_labels = TRUE
)
Data Transformation
R21 lines
### Prepare long-format data for GEE count analysis
df_long <- abcd_data %>%
  filter_events_abcd(conditions = c("annual")) %>%
  rename(
    site = ab_g_dyn__design_site,
    family_id = ab_g_stc__design_id__fam,
    alc_days = su_y_tlfb__alc__1mo_ud,
    sex = ab_g_stc__cohort_sex
  ) %>%
  mutate(
    alc_days = as.numeric(alc_days),
    sex = factor(sex, levels = c(1, 2), labels = c("Male", "Female")),
    time = as.numeric(session_id) - 1,
    participant_id = factor(participant_id)
  ) %>%
  # Keep participants with at least 2 observations
  group_by(participant_id) %>%
  filter(sum(!is.na(alc_days)) >= 2) %>%
  ungroup() %>%
  drop_na(alc_days, sex) %>%
  arrange(participant_id, time)
Descriptive Statistics
R27 lines
### Create descriptive summary table for count outcome
descriptives_table <- df_long %>%
  mutate(
    wave = factor(session_id)
  ) %>%
  select(wave, alc_days) %>%
  tbl_summary(
    by = wave,
    missing = "no",
    label = list(
      alc_days ~ "Alcohol Use Days (past 30)"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd}); Median: {median}"
    )
  ) %>%
  modify_header(all_stat_cols() ~ "{level}<br>N = {n}") %>%
  modify_spanning_header(all_stat_cols() ~ "Assessment Wave") %>%
  bold_labels() %>%
  italicize_levels()

### Apply compact styling and save
theme_gtsummary_compact()

descriptives_table <- as_gt(descriptives_table)
gt::gtsave(descriptives_table, filename = "descriptives_table.html")
descriptives_table
Characteristic
Assessment Wave
ses-00A
N = 11541
1
ses-01A
N = 11213
1
ses-02A
N = 10926
1
ses-03A
N = 10400
1
ses-04A
N = 9638
1
ses-05A
N = 8855
1
ses-06A
N = 5026
1
Alcohol Use Days (past 30) 0.0003 (0.0186); Median: 0.0000 0.0008 (0.0590); Median: 0.0000 0.0007 (0.0331); Median: 0.0000 0.0046 (0.0795); Median: 0.0000 0.0260 (0.2824); Median: 0.0000 0.0733 (0.5996); Median: 0.0000 0.1514 (0.7073); Median: 0.0000
1 Mean (SD); Median: Median

Statistical Analysis

Step 1: Fit Poisson GEE
R16 lines
### Fit Poisson GEE: modeling alcohol use days over time
poisson_gee <- geeglm(
  alc_days ~ time + sex,
  id = participant_id,
  data = df_long,
  family = poisson(link = "log"),
  corstr = "exchangeable"
)

### Extract Poisson GEE results
poisson_summary <- summary(poisson_gee)

### Check for overdispersion: scale parameter substantially > 1 indicates overdispersion
scale_param <- poisson_summary$dispersion$Estimate
cat("Scale parameter:", round(scale_param, 3), "\n")
cat("Overdispersion detected:", scale_param > 1.5, "\n")
Step 2: Compare Robust vs Naive Standard Errors
R25 lines
### Extract coefficients with both naive and robust SEs
naive_se <- poisson_summary$coefficients[, "Std.err"]
robust_results <- as.data.frame(poisson_summary$coefficients)

### Format SE comparison table
se_comparison <- data.frame(
  Parameter = rownames(robust_results),
  Estimate = robust_results$Estimate,
  Naive_SE = naive_se,
  Robust_SE = robust_results$Std.err,
  SE_Ratio = robust_results$Std.err / naive_se
) %>%
  gt() %>%
  tab_header(title = "Robust vs. Naive Standard Errors (Poisson GEE)") %>%
  fmt_number(columns = c(Estimate, Naive_SE, Robust_SE, SE_Ratio), decimals = 3) %>%
  cols_label(
    Parameter = "Parameter",
    Estimate = "Estimate (log)",
    Naive_SE = "Naive SE",
    Robust_SE = "Robust SE",
    SE_Ratio = "Robust/Naive"
  )

gt::gtsave(se_comparison, filename = "se_comparison.html")
se_comparison
Robust vs. Naive Standard Errors (Poisson GEE)
Parameter Estimate (log) Naive SE Robust SE Robust/Naive
(Intercept) โˆ’8.025 0.208 0.208 1.000
time 1.026 0.033 0.033 1.000
sexFemale 0.199 0.109 0.109 1.000
Step 3: Fit Model and Format Rate Ratios
R26 lines
### Compute incidence rate ratios (IRRs) from the Poisson GEE
coefs <- as.data.frame(poisson_summary$coefficients)

irr_table <- data.frame(
  Parameter = rownames(coefs),
  Log_Estimate = coefs$Estimate,
  IRR = exp(coefs$Estimate),
  IRR_Lower = exp(coefs$Estimate - 1.96 * coefs$Std.err),
  IRR_Upper = exp(coefs$Estimate + 1.96 * coefs$Std.err),
  p_value = coefs[, "Pr(>|W|)"]
) %>%
  gt() %>%
  tab_header(title = "Incidence Rate Ratios: Poisson GEE") %>%
  fmt_number(columns = c(Log_Estimate, IRR, IRR_Lower, IRR_Upper), decimals = 3) %>%
  fmt_number(columns = p_value, decimals = 4) %>%
  cols_label(
    Parameter = "Parameter",
    Log_Estimate = "Log(IRR)",
    IRR = "IRR",
    IRR_Lower = "95% CI Lower",
    IRR_Upper = "95% CI Upper",
    p_value = "p-value"
  )

gt::gtsave(irr_table, filename = "model_summary.html")
irr_table
Incidence Rate Ratios: Poisson GEE
Parameter Log(IRR) IRR 95% CI Lower 95% CI Upper p-value
(Intercept) โˆ’8.025 0.000 0.000 0.000 0.0000
time 1.026 2.790 2.613 2.979 0.0000
sexFemale 0.199 1.220 0.985 1.511 0.0678
Step 4: Model Diagnostics
R30 lines
### Create GEE diagnostics summary
diagnostics_data <- data.frame(
  Characteristic = c(
    "Family",
    "Link Function",
    "Correlation Structure",
    "Correlation Parameter",
    "Scale Parameter",
    "Overdispersion Detected",
    "Number of Participants",
    "Total Observations"
  ),
  Value = c(
    "Poisson",
    "log",
    poisson_gee$corstr,
    round(poisson_gee$geese$alpha, 3),
    round(scale_param, 3),
    ifelse(scale_param > 1.5, "Yes", "No"),
    length(unique(df_long$participant_id)),
    nrow(df_long)
  )
)

diagnostics_table <- diagnostics_data %>%
  gt() %>%
  gt::tab_header(title = "GEE Model Diagnostics")

gt::gtsave(diagnostics_table, filename = "model_diagnostics.html")
diagnostics_table
GEE Model Diagnostics
Characteristic Value
Family Poisson
Link Function log
Correlation Structure exchangeable
Correlation Parameter 0.048
Scale Parameter 2.448
Overdispersion Detected Yes
Number of Participants 11541
Total Observations 67599
Interpretation
Interpretation

The Poisson GEE estimated a population-averaged IRR for time of 2.79 (95% CI: 2.61โ€“2.98, p < .001), indicating that the expected count of alcohol use days nearly tripled with each assessment wave. The sex effect was not significant (IRR = 1.22, 95% CI: 0.99โ€“1.51, p = .068), suggesting only a marginal difference between males and females in average alcohol use days after accounting for the time trend.

The scale parameter of 2.448 confirms substantial overdispersion โ€” the data have roughly 2.4 times more variability than the Poisson distribution assumes. This is expected for count data like alcohol use days with many zeros and occasional large values. GEE with robust (sandwich) standard errors remains valid under overdispersion because inference relies on the sandwich variance estimator rather than the assumed Poisson variance structure. The exchangeable correlation parameter of 0.048 indicates modest within-person clustering across waves.

The SE comparison table shows that robust and naive standard errors are essentially identical for this model, suggesting the working correlation structure is well-specified. When these diverge substantially, it signals variance misspecification โ€” but robust inference protects against this regardless.

Visualization
R19 lines
### Visualize count distribution over time
count_plot <- df_long %>%
  mutate(wave = factor(session_id)) %>%
  ggplot(aes(x = alc_days)) +
  geom_histogram(binwidth = 1, fill = "#4A90D9", color = "white", alpha = 0.8) +
  facet_wrap(~wave, scales = "free_y", ncol = 2) +
  labs(
    title = "Distribution of Alcohol Use Days by Assessment Wave",
    x = "Number of Days (past 30)",
    y = "Count of Participants"
  ) +
  theme_minimal() +
  coord_cartesian(xlim = c(0, 30))

ggsave(
  filename = "visualization.png",
  plot = count_plot,
  width = 10, height = 6, dpi = 300
)
Count Distribution by Wave
Interpretation
Visualization Notes

The histograms reveal the characteristic shape of count data with excess zeros: most youth report zero alcohol use days at each wave, with a long right tail of those reporting higher counts. As waves progress, the distribution shifts rightward โ€” the proportion reporting zero decreases and the tail extends further โ€” consistent with the increasing IRR from the GEE model. This zero-inflated, right-skewed distribution explains the overdispersion detected by the scale parameter: the Poisson distribution assumes the variance equals the mean, but the combination of many zeros and a few high counts produces variance far exceeding the mean. The GEE's robust standard errors account for this pattern, making the inference valid despite the distributional mismatch.

Discussion

The Poisson GEE analysis demonstrates how to model count outcomes in a population-averaged framework while handling the practical challenges of overdispersion and excess zeros that are typical of behavioral count data. By using robust sandwich standard errors, the GEE provides valid inference on incidence rate ratios even when the Poisson variance assumption is violated โ€” a common occurrence with developmental count data where many participants report zero events.

The rate ratio interpretation is particularly useful for applied researchers: rather than reporting log-scale coefficients, IRRs directly communicate the multiplicative change in the expected count associated with each predictor. For time effects, this translates to statements about the percentage increase (or decrease) in alcohol use frequency per assessment wave, which is more interpretable than additive effects on a transformed scale.

Key methodological takeaways include: (1) always report the scale parameter and compare robust versus naive standard errors as overdispersion diagnostics; (2) GEE with robust SEs is a practical default for count outcomes because it protects against variance misspecification; and (3) the marginal interpretation of GEE โ€” population-averaged rate ratios rather than subject-specific effects โ€” is appropriate when the research question targets population-level trends. For subject-specific interpretations or when random effects are of substantive interest, a GLMM count model (negative binomial or Poisson) would be the complementary approach.

Additional Resources

4

geepack Package Documentation

DOCS

Official CRAN documentation for the geepack package, covering the geeglm() function for Poisson and other generalized estimating equations with different working correlation structures.

Visit Resource

Zeger & Liang (1986): Longitudinal Data Analysis Using GEE

PAPER

Foundational paper introducing GEE methodology for longitudinal data, establishing the theoretical basis for robust inference under working correlation misspecification for both binary and count outcomes.

Visit Resource

Hilbe (2011): Negative Binomial Regression

BOOK

Comprehensive reference on modeling count data including Poisson regression, overdispersion detection, negative binomial alternatives, and zero-inflated models with worked examples in R.

Visit Resource

QIC for Model Selection in GEE

PAPER

Methodology paper on using Quasi-likelihood Information Criterion (QIC) for selecting working correlation structures in GEE models (Pan, 2001). Note: access may require institutional or paid subscription.

Visit Resource