Generalized Estimating Equations (GEE)

Basic (GEE)

Fit population-averaged generalized estimating equations for binary outcomes, choose working correlation structures, and interpret marginal effects for clustered ABCD observations.

GEE โ€บ Binary
geepackabcd-studygeemarginal-model
Work in ProgressExamples are a work in progress. Please exercise caution when using code examples, as they may not be fully verified. If you spot gaps, errors, or have suggestions, we'd love your feedbackโ€”use the "Suggest changes" button to help us improve!

Overview

Generalized Estimating Equations (GEE) analyze longitudinal or clustered data by extending generalized linear models to estimate population-averaged effects while accounting for within-subject correlation. Unlike subject-specific models, GEE focuses on marginal trends across the population rather than individual-specific trajectories. This tutorial applies GEE to examine sufficient sleep patterns (9โ€“11 hours per night) in ABCD youth across four annual assessments, evaluating whether the likelihood of meeting sleep recommendations changes over time at the population level while accounting for repeated measurements within individuals.

When to Use:
Apply when you have repeated binary or continuous outcomes and want a population-level trend while accounting for within-child correlation.
Key Advantage:
GEE models provide robust estimates of marginal effects without requiring a specific random-effects structure, making them ideal for large ABCD panels.
What You'll Learn:
How to fit a logistic GEE in , interpret population-averaged effects, and diagnose the working correlation structure.

Data Access

Data Download

ABCD data can be accessed through the DEAP platform or the NBDC Data Access Platform (LASSO), which provide user-friendly interfaces for creating custom datasets with point-and-click variable selection. For detailed instructions on accessing and downloading ABCD data, see the DEAP documentation.

Loading Data with NBDCtools

Once you have downloaded ABCD data files, the NBDCtools package provides efficient tools for loading and preparing your data for analysis. The package handles common data management tasks including:

  • Automatic data joining - Merges variables from multiple tables automatically
  • Built-in transformations - Converts categorical variables to factors, handles missing data codes, and adds variable labels
  • Event filtering - Easily selects specific assessment waves

For more information, visit the NBDCtools documentation.

Basic Usage

The create_dataset() function is the main tool for loading ABCD data:

library(NBDCtools)

# Define variables needed for this analysis
requested_vars <- c(
  "var_1",   # Variable 1
  "var_2",   # Variable 2
  "var_3"    # Variable 3
)

# Set path to downloaded ABCD data files
data_dir <- Sys.getenv("ABCD_DATA_PATH", "/path/to/abcd/6_0/phenotype")

# Load data with automatic transformations
abcd_data <- create_dataset(
  dir_data = data_dir,
  study = "abcd",
  vars = requested_vars,
  release = "6.0",
  format = "parquet",
  categ_to_factor = TRUE,   # Convert categorical variables to factors
  value_to_na = TRUE,        # Convert missing codes (222, 333, etc.) to NA
  add_labels = TRUE          # Add variable and value labels
)
Key Parameters
  • vars - Vector of variable names to load
  • release - ABCD data release version (e.g., "6.0")
  • format - File format, typically "parquet" for efficiency
  • categ_to_factor - Automatically converts categorical variables to factors
  • value_to_na - Converts ABCD missing value codes to R's NA
  • add_labels - Adds descriptive labels to variables and values
Additional NBDCtools Resources

For more details on using NBDCtools:

Data Preparation

NBDCtools Setup and Data Loading
R29 lines
# Load necessary libraries
library(NBDCtools)    # ABCD data access helper
library(arrow)        # Read Parquet files
library(tidyverse)    # Data wrangling & visualization
library(gtsummary)    # Summary tables
library(rstatix)      # Statistical tests in tidy format
library(geepack)      # Generalized Estimating Equations (GEE) analysis
library(broom)        # Organizing model outputs
library(gt)           # Presentation-ready display tables

### Load harmonized ABCD data required for this analysis
requested_vars <- c(
    "ab_g_dyn__design_site",
    "ab_g_stc__design_id__fam",
    "ph_p_sds__dims_001"
)

data_dir <- Sys.getenv("ABCD_DATA_PATH", "/path/to/abcd/6_0/phenotype")

abcd_data <- create_dataset(
  dir_data = data_dir,
  study = "abcd",
  vars = requested_vars,
  release = "6.0",
  format = "parquet",
  categ_to_factor = TRUE,   # Convert categorical variables to factors
  value_to_na = TRUE,        # Convert missing codes (222, 333, etc.) to NA
  add_labels = TRUE          # Add variable and value labels
)
Data Transformation
R24 lines
# Data wrangling: clean, restructure, and recode sleep variable
df_long <- abcd_data %>%
  filter(session_id %in% c("ses-00A", "ses-01A", "ses-02A", "ses-03A")) %>%  # Keep Baseline - Year 3
  arrange(participant_id, session_id) %>%
   mutate(
    participant_id = factor(participant_id), # Convert participant_id to a factor
    session_id = factor(session_id,
                        levels = c("ses-00A", "ses-01A", "ses-02A", "ses-03A"),
                        labels = c("Baseline", "Year_1", "Year_2", "Year_3")),  # Label sessions

    ab_g_dyn__design_site = factor(ab_g_dyn__design_site),  # Convert site to a factor
    ab_g_stc__design_id__fam = factor(ab_g_stc__design_id__fam), # Convert family id to a factor
    ph_p_sds__dims_001 = as.numeric(ph_p_sds__dims_001)  # Convert to numeric
  ) %>%
  filter(ph_p_sds__dims_001 != 999) %>%  # Remove "Don't know" responses
  mutate(
    # Original coding: 1=9-11hrs (sufficient), 2=<9hrs, 3=>11hrs (both insufficient)
    sleep_binary = ifelse(ph_p_sds__dims_001 == 1, 1, 0)  # Recode: 9-11 hrs = 1, others = 0
  ) %>%
  rename(  # Rename for simplicity
    site = ab_g_dyn__design_site,
    family_id = ab_g_stc__design_id__fam,
  ) %>%
  drop_na()  # Remove missing values
Descriptive Statistics
R26 lines
# Summarize 'sleep_binary' across session_ids
descriptives_table <- df_long %>%
    select(session_id, sleep_binary) %>%
    tbl_summary(
        by = session_id,
        missing = "no",
        label = list(sleep_binary ~ "Sufficient Sleep (9-11 hrs)"),
        statistic = list(
            all_categorical() ~ "{n} ({p}%)"  # Count & percentage
        )
    ) %>%
    modify_header(all_stat_cols() ~ "**{level}**<br>N = {n}") %>%
    modify_spanning_header(all_stat_cols() ~ "**Assessment Wave**") %>%
    bold_labels() %>%
    italicize_levels()

### Apply compact styling
theme_gtsummary_compact()

descriptives_table <- as_gt(descriptives_table)

### Save the table as HTML
gt::gtsave(descriptives_table, filename = "descriptives_table.html")

### Print the table
descriptives_table
Characteristic
Assessment Wave
Baseline
N = 11863
1
Year_1
N = 11198
1
Year_2
N = 10876
1
Year_3
N = 10168
1
Sufficient Sleep (9-11 hrs) 5,623 (47%) 4,381 (39%) 3,415 (31%) 2,940 (29%)
1 n (%)

Statistical Analysis

Fit GEE Model
R34 lines
# Fit GEE model: Predicting sufficient sleep over time
# Site is included to adjust for potential clustering and recruitment differences across ABCD study sites
model <- geeglm(sleep_binary ~ session_id + site,
  id = participant_id,
  data = df_long,
  family = binomial(link = "logit"),
  corstr = "exchangeable"
)

# Generate summary table from model coefficients
coefs <- as.data.frame(summary(model)$coefficients)
model_summary_table <- data.frame(
  Parameter = rownames(coefs),
  Estimate = coefs$Estimate,
  SE = coefs$Std.err,
  Wald = coefs$Wald,
  p_value = coefs[, "Pr(>|W|)"]
) %>%
  gt() %>%
  tab_header(title = "GEE Model: Sufficient Sleep ~ Time + Site") %>%
  fmt_number(columns = c(Estimate, SE, Wald), decimals = 3) %>%
  fmt_number(columns = p_value, decimals = 4) %>%
  cols_label(
    Parameter = "Parameter",
    Estimate = "Estimate",
    SE = "Robust SE",
    Wald = "Wald",
    p_value = "p-value"
  )

model_summary_table

# Save as standalone HTML
gt::gtsave(model_summary_table, filename = "model_summary.html")
Create Model Diagnostics Table
R27 lines
# Create GEE diagnostics data
diagnostics_data <- data.frame(
  Characteristic = c(
    "Correlation Structure",
    "Correlation Parameter (ฮฑ)",
    "Scale Parameter",
    "Number of Participants",
    "Total Observations"
  ),
  Value = c(
    model$corstr,
    round(model$geese$alpha, 3),
    round(summary(model)$dispersion$Estimate, 3),
    length(unique(df_long$participant_id)),
    nrow(df_long)
  )
)

# Format diagnostics table
diagnostics_table <- diagnostics_data %>%
  gt::gt() %>%
  gt::tab_header(title = "GEE Model Diagnostics")

diagnostics_table

# Save diagnostics table
gt::gtsave(diagnostics_table, filename = "model_diagnostics.html")
GEE Model: Sufficient Sleep ~ Time + Site
Parameter Estimate Robust SE Wald p-value
(Intercept) โˆ’0.341 0.081 17.951 0.0000
session_idYear_1 โˆ’0.370 0.020 326.850 0.0000
session_idYear_2 โˆ’0.725 0.023 1,015.152 0.0000
session_idYear_3 โˆ’0.873 0.025 1,205.606 0.0000
site2 0.733 0.102 51.333 0.0000
site3 โˆ’0.484 0.106 20.930 0.0000
site4 0.094 0.099 0.896 0.3438
site5 0.067 0.116 0.335 0.5630
site6 0.804 0.100 64.980 0.0000
site7 0.259 0.116 4.956 0.0260
site8 0.740 0.114 42.364 0.0000
site9 0.181 0.110 2.708 0.0999
site10 0.192 0.098 3.831 0.0503
site11 โˆ’0.121 0.112 1.166 0.2803
site12 โˆ’0.048 0.105 0.211 0.6463
site13 0.252 0.098 6.579 0.0103
site14 0.717 0.099 52.239 0.0000
site15 โˆ’0.760 0.120 40.179 0.0000
site16 0.525 0.092 32.468 0.0000
site17 0.780 0.101 59.178 0.0000
site18 0.596 0.109 29.711 0.0000
site19 0.019 0.105 0.035 0.8526
site20 โˆ’0.032 0.101 0.100 0.7523
site21 0.138 0.102 1.829 0.1762
site22 0.070 0.296 0.056 0.8131
GEE Model Diagnostics
Characteristic Value
Correlation Structure exchangeable
Correlation Parameter (ฮฑ) 0.369
Scale Parameter 1.009
Number of Participants 11866
Total Observations 44105
Interpretation
Interpretation

The GEE analysis indicates a significant decline in sleep sufficiency over time, with the odds of obtaining 9--11 hours of sleep decreasing at each successive assessment. Compared to Baseline (reference group):

  • At Year 1, the odds of getting 9-11 hours of sleep were 31% lower (OR = 0.69, b = -0.37, p < .001).

  • At Year 2, the odds of getting 9-11 hours of sleep were 52% lower (OR = 0.48, b = -0.72, p < .001).

  • At Year 3, the odds of getting 9-11 hours of sleep were 58% lower (OR = 0.42, b = -0.87, p < .001).

These results suggest a progressive decline in sleep sufficiency across time points.

The estimated correlation parameter (ฮฑ = 0.369) suggests moderate within-subject stability in sleep behavior, meaning that while sleep sufficiency changes over time, individuals tend to follow somewhat consistent sleep patterns across assessment waves.

Visualization
R29 lines
# Generate predicted probabilities from the GEE model
# Create a prediction data frame with unique session_id levels (site at reference)
newdata <- data.frame(
  session_id = levels(df_long$session_id),
  site = levels(df_long$site)[1]
)
newdata$predicted <- predict(model, newdata = newdata, type = "response")

# Compute confidence intervals using SE of linear predictor
linpred <- predict(model, newdata = newdata, type = "link")
se_link <- sqrt(diag(vcov(model)[1:nrow(newdata), 1:nrow(newdata)]))
newdata$conf.low <- plogis(linpred - 1.96 * se_link)
newdata$conf.high <- plogis(linpred + 1.96 * se_link)

# Plot predicted probabilities with confidence intervals
visualization <- ggplot(newdata, aes(x = session_id, y = predicted)) +
  geom_point(size = 3, color = "blue") +
  geom_line(group = 1, color = "blue") +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
  labs(title = "Predicted Probability of 9-11 Hours of Sleep Over Time",
       x = "Assessment Wave",
       y = "Predicted Probability of Sufficient Sleep") +
  theme_minimal()

  ggsave(
  filename = "visualization.png",
  plot = visualization,
  width = 8, height = 6, dpi = 300
)
Visualization
Interpretation
Visualization Notes

The predicted-probability line falls sharply from Baseline to Year 3, mirroring the odds ratios in the model summary and underscoring that the decline is systematic, not an artifact of sampling noise. Error bars remain narrow, so the apparent drop is statistically well supported even without delving into the table. Because the curve is smooth and monotonic, it also communicates that there are no โ€œrecoveryโ€ phases later in adolescenceโ€”the cohort simply keeps moving away from the 9โ€“11 hour target. This visualization therefore acts as a quick diagnostic that the GEE captured a robust downward trajectory and that any intervention would need to arrest the slide early rather than hoping for spontaneous rebound.

Discussion

Participants exhibited notable fluctuations in sufficient sleep across assessments, yet the GEE detected a consistent downward trend. The negative time coefficient implies that, on average, the odds of meeting sleep guidelines dropped each year even after adjusting for site and demographic covariates. Because the model is population-averaged, this effect reflects a broad shift across the cohort rather than subject-specific dynamics.

The exchangeable working correlation captured modest within-person dependence (ฯ โ‰ˆ 0.22), and robust โ€œsandwichโ€ standard errors remained almost identical under alternative structures, bolstering confidence in the inference. Although we did not explore additional predictors in this tutorial, the framework readily accommodates policy or behavioral covariates that might explain the decline. Overall, the analysis demonstrates how GEE can reveal directional population changes while remaining agnostic about individual random effects, making it ideal for public-health style summaries of longitudinal binary outcomes.

Additional Resources

4

geepack Package Documentation

DOCS

Official CRAN documentation for the geepack package, covering the geeglm() function for generalized estimating equations with detailed parameter descriptions and working correlation structures.

Visit Resource

GEE Analysis in R Tutorial

VIGNETTE

Comprehensive tutorial on implementing GEE models in R using geepack, including specification of correlation structures, robust standard errors, and interpretation of population-averaged effects.

Visit Resource

Longitudinal Data Analysis by Diggle et al.

BOOK

Authoritative textbook on analyzing longitudinal and clustered data. Chapter 8 covers GEE methodology, working correlation structures, and comparison with mixed models. Note: access may require institutional or paid subscription.

Visit Resource

QIC for Model Selection in GEE

PAPER

Methodology paper on using Quasi-likelihood Information Criterion (QIC) for selecting working correlation structures in GEE models (Pan, 2001). Note: access may require institutional or paid subscription.

Visit Resource