Aggregate Health Data into Time Series — sus_data

Aggregates individual-level health data into time series counts by specified time units and grouping variables. This function is essential for preparing data for time series analysis, DLNM models, and other temporal epidemiological methods.

Usage

sus_data_aggregate(
  df,
  time_unit = "day",
  fun = "count",
  group_by = NULL,
  value_col = NULL,
  complete_dates = FALSE,
  date_col = NULL,
  backend = "arrow",
  lang = "pt",
  verbose = TRUE
)

Arguments

df

A data frame containing health data (output from sus_data_standardize(), or sus_data_filter*()).

time_unit

Character string specifying the temporal aggregation unit. Standard units: "day", "week", "month", "quarter", "year" Multi-day/week/month: "2 days", "5 days" (pentads), "14 days" (fortnightly), "3 months" (trimester), "6 months" (semester). Special: "season" (Brazilian seasons: DJF, MAM, JJA, SON). Default is "day".

fun

Character string or list of functions specifying the aggregation function(s). Options: "count" (default), "sum", "mean", "median", "min", "max", "sd", "q25" (25th percentile), "q75", "q95", and "q99". Can also be a named list for multiple aggregations, e.g., list(mean_temp = "mean", max_temp = "max").

group_by

Character vector with names of columns to group by (e.g., c("sex", "age_group", "race")). If NULL (default), aggregates across "municipality_code" records.

value_col

Character string with the name of the column to aggregate when using functions other than "count". Required for "sum", "mean", etc. For example, "temperature", "precipitation", "pm25".

complete_dates

Logical. If TRUE (default), fills in missing time periods with zero counts to create a complete time series without gaps.

date_col

Character string with the name of the date column to use for aggregation. If NULL (default), the function will attempt to auto-detect the date column based on common patterns.

backend

Character string specifying the data processing backend. Use "arrow" for out-of-memory, lazy processing (recommended for large datasets), or "tibble" for in-memory processing (recommended for small to medium datasets).

"arrow": operations are performed lazily using the Apache Arrow engine, avoiding loading the full dataset into memory. Ideal for large files (e.g., Parquet, Feather) and high-performance workflows.
"tibble": data is fully loaded into memory as a tibble and processed eagerly using dplyr. Simpler and more predictable, but may be slow or fail for large datasets.

If not specified, the function may automatically choose the backend based on the input data type.

lang

Character string specifying the language for messages. Options: "en" (English), "pt" (Portuguese, default), "es" (Spanish).

verbose

Logical. If TRUE (default), prints progress messages.

Value

A tibble with aggregated data containing:

date: The aggregated date (start of period)
Grouping columns (if group_by was specified)
Aggregated value column(s) with smart names based on system and function

Details

New Features:

Multiple aggregation functions: Beyond counting, you can now calculate mean, sum, median, percentiles, etc., useful for climate and environmental data.
Smart column naming: The aggregated column is automatically named based on the health system (e.g., n_deaths for SIM, n_hospitalizations for SIH-RD, n_births for SINASC, n_cases for SINAN, n_procedures for SIA, and n_establishments, for CNES).

Epidemiological Use Cases:

Daily/Weekly: Standard time series analysis, DLNM for short-term effects
Pentads (5 days): Heat wave analysis, smoothing daily noise
Fortnightly (14 days): Diseases with longer incubation periods
Monthly: Seasonal patterns, long-term trends
Quarterly: SUS management reports, policy evaluation
Seasonal: Dengue, Influenza, respiratory diseases aligned with Brazilian climate
Yearly: Long-term trend analysis, climate change impacts

Brazilian Seasons (when time_unit = "season"):

Summer (Verao): December-January-February (DJF)
Autumn (Outono): March-April-May (MAM)
Winter (Inverno): June-July-August (JJA)
Spring (Primavera): September-October-November (SON)

Examples

if (FALSE) { # \dontrun{
library(climasus4r)

# Basic daily aggregation
df_daily <- sus_data_import(uf = "SP", year = 2023, system = "SIM-DO") %>%
  sus_data_standardize() %>%
  sus_data_filter_cid(disease_group = "respiratory") %>%
  sus_data_aggregate(time_unit = "day")

# Pentad aggregation (5-day periods) for heat wave analysis
df_pentad <- sus_data_aggregate(df, time_unit = "5 days")

# Fortnightly aggregation for diseases with longer incubation
df_fortnightly <- sus_data_aggregate(df, time_unit = "14 days")

# Monthly aggregation by municipality
df_monthly <- sus_data_aggregate(
  df,
  time_unit = "month",
  group_by = c("race", "sex"),
  lang = "pt"
)

# Quarterly aggregation for SUS reports
df_quarterly <- sus_data_aggregate(df, time_unit = "quarter")

# Seasonal aggregation for dengue analysis (Brazilian seasons)
df_seasonal <- sus_data_aggregate(
  df,
  time_unit = "season"
)

# Weekly aggregation by age group and sex
df_weekly <- sus_data_aggregate(
  df,
  time_unit = "week",
  group_by = c("age_group", "sex") #age_group comes from `sus_data_create_variables()`
)
} # }