Create Derived Variables for Epidemiological Analysis
Source:R/sus_create_variables.R
sus_create_variables.RdCreates commonly used derived variables from health data across ALL SUS systems (SIM, SINAN, SIH, SIA, SINASC, CNES), including age groups, calendar variables, and other epidemiologically relevant categorizations. Features age calculation that handles different data formats across systems.
Usage
sus_create_variables(
df,
create_age_groups = TRUE,
age_breaks = c(0, 5, 15, 60, Inf),
age_labels = NULL,
create_calendar_vars = TRUE,
create_climate_vars = TRUE,
climate_region = NULL,
date_col = NULL,
age_col = NULL,
hemisphere = "south",
lang = "pt",
verbose = TRUE
)Arguments
- df
A data frame containing health data from any SUS system.
- create_age_groups
Logical. If
TRUE, derives age-based variables. When enabled, the function will attempt to determine age using the following hierarchy (in order of preference):An existing age column supplied via
age_col.Calculation from birth date and event date columns (gold standard).
Decoding of DATASUS age code variables (fallback method).
If age is successfully determined, the following variables may be created:
User-defined age groups: A categorical variable based on
age_breaksandage_labels, created usingcut(). This variable is always created whencreate_age_groups = TRUE.Climate risk age group: A coarse age classification designed for climate–health analyses, with three categories:
- 0–4
High risk
- 5–64
Standard risk
- 65+
High risk
The variable name depends on the selected language:
English:
climate_risk_groupPortuguese:
grupo_risco_climaticoSpanish:
grupo_riesgo_climatico
IBGE quinquennial age groups: A standardized 17-group age classification following the Brazilian Institute of Geography and Statistics (IBGE) quinquennial structure:
0–4, 5–9, ..., 75–79, 80+. This variable ensures national and international comparability and is particularly useful for demographic and epidemiological analyses. Variable names by language:English:
ibge_age_groupPortuguese:
faixa_etaria_ibgeSpanish:
grupo_edad_ibge
All age group variables are created only if age can be reliably determined. If age cannot be inferred, the function will stop with an informative error.
- age_breaks
Numeric vector specifying the breakpoints for age groups. Default is
c(0, 5, 15, 65, Inf)for standard epidemiological categories.- age_labels
Character vector with labels for age groups. If
NULL(default), generates labels automatically from breaks.- create_calendar_vars
Logical. If
TRUE, creates calendar variables (day of week, month, season, etc.). Default isFALSE.- create_climate_vars
Logical. Se
TRUE, cria variaveis climaticas e sazonais.- climate_region
Character. Regiao climatica para calculos sazonais. Opcoes: "norte", "nordeste", "centro-oeste", "sudeste", "sul".
- date_col
Character string with the name of the date column. If
NULL(default), auto-detects the date column.- age_col
Character string with the name of the age column (in years). If
NULL(default), auto-detects and calculates age using hierarchical logic:Direct age column (if exists)
Calculate from dates (event_date - birth_date)
Decode DATASUS age codes (NU_IDADE_N, IDADE)
- hemisphere
Character string specifying the hemisphere for season calculation. Options:
"south"(default, for Brazil),"north".- lang
Character string specifying the language for variable labels and messages. Options:
"en"(English),"pt"(Portuguese, default),"es"(Spanish).- verbose
Logical. If
TRUE(default), prints progress messages.
Details
Age Calculation (Hierarchical Logic):
The function uses a 3-tier hierarchy to ensure age is calculated correctly across all SUS systems:
Direct Age Column (Fastest): If a column with age in years already exists (common in SIM after microdatasus processing), uses it directly.
Date Calculation (Gold Standard): If birth date and event date are available, calculates exact age as:
interval(birth_date, event_date) / years(1). This is the most accurate method and works for:SINAN: Has
DTNASCandDT_NOTIFICSIH: Has
NASCandDT_INTERSINASC: Has
DTNASC(mother) andDTNASC(newborn)
DATASUS Code Decoder (Fallback): If dates are missing (common in anonymized data), decodes the composite age code used by DATASUS:
Codes starting with
1: Hours (converted to 0 years)Codes starting with
2: Days (converted to 0 years)Codes starting with
3: Months (converted to 0 years for <12 months)Codes starting with
4: Years (e.g., 4035 = 35 years)Codes starting with
5: 100+ years (e.g., 5105 = 105 years)
Age Groups: Creates a factor variable age_group based on the specified
breaks and labels. Common epidemiological categories:
Pediatric: 0-4, 5-14, 15-19
Adult: 20-39, 40-59, 60+
Elderly: 65-74, 75-84, 85+
Climate-Health: 0-4, 5-64, 65+ (vulnerable populations)
Calendar Variables (when create_calendar_vars = TRUE):
day_of_week: Day of the week (1 = Monday, 7 = Sunday)day_of_week_name: Day name (e.g., "Monday", "Segunda-feira")month: Month number (1-12)month_name: Month name (e.g., "January", "Janeiro")year: Yearquarter: Quarter (1-4)season: Season (Summer, Autumn, Winter, Spring)is_weekend: Logical indicating if date is weekendday_of_year: Day of year (1-365/366)semester: Semester (1 or 2)
Seasons are calculated based on hemisphere:
Southern Hemisphere (Brazil): Summer (Dec-Feb), Autumn (Mar-May), Winter (Jun-Aug), Spring (Sep-Nov)
Northern Hemisphere: Summer (Jun-Aug), Autumn (Sep-Nov), Winter (Dec-Feb), Spring (Mar-May)
Examples
if (FALSE) { # \dontrun{
library(climasus4r)
# ===== EXAMPLE 1: SIM (Mortality) - Age already calculated =====
df_sim <- sus_data_import(uf = "SP", year = 2023, system = "SIM-DO") |>
sus_data_standardize(lang = "en") |>
sus_create_variables(
create_age_groups = TRUE,
age_breaks = c(0, 5, 65, Inf),
age_labels = c("0-4", "5-64", "65+"),
create_calendar_vars = TRUE,
lang = "en"
)
# Uses direct age column (fastest)
# ===== EXAMPLE 2: SINAN (Dengue) - Calculate from dates =====
df_sinan <- sus_data_import(uf = "RJ", year = 2023, system = "SINAN-DENGUE") |>
sus_data_standardize(lang = "pt") |>
sus_create_variables(
create_age_groups = TRUE,
age_breaks = c(0, 15, 60, Inf),
create_calendar_vars = TRUE,
lang = "pt"
)
# Calculates age from DTNASC and DT_NOTIFIC (gold standard)
# ===== EXAMPLE 3: SIH (Hospitalizations) - Decode age codes =====
df_sih <- sus_data_import(uf = "MG", year = 2023, system = "SIH-RD") |>
sus_data_standardize(lang = "es") |>
sus_create_variables(
create_age_groups = TRUE,
age_breaks = c(0, 18, 60, Inf),
age_labels = c("0-17", "18-59", "60+"),
create_calendar_vars = TRUE,
lang = "es"
)
# Decodes DATASUS age codes if dates are missing (fallback)
# ===== EXAMPLE 4: Custom age groups for elderly analysis =====
df_elderly <- sus_create_variables(
df,
create_age_groups = TRUE,
age_breaks = c(60, 70, 80, 90, Inf),
age_labels = c("60-69", "70-79", "80-89", "90+"),
lang = "pt"
)
# ===== EXAMPLE 5: Calendar variables and climate variables =====
df_calendar_climate <- sus_create_variables(
df,
create_calendar_vars = TRUE,
create_semester = TRUE,
create_climate_vars = TRUE,
climate_region = "Norte"
)
} # }