Skip to contents

Creates commonly used derived variables from health data across ALL SUS systems (SIM, SINAN, SIH, SIA, SINASC, CNES), including age groups, calendar variables, and other epidemiologically relevant categorizations. Features age calculation that handles different data formats across systems.

Usage

sus_create_variables(
  df,
  create_age_groups = TRUE,
  age_breaks = c(0, 5, 15, 60, Inf),
  age_labels = NULL,
  create_calendar_vars = TRUE,
  create_climate_vars = TRUE,
  climate_region = NULL,
  date_col = NULL,
  age_col = NULL,
  hemisphere = "south",
  lang = "pt",
  verbose = TRUE
)

Arguments

df

A data frame containing health data from any SUS system.

create_age_groups

Logical. If TRUE, derives age-based variables. When enabled, the function will attempt to determine age using the following hierarchy (in order of preference):

  1. An existing age column supplied via age_col.

  2. Calculation from birth date and event date columns (gold standard).

  3. Decoding of DATASUS age code variables (fallback method).

If age is successfully determined, the following variables may be created:

  • User-defined age groups: A categorical variable based on age_breaks and age_labels, created using cut(). This variable is always created when create_age_groups = TRUE.

  • Climate risk age group: A coarse age classification designed for climate–health analyses, with three categories:

    0–4

    High risk

    5–64

    Standard risk

    65+

    High risk

    The variable name depends on the selected language:

    • English: climate_risk_group

    • Portuguese: grupo_risco_climatico

    • Spanish: grupo_riesgo_climatico

  • IBGE quinquennial age groups: A standardized 17-group age classification following the Brazilian Institute of Geography and Statistics (IBGE) quinquennial structure: 0–4, 5–9, ..., 75–79, 80+. This variable ensures national and international comparability and is particularly useful for demographic and epidemiological analyses. Variable names by language:

    • English: ibge_age_group

    • Portuguese: faixa_etaria_ibge

    • Spanish: grupo_edad_ibge

All age group variables are created only if age can be reliably determined. If age cannot be inferred, the function will stop with an informative error.

age_breaks

Numeric vector specifying the breakpoints for age groups. Default is c(0, 5, 15, 65, Inf) for standard epidemiological categories.

age_labels

Character vector with labels for age groups. If NULL (default), generates labels automatically from breaks.

create_calendar_vars

Logical. If TRUE, creates calendar variables (day of week, month, season, etc.). Default is FALSE.

create_climate_vars

Logical. Se TRUE, cria variaveis climaticas e sazonais.

climate_region

Character. Regiao climatica para calculos sazonais. Opcoes: "norte", "nordeste", "centro-oeste", "sudeste", "sul".

date_col

Character string with the name of the date column. If NULL (default), auto-detects the date column.

age_col

Character string with the name of the age column (in years). If NULL (default), auto-detects and calculates age using hierarchical logic:

  1. Direct age column (if exists)

  2. Calculate from dates (event_date - birth_date)

  3. Decode DATASUS age codes (NU_IDADE_N, IDADE)

hemisphere

Character string specifying the hemisphere for season calculation. Options: "south" (default, for Brazil), "north".

lang

Character string specifying the language for variable labels and messages. Options: "en" (English), "pt" (Portuguese, default), "es" (Spanish).

verbose

Logical. If TRUE (default), prints progress messages.

Value

The input data frame with additional columns for the created variables.

Details

Age Calculation (Hierarchical Logic):

The function uses a 3-tier hierarchy to ensure age is calculated correctly across all SUS systems:

  1. Direct Age Column (Fastest): If a column with age in years already exists (common in SIM after microdatasus processing), uses it directly.

  2. Date Calculation (Gold Standard): If birth date and event date are available, calculates exact age as: interval(birth_date, event_date) / years(1). This is the most accurate method and works for:

    • SINAN: Has DTNASC and DT_NOTIFIC

    • SIH: Has NASC and DT_INTER

    • SINASC: Has DTNASC (mother) and DTNASC (newborn)

  3. DATASUS Code Decoder (Fallback): If dates are missing (common in anonymized data), decodes the composite age code used by DATASUS:

    • Codes starting with 1: Hours (converted to 0 years)

    • Codes starting with 2: Days (converted to 0 years)

    • Codes starting with 3: Months (converted to 0 years for <12 months)

    • Codes starting with 4: Years (e.g., 4035 = 35 years)

    • Codes starting with 5: 100+ years (e.g., 5105 = 105 years)

Age Groups: Creates a factor variable age_group based on the specified breaks and labels. Common epidemiological categories:

  • Pediatric: 0-4, 5-14, 15-19

  • Adult: 20-39, 40-59, 60+

  • Elderly: 65-74, 75-84, 85+

  • Climate-Health: 0-4, 5-64, 65+ (vulnerable populations)

Calendar Variables (when create_calendar_vars = TRUE):

  • day_of_week: Day of the week (1 = Monday, 7 = Sunday)

  • day_of_week_name: Day name (e.g., "Monday", "Segunda-feira")

  • month: Month number (1-12)

  • month_name: Month name (e.g., "January", "Janeiro")

  • year: Year

  • quarter: Quarter (1-4)

  • season: Season (Summer, Autumn, Winter, Spring)

  • is_weekend: Logical indicating if date is weekend

  • day_of_year: Day of year (1-365/366)

  • semester: Semester (1 or 2)

Seasons are calculated based on hemisphere:

  • Southern Hemisphere (Brazil): Summer (Dec-Feb), Autumn (Mar-May), Winter (Jun-Aug), Spring (Sep-Nov)

  • Northern Hemisphere: Summer (Jun-Aug), Autumn (Sep-Nov), Winter (Dec-Feb), Spring (Mar-May)

Examples

if (FALSE) { # \dontrun{
library(climasus4r)

# ===== EXAMPLE 1: SIM (Mortality) - Age already calculated =====
df_sim <- sus_data_import(uf = "SP", year = 2023, system = "SIM-DO") |>
  sus_data_standardize(lang = "en") |>
  sus_create_variables(
    create_age_groups = TRUE,
    age_breaks = c(0, 5, 65, Inf),
    age_labels = c("0-4", "5-64", "65+"),
    create_calendar_vars = TRUE,
    lang = "en"
  )
# Uses direct age column (fastest)

# ===== EXAMPLE 2: SINAN (Dengue) - Calculate from dates =====
df_sinan <- sus_data_import(uf = "RJ", year = 2023, system = "SINAN-DENGUE") |>
  sus_data_standardize(lang = "pt") |>
  sus_create_variables(
    create_age_groups = TRUE,
    age_breaks = c(0, 15, 60, Inf),
    create_calendar_vars = TRUE,
    lang = "pt"
  )
# Calculates age from DTNASC and DT_NOTIFIC (gold standard)

# ===== EXAMPLE 3: SIH (Hospitalizations) - Decode age codes =====
df_sih <- sus_data_import(uf = "MG", year = 2023, system = "SIH-RD") |>
  sus_data_standardize(lang = "es") |>
  sus_create_variables(
    create_age_groups = TRUE,
    age_breaks = c(0, 18, 60, Inf),
    age_labels = c("0-17", "18-59", "60+"),
    create_calendar_vars = TRUE,
    lang = "es"
  )
# Decodes DATASUS age codes if dates are missing (fallback)

# ===== EXAMPLE 4: Custom age groups for elderly analysis =====
df_elderly <- sus_create_variables(
  df,
  create_age_groups = TRUE,
  age_breaks = c(60, 70, 80, 90, Inf),
  age_labels = c("60-69", "70-79", "80-89", "90+"),
  lang = "pt"
)

# ===== EXAMPLE 5: Calendar variables and climate variables =====
df_calendar_climate <- sus_create_variables(
  df,
  create_calendar_vars = TRUE,
  create_semester = TRUE,
  create_climate_vars = TRUE,
  climate_region = "Norte"
)
} # }