Standardize SUS data column names and values — sus_data

This function standardizes column names and categorical values in SUS datasets, ensuring consistency across different years and versions. It supports three languages: English (en), Portuguese (pt), and Spanish (es).

Usage

sus_data_standardize(
  df,
  lang = "pt",
  translate_columns = TRUE,
  standardize_values = TRUE,
  keep_original = FALSE,
  backend = "arrow",
  verbose = TRUE
)

Arguments

df

A data.frame or tibble to be standardized (typically output from sus_data_import()).

lang

Character. Output language for column names and values. Options: "en" (English), "pt" (Portuguese, Default), "es" (Spanish).

translate_columns

Logical. If TRUE, translates column names. Default is TRUE.

standardize_values

Logical. If TRUE, standardizes categorical values. Default is TRUE.

keep_original

Logical. If TRUE, keeps original columns alongside standardized ones. Default is FALSE.

backend

Character string specifying the data processing backend. Use "arrow" for out-of-memory, lazy processing (recommended for large datasets), or "tibble" for in-memory processing (recommended for small to medium datasets).

"arrow": operations are performed lazily using the Apache Arrow engine, avoiding loading the full dataset into memory. Ideal for large files (e.g., Parquet, Feather) and high-performance workflows.
"tibble": data is fully loaded into memory as a tibble and processed eagerly using dplyr. Simpler and more predictable, but may be slow or fail for large datasets.

If not specified, the function may automatically choose the backend based on the input data type.

verbose

Logical. If TRUE, prints a report of standardization actions. Default is TRUE.

Value

A data.frame with standardized column names and values in the specified language.

Details

The function builds upon the preprocessing done by microdatasus, adding an additional layer of standardization specifically designed for climate-health research workflows.

References

Brazilian Ministry of Health. DATASUS. http://datasus.saude.gov.br

SALDANHA, Raphael de Freitas; BASTOS, Ronaldo Rocha; BARCELLOS, Christovam. Microdatasus: pacote para download e pre-processamento de microdados do Departamento de Informatica do SUS (DATASUS). Cad. Saude Publica, Rio de Janeiro , v. 35, n. 9, e00032419, 2019. Available from https://doi.org/10.1590/0102-311x00032419.

Examples

if (FALSE) { # \dontrun{
# Standardize to English (default)
df_en <- sus_data_standardize(df_raw, lang = "en")

# Standardize to Portuguese
df_pt <- sus_data_standardize(df_raw, lang = "pt")

# Standardize to Spanish
df_es <- sus_data_standardize(df_raw, lang = "es")

# Keep original columns for comparison
df_both <- sus_data_standardize(
  df_raw,
  lang = "pt",
  keep_original = TRUE
)

# Only translate column names (not values)
df_cols_only <- sus_data_standardize(
  df_raw,
  lang = "en",
  translate_columns = TRUE,
  standardize_values = FALSE
)

# Complete pipeline
df_analysis_ready <- sus_data_import(uf = "SP", year = 2023, system = "SIM-DO") |>
  sus_data_clean_encoding() |>
  sus_data_standardize(lang = "pt")
} # }