Detect and correct character encoding issues — sus_data_clean

This function scans text columns in a data.frame and corrects common encoding problems (e.g., "Sao Paulo") that occur when Latin1 data is incorrectly read as UTF-8. It acts as a final auditor to ensure all text data is properly encoded, complementing the preprocessing done by microdatasus. Supports multilingual output messages (English, Portuguese, Spanish).

Usage

sus_data_clean_encoding(df, backend = "arrow", lang = "pt", verbose = TRUE)

Arguments

df

A data.frame or tibble to be cleaned.

backend

Character string specifying the data processing backend. Use "arrow" for out-of-memory, lazy processing (recommended for large datasets), or "tibble" for in-memory processing (recommended for small to medium datasets).

"arrow": operations are performed lazily using the Apache Arrow engine, avoiding loading the full dataset into memory. Ideal for large files (e.g., Parquet, Feather) and high-performance workflows.
"tibble": data is fully loaded into memory as a tibble and processed eagerly using dplyr. Simpler and more predictable, but may be slow or fail for large datasets.

If not specified, the function may automatically choose the backend based on the input data type.

lang

Character. Language for UI messages. Options: "en" (English), "pt" (Portuguese, default), "es" (Spanish).

verbose

Logical. If TRUE, prints a report of columns checked and corrected. Default is TRUE.

Value

A data.frame with corrected text columns.

Examples

if (FALSE) { # \dontrun{
# Create a sample dataset with encoding issues
# In real data, this might happen with Brazilian Portuguese text
df_problem <- data.frame(
  id = 1:3,
  city = c("Sao Paulo", "Rio de Janeiro", "Belo Horizonte"),
  state = c("SP", "RJ", "MG"),
  stringsAsFactors = FALSE
)

# Simulate encoding issue (for demonstration only)
# In practice, this happens when Latin1 text is read as UTF-8

# Correct encoding with English messages
df_clean_en <- sus_data_clean_encoding(df_problem, lang = "en")

# Correct encoding with Portuguese messages
df_clean_pt <- sus_data_clean_encoding(df_problem, lang = "pt")

# Correct encoding with Spanish messages
df_clean_es <- sus_data_clean_encoding(df_problem, lang = "es")

# Use in a pipeline
df_clean <- sus_data_import(uf = "RJ", year = 2022, system = "SIM") %>%
  sus_data_clean_encoding(lang = "pt")
} # }