Skip to contents

This function scans text columns in a data.frame and corrects common encoding problems (e.g., "Sao Paulo") that occur when Latin1 data is incorrectly read as UTF-8. It acts as a final auditor to ensure all text data is properly encoded, complementing the preprocessing done by microdatasus. Supports multilingual output messages (English, Portuguese, Spanish).

Usage

sus_data_clean_encoding(df, lang = "pt", verbose = TRUE)

Arguments

df

A data.frame or tibble to be cleaned.

lang

Character. Language for UI messages. Options: "en" (English), "pt" (Portuguese, default), "es" (Spanish).

verbose

Logical. If TRUE, prints a report of columns checked and corrected. Default is TRUE.

Value

A data.frame with corrected text columns.

Examples

if (FALSE) { # \dontrun{
# Create a sample dataset with encoding issues
# In real data, this might happen with Brazilian Portuguese text
df_problem <- data.frame(
  id = 1:3,
  city = c("Sao Paulo", "Rio de Janeiro", "Belo Horizonte"),
  state = c("SP", "RJ", "MG"),
  stringsAsFactors = FALSE
)

# Simulate encoding issue (for demonstration only)
# In practice, this happens when Latin1 text is read as UTF-8

# Correct encoding with English messages
df_clean_en <- sus_data_clean_encoding(df_problem, lang = "en")

# Correct encoding with Portuguese messages
df_clean_pt <- sus_data_clean_encoding(df_problem, lang = "pt")

# Correct encoding with Spanish messages
df_clean_es <- sus_data_clean_encoding(df_problem, lang = "es")

# Use in a pipeline
df_clean <- sus_data_import(uf = "RJ", year = 2022, system = "SIM") %>%
  sus_data_clean_encoding(lang = "pt")
} # }