Detect and correct character encoding issues
Source:R/sus_data_clean_encoding.R
sus_data_clean_encoding.RdThis function scans text columns in a data.frame and corrects common encoding
problems (e.g., "Sao Paulo") that occur when Latin1 data is
incorrectly read as UTF-8. It acts as a final auditor to ensure all text data
is properly encoded, complementing the preprocessing done by microdatasus.
Supports multilingual output messages (English, Portuguese, Spanish).
Arguments
- df
A
data.frameortibbleto be cleaned.- backend
Character string specifying the data processing backend. Use
"arrow"for out-of-memory, lazy processing (recommended for large datasets), or"tibble"for in-memory processing (recommended for small to medium datasets)."arrow": operations are performed lazily using the Apache Arrow engine, avoiding loading the full dataset into memory. Ideal for large files (e.g., Parquet, Feather) and high-performance workflows."tibble": data is fully loaded into memory as a tibble and processed eagerly using dplyr. Simpler and more predictable, but may be slow or fail for large datasets.
If not specified, the function may automatically choose the backend based on the input data type.
- lang
Character. Language for UI messages. Options: "en" (English), "pt" (Portuguese, default), "es" (Spanish).
- verbose
Logical. If TRUE, prints a report of columns checked and corrected. Default is TRUE.
Examples
if (FALSE) { # \dontrun{
# Create a sample dataset with encoding issues
# In real data, this might happen with Brazilian Portuguese text
df_problem <- data.frame(
id = 1:3,
city = c("Sao Paulo", "Rio de Janeiro", "Belo Horizonte"),
state = c("SP", "RJ", "MG"),
stringsAsFactors = FALSE
)
# Simulate encoding issue (for demonstration only)
# In practice, this happens when Latin1 text is read as UTF-8
# Correct encoding with English messages
df_clean_en <- sus_data_clean_encoding(df_problem, lang = "en")
# Correct encoding with Portuguese messages
df_clean_pt <- sus_data_clean_encoding(df_problem, lang = "pt")
# Correct encoding with Spanish messages
df_clean_es <- sus_data_clean_encoding(df_problem, lang = "es")
# Use in a pipeline
df_clean <- sus_data_import(uf = "RJ", year = 2022, system = "SIM") %>%
sus_data_clean_encoding(lang = "pt")
} # }