Detect and correct character encoding issues
Source:R/sus_data_clean_encoding.R
sus_data_clean_encoding.RdThis function scans text columns in a data.frame and corrects common encoding
problems (e.g., "Sao Paulo") that occur when Latin1 data is
incorrectly read as UTF-8. It acts as a final auditor to ensure all text data
is properly encoded, complementing the preprocessing done by microdatasus.
Supports multilingual output messages (English, Portuguese, Spanish).
Examples
if (FALSE) { # \dontrun{
# Create a sample dataset with encoding issues
# In real data, this might happen with Brazilian Portuguese text
df_problem <- data.frame(
id = 1:3,
city = c("Sao Paulo", "Rio de Janeiro", "Belo Horizonte"),
state = c("SP", "RJ", "MG"),
stringsAsFactors = FALSE
)
# Simulate encoding issue (for demonstration only)
# In practice, this happens when Latin1 text is read as UTF-8
# Correct encoding with English messages
df_clean_en <- sus_data_clean_encoding(df_problem, lang = "en")
# Correct encoding with Portuguese messages
df_clean_pt <- sus_data_clean_encoding(df_problem, lang = "pt")
# Correct encoding with Spanish messages
df_clean_es <- sus_data_clean_encoding(df_problem, lang = "es")
# Use in a pipeline
df_clean <- sus_data_import(uf = "RJ", year = 2022, system = "SIM") %>%
sus_data_clean_encoding(lang = "pt")
} # }