Import and preprocess data from DATASUS with intelligent caching
Source:R/sus_data_import.R
sus_data_import.RdThis function acts as a wrapper for microdatasus::fetch_datasus,
simplifying the download and reading of data from Brazilian public health
information systems (SIM, SINAN, SIH, SIA, CNES, SINASC).
It includes parallel processing, caching, and user-friendly CLI feedback.
Usage
sus_data_import(
uf = NULL,
region = NULL,
year,
month = NULL,
system,
use_cache = TRUE,
cache_dir = "~/.climasus4r_cache/data",
force_redownload = FALSE,
parallel = FALSE,
workers = 4,
lang = "pt",
verbose = TRUE
)Arguments
- uf
A string or vector of strings with state abbreviations (igonered if 'region' is provided) (e.g., "AM", c("SP", "RJ")). Valid UF codes: AC, AL, AP, AM, BA, CE, DF, ES, GO, MA, MT, MS, MG, PA, PB, PR, PE, PI, RJ, RN, RS, RO, RR, SC, SP, SE, TO.
- region
A string indicating a predefined group of states (supports multilingual names PT, EN, ES). Available regions:
IBGE Macro-regions:
"norte": c("AC", "AP", "AM", "PA", "RO", "RR", "TO")"nordeste": c("AL", "BA", "CE", "MA", "PB", "PE", "PI", "RN", "SE")"centro_oeste": c("DF", "GO", "MT", "MS")"sudeste": c("ES", "MG", "RJ", "SP")"sul": c("PR", "RS", "SC")
Biomes (Ecological Borders):
"amazonia_legal": c("AC", "AP", "AM", "PA", "RO", "RR", "MT", "MA", "TO")"mata_atlantica": c("AL", "BA", "CE", "ES", "GO", "MA", "MG", "MS", "PB", "PE", "PI", "PR", "RJ", "RN", "RS", "SC", "SE", "SP")"caatinga": c("AL", "BA", "CE", "MA", "PB", "PE", "PI", "RN", "SE", "MG")"cerrado": c("BA", "DF", "GO", "MA", "MG", "MS", "MT", "PA", "PI", "PR", "RO", "SP", "TO")"pantanal": c("MT", "MS")"pampa": c("RS")
Hydrography & Climate:
"bacia_amazonia": c("AC", "AM", "AP", "MT", "PA", "RO", "RR")"bacia_sao_francisco": c("AL", "BA", "DF", "GO", "MG", "PE", "SE")"bacia_parana": c("GO", "MG", "MS", "PR", "SP")"bacia_tocantins": c("GO", "MA", "PA", "TO")"semi_arido": c("AL", "BA", "CE", "MA", "PB", "PE", "PI", "RN", "SE", "MG")
Health, Agriculture & Geopolitics:
"matopiba": c("MA", "TO", "PI", "BA")"arco_desmatamento": c("RO", "AC", "AM", "PA", "MT", "MA")"dengue_hyperendemic": c("GO", "MS", "MT", "PR", "RJ", "SP")"sudene": c("AL", "BA", "CE", "MA", "PB", "PE", "PI", "RN", "SE", "MG", "ES")"fronteira_brasil": c("AC", "AM", "AP", "MT", "MS", "PA", "PR", "RO", "RR", "RS", "SC")
- year
An integer or vector of integers with the desired years (4 digits).
- month
An integer or vector of integers with the desired months (1-12). This argument is only used with monthly-based health information systems: SIH, CNES, and SIA. For annual systems (SIM, SINAN, SINASC), this parameter is ignored.
- system
A string indicating the information system. Available systems:
Mortality Systems (SIM - Mortality Information System):
"SIM-DO": Death certificates (Declaracoes de Obito) - Complete dataset"SIM-DOFET": Fetal deaths (Obitos Fetais)"SIM-DOEXT": External causes deaths (Obitos por Causas Externas)"SIM-DOINF": Infant deaths (Obitos Infantis)"SIM-DOMAT": Maternal deaths (Obitos Maternos)
Hospitalization Systems (SIH - Hospital Information System):
"SIH-RD": Hospital Admission Authorizations (AIH - Autorizacoes de Internacao Hospitalar)"SIH-RJ": Hospital Admission Authorizations - Rio de Janeiro specific"SIH-SP": Hospital Admission Authorizations - Sao Paulo specific"SIH-ER": Emergency Room Records (Prontuarios de Emergencia)
Notifiable Diseases (SINAN - Notifiable Diseases Information System):
"SINAN-DENGUE": Dengue fever cases"SINAN-CHIKUNGUNYA": Chikungunya cases"SINAN-ZIKA": Zika virus cases"SINAN-MALARIA": Malaria cases"SINAN-CHAGAS": Chagas disease cases"SINAN-LEISHMANIOSE-VISCERAL": Visceral leishmaniasis cases"SINAN-LEISHMANIOSE-TEGUMENTAR": Cutaneous leishmaniasis cases"SINAN-LEPTOSPIROSE": Leptospirosis cases
Outpatient Systems (SIA - Outpatient Information System):
"SIA-AB": Primary Care (Atencao Basica)"SIA-ABO": Dental Procedures (Procedimentos Odontologicos)"SIA-ACF": Pharmaceutical Assistance (Assistencia Farmaceutica)"SIA-AD": High Complexity (Alta Complexidade/Diferenciada)"SIA-AN": Home Care (Atencao Domiciliar)"SIA-AM": Medical Specialties (Ambulatorio de Especialidades)"SIA-AQ": Strategic Actions (Acoes Estrategicas)"SIA-AR": Regulation (Regulacao)"SIA-ATD": Urgency/Emergency (Urgencia/Emergencia)"SIA-PA": Hospital Outpatient (Procedimentos Ambulatoriais em Hospital)"SIA-PS": Psychosocial Care (Atencao Psicossocial)"SIA-SAD": Specialized Care (Atencao Especializada)
Health Establishments (CNES - National Health Establishment Registry):
"CNES-LT": Beds (Leitos)"CNES-ST": Health Professionals (Profissionais de Saude)"CNES-DC": Equipment (Equipamentos) - Detailed"CNES-EQ": Equipment (Equipamentos) - Summary"CNES-SR": Specialized Services (Servicos Especializados)"CNES-HB": Hospital Beds (Leitos Hospitalares)"CNES-PF": Health Professionals Detailed (Pessoal Fisico)"CNES-EP": Teaching Participants (Participantes do Ensino)"CNES-RC": Hospital Class (Classificacao Hospitalar)"CNES-IN": Hospital Indicators (Indicadores Hospitalares)"CNES-EE": Educational Entities (Entidades de Ensino)"CNES-EF": Teaching Facilities (Instalacoes de Ensino)"CNES-GM": Management/Support (Gestao e Apoio)
Live Births (SINASC - Live Birth Information System):
"SINASC": Live Birth Declarations (Declaracoes de Nascidos Vivos)
- use_cache
Logical. If TRUE (default), will use cached data to avoid re-downloads. Cache is based on UF, year, month, and system parameters.
- cache_dir
Character. Directory to store cached files. Default is "~/.climasus4r_cache/data".
- force_redownload
Logical. If TRUE, ignores cache and re-downloads everything. Useful when you suspect cached data is corrupted or outdated.
- parallel
Logical. If TRUE (default), will use parallel processing for multiple UF/year combinations. Significantly speeds up bulk downloads.
- workers
Integer. Number of parallel workers to use. Default is 4. Set to 1 to disable parallel processing.
- lang
Character string specifying the language for variable labels and messages. Options:
"en"(English),"pt"(Portuguese, default),"es"(Spanish).- verbose
Logical. If TRUE (default), prints detailed progress information including cache status, download progress, and time estimates.
Value
A tibble (or data.frame) with the requested data, combining
multiple UFs/years when requested. The output includes:
All original variables from the DATASUS system
Additional metadata columns:
source_system,download_timestampStandardized date formats (Date objects instead of strings)
UTF-8 encoded character variables
Note: Large datasets (especially SIA and SIH) may require significant memory (1GB+ for national annual data).
Details
Data Sources
All data is sourced from the Brazilian Ministry of Health's DATASUS portal (http://datasus.saude.gov.br).
Caching System
The cache uses SHA-256 hashing of parameters to create unique cache keys. Cached files are stored as compressed RDS files and include metadata about the download date and parameter combination. Cache is automatically invalidated after 30 days for dynamic systems (CNES, SIA, SIH) and 365 days for static systems (SIM, SINAN, SINASC).
References
Brazilian Ministry of Health. DATASUS. http://datasus.saude.gov.br
SALDANHA, Raphael de Freitas; BASTOS, Ronaldo Rocha; BARCELLOS, Christovam. Microdatasus: pacote para download e pre-processamento de microdados do Departamento de Informatica do SUS (DATASUS). Cad. Saude Publica, Rio de Janeiro , v. 35, n. 9, e00032419, 2019. Available from https://doi.org/10.1590/0102-311x00032419.
See also
Official DATASUS documentation: http://datasus.saude.gov.br
Microdatasus package: https://github.com/rfsaldanha/microdatasus
Examples
if (FALSE) { # \dontrun{
# Basic example: Mortality data for Rio de Janeiro in 2022
df_sim <- sus_data_import(
uf = "RJ",
year = 2022,
system = "SIM-DO",
use_cache = TRUE
)
# Dengue cases for two states with parallel processing
df_dengue <- sus_data_import(
uf = c("SP", "MG"),
year = 2023,
system = "SINAN-DENGUE",
parallel = TRUE,
workers = 3
)
# Hospitalizations with monthly specification
df_hospital <- sus_data_import(
uf = "SP",
year = 2024,
month = 1:6, # January to June
system = "SIH-RD",
verbose = TRUE
)
# Force re-download ignoring cache
df_births <- sus_data_import(
uf = "BA",
year = 2020:2022,
system = "SINASC",
use_cache = TRUE,
force_redownload = TRUE # Refresh cached data
)
} # }