Read Processed Health Data with Batch and Parallel Support

Smartly reads one or multiple health data files exported by sus_data_export(). Supports automatic format detection, batch processing, parallel execution, spatial data, metadata loading, and data validation.

Usage

sus_data_read(
  path,
  format = NULL,
  parallel = FALSE,
  workers = 4,
  read_metadata = FALSE,
  lang = "pt",
  verbose = TRUE
)

Arguments

path: Character vector of file paths, or a single directory path. If a directory is provided, all matching files will be read.
format: Character string specifying the input format. Options: "dbf", "dbc", "rds", "parquet", "geoparquet", "shapefile", "gpkg", "geojson", "csv". If NULL (default), automatically detects format from file extension.
parallel: Logical. If TRUE, uses parallel processing for multiple files. Requires future and future.apply packages. Default: FALSE.
workers: Integer. Number of parallel workers when parallel = TRUE. Default: 4.
read_metadata: Logical. If TRUE (default), loads companion metadata files and attaches them as attributes.
lang: Character string specifying the language for messages. Options: "en" (English), "pt" (Portuguese, default), "es" (Spanish).
verbose: Logical. If TRUE (default), prints progress and summary.

Value

A data frame or sf object (for spatial data) containing the loaded data. For batch reads, all files are combined with dplyr::bind_rows(). Metadata is attached as attributes:

Single file: attr(df, "metadata")
Batch: attr(df, "batch_metadata") (list of metadata from each file)
Batch: attr(df, "n_files_combined") (number of files)

Details

Batch Processing: Pass a vector of file paths or a directory path to read multiple files at once. All files are automatically combined into a single object.

Parallel Processing: When parallel = TRUE, files are read simultaneously using future.apply. This significantly speeds up batch reads of large files.

Format Detection: Automatically detects format from file extension. For .parquet files, automatically determines if it's GeoParquet (spatial) or regular Parquet.

Memory Efficiency: For very large datasets (>50 GB), consider using chunked processing or reading files individually instead of batch mode.

Examples

if (FALSE) { # \dontrun{
library(climasus4r)

# Single file
df <- sus_data_read("output/data.parquet")

# Multiple files (vector)
df <- sus_data_read(c("output/2020.parquet", "output/2021.parquet"))

# Directory (all Parquet files)
df <- sus_data_read("output/", format = "parquet")

# Parallel batch read
df <- sus_data_read("output/", format = "dbf", 
                    parallel = TRUE, workers = 6)

# Access batch metadata
batch_meta <- attr(df, "batch_metadata")
n_files <- attr(df, "n_files_combined")
} # }