Fill gaps in climate and air-quality time series using XGBoost

sus_climate_fill_gaps() imputes missing values in climate time series using station-wise XGBoost models with automated feature engineering.

Key features:

Single-target focus: Each call imputes ONE variable
Station-wise modeling: Separate model per station
Temporal features: Automatic creation of lags and rolling statistics
Quality control: Stations with >quality_threshold missing are excluded
Parallel processing: Stations processed in parallel for speed
Evaluation mode: Assess accuracy by creating artificial gaps

Usage

sus_climate_fill_gaps(
  df,
  target_var,
  datetime_col = NULL,
  station_col = NULL,
  quality_threshold = 0.4,
  run_evaluation = FALSE,
  gap_percentage = 0.2,
  keep_features = FALSE,
  parallel = TRUE,
  workers = NULL,
  verbose = TRUE,
  lang = "pt"
)

Arguments

df

A data frame (or tibble) containing climate data, typically from sus_climate_inmet(). Must contain:

A datetime column (POSIXct or convertible)
A station identifier column
The target numeric column to be imputed

target_var

Single character string specifying the column to impute. Example: target_var = "tair_dry_bulb_c".

datetime_col

Character. Name of the datetime column. If NULL (default), auto-detected.

station_col

Character. Name of the station identifier column. If NULL, auto-detected.

quality_threshold

Numeric (0-1). Maximum allowed missing proportion per station. Stations exceeding this are excluded. Default: 0.4 (40%).

run_evaluation

Logical. If TRUE, runs in evaluation mode:

Creates artificial MCAR gaps in observed data
Imputes and compares predictions with true values
Returns metrics by station

Default: FALSE (production mode).

gap_percentage

Numeric (0-1). Proportion of data to set as missing in evaluation mode. The "MCAR" Missing Completely At Random is used as default. Default: 0.2 (20%).

keep_features

Logical. If TRUE, retains engineered features (lags, rolling stats). Default: FALSE (returns only original columns + is_imputed).

parallel

Logical. If TRUE (default), processes stations in parallel using furrr.

workers

Integer. Number of parallel workers. If NULL, uses availableCores() - 1.

verbose

Logical. If TRUE (default), prints progress messages.

lang

Character. Message language: "pt" (Portuguese), "en" (English), or "es" (Spanish). Default: "pt".

Value

Production mode (run_evaluation = FALSE): Returns a tibble (same class as input) with:

Original columns plus imputed values in target_var
is_imputed: Logical flag (TRUE for filled values)
climasus_meta attribute with imputation metadata

Evaluation mode (run_evaluation = TRUE): Returns a list of class climasus_eval containing:

$data: Data frame with artificial gaps and predictions
$metrics: A tibble with per-station performance metrics:
- station: Station identifier
- rmse: Root Mean Squared Error
- mae: Mean Absolute Error
- r_squared: R-squared (lower than 1, higher is better)
- smape: Symmetric MAPE (0-200%, lower is better)
- slope_bias: Should be close to 1.0, indicating underestimate and overestimate
- n_gaps: Number of artificial gaps

Methodological Notes

Important limitations:

Not forecasting: Predicts only where data are missing
No future data: Uses only past information (lags)
Station independence: Models don't share information
Quality filter: Stations with >quality_threshold missing are skipped

Feature engineering: Automatically creates:

Time features: hour, day, month, year, cyclic transforms
Lag features: 1,2,3,6,12,24,48,72,168 periods
Rolling statistics: mean and sd over windows 3,6,12,24,48,72

Evaluation Mode Details

When run_evaluation = TRUE, the function:

Creates artificial gaps (MCAR by default)
Runs imputation on the data with gaps
Compares predictions with true values
Returns per-station performance

This helps assess model accuracy before production use.

Examples

if (FALSE) { # \dontrun{
# ===== PRODUCTION MODE =====
# Impute missing temperature data
filled_temp <- sus_climate_fill_gaps(
  df = climate_data,
  target_var = "tair_dry_bulb_c",
  quality_threshold = 0.3,
  parallel = TRUE
)


# ===== EVALUATION MODE =====
# Assess model performance on a subset
eval_results <- sus_climate_fill_gaps(
  df = climate_data,
  target_var = "ws_2_m_s",
  run_evaluation = TRUE,
  gap_percentage = 0.2,
  workers = 4
)

# View performance metrics
eval_results$metrics

} # }