Skip to contents

sus_climate_fill_gaps() imputes missing values in climate time series using station-wise XGBoost models with automated feature engineering.

Key features:

  • Single-target focus: Each call imputes ONE variable

  • Station-wise modeling: Separate model per station

  • Temporal features: Automatic creation of lags and rolling statistics

  • Quality control: Stations with >quality_threshold missing are excluded

  • Parallel processing: Stations processed in parallel for speed

  • Evaluation mode: Assess accuracy by creating artificial gaps

Usage

sus_climate_fill_gaps(
  df,
  target_var,
  datetime_col = NULL,
  station_col = NULL,
  quality_threshold = 0.4,
  run_evaluation = FALSE,
  gap_percentage = 0.2,
  keep_features = FALSE,
  parallel = TRUE,
  workers = NULL,
  verbose = TRUE,
  lang = "pt"
)

Arguments

df

A data frame (or tibble) containing climate data, typically from sus_climate_inmet(). Must contain:

  • A datetime column (POSIXct or convertible)

  • A station identifier column

  • The target numeric column to be imputed

target_var

Single character string specifying the column to impute. Example: target_var = "tair_dry_bulb_c".

datetime_col

Character. Name of the datetime column. If NULL (default), auto-detected.

station_col

Character. Name of the station identifier column. If NULL, auto-detected.

quality_threshold

Numeric (0-1). Maximum allowed missing proportion per station. Stations exceeding this are excluded. Default: 0.4 (40%).

run_evaluation

Logical. If TRUE, runs in evaluation mode:

  • Creates artificial MCAR gaps in observed data

  • Imputes and compares predictions with true values

  • Returns metrics by station

Default: FALSE (production mode).

gap_percentage

Numeric (0-1). Proportion of data to set as missing in evaluation mode. The "MCAR" Missing Completely At Random is used as default. Default: 0.2 (20%).

keep_features

Logical. If TRUE, retains engineered features (lags, rolling stats). Default: FALSE (returns only original columns + is_imputed).

parallel

Logical. If TRUE (default), processes stations in parallel using furrr.

workers

Integer. Number of parallel workers. If NULL, uses availableCores() - 1.

verbose

Logical. If TRUE (default), prints progress messages.

lang

Character. Message language: "pt" (Portuguese), "en" (English), or "es" (Spanish). Default: "pt".

Value

Production mode (run_evaluation = FALSE): Returns a tibble (same class as input) with:

  • Original columns plus imputed values in target_var

  • is_imputed: Logical flag (TRUE for filled values)

  • climasus_meta attribute with imputation metadata

Evaluation mode (run_evaluation = TRUE): Returns a list of class climasus_eval containing:

  • $data: Data frame with artificial gaps and predictions

  • $metrics: A tibble with per-station performance metrics:

    • station: Station identifier

    • rmse: Root Mean Squared Error

    • mae: Mean Absolute Error

    • r_squared: R-squared (lower than 1, higher is better)

    • smape: Symmetric MAPE (0-200%, lower is better)

    • slope_bias: Should be close to 1.0, indicating underestimate and overestimate

    • n_gaps: Number of artificial gaps

Methodological Notes

Important limitations:

  • Not forecasting: Predicts only where data are missing

  • No future data: Uses only past information (lags)

  • Station independence: Models don't share information

  • Quality filter: Stations with >quality_threshold missing are skipped

Feature engineering: Automatically creates:

  • Time features: hour, day, month, year, cyclic transforms

  • Lag features: 1,2,3,6,12,24,48,72,168 periods

  • Rolling statistics: mean and sd over windows 3,6,12,24,48,72

Evaluation Mode Details

When run_evaluation = TRUE, the function:

  1. Creates artificial gaps (MCAR by default)

  2. Runs imputation on the data with gaps

  3. Compares predictions with true values

  4. Returns per-station performance

This helps assess model accuracy before production use.

Examples

if (FALSE) { # \dontrun{
# ===== PRODUCTION MODE =====
# Impute missing temperature data
filled_temp <- sus_climate_fill_gaps(
  df = climate_data,
  target_var = "tair_dry_bulb_c",
  quality_threshold = 0.3,
  parallel = TRUE
)


# ===== EVALUATION MODE =====
# Assess model performance on a subset
eval_results <- sus_climate_fill_gaps(
  df = climate_data,
  target_var = "ws_2_m_s",
  run_evaluation = TRUE,
  gap_percentage = 0.2,
  workers = 4
)

# View performance metrics
eval_results$metrics

} # }