Fill gaps in INMET climate time series using XGBoost — sus_climate_fill

sus_climate_fill_inmet() imputes missing values in INMET automatic station data using station-wise XGBoost models with automated feature engineering.

Key features:

Multi-target support: Impute one, many, or ALL numeric variables in a single call
Station-wise modeling: Separate model per station per variable
Temporal features: Automatic creation of lags and rolling statistics
Quality control: Stations with >quality_threshold missing are excluded
Parallel processing: Stations processed in parallel; workers resolved once and reused across all variable iterations
Evaluation mode: Assess accuracy by creating artificial gaps

Important: This function is designed exclusively for data imported by sus_climate_inmet() and works with the standard INMET variable set.

Usage

sus_climate_fill_inmet(
  df,
  target_var,
  datetime_col = NULL,
  station_col = NULL,
  quality_threshold = 0.4,
  run_evaluation = FALSE,
  gap_percentage = 0.2,
  keep_features = FALSE,
  parallel = TRUE,
  workers = NULL,
  verbose = TRUE,
  lang = "pt"
)

Arguments

df

A data frame (or tibble) containing climate data, typically from sus_climate_inmet(). Must contain:

A datetime column (POSIXct or convertible)
A station identifier column
The target numeric column(s) to be imputed

target_var

Character vector of column name(s) to impute, or the special string "all" to impute every numeric column that is not the datetime or station column. Examples:

Single variable: target_var = "tair_dry_bulb_c"
Multiple variables: target_var = c("tair_dry_bulb_c", "rh_pct", "ws_2_m_s")
All numeric variables: target_var = "all"

datetime_col

Character. Name of the datetime column. If NULL (default), auto-detected.

station_col

Character. Name of the station identifier column. If NULL, auto-detected.

quality_threshold

Numeric (0-1). Maximum allowed missing proportion per station. Stations exceeding this are excluded. Default: 0.4 (40%).

For INMET data: Recommended values:

0.3 (30%) - Conservative, high-quality stations only
0.4 (40%) - Default, balanced approach
0.6 (60%) - Lenient, includes stations with significant gaps

run_evaluation

Logical. If TRUE, runs in evaluation mode:

Creates artificial MCAR gaps in observed data
Imputes and compares predictions with true values
Returns metrics by station

Default: FALSE (production mode).

gap_percentage

Numeric (0-1). Proportion of data to set as missing in evaluation mode. The "MCAR" Missing Completely At Random is used as default. Default: 0.2 (20%).

keep_features

Logical. If TRUE, retains engineered features (lags, rolling stats). Default: FALSE (returns only original columns + is_imputed).

parallel

Logical. If TRUE (default), processes stations in parallel using furrr.

workers

Integer. Number of parallel workers. If NULL, uses max(1, availableCores() - 1). The value is resolved once at the start and reused across all variable iterations, avoiding repeated plan switches.

verbose

Logical. If TRUE (default), prints progress messages.

lang

Character. Message language: "pt" (Portuguese), "en" (English), or "es" (Spanish). Default: "pt".

Value

Production mode (run_evaluation = FALSE):

Single variable: a tibble (same class as input) with imputed values in target_var and an is_imputed_<var> flag column.
Multiple variables / "all": the same tibble with all requested variables imputed sequentially, each with its own is_imputed_<var> flag column.
A sus_meta attribute records all imputed variables and their rates.

Evaluation mode (run_evaluation = TRUE): Returns a named list (one element per variable) of class climasus_eval, each with:

$data: Data frame with artificial gaps and predictions
$metrics: A tibble with per-station performance metrics:
- station: Station identifier
- rmse: Root Mean Squared Error
- mae: Mean Absolute Error
- r_squared: R-squared (lower than 1, higher is better)
- smape: Symmetric MAPE (0-200%, lower is better)
- slope_bias: Should be close to 1.0, indicating underestimate and overestimate
- n_gaps: Number of artificial gaps

INMET Variable Set

The function is optimized for the following 17 INMET variables:

Atmospheric Pressure: patm_mb, patm_max_mb, patm_min_mb
Temperature: tair_dry_bulb_c, tair_max_c, tair_min_c
Dew Point: dew_tmean_c, dew_tmax_c, dew_tmin_c
Relative Humidity: rh_max_porc, rh_min_porc, rh_mean_porc
Precipitation: rainfall_mm
Wind: ws_gust_m_s, ws_2_m_s, wd_degrees
Solar Radiation: sr_kj_m2

When target_var = "all", the function automatically detects and imputes only these 17 variables if present in the data.

Methodological Notes

Important limitations:

Not forecasting: Predicts only where data are missing
No future data: Uses only past information (lags)
Station independence: Models don't share information
Quality filter: Stations with >quality_threshold missing are skipped

Feature engineering: Automatically creates:

Time features: hour, day, month, year, cyclic transforms
Lag features: 1,2,3,6,12,24,48,72,168 periods
Rolling statistics: mean and sd over windows 3,6,12,24,48,72

Evaluation Mode Details

When run_evaluation = TRUE, the function:

Creates artificial gaps (MCAR by default)
Runs imputation on the data with gaps
Compares predictions with true values
Returns per-station performance

This helps assess model accuracy before production use.

Examples

if (FALSE) { # \dontrun{
# ===== PRODUCTION MODE — single variable =====
filled_temp <- sus_climate_fill_inmet(
  df = climate_data,
  target_var = "tair_dry_bulb_c",
  quality_threshold = 0.3,
  parallel = TRUE
)

# ===== PRODUCTION MODE — multiple variables =====
filled_multi <- sus_climate_fill_inmet(
  df = climate_data,
  target_var = c("tair_dry_bulb_c", "rh_pct", "ws_2_m_s"),
  quality_threshold = 0.3,
  parallel = TRUE,
  workers = 4
)

# ===== PRODUCTION MODE — all numeric variables =====
filled_all <- sus_climate_fill_inmet(
  df = climate_data,
  target_var = "all",
  parallel = TRUE
)

# ===== EVALUATION MODE — single variable =====
eval_results <- sus_climate_fill_inmet(
  df = climate_data,
  target_var = "ws_2_m_s",
  run_evaluation = TRUE,
  gap_percentage = 0.2,
  workers = 4
)
eval_results$ws_2_m_s$metrics

# ===== EVALUATION MODE — multiple variables =====
eval_multi <- sus_climate_fill_inmet(
  df = climate_data,
  target_var = c("tair_dry_bulb_c", "ws_2_m_s"),
  run_evaluation = TRUE,
  gap_percentage = 0.2
)
eval_multi$tair_dry_bulb_c$metrics
eval_multi$ws_2_m_s$metrics
} # }