Skip to contents

sus_climate_fill_inmet() imputes missing values in INMET automatic station data using station-wise XGBoost models with automated feature engineering.

Key features:

  • Multi-target support: Impute one, many, or ALL numeric variables in a single call

  • Station-wise modeling: Separate model per station per variable

  • Temporal features: Automatic creation of lags and rolling statistics

  • Quality control: Stations with >quality_threshold missing are excluded

  • Parallel processing: Stations processed in parallel; workers resolved once and reused across all variable iterations

  • Evaluation mode: Assess accuracy by creating artificial gaps

Important: This function is designed exclusively for data imported by sus_climate_inmet() and works with the standard INMET variable set.

Usage

sus_climate_fill_inmet(
  df,
  target_var,
  datetime_col = NULL,
  station_col = NULL,
  quality_threshold = 0.4,
  run_evaluation = FALSE,
  gap_percentage = 0.2,
  keep_features = FALSE,
  parallel = TRUE,
  workers = NULL,
  verbose = TRUE,
  lang = "pt"
)

Arguments

df

A data frame (or tibble) containing climate data, typically from sus_climate_inmet(). Must contain:

  • A datetime column (POSIXct or convertible)

  • A station identifier column

  • The target numeric column(s) to be imputed

target_var

Character vector of column name(s) to impute, or the special string "all" to impute every numeric column that is not the datetime or station column. Examples:

  • Single variable: target_var = "tair_dry_bulb_c"

  • Multiple variables: target_var = c("tair_dry_bulb_c", "rh_pct", "ws_2_m_s")

  • All numeric variables: target_var = "all"

datetime_col

Character. Name of the datetime column. If NULL (default), auto-detected.

station_col

Character. Name of the station identifier column. If NULL, auto-detected.

quality_threshold

Numeric (0-1). Maximum allowed missing proportion per station. Stations exceeding this are excluded. Default: 0.4 (40%).

For INMET data: Recommended values:

  • 0.3 (30%) - Conservative, high-quality stations only

  • 0.4 (40%) - Default, balanced approach

  • 0.6 (60%) - Lenient, includes stations with significant gaps

run_evaluation

Logical. If TRUE, runs in evaluation mode:

  • Creates artificial MCAR gaps in observed data

  • Imputes and compares predictions with true values

  • Returns metrics by station

Default: FALSE (production mode).

gap_percentage

Numeric (0-1). Proportion of data to set as missing in evaluation mode. The "MCAR" Missing Completely At Random is used as default. Default: 0.2 (20%).

keep_features

Logical. If TRUE, retains engineered features (lags, rolling stats). Default: FALSE (returns only original columns + is_imputed).

parallel

Logical. If TRUE (default), processes stations in parallel using furrr.

workers

Integer. Number of parallel workers. If NULL, uses max(1, availableCores() - 1). The value is resolved once at the start and reused across all variable iterations, avoiding repeated plan switches.

verbose

Logical. If TRUE (default), prints progress messages.

lang

Character. Message language: "pt" (Portuguese), "en" (English), or "es" (Spanish). Default: "pt".

Value

Production mode (run_evaluation = FALSE):

  • Single variable: a tibble (same class as input) with imputed values in target_var and an is_imputed_<var> flag column.

  • Multiple variables / "all": the same tibble with all requested variables imputed sequentially, each with its own is_imputed_<var> flag column.

  • A sus_meta attribute records all imputed variables and their rates.

Evaluation mode (run_evaluation = TRUE): Returns a named list (one element per variable) of class climasus_eval, each with:

  • $data: Data frame with artificial gaps and predictions

  • $metrics: A tibble with per-station performance metrics:

    • station: Station identifier

    • rmse: Root Mean Squared Error

    • mae: Mean Absolute Error

    • r_squared: R-squared (lower than 1, higher is better)

    • smape: Symmetric MAPE (0-200%, lower is better)

    • slope_bias: Should be close to 1.0, indicating underestimate and overestimate

    • n_gaps: Number of artificial gaps

INMET Variable Set

The function is optimized for the following 17 INMET variables:

  • Atmospheric Pressure: patm_mb, patm_max_mb, patm_min_mb

  • Temperature: tair_dry_bulb_c, tair_max_c, tair_min_c

  • Dew Point: dew_tmean_c, dew_tmax_c, dew_tmin_c

  • Relative Humidity: rh_max_porc, rh_min_porc, rh_mean_porc

  • Precipitation: rainfall_mm

  • Wind: ws_gust_m_s, ws_2_m_s, wd_degrees

  • Solar Radiation: sr_kj_m2

When target_var = "all", the function automatically detects and imputes only these 17 variables if present in the data.

Methodological Notes

Important limitations:

  • Not forecasting: Predicts only where data are missing

  • No future data: Uses only past information (lags)

  • Station independence: Models don't share information

  • Quality filter: Stations with >quality_threshold missing are skipped

Feature engineering: Automatically creates:

  • Time features: hour, day, month, year, cyclic transforms

  • Lag features: 1,2,3,6,12,24,48,72,168 periods

  • Rolling statistics: mean and sd over windows 3,6,12,24,48,72

Evaluation Mode Details

When run_evaluation = TRUE, the function:

  1. Creates artificial gaps (MCAR by default)

  2. Runs imputation on the data with gaps

  3. Compares predictions with true values

  4. Returns per-station performance

This helps assess model accuracy before production use.

Examples

if (FALSE) { # \dontrun{
# ===== PRODUCTION MODE — single variable =====
filled_temp <- sus_climate_fill_inmet(
  df = climate_data,
  target_var = "tair_dry_bulb_c",
  quality_threshold = 0.3,
  parallel = TRUE
)

# ===== PRODUCTION MODE — multiple variables =====
filled_multi <- sus_climate_fill_inmet(
  df = climate_data,
  target_var = c("tair_dry_bulb_c", "rh_pct", "ws_2_m_s"),
  quality_threshold = 0.3,
  parallel = TRUE,
  workers = 4
)

# ===== PRODUCTION MODE — all numeric variables =====
filled_all <- sus_climate_fill_inmet(
  df = climate_data,
  target_var = "all",
  parallel = TRUE
)

# ===== EVALUATION MODE — single variable =====
eval_results <- sus_climate_fill_inmet(
  df = climate_data,
  target_var = "ws_2_m_s",
  run_evaluation = TRUE,
  gap_percentage = 0.2,
  workers = 4
)
eval_results$ws_2_m_s$metrics

# ===== EVALUATION MODE — multiple variables =====
eval_multi <- sus_climate_fill_inmet(
  df = climate_data,
  target_var = c("tair_dry_bulb_c", "ws_2_m_s"),
  run_evaluation = TRUE,
  gap_percentage = 0.2
)
eval_multi$tair_dry_bulb_c$metrics
eval_multi$ws_2_m_s$metrics
} # }