Fill gaps in INMET climate time series using XGBoost
Source:R/sus_climate_fill_gap.R
sus_climate_fill_inmet.Rdsus_climate_fill_inmet() imputes missing values in INMET automatic station data
using station-wise XGBoost models with automated feature engineering.
Key features:
Multi-target support: Impute one, many, or ALL numeric variables in a single call
Station-wise modeling: Separate model per station per variable
Temporal features: Automatic creation of lags and rolling statistics
Quality control: Stations with >
quality_thresholdmissing are excludedParallel processing: Stations processed in parallel; workers resolved once and reused across all variable iterations
Evaluation mode: Assess accuracy by creating artificial gaps
Important: This function is designed exclusively for data imported by
sus_climate_inmet() and works with the standard INMET variable set.
Usage
sus_climate_fill_inmet(
df,
target_var,
datetime_col = NULL,
station_col = NULL,
quality_threshold = 0.4,
run_evaluation = FALSE,
gap_percentage = 0.2,
keep_features = FALSE,
parallel = TRUE,
workers = NULL,
verbose = TRUE,
lang = "pt"
)Arguments
- df
A data frame (or tibble) containing climate data, typically from
sus_climate_inmet(). Must contain:A datetime column (POSIXct or convertible)
A station identifier column
The target numeric column(s) to be imputed
- target_var
Character vector of column name(s) to impute, or the special string
"all"to impute every numeric column that is not the datetime or station column. Examples:Single variable:
target_var = "tair_dry_bulb_c"Multiple variables:
target_var = c("tair_dry_bulb_c", "rh_pct", "ws_2_m_s")All numeric variables:
target_var = "all"
- datetime_col
Character. Name of the datetime column. If
NULL(default), auto-detected.- station_col
Character. Name of the station identifier column. If
NULL, auto-detected.- quality_threshold
Numeric (0-1). Maximum allowed missing proportion per station. Stations exceeding this are excluded. Default:
0.4(40%).For INMET data: Recommended values:
0.3(30%) - Conservative, high-quality stations only0.4(40%) - Default, balanced approach0.6(60%) - Lenient, includes stations with significant gaps
- run_evaluation
Logical. If
TRUE, runs in evaluation mode:Creates artificial MCAR gaps in observed data
Imputes and compares predictions with true values
Returns metrics by station
Default:
FALSE(production mode).- gap_percentage
Numeric (0-1). Proportion of data to set as missing in evaluation mode. The
"MCAR"Missing Completely At Random is used as default. Default:0.2(20%).- keep_features
Logical. If
TRUE, retains engineered features (lags, rolling stats). Default:FALSE(returns only original columns +is_imputed).- parallel
Logical. If
TRUE(default), processes stations in parallel usingfurrr.- workers
Integer. Number of parallel workers. If
NULL, usesmax(1, availableCores() - 1). The value is resolved once at the start and reused across all variable iterations, avoiding repeated plan switches.- verbose
Logical. If
TRUE(default), prints progress messages.- lang
Character. Message language:
"pt"(Portuguese),"en"(English), or"es"(Spanish). Default:"pt".
Value
Production mode (run_evaluation = FALSE):
Single variable: a
tibble(same class as input) with imputed values intarget_varand anis_imputed_<var>flag column.Multiple variables /
"all": the same tibble with all requested variables imputed sequentially, each with its ownis_imputed_<var>flag column.A
sus_metaattribute records all imputed variables and their rates.
Evaluation mode (run_evaluation = TRUE):
Returns a named list (one element per variable) of class climasus_eval, each with:
$data: Data frame with artificial gaps and predictions$metrics: Atibblewith per-station performance metrics:station: Station identifierrmse: Root Mean Squared Errormae: Mean Absolute Errorr_squared: R-squared (lower than 1, higher is better)smape: Symmetric MAPE (0-200%, lower is better)slope_bias: Should be close to 1.0, indicating underestimate and overestimaten_gaps: Number of artificial gaps
INMET Variable Set
The function is optimized for the following 17 INMET variables:
Atmospheric Pressure:
patm_mb,patm_max_mb,patm_min_mbTemperature:
tair_dry_bulb_c,tair_max_c,tair_min_cDew Point:
dew_tmean_c,dew_tmax_c,dew_tmin_cRelative Humidity:
rh_max_porc,rh_min_porc,rh_mean_porcPrecipitation:
rainfall_mmWind:
ws_gust_m_s,ws_2_m_s,wd_degreesSolar Radiation:
sr_kj_m2
When target_var = "all", the function automatically detects and imputes
only these 17 variables if present in the data.
Methodological Notes
Important limitations:
Not forecasting: Predicts only where data are missing
No future data: Uses only past information (lags)
Station independence: Models don't share information
Quality filter: Stations with >
quality_thresholdmissing are skipped
Feature engineering: Automatically creates:
Time features: hour, day, month, year, cyclic transforms
Lag features: 1,2,3,6,12,24,48,72,168 periods
Rolling statistics: mean and sd over windows 3,6,12,24,48,72
Evaluation Mode Details
When run_evaluation = TRUE, the function:
Creates artificial gaps (MCAR by default)
Runs imputation on the data with gaps
Compares predictions with true values
Returns per-station performance
This helps assess model accuracy before production use.
Examples
if (FALSE) { # \dontrun{
# ===== PRODUCTION MODE — single variable =====
filled_temp <- sus_climate_fill_inmet(
df = climate_data,
target_var = "tair_dry_bulb_c",
quality_threshold = 0.3,
parallel = TRUE
)
# ===== PRODUCTION MODE — multiple variables =====
filled_multi <- sus_climate_fill_inmet(
df = climate_data,
target_var = c("tair_dry_bulb_c", "rh_pct", "ws_2_m_s"),
quality_threshold = 0.3,
parallel = TRUE,
workers = 4
)
# ===== PRODUCTION MODE — all numeric variables =====
filled_all <- sus_climate_fill_inmet(
df = climate_data,
target_var = "all",
parallel = TRUE
)
# ===== EVALUATION MODE — single variable =====
eval_results <- sus_climate_fill_inmet(
df = climate_data,
target_var = "ws_2_m_s",
run_evaluation = TRUE,
gap_percentage = 0.2,
workers = 4
)
eval_results$ws_2_m_s$metrics
# ===== EVALUATION MODE — multiple variables =====
eval_multi <- sus_climate_fill_inmet(
df = climate_data,
target_var = c("tair_dry_bulb_c", "ws_2_m_s"),
run_evaluation = TRUE,
gap_percentage = 0.2
)
eval_multi$tair_dry_bulb_c$metrics
eval_multi$ws_2_m_s$metrics
} # }