Fill gaps in climate and air-quality time series using XGBoost
Source:R/sus_climate_fill_gap.R
sus_climate_fill_gaps.Rdsus_climate_fill_gaps() imputes missing values in climate time series using
station-wise XGBoost models with automated feature engineering.
Key features:
Single-target focus: Each call imputes ONE variable
Station-wise modeling: Separate model per station
Temporal features: Automatic creation of lags and rolling statistics
Quality control: Stations with >
quality_thresholdmissing are excludedParallel processing: Stations processed in parallel for speed
Evaluation mode: Assess accuracy by creating artificial gaps
Usage
sus_climate_fill_gaps(
df,
target_var,
datetime_col = NULL,
station_col = NULL,
quality_threshold = 0.4,
run_evaluation = FALSE,
gap_percentage = 0.2,
keep_features = FALSE,
parallel = TRUE,
workers = NULL,
verbose = TRUE,
lang = "pt"
)Arguments
- df
A data frame (or tibble) containing climate data, typically from
sus_climate_inmet(). Must contain:A datetime column (POSIXct or convertible)
A station identifier column
The target numeric column to be imputed
- target_var
Single character string specifying the column to impute. Example:
target_var = "tair_dry_bulb_c".- datetime_col
Character. Name of the datetime column. If
NULL(default), auto-detected.- station_col
Character. Name of the station identifier column. If
NULL, auto-detected.- quality_threshold
Numeric (0-1). Maximum allowed missing proportion per station. Stations exceeding this are excluded. Default:
0.4(40%).- run_evaluation
Logical. If
TRUE, runs in evaluation mode:Creates artificial MCAR gaps in observed data
Imputes and compares predictions with true values
Returns metrics by station
Default:
FALSE(production mode).- gap_percentage
Numeric (0-1). Proportion of data to set as missing in evaluation mode. The
"MCAR"Missing Completely At Random is used as default. Default:0.2(20%).- keep_features
Logical. If
TRUE, retains engineered features (lags, rolling stats). Default:FALSE(returns only original columns +is_imputed).- parallel
Logical. If
TRUE(default), processes stations in parallel usingfurrr.- workers
Integer. Number of parallel workers. If
NULL, usesavailableCores() - 1.- verbose
Logical. If
TRUE(default), prints progress messages.- lang
Character. Message language:
"pt"(Portuguese),"en"(English), or"es"(Spanish). Default:"pt".
Value
Production mode (run_evaluation = FALSE):
Returns a tibble (same class as input) with:
Original columns plus imputed values in
target_varis_imputed: Logical flag (TRUE for filled values)climasus_metaattribute with imputation metadata
Evaluation mode (run_evaluation = TRUE):
Returns a list of class climasus_eval containing:
$data: Data frame with artificial gaps and predictions$metrics: Atibblewith per-station performance metrics:station: Station identifierrmse: Root Mean Squared Errormae: Mean Absolute Errorr_squared: R-squared (lower than 1, higher is better)smape: Symmetric MAPE (0-200%, lower is better)slope_bias: Should be close to 1.0, indicating underestimate and overestimaten_gaps: Number of artificial gaps
Methodological Notes
Important limitations:
Not forecasting: Predicts only where data are missing
No future data: Uses only past information (lags)
Station independence: Models don't share information
Quality filter: Stations with >
quality_thresholdmissing are skipped
Feature engineering: Automatically creates:
Time features: hour, day, month, year, cyclic transforms
Lag features: 1,2,3,6,12,24,48,72,168 periods
Rolling statistics: mean and sd over windows 3,6,12,24,48,72
Evaluation Mode Details
When run_evaluation = TRUE, the function:
Creates artificial gaps (MCAR by default)
Runs imputation on the data with gaps
Compares predictions with true values
Returns per-station performance
This helps assess model accuracy before production use.
Examples
if (FALSE) { # \dontrun{
# ===== PRODUCTION MODE =====
# Impute missing temperature data
filled_temp <- sus_climate_fill_gaps(
df = climate_data,
target_var = "tair_dry_bulb_c",
quality_threshold = 0.3,
parallel = TRUE
)
# ===== EVALUATION MODE =====
# Assess model performance on a subset
eval_results <- sus_climate_fill_gaps(
df = climate_data,
target_var = "ws_2_m_s",
run_evaluation = TRUE,
gap_percentage = 0.2,
workers = 4
)
# View performance metrics
eval_results$metrics
} # }