Iteratively impute missing values by regression with user-specified model acceptance criteria

Iteratively imputes missing, "not assigned" values (NAs) in a data frame using regression models derived from correlation analyses of site records (columns). Available models include ordinary least squares (OLS), ridge regression (ridge), and the maintenance of variance extension, type 1 (MOVE.1). Models meeting a user-specified error threshold are used to impute missing values at target sites. Error thresholds are assessed based on commonly used metrics to evaluate model fits (RMSE, MAE, NRMSE, and NSE). Imputed records can be used as references (predictors) to build subsequent regression models until gap-filling ceases. Selection of reference sites used to impute values at target sites can be constrained by geospatial, group, and correlation-based cutoff filters. A non-parametric bootstrapping routine is used to generate prediction intervals for imputed values.

Usage

impute_grid(
  input_grid,
  model = "ridge",
  error_thresh = 0,
  error_method = "NSE",
  n_refwl = "max",
  p_per_n = 0.5,
  relax = 0.1,
  final_pass = T,
  r_cutoff = NA,
  d_cutoff = NA,
  sites = NA,
  sites_crs = "NAD83",
  group_sites = F,
  rnd = 2,
  bootstrap_PI = F,
  B = 500,
  PI_alpha = 0.05,
  cv_lambda = "lambda.min",
  nfolds = 10,
  add_std_means = T,
  verbose = T
)

Arguments

input_grid: data frame populated with observed, numeric values and NAs. Sites (variables) are included as columns and timesteps (observations) are included as rows. An additional leading timestep column formatted as a vector of R "Date" or "numeric" class type can be included and is used for indexing output into long format. If absent, a timestep column will be generated from sequential row numbers and included for output indexing. Each row must have at least one observed value for imputation to proceed. Only columns with at least ten observed values will be considered as viable imputation targets. Accepts output from timestep_grid and trim_grid functions.
model: regression model to use for imputation. Can be set to "OLS", "ridge", or "MOVE.1". Default is "ridge".
error_thresh: error threshold used to evaluate regression model results. When error_method is set to "RSME","MAE", or "NRMSE" model errors must be less than the error_thresh to accept modeled values for imputation. When error_method is set to "NSE" model errors must be greater than the error_thresh to accept modeled values for imputation. Default is 0 (only applicable to the default "NSE" method; model results are not accepted if errors are higher than the mean of the observed data).
error_method: method by which to compute regression-model errors. Can be set to "RMSE", "MAE", "NRMSE", or "NSE". See details for descriptions of error methods. Default is "NSE".
n_refwl: maximum number of reference sites used to build regression model. Can be set to a positive numeric value or "max". Numeric values specify an upper limit to the number of reference sites used to build regression models for imputation. When specified as "max", uses all available sites that meet basic reference-selection criteria. Default is "max".
p_per_n: number of predictors allowed per observation. Must be numeric between 0 and 1. For example, if set to 0.1, a model using 100 observed values is limited to the top 10 predictors (most correlated references to the target). Set to NA to remove this constraint. Default is 0.5, which is recommended for ridge regression when n_refwl is set to "max". Recommended to set to 0.1 or lower when using the OLS model to prevent overfitting.
relax: relaxation parameter. Specify as a numeric from 0 to 1. 1 - relax constrains the proportion of matching, row indexed observations at among target and reference sites required to build a correlation model. If set to 0, all observed values at the target site must be used to build a regression model, and only complete reference site records can be used for imputation. Default is 0.1, which requires at least 90 percent of observations at the target site be used to build regression models with correlated reference sites.
final_pass: logical flag determining whether initial imputation models will be overwritten during a final pass. If TRUE, allows overwriting of initial imputation results if improvement in model error is achieved after initial imputation with a more complete dataset. Typically results in improved model fits, but less traceable results. Default is TRUE.
r_cutoff: correlation cutoff value. Minimum allowable Pearson's correlation coefficient (r) between reference and target sites. Default is NA, which does not apply an r_cutoff.
d_cutoff: distance cutoff in kilometers (km). Maximum allowable distance a reference site can be from a target site. Requires input of latitude and longitude coordinates in sites input argument. Default is NA, which does not apply a d_cutoff.
sites: data frame of site metadata. Requires unique site identifiers (character or numeric) matching site columns in the input_grid data frame without the leading "X." character string if present. If using the d_cutoff method, columns named latitude and longitude must be present. If group_sites is set to TRUE, a column named group (character or numeric) must be present.
sites_crs: coordinate reference system used for geospatial coordinates included in sites data frame. Can be set to "NAD83" (EPSG: 4269) or "WGS84" (EPSG: 4326). Default is "NAD83".
group_sites: logical flag determining whether to restrict references to the target site's group. If TRUE only sites within the target site's group, as specified in the site data frame, can be used as references for imputation. Default is FALSE.
rnd: decimals to round model output to. Specified as numeric of zero or greater. Default is 2.
bootstrap_PI: logical flag to generate prediction intervals using non-parametric bootstrap routine based on Mougan and Nielsen (2022). Default is FALSE.
B: numeric specifying number of bootstrap iterations. Default is 500.
PI_alpha: significance level used to compute prediction intervals. Specify as numeric between 0 and 1. Default is 0.05 to compute a 95 percent prediction interval.
cv_lambda: lambda value selection from ridge regression cross-validation results. Can be specified as "lambda.min" or "lambda.1se". Default is "lambda.min". See cv.glmnet for details.
nfolds: number of cross-validation folds for automated selection of the lambda penalty parameter for ridge regression models. Default is 10. See cv.glmnet for details.
add_std_means: logical flag to use z-score standardized group means as additional synthetic reference sites. If TRUE computes sitewise z-scores for all observed values prior to imputation and creates a synthetic reference to create synthetic reference records for imputation. A reference named all_sites is created by averaging all site z-score values by timestep. Produces additional group means if a group field is supplied with the sites input data frame. Default is TRUE.
verbose: logical flag whether to print detailed information to the console while the function is running. Default is TRUE.

Value

A named list object with S3 class "ARCHI" containing:

imputed_grid: data frame with imputed input_grid
model_grid: data frame with structure of input_grid containing regression model predictions for imputed sites.
PI_upr: data frame with structure of input_grid containing upper prediction intervals for modeled sites.
PI_lwr: data frame with structure of input_grid containing lower prediction intervals for modeled sites.
refs: named list of imputed target sites containing reference site identifiers. Reference sites are ordered from most to least correlated to the named target.
model_stats: data frame containing statistical summary of modeling results for all input sites. Fields include: site_no site identifier, proportion_complete proportion of timesteps (rows) containing observed values, group grouping variable included if included in sites input (defaults to "all_sites"), RMSE root mean square error, MAE mean absolute error, NRMSE normalized root mean square error (RMSE divided by the standard deviation of observations), NSE Nash-Sutcliffe efficiency, var_ratio ratio of modeled to observed variance, lambda penalty parameter used for ridge regression, mod_n number of observations used to build regression model, n_refs number of reference sites (predictors) used to build regression model, mod_notes notes on model output, regression_model regression model used.
out_long: Observed and imputed values output as long-format data frame with the following fields: site_no site identifier, timestep index value from timestep field (date formatting is preserved from input), value imputed or observed response variable, PI_lwr lower prediction interval, PI_upr upper prediction interval, data_range defines whether an imputed value represents interpolation (within range of training data) or extrapolation (outside the range of training data), time_range defines whether an imputed value represents gap_fill (within the period of the training data) or extension (outside the period of the training data) value_type defines whether value is imputed or observed.
std_means: data frame of standardized group means used as synthetic reference sites. Only included in output list when add_std_means = TRUE.

Details

The impute_grid() function imputes NAs in a data frame of numeric values (input_grid). A leading timestep column can be included with "Date" or "numeric" type values for output indexing. If a leading the timestep column is not included in the input_grid one will be generated using sequential row numbers as index values. Remaining columns represent individual site records with rows indexed to common timesteps.

"Target" sites refer to sites with missing values (NAs) that will be attempted for imputation. "Reference" sites refer to sites used a predictors to build regression models fit to the record at the target site (response variable). Evans and others (2020) provide a description of this basic modeling approach to impute groundwater-level records at a single target site. The impute_grid() algorithm extends the approach of Evans and others (2020) by allowing all sites included in the input data frame to serve as targets and references for each other.

Algorithm Overview

The ARCHI algorithm has a similar generalized form to those of the R packages mice (van Buuren and Groothuis-Oudshoorn, 2011) and missForest (Stekhoven and Bühlmann, 2012), which employ variable-by-variable imputation strategies sometimes referred to as chained equations, fully conditional specification, or sequential regression (van Buuren, 2018).

The ARCHI algorithm proceeds as follows. Each incomplete site record is attempted for imputation beginning with the site with the most observed values and proceeding in order of completeness to the site with the least observed values. Reference sites are selected using the criteria defined below (see "Selection Criteria for Reference Sites"). A regression model is fit to the target (response) using the selected references (predictors) as described below (see "Regression Models"). The regression model is evaluated using the user-specified method error_method and error_thresh arguments. If the model fit is sufficient (less than error_thresh for "RMSE", "MAE", or "NRMSE" and greater than error_thresh for "NSE"), model results are saved and used to impute NAs in the target record. The next site is then considered for imputation using the entire imputed dataset as a potential reference pool and so on.

An iteration is completed once all candidate target sites are cycled through once. Iterations continue until regression models cease to be able to impute missing values or none remain. If final_pass is set to TRUE a "final pass" iteration is initiated once the initial stopping criteria has been met. During the final pass, initial imputations can be over-written if an improved model fit is attained. After the final pass, iterations proceed again without over-writing until gap-filling once again ceases.

Boostrapping of model prediction intervals commences after imputation is completed if Bootstrap_PI = TRUE. Bootstrapped prediction intervals are generated using the method described by Mougan and Nielsen (2022). Some truncation of prediction intervals from ideal values may occur if final_pass is set to TRUE. Imputation accuracy and coverage rates for prediction intervals can be evaluated with the hold_grid and hold_eval functions included in the ARCHI package.

Selection Criteria For Reference Sites

The following basic criteria is used to select reference sites for a given target:

reference sites must have row-indexed data corresponding to all NAs in the target record;
reference sites must have row-indexed data corresponding to the proportion of observed values in the target record allowed by the relax parameter; and
reference records (observed and imputed values) must have a significant linear correlation (p-value for Pearson's r less than or equal 0.05) to observed values in the target record.

If relax = 0 only complete reference records can be used to impute a given site. The default of relax = 0.1 allows for reference correlations to be based on a minimum of 90 percent of the observed target record. Target records with less than 10 observed values are not attempted for imputation.

After the initial pool of potential reference sites is selected, a series of cutoffs can be applied to further constrain the references used for imputation:

The d_cutoff parameter only allows references to be selected if they are within a user-specified buffer radius of the target (in kilometers).
The r_cutoff parameter only allows references to be used if they have a correlation to the target greater than or equal to the cutoff.
When group_sites = TRUE references can only be used to impute a target if they share the same user-supplied group designation specified in the sites data frame.

From the final pool of potential reference sites, the top n_refwl most correlated sites are used to fit a regression model to the target. If p_per_n is specified, the maximum number of references is limited on the basis of the number of observations used to fit the regression model. For MOVE.1 only the top most correlated reference site is used to build the regression model. For OLS and ridge models the number of references is constrained to always be less than the number of observations (p < n).

Regression Models

Three regression models are available in the impute_grid() function: ordinary least squares (OLS) using lm, ridge regression (ridge) using glmnet, and the maintenance of variance extension, type 1 (MOVE.1) using methods described by Hirsch (1982). The OLS and ridge models support multiple predictors whereas the MOVE.1 model can only use one predictor. If only one reference is available to impute a target site and OLS or ridge is selected, the regression model will default to simple linear regression (SLR) defined here as OLS with only one predictor. Ridge regression requires estimation of a penalty parameter (lambda). This is accomplished using the R package cv.glmnet, which employs an efficient cross-validation routine for selection of lambda. The impute_grid() function arguments nfolds and cv_lambda are passed to glmnet to select the optimal lambda from cross-validation results. Because the lambda cross-validation routine incorporates an element of randomness, if using the ridge model run set.seed prior to impute_grid() for reproducible results.

Prior to modeling, target and selected reference records ($R$) are normalized to the range of the training data using the following equation adapted from Evans and others (2020): $$R^{normi} =\frac {R^i - min(R^{traini})}{max(R^{traini}) - min(R^{traini})}$$ Where $R^{normi}$ denotes the $ith$ normalized value, $R^{i}$ denotes the $ith$ un-normalized value, and $R^{traini}$ denotes values in the training dataset. This ensures all training data are normalized to a set of values from 0 to 1. Predictions are back-transformed to original units following model fitting.

Evaluation of Model Error

Model-error thresholds are evaluated with the following error methods. Modeled ("simulated") values are denoted with by $S$ and observed values are denoted by $O$.

Root mean square error (RMSE): $$RMSE = \sqrt{ \frac{1}{N} \sum_{i=1}^N{ \left( S_i - O_i \right)^2 }}$$

Mean absolute error (MAE): $$MAE = \frac{1}{N} \sum_{i=1}^N { \left|S_i - O_i \right| }$$

Normalized root mean square error (NRMSE): $$NRMSE = \frac {\sqrt{ \frac{1}{N} \sum_{i=1}^N { \left( S_i - O_i \right)^2 } } } {\sigma_o} $$

Where $\sigma_o$ is the standard deviation of the observed values.

Nash-Sutcliffe efficiency (NSE): $$NSE = 1 -\frac{ \sum_{i=1}^N { \left( S_i - O_i \right)^2 }}{ \sum_{i=1}^N { \left( O_i - \bar{O} \right)^2 }}$$

An additional metric for model evaluation, the variance ratio, is included in the model_stats output data frame:

Variance ratio (var_ratio): $$ {\sigma_{ratio}}^2 = \frac{{\sigma_s}^2}{{\sigma_o}^2} $$

Where ${\sigma_o}^2$ and ${\sigma_s}^2$ are the variances of observed and simulated values, respectively.

$RMSE$ and $MAE$ are in the units of the observation values. $NRMSE$ and $NSE$ are dimensionless. The $NSE$ ranges from -Inf to 1, where 1 represents a perfect model fit. An $NSE$ value of zero signifies model predictions are as accurate as the mean of the observed data. A negative $NSE$ value indicates the model is a worse predictor than the mean of the observed data (Nash and Sutcliffe, 1970).

References

Evans, S.W., Jones, N.L., Williams, G.P., Ames, D.P., and Nelson, E.J., 2020, Groundwater Level Mapping Tool: An open source web application for assessing groundwater sustainability: Environmental Modelling and Software, 131, 104782, https://doi.org/10.1016/j.envsoft.2020.104782.

Hirsch, R.M., 1982, A comparison of four streamflow record extension techniques: Water Resources Research, v. 18, p. 1081–1088, https://doi.org/10.1029/WR018i004p01081

Mougan, C., and Nielsen, D.S., 2022, Monitoring Model Deterioration with Explainable Uncertainty Estimation via Non-parametric Bootstrap. Proceedings of the AAAI Conference on Artificial Intelligence 1686, 15037-15045, Available online: https://doi.org/10.1609/aaai.v37i12.26755.

Nash, J.E., and Sutcliffe, J.V., 1970, River flow forecasting through conceptual models part I — A discussion of principles: Journal of Hydrology, v. 10, no. 3, p. 282–290, https://doi.org/10.1016/0022-1694(70)90255-6.

Stekhoven, D.J., and Bühlmann, P., 2012, MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28, no. 1: 112-118. https://doi.org/10.1093/bioinformatics/btr597.

van Buuren, S., 2018, Flexible Imputation of Missing Data, Second Edition, CRC Press, Boca Raton. https://doi.org/10.1201/9780429492259.

van Buuren, S. and Groothuis-Oudshoorn, K., 2011, mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software 45, no. 3: 1-67. https://doi.org/10.18637/jss.v045.i03.

Author

Zeno F. Levy, Timothy J. Stagnitta, and Robin L. Glas
Maintainer: Zeno F. Levy zlevy@usgs.gov

Examples

if (FALSE) { # \dontrun{
# load example Long Island dataset
data(LI_data)

# grid data at monthly timestep using median observed values
  grid <- timestep_grid(data = LI_data, 
                        timestep = "monthly", 
                        agg_method = "median")

# trim grid to remove sites that are less than 35 percent complete
  grid <- trim_grid(grid, data_thresh = 0.35)

# impute grid using default settings
# ridge regression with an NSE error threshold of 0, 
# using all correlated references limited by a p_per_n ratio of 0.5
# (one predictor for every two observations)

  mod <- impute_grid(input_grid = grid) 

# Modify default settings using top 10
# most correlated reference sites as predictors 
  
  mod10 <- impute_grid(input_grid = grid,
                       n_refwl = 10,
                       p_per_n = NA)

# use all correlated references within 50 km of target for imputation
  data(LI_sites)                        
  
  mod50km <- impute_grid(input_grid = grid,
                         sites = LI_sites,
                         sites_crs = "NAD83",
                         d_cutoff = 50,
                         p_per_n = NA)

# limit reference selection to top 10 correlated sites within target's group 
  
  modgrp10 <- impute_grid(input_grid = grid,
                         sites = LI_sites,
                         n_refwl = 10,
                         group_sites = TRUE)

# use OLS regression with maximum number of reference sites limited to one 
# predictor per every ten observations
  
  ols <- impute_grid(input_grid = grid,
                     model = "OLS",
                     n_refwl = "max",
                     p_per_n = 0.1)
                   
# use MOVE.1 model and require all references have correlations to the target 
# with Person's r values greater than or equal to 0.7
  
  m1 <- impute_grid(input_grid = grid,
                    model = "MOVE.1",
                    r_cutoff = 0.7)
                    
# generate 95% prediction intervals for MOVE.1 model using non-parametric bootstrapping
  
  m1_PI <- impute_grid(input_grid = grid,
                       model = "MOVE.1",
                       bootstrap_PI = TRUE,
                       PI_alpha = 0.05,
                       B = 500)
} # }