Iteratively impute missing values by regression with user-specified model acceptance criteria
impute_grid.Rd
Iteratively imputes missing, "not assigned" values (NA
s) in a data frame using regression
models derived from correlation analyses of site records (columns).
Available models include ordinary least squares (OLS
), ridge regression
(ridge
), and the maintenance of variance extension, type 1 (MOVE.1
).
Models meeting a user-specified error threshold are used to impute missing
values at target sites. Error thresholds are assessed based on commonly used
metrics to evaluate model fits (RMSE
, MAE
, NRMSE
, and NSE
).
Imputed records can be used as references (predictors) to build subsequent
regression models until gap-filling ceases. Selection of reference sites used
to impute values at target sites can be constrained by geospatial, group, and
correlation-based cutoff filters. A non-parametric bootstrapping routine is used
to generate prediction intervals for imputed values.
Usage
impute_grid(
input_grid,
model = "ridge",
error_thresh = 0,
error_method = "NSE",
n_refwl = "max",
p_per_n = 0.5,
relax = 0.1,
final_pass = T,
r_cutoff = NA,
d_cutoff = NA,
sites = NA,
sites_crs = "NAD83",
group_sites = F,
rnd = 2,
bootstrap_PI = F,
B = 500,
PI_alpha = 0.05,
cv_lambda = "lambda.min",
nfolds = 10,
add_std_means = T,
verbose = T
)
Arguments
- input_grid
data frame populated with observed, numeric values and
NA
s. Sites (variables) are included as columns and timesteps (observations) are included as rows. An additional leadingtimestep
column formatted as a vector of R"Date"
or"numeric"
class type can be included and is used for indexing output into long format. If absent, atimestep
column will be generated from sequential row numbers and included for output indexing. Each row must have at least one observed value for imputation to proceed. Only columns with at least ten observed values will be considered as viable imputation targets. Accepts output fromtimestep_grid
andtrim_grid
functions.- model
regression model to use for imputation. Can be set to
"OLS"
,"ridge"
, or"MOVE.1"
. Default is"ridge"
.- error_thresh
error threshold used to evaluate regression model results. When
error_method
is set to"RSME"
,"MAE"
, or"NRMSE"
model errors must be less than theerror_thresh
to accept modeled values for imputation. Whenerror_method
is set to"NSE"
model errors must be greater than theerror_thresh
to accept modeled values for imputation. Default is0
(only applicable to the default"NSE"
method; model results are not accepted if errors are higher than the mean of the observed data).- error_method
method by which to compute regression-model errors. Can be set to
"RMSE"
,"MAE"
,"NRMSE"
, or"NSE"
. See details for descriptions of error methods. Default is"NSE"
.- n_refwl
maximum number of reference sites used to build regression model. Can be set to a positive numeric value or
"max"
. Numeric values specify an upper limit to the number of reference sites used to build regression models for imputation. When specified as"max",
uses all available sites that meet basic reference-selection criteria. Default is"max"
.- p_per_n
number of predictors allowed per observation. Must be numeric between 0 and 1. For example, if set to 0.1, a model using 100 observed values is limited to the top 10 predictors (most correlated references to the target). Set to
NA
to remove this constraint. Default is0.5
, which is recommended for ridge regression whenn_refwl
is set to"max"
. Recommended to set to0.1
or lower when using theOLS
model to prevent overfitting.- relax
relaxation parameter. Specify as a numeric from 0 to 1.
1 - relax
constrains the proportion of matching, row indexed observations at among target and reference sites required to build a correlation model. If set to 0, all observed values at the target site must be used to build a regression model, and only complete reference site records can be used for imputation. Default is0.1
, which requires at least 90 percent of observations at the target site be used to build regression models with correlated reference sites.- final_pass
logical flag determining whether initial imputation models will be overwritten during a final pass. If
TRUE
, allows overwriting of initial imputation results if improvement in model error is achieved after initial imputation with a more complete dataset. Typically results in improved model fits, but less traceable results. Default isTRUE
.- r_cutoff
correlation cutoff value. Minimum allowable Pearson's correlation coefficient (r) between reference and target sites. Default is
NA
, which does not apply anr_cutoff
.- d_cutoff
distance cutoff in kilometers (km). Maximum allowable distance a reference site can be from a target site. Requires input of latitude and longitude coordinates in
sites
input argument. Default isNA
, which does not apply ad_cutoff
.- sites
data frame of site metadata. Requires unique site identifiers (character or numeric) matching site columns in the
input_grid
data frame without the leading "X." character string if present. If using thed_cutoff
method, columns namedlatitude
andlongitude
must be present. Ifgroup_sites
is set toTRUE
, a column namedgroup
(character or numeric) must be present.- sites_crs
coordinate reference system used for geospatial coordinates included in
sites
data frame. Can be set to"NAD83"
(EPSG: 4269) or"WGS84"
(EPSG: 4326). Default is"NAD83"
.- group_sites
logical flag determining whether to restrict references to the target site's group. If
TRUE
only sites within the target site's group, as specified in thesite
data frame, can be used as references for imputation. Default isFALSE
.- rnd
decimals to round model output to. Specified as numeric of zero or greater. Default is
2
.- bootstrap_PI
logical flag to generate prediction intervals using non-parametric bootstrap routine based on Mougan and Nielsen (2022). Default is
FALSE
.- B
numeric specifying number of bootstrap iterations. Default is
500
.- PI_alpha
significance level used to compute prediction intervals. Specify as numeric between 0 and 1. Default is
0.05
to compute a 95 percent prediction interval.- cv_lambda
lambda value selection from ridge regression cross-validation results. Can be specified as
"lambda.min"
or"lambda.1se"
. Default is"lambda.min"
. Seecv.glmnet
for details.- nfolds
number of cross-validation folds for automated selection of the lambda penalty parameter for ridge regression models. Default is
10
. Seecv.glmnet
for details.- add_std_means
logical flag to use z-score standardized group means as additional synthetic reference sites. If
TRUE
computes sitewise z-scores for all observed values prior to imputation and creates a synthetic reference to create synthetic reference records for imputation. A reference namedall_sites
is created by averaging all site z-score values by timestep. Produces additional group means if agroup
field is supplied with thesites
input data frame. Default isTRUE
.- verbose
logical flag whether to print detailed information to the console while the function is running. Default is
TRUE
.
Value
A named list object with S3 class "ARCHI"
containing:
- imputed_grid
data frame with imputed
input_grid
- model_grid
data frame with structure of
input_grid
containing regression model predictions for imputed sites.- PI_upr
data frame with structure of
input_grid
containing upper prediction intervals for modeled sites.- PI_lwr
data frame with structure of
input_grid
containing lower prediction intervals for modeled sites.- refs
named list of imputed target sites containing reference site identifiers. Reference sites are ordered from most to least correlated to the named target.
- model_stats
data frame containing statistical summary of modeling results for all input sites. Fields include: site_no site identifier, proportion_complete proportion of timesteps (rows) containing observed values, group grouping variable included if included in
sites
input (defaults to"all_sites"
), RMSE root mean square error, MAE mean absolute error, NRMSE normalized root mean square error (RMSE divided by the standard deviation of observations), NSE Nash-Sutcliffe efficiency, var_ratio ratio of modeled to observed variance, lambda penalty parameter used for ridge regression, mod_n number of observations used to build regression model, n_refs number of reference sites (predictors) used to build regression model, mod_notes notes on model output, regression_model regression model used.- out_long
Observed and imputed values output as long-format data frame with the following fields: site_no site identifier, timestep index value from timestep field (date formatting is preserved from input), value imputed or observed response variable, PI_lwr lower prediction interval, PI_upr upper prediction interval, data_range defines whether an imputed value represents
interpolation
(within range of training data) orextrapolation
(outside the range of training data), time_range defines whether an imputed value representsgap_fill
(within the period of the training data) orextension
(outside the period of the training data) value_type defines whether value isimputed
orobserved
.- std_means
data frame of standardized group means used as synthetic reference sites. Only included in output list when
add_std_means = TRUE
.
Details
The impute_grid()
function imputes NA
s in a data frame of numeric
values (input_grid
). A leading timestep
column can be included with
"Date"
or "numeric"
type values for output indexing. If a leading the
timestep
column is not included in the input_grid
one will be generated
using sequential row numbers as index values. Remaining columns represent
individual site records with rows indexed to common timesteps.
"Target" sites refer to sites with missing values (NAs
) that will be
attempted for imputation. "Reference" sites refer to sites used a predictors
to build regression models fit to the record at the target site (response variable).
Evans and others (2020) provide a description of this basic modeling approach
to impute groundwater-level records at a single target site. The impute_grid()
algorithm extends the approach of Evans and others (2020) by allowing all sites
included in the input data frame to serve as targets and references for each other.
Algorithm Overview
The ARCHI algorithm has a similar generalized form to those of
the R packages mice
(van Buuren and Groothuis-Oudshoorn, 2011) and missForest
(Stekhoven and Bühlmann, 2012), which employ variable-by-variable imputation
strategies sometimes referred to as chained equations, fully conditional
specification, or sequential regression (van Buuren, 2018).
The ARCHI algorithm proceeds as follows. Each incomplete site record is attempted
for imputation beginning with the site with the most observed values and
proceeding in order of completeness to the site with the least
observed values. Reference sites are selected using the criteria defined
below (see "Selection Criteria for Reference Sites"). A regression model
is fit to the target (response) using the selected references (predictors)
as described below (see "Regression Models"). The regression model is
evaluated using the user-specified method error_method
and error_thresh
arguments. If the model fit is sufficient (less than error_thresh
for
"RMSE"
, "MAE"
, or "NRMSE"
and greater than error_thresh
for "NSE"
),
model results are saved and used to impute NA
s in the target record.
The next site is then considered for imputation using the entire imputed
dataset as a potential reference pool and so on.
An iteration is completed once all candidate target sites are cycled through
once. Iterations continue until regression models cease to be able to impute
missing values or none remain. If final_pass
is set to TRUE
a "final pass"
iteration is initiated once the initial stopping criteria has been met. During
the final pass, initial imputations can be over-written if an improved model
fit is attained. After the final pass, iterations proceed again without
over-writing until gap-filling once again ceases.
Boostrapping of model prediction intervals commences after imputation is
completed if Bootstrap_PI = TRUE
. Bootstrapped prediction intervals are
generated using the method described by Mougan and Nielsen (2022). Some
truncation of prediction intervals from ideal values may occur if final_pass
is set to TRUE
. Imputation accuracy and coverage rates for prediction
intervals can be evaluated with the hold_grid
and
hold_eval
functions included in the ARCHI package.
Selection Criteria For Reference Sites
The following basic criteria is used to select reference sites for a given target:
reference sites must have row-indexed data corresponding to all
NA
s in the target record;reference sites must have row-indexed data corresponding to the proportion of observed values in the target record allowed by the
relax
parameter; andreference records (observed and imputed values) must have a significant linear correlation (p-value for Pearson's r less than or equal 0.05) to observed values in the target record.
If relax = 0
only complete reference records can be used to impute a given site.
The default of relax = 0.1
allows for reference correlations to be based
on a minimum of 90 percent of the observed target record. Target records with
less than 10 observed values are not attempted for imputation.
After the initial pool of potential reference sites is selected, a series of cutoffs can be applied to further constrain the references used for imputation:
The
d_cutoff
parameter only allows references to be selected if they are within a user-specified buffer radius of the target (in kilometers).The
r_cutoff
parameter only allows references to be used if they have a correlation to the target greater than or equal to the cutoff.When
group_sites = TRUE
references can only be used to impute a target if they share the same user-supplied group designation specified in thesites
data frame.
From the final pool of potential reference sites, the top n_refwl
most correlated sites are used to fit a regression model to the target. If
p_per_n
is specified, the maximum number of references is limited on the
basis of the number of observations used to fit the regression model. For
MOVE.1
only the top most correlated reference site is used to
build the regression model. For OLS
and ridge
models the number of
references is constrained to always be less than the number of observations
(p < n).
Regression Models
Three regression models are available in the impute_grid()
function:
ordinary least squares (OLS
) using lm
, ridge regression
(ridge
) using glmnet
, and the maintenance of variance
extension, type 1 (MOVE.1
) using
methods
described by Hirsch (1982). The OLS
and ridge
models support multiple
predictors whereas the MOVE.1
model can only use one predictor. If only
one reference is available to impute a target site and OLS
or ridge
is selected, the regression model will default to simple linear regression
(SLR
) defined here as OLS
with only one predictor.
Ridge regression requires estimation of a penalty parameter (lambda
).
This is accomplished using the R package cv.glmnet
,
which employs an efficient cross-validation routine for selection of lambda
.
The impute_grid()
function arguments nfolds
and cv_lambda
are passed to
glmnet
to select the optimal lambda
from cross-validation
results. Because the lambda
cross-validation routine incorporates an element
of randomness, if using the ridge
model run set.seed
prior to impute_grid()
for reproducible results.
Prior to modeling, target and selected reference records (\(R\)) are normalized to the range of the training data using the following equation adapted from Evans and others (2020): $$R^{normi} =\frac {R^i - min(R^{traini})}{max(R^{traini}) - min(R^{traini})}$$ Where \(R^{normi}\) denotes the \(ith\) normalized value, \(R^{i}\) denotes the \(ith\) un-normalized value, and \(R^{traini}\) denotes values in the training dataset. This ensures all training data are normalized to a set of values from 0 to 1. Predictions are back-transformed to original units following model fitting.
Evaluation of Model Error
Model-error thresholds are evaluated with the following error methods. Modeled ("simulated") values are denoted with by \(S\) and observed values are denoted by \(O\).
Root mean square error (RMSE
):
$$RMSE = \sqrt{ \frac{1}{N} \sum_{i=1}^N{ \left( S_i - O_i \right)^2 }}$$
Mean absolute error (MAE
):
$$MAE = \frac{1}{N} \sum_{i=1}^N { \left|S_i - O_i \right| }$$
Normalized root mean square error (NRMSE
):
$$NRMSE = \frac {\sqrt{ \frac{1}{N} \sum_{i=1}^N { \left( S_i - O_i \right)^2 } } } {\sigma_o} $$
Where \(\sigma_o\) is the standard deviation of the observed values.
Nash-Sutcliffe efficiency (NSE
):
$$NSE = 1 -\frac{ \sum_{i=1}^N { \left( S_i - O_i \right)^2 }}{ \sum_{i=1}^N { \left( O_i - \bar{O} \right)^2 }}$$
An additional metric for model evaluation, the variance ratio, is included in the model_stats
output data frame:
Variance ratio (var_ratio
):
$$ {\sigma_{ratio}}^2 = \frac{{\sigma_s}^2}{{\sigma_o}^2} $$
Where \({\sigma_o}^2\) and \({\sigma_s}^2\) are the variances of observed and simulated values, respectively.
\(RMSE\) and \(MAE\) are in the units of the observation values.
\(NRMSE\) and \(NSE\) are dimensionless. The \(NSE\) ranges from -Inf
to 1,
where 1 represents a perfect model fit. An \(NSE\) value of zero signifies model
predictions are as accurate as the mean of the observed data. A negative \(NSE\)
value indicates the model is a worse predictor than the mean of the observed data
(Nash and Sutcliffe, 1970).
References
Evans, S.W., Jones, N.L., Williams, G.P., Ames, D.P.,
and Nelson, E.J., 2020, Groundwater Level Mapping Tool: An open source web
application for assessing groundwater sustainability: Environmental
Modelling and Software, 131, 104782, https://doi.org/10.1016/j.envsoft.2020.104782.
Hirsch, R.M., 1982, A comparison of four streamflow record
extension techniques: Water Resources Research, v. 18, p. 1081–1088,
https://doi.org/10.1029/WR018i004p01081
Mougan, C., and Nielsen, D.S., 2022, Monitoring Model Deterioration with
Explainable Uncertainty Estimation via Non-parametric Bootstrap. Proceedings
of the AAAI Conference on Artificial Intelligence 1686, 15037-15045,
Available online: https://doi.org/10.1609/aaai.v37i12.26755.
Nash, J.E., and Sutcliffe, J.V., 1970, River flow forecasting through
conceptual models part I — A discussion of principles: Journal of
Hydrology, v. 10, no. 3, p. 282–290,
https://doi.org/10.1016/0022-1694(70)90255-6.
Stekhoven, D.J., and Bühlmann, P., 2012, MissForest—non-parametric missing value
imputation for mixed-type data. Bioinformatics 28, no. 1: 112-118.
https://doi.org/10.1093/bioinformatics/btr597.
van Buuren, S., 2018, Flexible Imputation of Missing Data, Second Edition,
CRC Press, Boca Raton. https://doi.org/10.1201/9780429492259.
van Buuren, S. and Groothuis-Oudshoorn, K., 2011, mice: Multivariate Imputation
by Chained Equations in R. Journal of Statistical Software 45, no. 3: 1-67.
https://doi.org/10.18637/jss.v045.i03.
Author
Zeno F. Levy, Timothy J. Stagnitta, and Robin L. Glas
Maintainer: Zeno F. Levy zlevy@usgs.gov
Examples
if (FALSE) { # \dontrun{
# load example Long Island dataset
data(LI_data)
# grid data at monthly timestep using median observed values
grid <- timestep_grid(data = LI_data,
timestep = "monthly",
agg_method = "median")
# trim grid to remove sites that are less than 35 percent complete
grid <- trim_grid(grid, data_thresh = 0.35)
# impute grid using default settings
# ridge regression with an NSE error threshold of 0,
# using all correlated references limited by a p_per_n ratio of 0.5
# (one predictor for every two observations)
mod <- impute_grid(input_grid = grid)
# Modify default settings using top 10
# most correlated reference sites as predictors
mod10 <- impute_grid(input_grid = grid,
n_refwl = 10,
p_per_n = NA)
# use all correlated references within 50 km of target for imputation
data(LI_sites)
mod50km <- impute_grid(input_grid = grid,
sites = LI_sites,
sites_crs = "NAD83",
d_cutoff = 50,
p_per_n = NA)
# limit reference selection to top 10 correlated sites within target's group
modgrp10 <- impute_grid(input_grid = grid,
sites = LI_sites,
n_refwl = 10,
group_sites = TRUE)
# use OLS regression with maximum number of reference sites limited to one
# predictor per every ten observations
ols <- impute_grid(input_grid = grid,
model = "OLS",
n_refwl = "max",
p_per_n = 0.1)
# use MOVE.1 model and require all references have correlations to the target
# with Person's r values greater than or equal to 0.7
m1 <- impute_grid(input_grid = grid,
model = "MOVE.1",
r_cutoff = 0.7)
# generate 95% prediction intervals for MOVE.1 model using non-parametric bootstrapping
m1_PI <- impute_grid(input_grid = grid,
model = "MOVE.1",
bootstrap_PI = TRUE,
PI_alpha = 0.05,
B = 500)
} # }