Cluster complete site records in a timestep grid
cluster_grid.Rd
Uses agglomerative hierarchical clustering to group complete site records in
timestep grid with similar correlation structures. Can specify number of
cluster groups (k
) or optimize using k
number of clusters having the
highest mean silhouette width.
Usage
cluster_grid(
input_grid,
k = 2,
opt_k = T,
show_opt_k = F,
method = "ward.D2",
plot_cols = 1,
y_limits = c(3, -3),
autoscale_y = T
)
Arguments
- input_grid
data frame populated with observed, numeric values and
NA
(not assigned) values. Sites (variables) are included as columns and timesteps (observations) are included as rows. A leading index column namedtimestep
is not included in cluster assessment.- k
number of clusters to split complete time series into. Default is 2.
- opt_k
logical flag whether to optimize
k
on the basis of mean silhouette width. Overrides user-specifiedk
input whenTRUE
. Default isTRUE
.- show_opt_k
logical flag to plot cluster
k
optimization. Default isFALSE
.- method
specifies cluster linkage method passed to
hclust
. Default is"Ward.D2"
.- plot_cols
number of plot columns for faceted
ggplot
of cluster means. Default is1
.- y_limits
two-element numeric vector specifying y-axis limits for faceted
ggplot
of cluster means. Default isc(3,-3)
.- autoscale_y
logical flag to autoscale y-axis to reversed range of cluster means for faceted
ggplot
of cluster means. Overrides user-specifiedy_limits
whenTRUE
. Default isTRUE
.
Value
named list containing:
- clust_plot
ggplot
object of standardized site records faceted by cluster assignment. Standardized cluster means are plotted in color over individual sites in black- sites_out
data frame containing summary of cluster analysis results. Fields include:
site_no - site identifier,
cluster - cluster assignment,
neighbor - next closest cluster assignment,
silhouette_width - cluster validation index (see details).
- means_out
data frame of standardized cluster means indexed by timestep.
Details
Clusters complete sites using agglomerative hierarchical clustering using
hclust
. Sites are clustered by their distance matrix (d)
defined as 1 - c
, where c
is the correlation matrix for Pearson's r. Clusters
are assessed based on the mean silhouette width using
silhouette
. The silhouette width is a common cluster
validation index and ranges from -1 to 1. Silhouette widths less than or
equal to 0 indicate ambiguous cluster assignment (overlap with other
clusters) and greater than 0 indicate more distinct cluster assignment
(separation from other clusters) as described by Rousseeuw (1987).
References
Rousseeuw, P.J., 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis: Journal of Computational and Applied Mathematics v. 20, 53-65, https://doi.org/10.1016/0377-0427(87)90125-7.
Author
Maintainer: Zeno F. Levy zlevy@usgs.gov
Examples
if (FALSE) { # \dontrun{
# load example Long Island dataset
data(LI_data)
# grid data at monthly timestep using median observed values
grid <- timestep_grid(data = LI_data,
timestep = "monthly",
agg_method = "median")
# trim grid to remove sites that are less than 35 percent complete
grid <- trim_grid(grid, data_thresh = 0.35)
# impute grid with default settings
out <- impute_grid(input_grid = grid)
# cluster imputed grid to group sites with similar correlation structures
cg <- cluster_grid(out$imputed_grid)
} # }