Cluster complete site records in a timestep grid

Uses agglomerative hierarchical clustering to group complete site records in timestep grid with similar correlation structures. Can specify number of cluster groups (k) or optimize using k number of clusters having the highest mean silhouette width.

Usage

cluster_grid(
  input_grid,
  k = 2,
  opt_k = T,
  show_opt_k = F,
  method = "ward.D2",
  plot_cols = 1,
  y_limits = c(3, -3),
  autoscale_y = T
)

Arguments

input_grid: data frame populated with observed, numeric values and NA (not assigned) values. Sites (variables) are included as columns and timesteps (observations) are included as rows. A leading index column named timestep is not included in cluster assessment.
k: number of clusters to split complete time series into. Default is 2.
opt_k: logical flag whether to optimize k on the basis of mean silhouette width. Overrides user-specified k input when TRUE. Default is TRUE.
show_opt_k: logical flag to plot cluster k optimization. Default is FALSE.
method: specifies cluster linkage method passed to hclust. Default is "Ward.D2".
plot_cols: number of plot columns for faceted ggplot of cluster means. Default is 1.
y_limits: two-element numeric vector specifying y-axis limits for faceted ggplot of cluster means. Default is c(3,-3).
autoscale_y: logical flag to autoscale y-axis to reversed range of cluster means for faceted ggplot of cluster means. Overrides user-specified y_limits when TRUE. Default is TRUE.

Value

named list containing:

clust_plot

ggplot object of standardized site records faceted by cluster assignment. Standardized cluster means are plotted in color over individual sites in black

sites_out

data frame containing summary of cluster analysis results. Fields include:

site_no - site identifier,
cluster - cluster assignment,
neighbor - next closest cluster assignment,
silhouette_width - cluster validation index (see details).

means_out

data frame of standardized cluster means indexed by timestep.

Details

Clusters complete sites using agglomerative hierarchical clustering using hclust. Sites are clustered by their distance matrix (d) defined as 1 - c, where c is the correlation matrix for Pearson's r. Clusters are assessed based on the mean silhouette width using silhouette. The silhouette width is a common cluster validation index and ranges from -1 to 1. Silhouette widths less than or equal to 0 indicate ambiguous cluster assignment (overlap with other clusters) and greater than 0 indicate more distinct cluster assignment (separation from other clusters) as described by Rousseeuw (1987).

References

Rousseeuw, P.J., 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis: Journal of Computational and Applied Mathematics v. 20, 53-65, https://doi.org/10.1016/0377-0427(87)90125-7.

Author

Maintainer: Zeno F. Levy zlevy@usgs.gov

Examples

 if (FALSE) { # \dontrun{
# load example Long Island dataset
  data(LI_data)

# grid data at monthly timestep using median observed values
  grid <- timestep_grid(data = LI_data, 
                        timestep = "monthly", 
                        agg_method = "median")

# trim grid to remove sites that are less than 35 percent complete
  grid <- trim_grid(grid, data_thresh = 0.35)

# impute grid with default settings
  out <- impute_grid(input_grid = grid)

# cluster imputed grid to group sites with similar correlation structures
  cg <- cluster_grid(out$imputed_grid)
  } # }