Aggregate data into wide format by timestep and site

Aggregates observed values, observation dates, and site identifiers from a long-format into wide-format where timesteps of regular frequency (rows) are indexed to unique sites (columns) and populated with NAs where no data are present. Generated timesteps can be of daily, weekly, monthly, seasonal, or annual frequencies. Data can be aggregated by: min, max, mean, median, or a user-specified quantile.

Usage

timestep_grid(
  data,
  timestep = "monthly",
  agg_method = "median",
  q_perc = NULL,
  type_year = "calendar",
  start_month = 1,
  n_seasons = 4,
  year_range = NULL,
  months = c(1:12),
  output_date = "first_date"
)

Arguments

data: a data frame with at least three named columns that must include: site_no (character or numeric), date (character or date), and value (numeric) fields. The following character format orders are recognized for the date field following lubridate::parse_date_time() orders: "ymd","dmy", "mdy", "ymd HMS" with or without separators.
timestep: temporal frequency for data aggregation. Can be set to: "daily","weekly","monthly","seasonal", or "annual" frequencies. Default is "monthly". If "seasonal" is selected the type_year input argument must be set to "water" and start_month must be set to a numeric from 1 to 12.
agg_method: data aggregation function. All data values with dates within a given timestep are aggregated by this function. Can be set to: "min","max","mean","median", or "quantile". Default is "median". If "quantile" is selected the q_perc input argument must be set to a a numeric value from 0 to 1.
q_perc: user-defined quantile. For data aggregation when agg_method is set to "quantile". Must be set to a numeric value from 0 to 1. Default is NULL.
type_year: type of year used for data aggregation. Can be set to "calendar" or "water". Must be set to "water" if timestep is set to "seasonal".
start_month: first month defining water year. Must be set to a numeric value corresponding to calendar months from 1 (January) to 12 (December). Default is 1.
n_seasons: number of seasons. Defines seasons by evenly dividing months of the year beginning with start_month into n_seasons. Only months included in the months argument are considered. The number of months considered must be divisible by n_seasons. Default is 4.
year_range: two-element vector containing first and last year to filter input dates by. The year_range filter applies to calendar or water years depending on the type_year argument, but all output dates are calendar. Default is NULL, which does not filter the output by year.
months: vector of months to be included in the output. Calendar months included in construction of timesteps can range from 1 (January) to 12 (December). Months not included in the months argument will not be considered during timestep discretization and aggregation. Default is c(1:12).
output_date: date used as timestep identifier. Can be set to "first_date" or "median_date", which attributes a given timestep by its first or median date, respectively.

Value

A data frame with dimensions equal to the number of timesteps (rows) by the number of unique sites (columns) appended with an additional leading timestep index column. The timestep index column is formatted as an R "Date" class and all other columns are formatted as "numeric". Numeric values represent all values observed at the indexed site during a given timestep aggregated by the function specified by the agg_method input argument. Timesteps where no observed data occurs at a given site are populated with NAs. Unique sites identified in the column headers derived from the site_no input field are appended with an "X." character prefix to prevent truncation of numeric site identifiers.

Details

All dates output in the timestep field of the return data frame are indexed by calendar year even if data are aggregated by water year. If year_range is not specified, sequential timesteps will be generated spanning from the earliest to the latest timestep containing an observed value. Currently, sub-daily gridding (e.g., hours, minutes, seconds) is not available, but the impute_grid function will intake user-formatted grids of such data without a leading timestep column, indexing timesteps in model output by sequential integers.

Author

Timothy J. Stagnitta, Zeno F. Levy
Maintainer: Zeno F. Levy zlevy@usgs.gov

Examples

# load example Long Island dataset
data(LI_data)

# aggregate data at monthly timestep using median observed values
  grid <- timestep_grid(data = LI_data, 
                        timestep = "monthly", 
                        agg_method = "median")
                      
# view first five timesteps
  grid$timestep[1:5]
#> [1] "1975-01-01" "1975-02-01" "1975-03-01" "1975-04-01" "1975-05-01"
  
# output median dates for timestep indices and apply year range filter
  grid <- timestep_grid(data = LI_data, 
                        timestep = "monthly", 
                        agg_method = "median",
                        output_date = "median_date",
                        year_range = c(1990, 2000))
                      
# view first five timesteps
  grid$timestep[1:5]
#> [1] "1990-01-16" "1990-02-14" "1990-03-16" "1990-04-15" "1990-05-16"

# aggregate data by water year beginning in October using median observed values
  grid <- timestep_grid(data = LI_data, 
                        timestep = "annual", 
                        agg_method = "median",
                        type_year = "water",
                        start_month = 10)
                        
# view first five timesteps
  grid$timestep[1:5]
#> [1] "1974-10-01" "1975-10-01" "1976-10-01" "1977-10-01" "1978-10-01"
                        
# aggregate data seasonally by four-season water year beginning in October
  grid <- timestep_grid(data = LI_data, 
                        timestep = "seasonal", 
                        n_seasons = 4,
                        agg_method = "median",
                        type_year = "water",
                        start_month = 10)
                        
# view first five timesteps
  grid$timestep[1:5]
#> [1] "1975-01-01" "1975-04-01" "1975-07-01" "1975-10-01" "1976-01-01"