Introducing read_waterdata_statistics

The read_waterdata_stats_por and read_waterdata_stats_daterange functions replace the legacy readNWISstat function. This replacement is necessary because the legacy API service that readNWISstat uses will be decommissioned and replaced with a modernized API. This new API has two available endpoints, observationNormals and observationIntervals, that appear similar at first yet have important differences we want to highlight here.

library(dataRetrieval)
library(ggplot2)
library(tidyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

site1 <- "USGS-02037500"

Fetching day-of-year and month-of-year statistics

Day-of-year and month-of-year statistics aggregate observations for the same calendar day or month across multiple years to describe typical seasonal conditions (e.g., all Januarys or all January 1sts). Consider the output below, where we request day-of-year discharge averages for January 1 and January 2. Note that the start_date and end_date are set in month-year format to describe the day-of-year range.

jan_por_mean <-
  read_waterdata_stats_por(
  monitoring_location_id = site1,
  parameter_code = "00060",
  computation = "arithmetic_mean",
  start_date = "01-01",
  end_date = "01-02"
)

jan_por_mean
#> Simple feature collection with 3 features and 18 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -77.54693 ymin: 37.5632 xmax: -77.54693 ymax: 37.5632
#> Geodetic CRS:  WGS 84
#>   parameter_code unit_of_measure            parent_time_series_id time_of_year
#> 1          00060          ft^3/s 3995ee5508334084af0a1325f951eadf        01-01
#> 2          00060          ft^3/s 3995ee5508334084af0a1325f951eadf        01-02
#> 3          00060          ft^3/s 3995ee5508334084af0a1325f951eadf           01
#>   time_of_year_type    value sample_count approval_status
#> 1       day_of_year 9342.396           91        approved
#> 2       day_of_year 9437.714           91        approved
#> 3     month_of_year 7118.020           91        approved
#>                         computation_id     computation percentile
#> 1 3cf38aa6-e836-41e6-9598-9f2f0b813099 arithmetic_mean         NA
#> 2 df3d1834-6337-48ef-9632-0ee3d7076fa2 arithmetic_mean         NA
#> 3 d244a78a-d5c4-464d-8cc2-88f7b39ac0fa arithmetic_mean         NA
#>   monitoring_location_id      monitoring_location_name site_type site_type_code
#> 1          USGS-02037500 JAMES RIVER NEAR RICHMOND, VA    Stream             ST
#> 2          USGS-02037500 JAMES RIVER NEAR RICHMOND, VA    Stream             ST
#> 3          USGS-02037500 JAMES RIVER NEAR RICHMOND, VA    Stream             ST
#>   country_code state_code county_code                  geometry
#> 1           US         51         087 POINT (-77.54693 37.5632)
#> 2           US         51         087 POINT (-77.54693 37.5632)
#> 3           US         51         087 POINT (-77.54693 37.5632)

The first two rows show the average discharge values, aggregated across all January 1sts and 2nds in the site’s Period of Record (POR). But wait: what’s in that third row? Looking at the time_of_year and time_of_year_type columns, we see the third row represents the average discharge value aggregated across all Januarys. This illustrates one quirk of the modern statistics API: any time the start_date to end_date range overlaps with the first day of a month (e.g., "01-01"), we will get month-of-year as well as the day-of-year summary statistics. You can filter these rows out of the data if you don’t want them in downstream analyses:

jan_por_mean[jan_por_mean$time_of_year_type != "month_of_year",]
#> Simple feature collection with 2 features and 18 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -77.54693 ymin: 37.5632 xmax: -77.54693 ymax: 37.5632
#> Geodetic CRS:  WGS 84
#>   parameter_code unit_of_measure            parent_time_series_id time_of_year
#> 1          00060          ft^3/s 3995ee5508334084af0a1325f951eadf        01-01
#> 2          00060          ft^3/s 3995ee5508334084af0a1325f951eadf        01-02
#>   time_of_year_type    value sample_count approval_status
#> 1       day_of_year 9342.396           91        approved
#> 2       day_of_year 9437.714           91        approved
#>                         computation_id     computation percentile
#> 1 3cf38aa6-e836-41e6-9598-9f2f0b813099 arithmetic_mean         NA
#> 2 df3d1834-6337-48ef-9632-0ee3d7076fa2 arithmetic_mean         NA
#>   monitoring_location_id      monitoring_location_name site_type site_type_code
#> 1          USGS-02037500 JAMES RIVER NEAR RICHMOND, VA    Stream             ST
#> 2          USGS-02037500 JAMES RIVER NEAR RICHMOND, VA    Stream             ST
#>   country_code state_code county_code                  geometry
#> 1           US         51         087 POINT (-77.54693 37.5632)
#> 2           US         51         087 POINT (-77.54693 37.5632)

Let’s now look at an example that illustrates the benefits of the statistics API. In the example below, we pull all day-of-year discharge percentiles for our site. Keep in mind that doing so without the statistics API would require us to download the entire daily period of record for this site and hand-compute these percentiles ourselves, a time- and resource-intensive process indeed.

For demonstration, we filter the output to the January 1 day-of-year percentiles, which include a set of percentiles commonly used on WDFN webpages (e.g., Wisconsin water conditions).

library(dplyr)
library(tidyr)
library(ggplot2)

full_por_percentiles <-
  read_waterdata_stats_por(
    monitoring_location_id = site1,
    parameter_code = "00060",
    computation = c("minimum", "maximum", "percentile"),
    start_date = "01-01",
    end_date = "12-31"
  )

full_por_percentiles |>
  filter(time_of_year == "01-01" & time_of_year_type == "day_of_year") |>
  arrange(percentile) |>
  select(
    time_of_year, computation, percentile, value
  )
#> Simple feature collection with 9 features and 4 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -77.54693 ymin: 37.5632 xmax: -77.54693 ymax: 37.5632
#> Geodetic CRS:  WGS 84
#>   time_of_year computation percentile value                  geometry
#> 1        01-01     minimum          0   380 POINT (-77.54693 37.5632)
#> 2        01-01  percentile          5  1266 POINT (-77.54693 37.5632)
#> 3        01-01  percentile         10  1922 POINT (-77.54693 37.5632)
#> 4        01-01  percentile         25  3580 POINT (-77.54693 37.5632)
#> 5        01-01  percentile         50  5550 POINT (-77.54693 37.5632)
#> 6        01-01  percentile         75  9890 POINT (-77.54693 37.5632)
#> 7        01-01  percentile         90 22000 POINT (-77.54693 37.5632)
#> 8        01-01  percentile         95 37780 POINT (-77.54693 37.5632)
#> 9        01-01     maximum        100 60000 POINT (-77.54693 37.5632)

After a bit of data manipulation, we can then visualize the percentiles as “ribbons” on a plot. Each ribbon spans between two percentiles returned by the /statistics API (e.g., minimum to 5th, 5th to 10th, etc).

doy_perc_bands <-
  full_por_percentiles |>
  sf::st_drop_geometry() |>
  dplyr::filter(time_of_year_type == "day_of_year") |>
  select(time_of_year, percentile, value) |>
  mutate(time_of_year = as.Date(time_of_year, format = "%m-%d")) |>
  pivot_wider(names_from = percentile, values_from = value) 

pcode_info <- read_waterdata_parameter_codes(parameter_code = "00060")

ggplot(data = doy_perc_bands, 
       aes(x = time_of_year)) +
  geom_ribbon(aes(ymin = `95`, ymax = `100`), fill = "#292f6b") +
  geom_ribbon(aes(ymin = `90`, ymax = `95`), fill = "#5699c0") +
  geom_ribbon(aes(ymin = `75`, ymax = `90`), fill = "#aacee0") +
  geom_ribbon(aes(ymin = `25`, ymax = `75`), fill = "#e9e9e9") +
  geom_ribbon(aes(ymin = `10`, ymax = `25`), fill = "#ebd6ab") +
  geom_ribbon(aes(ymin = `5`, ymax = `10`), fill = "#dcb668") +
  geom_ribbon(aes(ymin = `0`, ymax = `5`), fill = "#8f4f1f") +
  geom_line(aes(y = `50`), linewidth = .2, color = "black") +
  scale_x_date(date_labels = "%b", date_breaks = "1 month",
               expand = expand_scale(mult = c(0, 0))) +
  scale_y_log10(breaks = scales::breaks_log(base = 10),
                labels = scales::label_log(base = 10)) +
  labs(
    x = "Month-day",
    y = pcode_info$parameter_description,
    title = "Day-of-year percentile bands"
  ) +
  theme_bw()

Finally, let’s overlay daily mean data onto the plot:

range <- as.Date(c("2025-01-01", "2026-02-01"))

complete_df <- data.frame(time = seq.Date(from = range[1], 
                                          to = as.Date("2026-12-31"),
                                          by = "day")) |>
  mutate(time_of_year = as.Date(format(time, "%m-%d"), format = "%m-%d"))

daily_data <- complete_df |> 
  left_join(read_waterdata_daily(
                monitoring_location_id = site1,
                parameter_code = "00060",
                statistic_id = "00003", 
                time = range),
            by = "time") |> 
  left_join(doy_perc_bands, by = "time_of_year")

update_geom_defaults("ribbon",
                   list(alpha = 0.5))

ggplot(data = daily_data, 
       aes(x = time)) +
  geom_ribbon(aes(ymin = `95`, ymax = `100`), fill = "#292f6b") +
  geom_ribbon(aes(ymin = `90`, ymax = `95`), fill = "#5699c0") +
  geom_ribbon(aes(ymin = `75`, ymax = `90`), fill = "#aacee0") +
  geom_ribbon(aes(ymin = `25`, ymax = `75`), fill = "#e9e9e9") +
  geom_ribbon(aes(ymin = `10`, ymax = `25`), fill = "#ebd6ab") +
  geom_ribbon(aes(ymin = `5`, ymax = `10`), fill = "#dcb668") +
  geom_ribbon(aes(ymin = `0`, ymax = `5`), fill = "#8f4f1f") +
  geom_line(aes(y = value, color = approval_status), linewidth = 1) +
  scale_x_date(date_labels = "%b %Y",
               date_breaks = "3 month",
               expand = expand_scale(mult = c(0, 0)))+
  scale_y_log10(breaks = scales::breaks_log(base = 10),
                labels = scales::label_log(base = 10)) +
  scale_color_manual("Status",
                     values = c("Approved" = "black",
                                "Provisional" = "red"), 
                     breaks = c("Approved", "Provisional"),
                     limits = force,
                     drop = TRUE) +
  labs(
    x = "",
    y = pcode_info$unit_of_measure,
    title = paste0("January 1, 2025 - December 31, 2026\n",
                   pcode_info$parameter_description)
  ) +
  theme_bw()

As https://waterdata.usgs.gov/monitoring-location/USGS-02037500/statistical-graphs/

Fetching monthly and annual statistics within a date range

Unlike day- or month-of-year statistics, which summarize typical seasonal conditions across many years, the read_waterdata_stats_daterange() function returns summaries for specific months and years within a requested date range. Consider the example below where we fetch the average discharge value for the month of January, 2024. Notice that the start_date and end_date arguments are given in YYYY-MM-DD format.

jan_daterange_mean <-
  read_waterdata_stats_daterange(
  monitoring_location_id = site1,
  parameter_code = "00060",
  computation = "arithmetic_mean",
  start_date = "2024-01-01",
  end_date = "2024-01-31"
)

jan_daterange_mean
#> Simple feature collection with 3 features and 19 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -77.54693 ymin: 37.5632 xmax: -77.54693 ymax: 37.5632
#> Geodetic CRS:  WGS 84
#>   parameter_code unit_of_measure            parent_time_series_id start_date
#> 1          00060          ft^3/s 3995ee5508334084af0a1325f951eadf 2024-01-01
#> 2          00060          ft^3/s 3995ee5508334084af0a1325f951eadf 2024-01-01
#> 3          00060          ft^3/s 3995ee5508334084af0a1325f951eadf 2023-10-01
#>     end_date interval_type     value sample_count approval_status
#> 1 2024-01-31         month 15787.097           31        approved
#> 2 2024-12-31 calendar_year  6795.699          366        approved
#> 3 2024-09-30    water_year  6306.902          366        approved
#>                         computation_id     computation percentile
#> 1 a1f576b1-0ca9-4e51-97a7-f12ca591b1e3 arithmetic_mean         NA
#> 2 1d49386c-d7d8-4e03-a076-530b32408bda arithmetic_mean         NA
#> 3 20148e1b-cae8-4342-8123-96c7aaa68e37 arithmetic_mean         NA
#>   monitoring_location_id      monitoring_location_name site_type site_type_code
#> 1          USGS-02037500 JAMES RIVER NEAR RICHMOND, VA    Stream             ST
#> 2          USGS-02037500 JAMES RIVER NEAR RICHMOND, VA    Stream             ST
#> 3          USGS-02037500 JAMES RIVER NEAR RICHMOND, VA    Stream             ST
#>   country_code state_code county_code                  geometry
#> 1           US         51         087 POINT (-77.54693 37.5632)
#> 2           US         51         087 POINT (-77.54693 37.5632)
#> 3           US         51         087 POINT (-77.54693 37.5632)

Instead of time_of_year and time_of_year_type columns, this output contains start_date, end_date, and interval_type columns representing the daterange over which the average was calculated. The first row shows the average January, 2024 discharge was about 112 cubic feet per second. We again have extra rows: the second row contains the calendar year 2024 average and the third contains the water year 2024 average.

Annual statistics will be returned for any calendar/water years than intersect with the specified date range. Consider the example below, where the start_date to end_date range is only 93 days yet happens to intersect with calendar and water years 2023 and 2024. We see this reflected in output containing monthly averages for September, 2023 through January, 2024 as well as 2023 and 2024 calendar and water year averages.

multiyear_daterange_mean <-
  read_waterdata_stats_daterange(
  monitoring_location_id = site1,
  parameter_code = "00060",
  computation = "arithmetic_mean",
  start_date = "2023-09-30",
  end_date = "2024-01-01"
)

multiyear_daterange_mean
#> Simple feature collection with 9 features and 19 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -77.54693 ymin: 37.5632 xmax: -77.54693 ymax: 37.5632
#> Geodetic CRS:  WGS 84
#>   parameter_code unit_of_measure            parent_time_series_id start_date
#> 1          00060          ft^3/s 3995ee5508334084af0a1325f951eadf 2023-09-01
#> 2          00060          ft^3/s 3995ee5508334084af0a1325f951eadf 2023-10-01
#> 3          00060          ft^3/s 3995ee5508334084af0a1325f951eadf 2023-11-01
#> 4          00060          ft^3/s 3995ee5508334084af0a1325f951eadf 2023-12-01
#> 5          00060          ft^3/s 3995ee5508334084af0a1325f951eadf 2024-01-01
#> 6          00060          ft^3/s 3995ee5508334084af0a1325f951eadf 2023-01-01
#> 7          00060          ft^3/s 3995ee5508334084af0a1325f951eadf 2024-01-01
#> 8          00060          ft^3/s 3995ee5508334084af0a1325f951eadf 2022-10-01
#> 9          00060          ft^3/s 3995ee5508334084af0a1325f951eadf 2023-10-01
#>     end_date interval_type     value sample_count approval_status
#> 1 2023-09-30         month  1917.667           30        approved
#> 2 2023-10-31         month  1264.516           31        approved
#> 3 2023-11-30         month  2000.333           30        approved
#> 4 2023-12-31         month  5939.355           31        approved
#> 5 2024-01-31         month 15787.097           31        approved
#> 6 2023-12-31 calendar_year  5062.219          365        approved
#> 7 2024-12-31 calendar_year  6795.699          366        approved
#> 8 2023-09-30    water_year  5740.247          365        approved
#> 9 2024-09-30    water_year  6306.902          366        approved
#>                         computation_id     computation percentile
#> 1 3014af0d-fda4-4cdc-9dbd-9c0656d0d72e arithmetic_mean         NA
#> 2 06950f19-e7f1-4ac7-a6ad-5792ec7b8120 arithmetic_mean         NA
#> 3 0ba2a5d6-910f-4a14-af64-d25a0a457105 arithmetic_mean         NA
#> 4 431e2bce-5ad5-4d15-9108-74a60d41c649 arithmetic_mean         NA
#> 5 a1f576b1-0ca9-4e51-97a7-f12ca591b1e3 arithmetic_mean         NA
#> 6 74e5f350-d0fc-4959-9bd2-42686d410d9a arithmetic_mean         NA
#> 7 1d49386c-d7d8-4e03-a076-530b32408bda arithmetic_mean         NA
#> 8 04917fde-3847-4045-abdf-a74a41351022 arithmetic_mean         NA
#> 9 20148e1b-cae8-4342-8123-96c7aaa68e37 arithmetic_mean         NA
#>   monitoring_location_id      monitoring_location_name site_type site_type_code
#> 1          USGS-02037500 JAMES RIVER NEAR RICHMOND, VA    Stream             ST
#> 2          USGS-02037500 JAMES RIVER NEAR RICHMOND, VA    Stream             ST
#> 3          USGS-02037500 JAMES RIVER NEAR RICHMOND, VA    Stream             ST
#> 4          USGS-02037500 JAMES RIVER NEAR RICHMOND, VA    Stream             ST
#> 5          USGS-02037500 JAMES RIVER NEAR RICHMOND, VA    Stream             ST
#> 6          USGS-02037500 JAMES RIVER NEAR RICHMOND, VA    Stream             ST
#> 7          USGS-02037500 JAMES RIVER NEAR RICHMOND, VA    Stream             ST
#> 8          USGS-02037500 JAMES RIVER NEAR RICHMOND, VA    Stream             ST
#> 9          USGS-02037500 JAMES RIVER NEAR RICHMOND, VA    Stream             ST
#>   country_code state_code county_code                  geometry
#> 1           US         51         087 POINT (-77.54693 37.5632)
#> 2           US         51         087 POINT (-77.54693 37.5632)
#> 3           US         51         087 POINT (-77.54693 37.5632)
#> 4           US         51         087 POINT (-77.54693 37.5632)
#> 5           US         51         087 POINT (-77.54693 37.5632)
#> 6           US         51         087 POINT (-77.54693 37.5632)
#> 7           US         51         087 POINT (-77.54693 37.5632)
#> 8           US         51         087 POINT (-77.54693 37.5632)
#> 9           US         51         087 POINT (-77.54693 37.5632)

Before we move on, consider the following example where we create a Monthly mean statistics table similar to what you’d find in the Water Year Summaries.

monthly_means <-
  read_waterdata_stats_daterange(
    monitoring_location_id = site1,
    parameter_code = "00060",
    computation = "arithmetic_mean"
  ) |>
  dplyr::filter(interval_type == "month") |>
  sf::st_drop_geometry()

monthly_means |>
  filter(start_date >= "2004-10-01" & start_date < "2025-09-01") |>
  mutate(
    Month = lubridate::month(start_date, label = TRUE),
    # reorder based on WY
    Month = factor(Month, month.abb[c(10:12, 1:9)]),
    water_year = lubridate::year(as.Date(start_date) + months(3))
  ) |>
  arrange(Month) |>
  group_by(Month) |>
  summarize(
    `Mean` = round(mean(value), 0),
    `Max (WY)` = paste0(
      round(max(value), 0),
      " (",
      water_year[which.max(value)],
      ")"
    ),
    `Min (WY)` = paste0(
      round(min(value), 0),
      " (",
      water_year[which.min(value)],
      ")"
    ),
    .groups = "drop"
  ) |>
  t() |>
  as.data.frame() |>
  knitr::kable(
    col.names = NULL
  )

Month	Oct	Nov	Dec	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep
Mean	4752	6211	9464	9315	11258	10980	10901	10397	5855	3337	2661	3370
Max (WY)	14059 (2019)	20085 (2019)	24194 (2019)	18138 (2010)	26647 (2025)	19426 (2019)	17269 (2019)	19037 (2020)	13455 (2013)	11236 (2013)	6883 (2020)	15211 (2018)
Min (WY)	1230 (2009)	1497 (2008)	1508 (2018)	2536 (2018)	2900 (2009)	3316 (2017)	5459 (2025)	3344 (2006)	1833 (2008)	1163 (2010)	1094 (2008)	968 (2010)

Statistics API tips

The statistics API does not follow the same OGC standards as the https://api.waterdata.usgs.gov/ogcapi/v0/ endpoints. This section will focus on important differences between the statistics and OGC-compliant APIs and other tips for working with the endpoint.

No request limit or API token: at time of writing, the statistics API does not limit the number of requests that can be made per hour. It also does not require you sign up for an API token. Requesting data from the statistics API does not count against your total request limit to the OGC-compliant APIs.
The API always returns all columns: compared to the OGC-compliant endpoints, which come with skipGeometry and properties arguments to limit the number of columns returned by the API, there is no way to request a subset of columns from the API.
Month-of-year statistics: to return month-of-year statistics using read_waterdata_stats_por, make sure the start_date to end_date range overlaps with the first day of the month for which you want to data. For example, start_date = "01-01" and end_date = "03-01" will return the month-of-year statistics for January, February, and March (in addition to the day-of-year statistics for each month-day in this range).
Monthly and annual statistics: when using read_waterdata_stats_daterange, the output will return monthly and annual summaries for every calendar month, calendar year, and water year that intersects with the start_date to end_date range. For example, start_date = 2023-12-31 and end_date = 2024-10-01 will return monthly statistics for each month between December, 2023 through October, 2024 and calendar year statistics for 2023 and 2024 and water year statistics for WY2024 and WY2025.
Median comes with percentiles: you never need to set computation = c("median", "percentile") as the median is returned as the 50th percentile. If you do ask for both median and percentiles, your data set will have two rows containing the median for each parent_time_series_id.
Minimum and maximum do not come with percentiles: minimum and maximum statistics are not returned as percentiles so use computation = c("minimum", "maximum", "percentile") if you want a “complete” set of order statistics.
The API returns specific percentiles: for computation = "percentile", the API will only ever return the following percentiles: 5th, 10th, 25th, 50th, 75th, 90th, and 95th. If you want other percentiles, you’ll need to pull the daily data period of record using read_waterdata_daily and compute them yourself.
Pay attention to sample_count: the sample_count column represents the number of observations used to compute the statistic. As stated in the statistics documentation, there is no minimum requirement for the number of observations to calculate a statistic. This means reported monthly and annual statistics can be based on one daily observation from that month/year. In the case of a single observation, the reported minimum, maximum, median, and arithmetic_mean will all be equal to the value of that observation and any other percentiles will be NA.

Introducing read_waterdata_statistics_*

Fetching day-of-year and month-of-year statistics

Fetching monthly and annual statistics within a date range

Statistics API tips