
Introducing read_waterdata_statistics_*
Source:vignettes/daily_data_statistics.Rmd
daily_data_statistics.RmdThe read_waterdata_stats_por and
read_waterdata_stats_daterange functions replace the legacy
readNWISstat function. This replacement is necessary
because the legacy API service that readNWISstat uses will
be decommissioned and replaced with a modernized
API. This new API has two available endpoints,
observationNormals and observationIntervals,
that appear similar at first yet have important differences we want to
highlight here.
library(dataRetrieval)
library(ggplot2)
library(tidyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
site1 <- "USGS-02037500"Fetching day-of-year and month-of-year statistics
Day-of-year and month-of-year statistics aggregate observations for
the same calendar day or month across multiple years to describe typical
seasonal conditions (e.g., all Januarys or all January 1sts). Consider
the output below, where we request day-of-year discharge averages for
January 1 and January 2. Note that the start_date and
end_date are set in month-year format to
describe the day-of-year range.
jan_por_mean <-
read_waterdata_stats_por(
monitoring_location_id = site1,
parameter_code = "00060",
computation = "arithmetic_mean",
start_date = "01-01",
end_date = "01-02"
)
jan_por_mean
#> Simple feature collection with 3 features and 18 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: -77.54693 ymin: 37.5632 xmax: -77.54693 ymax: 37.5632
#> Geodetic CRS: WGS 84
#> parameter_code unit_of_measure parent_time_series_id time_of_year
#> 1 00060 ft^3/s 3995ee5508334084af0a1325f951eadf 01-01
#> 2 00060 ft^3/s 3995ee5508334084af0a1325f951eadf 01-02
#> 3 00060 ft^3/s 3995ee5508334084af0a1325f951eadf 01
#> time_of_year_type value sample_count approval_status
#> 1 day_of_year 9342.396 91 approved
#> 2 day_of_year 9437.714 91 approved
#> 3 month_of_year 7118.020 91 approved
#> computation_id computation percentile
#> 1 0d0e09bc-af10-4740-8c33-afca7b4a9475 arithmetic_mean NA
#> 2 d0f4bba0-5534-46a9-a7e6-ed7b74152d72 arithmetic_mean NA
#> 3 d244a78a-d5c4-464d-8cc2-88f7b39ac0fa arithmetic_mean NA
#> monitoring_location_id monitoring_location_name site_type site_type_code
#> 1 USGS-02037500 JAMES RIVER NEAR RICHMOND, VA Stream ST
#> 2 USGS-02037500 JAMES RIVER NEAR RICHMOND, VA Stream ST
#> 3 USGS-02037500 JAMES RIVER NEAR RICHMOND, VA Stream ST
#> country_code state_code county_code geometry
#> 1 US 51 087 POINT (-77.54693 37.5632)
#> 2 US 51 087 POINT (-77.54693 37.5632)
#> 3 US 51 087 POINT (-77.54693 37.5632)The first two rows show the average discharge values, aggregated
across all January 1sts and 2nds in the site’s Period of Record (POR).
But wait: what’s in that third row? Looking at the
time_of_year and time_of_year_type columns, we
see the third row represents the average discharge value aggregated
across all Januarys. This illustrates one quirk of the modern
statistics API: any time the start_date to
end_date range overlaps with the first day of a month
(e.g., "01-01"), we will get month-of-year as well as the
day-of-year summary statistics. You can filter these rows out of the
data if you don’t want them in downstream analyses:
jan_por_mean[jan_por_mean$time_of_year_type != "month_of_year",]
#> Simple feature collection with 2 features and 18 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: -77.54693 ymin: 37.5632 xmax: -77.54693 ymax: 37.5632
#> Geodetic CRS: WGS 84
#> parameter_code unit_of_measure parent_time_series_id time_of_year
#> 1 00060 ft^3/s 3995ee5508334084af0a1325f951eadf 01-01
#> 2 00060 ft^3/s 3995ee5508334084af0a1325f951eadf 01-02
#> time_of_year_type value sample_count approval_status
#> 1 day_of_year 9342.396 91 approved
#> 2 day_of_year 9437.714 91 approved
#> computation_id computation percentile
#> 1 0d0e09bc-af10-4740-8c33-afca7b4a9475 arithmetic_mean NA
#> 2 d0f4bba0-5534-46a9-a7e6-ed7b74152d72 arithmetic_mean NA
#> monitoring_location_id monitoring_location_name site_type site_type_code
#> 1 USGS-02037500 JAMES RIVER NEAR RICHMOND, VA Stream ST
#> 2 USGS-02037500 JAMES RIVER NEAR RICHMOND, VA Stream ST
#> country_code state_code county_code geometry
#> 1 US 51 087 POINT (-77.54693 37.5632)
#> 2 US 51 087 POINT (-77.54693 37.5632)Let’s now look at an example that illustrates the benefits of the statistics API. In the example below, we pull all day-of-year discharge percentiles for our site. Keep in mind that doing so without the statistics API would require us to download the entire daily period of record for this site and hand-compute these percentiles ourselves, a time- and resource-intensive process indeed.
For demonstration, we filter the output to the January 1 day-of-year percentiles, which include a set of percentiles commonly used on WDFN webpages (e.g., Wisconsin water conditions).
library(dplyr)
library(tidyr)
library(ggplot2)
full_por_percentiles <-
read_waterdata_stats_por(
monitoring_location_id = site1,
parameter_code = "00060",
computation = c("minimum", "maximum", "percentile"),
start_date = "01-01",
end_date = "12-31"
)
full_por_percentiles |>
filter(time_of_year == "01-01" & time_of_year_type == "day_of_year") |>
arrange(percentile) |>
select(
time_of_year, computation, percentile, value
)
#> Simple feature collection with 9 features and 4 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: -77.54693 ymin: 37.5632 xmax: -77.54693 ymax: 37.5632
#> Geodetic CRS: WGS 84
#> time_of_year computation percentile value geometry
#> 1 01-01 minimum 0 380 POINT (-77.54693 37.5632)
#> 2 01-01 percentile 5 1266 POINT (-77.54693 37.5632)
#> 3 01-01 percentile 10 1922 POINT (-77.54693 37.5632)
#> 4 01-01 percentile 25 3580 POINT (-77.54693 37.5632)
#> 5 01-01 percentile 50 5550 POINT (-77.54693 37.5632)
#> 6 01-01 percentile 75 9890 POINT (-77.54693 37.5632)
#> 7 01-01 percentile 90 22000 POINT (-77.54693 37.5632)
#> 8 01-01 percentile 95 37780 POINT (-77.54693 37.5632)
#> 9 01-01 maximum 100 60000 POINT (-77.54693 37.5632)After a bit of data manipulation, we can then visualize the percentiles as “ribbons” on a plot. Each ribbon spans between two percentiles returned by the /statistics API (e.g., minimum to 5th, 5th to 10th, etc).
doy_perc_bands <-
full_por_percentiles |>
sf::st_drop_geometry() |>
dplyr::filter(time_of_year_type == "day_of_year") |>
select(time_of_year, percentile, value) |>
mutate(time_of_year = as.Date(time_of_year, format = "%m-%d")) |>
pivot_wider(names_from = percentile, values_from = value)
pcode_info <- read_waterdata_parameter_codes(parameter_code = "00060")
ggplot(data = doy_perc_bands,
aes(x = time_of_year)) +
geom_ribbon(aes(ymin = `95`, ymax = `100`), fill = "#292f6b") +
geom_ribbon(aes(ymin = `90`, ymax = `95`), fill = "#5699c0") +
geom_ribbon(aes(ymin = `75`, ymax = `90`), fill = "#aacee0") +
geom_ribbon(aes(ymin = `25`, ymax = `75`), fill = "#e9e9e9") +
geom_ribbon(aes(ymin = `10`, ymax = `25`), fill = "#ebd6ab") +
geom_ribbon(aes(ymin = `5`, ymax = `10`), fill = "#dcb668") +
geom_ribbon(aes(ymin = `0`, ymax = `5`), fill = "#8f4f1f") +
geom_line(aes(y = `50`), linewidth = .2, color = "black") +
scale_x_date(date_labels = "%b", date_breaks = "1 month",
expand = expand_scale(mult = c(0, 0))) +
scale_y_log10(breaks = scales::breaks_log(base = 10),
labels = scales::label_log(base = 10)) +
labs(
x = "Month-day",
y = pcode_info$parameter_description,
title = "Day-of-year percentile bands"
) +
theme_bw()
Finally, let’s overlay daily mean data onto the plot:
range <- as.Date(c("2025-01-01", "2026-02-01"))
complete_df <- data.frame(time = seq.Date(from = range[1],
to = as.Date("2026-12-31"),
by = "day")) |>
mutate(time_of_year = as.Date(format(time, "%m-%d"), format = "%m-%d"))
daily_data <- complete_df |>
left_join(read_waterdata_daily(
monitoring_location_id = site1,
parameter_code = "00060",
statistic_id = "00003",
time = range),
by = "time") |>
left_join(doy_perc_bands, by = "time_of_year")
update_geom_defaults("ribbon",
list(alpha = 0.5))
ggplot(data = daily_data,
aes(x = time)) +
geom_ribbon(aes(ymin = `95`, ymax = `100`), fill = "#292f6b") +
geom_ribbon(aes(ymin = `90`, ymax = `95`), fill = "#5699c0") +
geom_ribbon(aes(ymin = `75`, ymax = `90`), fill = "#aacee0") +
geom_ribbon(aes(ymin = `25`, ymax = `75`), fill = "#e9e9e9") +
geom_ribbon(aes(ymin = `10`, ymax = `25`), fill = "#ebd6ab") +
geom_ribbon(aes(ymin = `5`, ymax = `10`), fill = "#dcb668") +
geom_ribbon(aes(ymin = `0`, ymax = `5`), fill = "#8f4f1f") +
geom_line(aes(y = value, color = approval_status), linewidth = 1) +
scale_x_date(date_labels = "%b %Y",
date_breaks = "3 month",
expand = expand_scale(mult = c(0, 0)))+
scale_y_log10(breaks = scales::breaks_log(base = 10),
labels = scales::label_log(base = 10)) +
scale_color_manual("Status",
values = c("Approved" = "black",
"Provisional" = "red"),
breaks = c("Approved", "Provisional"),
limits = force,
drop = TRUE) +
labs(
x = "",
y = pcode_info$unit_of_measure,
title = paste0("January 1, 2025 - December 31, 2026\n",
pcode_info$parameter_description)
) +
theme_bw()
As https://waterdata.usgs.gov/monitoring-location/USGS-02037500/statistical-graphs/
Fetching monthly and annual statistics within a date range
Unlike day- or month-of-year statistics, which summarize typical
seasonal conditions across many years, the
read_waterdata_stats_daterange() function returns summaries
for specific months and years within a requested date range. Consider
the example below where we fetch the average discharge value for the
month of January, 2024. Notice that the start_date and
end_date arguments are given in YYYY-MM-DD
format.
jan_daterange_mean <-
read_waterdata_stats_daterange(
monitoring_location_id = site1,
parameter_code = "00060",
computation = "arithmetic_mean",
start_date = "2024-01-01",
end_date = "2024-01-31"
)
jan_daterange_mean
#> Simple feature collection with 3 features and 19 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: -77.54693 ymin: 37.5632 xmax: -77.54693 ymax: 37.5632
#> Geodetic CRS: WGS 84
#> parameter_code unit_of_measure parent_time_series_id start_date
#> 1 00060 ft^3/s 3995ee5508334084af0a1325f951eadf 2024-01-01
#> 2 00060 ft^3/s 3995ee5508334084af0a1325f951eadf 2024-01-01
#> 3 00060 ft^3/s 3995ee5508334084af0a1325f951eadf 2023-10-01
#> end_date interval_type value sample_count approval_status
#> 1 2024-01-31 month 15787.097 31 approved
#> 2 2024-12-31 calendar_year 6795.699 366 approved
#> 3 2024-09-30 water_year 6306.902 366 approved
#> computation_id computation percentile
#> 1 a1f576b1-0ca9-4e51-97a7-f12ca591b1e3 arithmetic_mean NA
#> 2 1d49386c-d7d8-4e03-a076-530b32408bda arithmetic_mean NA
#> 3 20148e1b-cae8-4342-8123-96c7aaa68e37 arithmetic_mean NA
#> monitoring_location_id monitoring_location_name site_type site_type_code
#> 1 USGS-02037500 JAMES RIVER NEAR RICHMOND, VA Stream ST
#> 2 USGS-02037500 JAMES RIVER NEAR RICHMOND, VA Stream ST
#> 3 USGS-02037500 JAMES RIVER NEAR RICHMOND, VA Stream ST
#> country_code state_code county_code geometry
#> 1 US 51 087 POINT (-77.54693 37.5632)
#> 2 US 51 087 POINT (-77.54693 37.5632)
#> 3 US 51 087 POINT (-77.54693 37.5632)Instead of time_of_year and
time_of_year_type columns, this output contains
start_date, end_date, and
interval_type columns representing the daterange over which
the average was calculated. The first row shows the average January,
2024 discharge was about 112 cubic feet per second. We again have extra
rows: the second row contains the calendar year 2024
average and the third contains the water year 2024
average.
Annual statistics will be returned for any calendar/water years than
intersect with the specified date range. Consider the example below,
where the start_date to end_date range is only
93 days yet happens to intersect with calendar and
water years 2023 and 2024. We see this reflected in output containing
monthly averages for September, 2023 through January, 2024 as well as
2023 and 2024 calendar and water year averages.
multiyear_daterange_mean <-
read_waterdata_stats_daterange(
monitoring_location_id = site1,
parameter_code = "00060",
computation = "arithmetic_mean",
start_date = "2023-09-30",
end_date = "2024-01-01"
)
multiyear_daterange_mean
#> Simple feature collection with 9 features and 19 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: -77.54693 ymin: 37.5632 xmax: -77.54693 ymax: 37.5632
#> Geodetic CRS: WGS 84
#> parameter_code unit_of_measure parent_time_series_id start_date
#> 1 00060 ft^3/s 3995ee5508334084af0a1325f951eadf 2023-09-01
#> 2 00060 ft^3/s 3995ee5508334084af0a1325f951eadf 2023-10-01
#> 3 00060 ft^3/s 3995ee5508334084af0a1325f951eadf 2023-11-01
#> 4 00060 ft^3/s 3995ee5508334084af0a1325f951eadf 2023-12-01
#> 5 00060 ft^3/s 3995ee5508334084af0a1325f951eadf 2024-01-01
#> 6 00060 ft^3/s 3995ee5508334084af0a1325f951eadf 2023-01-01
#> 7 00060 ft^3/s 3995ee5508334084af0a1325f951eadf 2024-01-01
#> 8 00060 ft^3/s 3995ee5508334084af0a1325f951eadf 2022-10-01
#> 9 00060 ft^3/s 3995ee5508334084af0a1325f951eadf 2023-10-01
#> end_date interval_type value sample_count approval_status
#> 1 2023-09-30 month 1917.667 30 approved
#> 2 2023-10-31 month 1264.516 31 approved
#> 3 2023-11-30 month 2000.333 30 approved
#> 4 2023-12-31 month 5939.355 31 approved
#> 5 2024-01-31 month 15787.097 31 approved
#> 6 2023-12-31 calendar_year 5062.219 365 approved
#> 7 2024-12-31 calendar_year 6795.699 366 approved
#> 8 2023-09-30 water_year 5740.247 365 approved
#> 9 2024-09-30 water_year 6306.902 366 approved
#> computation_id computation percentile
#> 1 3014af0d-fda4-4cdc-9dbd-9c0656d0d72e arithmetic_mean NA
#> 2 06950f19-e7f1-4ac7-a6ad-5792ec7b8120 arithmetic_mean NA
#> 3 0ba2a5d6-910f-4a14-af64-d25a0a457105 arithmetic_mean NA
#> 4 431e2bce-5ad5-4d15-9108-74a60d41c649 arithmetic_mean NA
#> 5 a1f576b1-0ca9-4e51-97a7-f12ca591b1e3 arithmetic_mean NA
#> 6 74e5f350-d0fc-4959-9bd2-42686d410d9a arithmetic_mean NA
#> 7 1d49386c-d7d8-4e03-a076-530b32408bda arithmetic_mean NA
#> 8 04917fde-3847-4045-abdf-a74a41351022 arithmetic_mean NA
#> 9 20148e1b-cae8-4342-8123-96c7aaa68e37 arithmetic_mean NA
#> monitoring_location_id monitoring_location_name site_type site_type_code
#> 1 USGS-02037500 JAMES RIVER NEAR RICHMOND, VA Stream ST
#> 2 USGS-02037500 JAMES RIVER NEAR RICHMOND, VA Stream ST
#> 3 USGS-02037500 JAMES RIVER NEAR RICHMOND, VA Stream ST
#> 4 USGS-02037500 JAMES RIVER NEAR RICHMOND, VA Stream ST
#> 5 USGS-02037500 JAMES RIVER NEAR RICHMOND, VA Stream ST
#> 6 USGS-02037500 JAMES RIVER NEAR RICHMOND, VA Stream ST
#> 7 USGS-02037500 JAMES RIVER NEAR RICHMOND, VA Stream ST
#> 8 USGS-02037500 JAMES RIVER NEAR RICHMOND, VA Stream ST
#> 9 USGS-02037500 JAMES RIVER NEAR RICHMOND, VA Stream ST
#> country_code state_code county_code geometry
#> 1 US 51 087 POINT (-77.54693 37.5632)
#> 2 US 51 087 POINT (-77.54693 37.5632)
#> 3 US 51 087 POINT (-77.54693 37.5632)
#> 4 US 51 087 POINT (-77.54693 37.5632)
#> 5 US 51 087 POINT (-77.54693 37.5632)
#> 6 US 51 087 POINT (-77.54693 37.5632)
#> 7 US 51 087 POINT (-77.54693 37.5632)
#> 8 US 51 087 POINT (-77.54693 37.5632)
#> 9 US 51 087 POINT (-77.54693 37.5632)Before we move on, consider the following example where we create a Monthly mean statistics table similar to what you’d find in the Water Year Summaries.
monthly_means <-
read_waterdata_stats_daterange(
monitoring_location_id = site1,
parameter_code = "00060",
computation = "arithmetic_mean"
) |>
dplyr::filter(interval_type == "month") |>
sf::st_drop_geometry()
monthly_means |>
filter(start_date >= "2004-10-01" & start_date < "2025-09-01") |>
mutate(
Month = lubridate::month(start_date, label = TRUE),
# reorder based on WY
Month = factor(Month, month.abb[c(10:12, 1:9)]),
water_year = lubridate::year(as.Date(start_date) + months(3))
) |>
arrange(Month) |>
group_by(Month) |>
summarize(
`Mean` = round(mean(value), 0),
`Max (WY)` = paste0(
round(max(value), 0),
" (",
water_year[which.max(value)],
")"
),
`Min (WY)` = paste0(
round(min(value), 0),
" (",
water_year[which.min(value)],
")"
),
.groups = "drop"
) |>
t() |>
as.data.frame() |>
knitr::kable(
col.names = NULL
)| Month | Oct | Nov | Dec | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep |
| Mean | 4752 | 6211 | 9464 | 9315 | 11258 | 10980 | 10901 | 10397 | 5855 | 3337 | 2683 | 3370 |
| Max (WY) | 14059 (2019) | 20085 (2019) | 24194 (2019) | 18138 (2010) | 26647 (2025) | 19426 (2019) | 17269 (2019) | 19037 (2020) | 13455 (2013) | 11236 (2013) | 6883 (2020) | 15211 (2018) |
| Min (WY) | 1230 (2009) | 1497 (2008) | 1508 (2018) | 2536 (2018) | 2900 (2009) | 3316 (2017) | 5459 (2025) | 3344 (2006) | 1833 (2008) | 1163 (2010) | 1094 (2008) | 968 (2010) |
Statistics API tips
The statistics API does not follow the same OGC standards as the https://api.waterdata.usgs.gov/ogcapi/v0/ endpoints. This section will focus on important differences between the statistics and OGC-compliant APIs and other tips for working with the endpoint.
No request limit or API token: at time of writing, the statistics API does not limit the number of requests that can be made per hour. It also does not require you sign up for an API token. Requesting data from the statistics API does not count against your total request limit to the OGC-compliant APIs.
The API always returns all columns: compared to the OGC-compliant endpoints, which come with
skipGeometryandpropertiesarguments to limit the number of columns returned by the API, there is no way to request a subset of columns from the API.Month-of-year statistics: to return month-of-year statistics using
read_waterdata_stats_por, make sure thestart_datetoend_daterange overlaps with the first day of the month for which you want to data. For example,start_date = "01-01"andend_date = "03-01"will return the month-of-year statistics for January, February, and March (in addition to the day-of-year statistics for each month-day in this range).Monthly and annual statistics: when using
read_waterdata_stats_daterange, the output will return monthly and annual summaries for every calendar month, calendar year, and water year that intersects with thestart_datetoend_daterange. For example,start_date = 2023-12-31andend_date = 2024-10-01will return monthly statistics for each month between December, 2023 through October, 2024 and calendar year statistics for 2023 and 2024 and water year statistics for WY2024 and WY2025.Median comes with percentiles: you never need to set
computation = c("median", "percentile")as the median is returned as the 50th percentile. If you do ask for both median and percentiles, your data set will have two rows containing the median for eachparent_time_series_id.Minimum and maximum do not come with percentiles: minimum and maximum statistics are not returned as percentiles so use
computation = c("minimum", "maximum", "percentile")if you want a “complete” set of order statistics.The API returns specific percentiles: for
computation = "percentile", the API will only ever return the following percentiles: 5th, 10th, 25th, 50th, 75th, 90th, and 95th. If you want other percentiles, you’ll need to pull the daily data period of record usingread_waterdata_dailyand compute them yourself.Pay attention to
sample_count: thesample_countcolumn represents the number of observations used to compute the statistic. As stated in the statistics documentation, there is no minimum requirement for the number of observations to calculate a statistic. This means reported monthly and annual statistics can be based on one daily observation from that month/year. In the case of a single observation, the reportedminimum,maximum,median, andarithmetic_meanwill all be equal to the value of that observation and any other percentiles will beNA.