As we bid adieu to the NWIS web services, we welcome a host of new
web service offering: the USGS Water Data OGC
APIs. This is a modern access point for USGS water data. The USGS
will be modernizing all
of the NWIS web services in the near future. For each of these
updates, dataRetrieval
will create a new function to access
the new services and deprecate functions for accessing the legacy
services.
This document will introduce each new function (note as time goes on, we’ll update this document to include additional functions). Note the timeline for the NWIS servers being shut down is currently very uncertain. We’d recommend incorporating these new functions as soon as possible to avoid future headaches.
New Features
Each new “API endpoint” will deliver a new type of USGS data. Currently the available endpoints are “monitoring-locations”, “time-series-metadata”, and “daily”. All of these endpoints offer some new features that the legacy services did not have:
Flexible Queries
When you look at the help file for the new functions, you’ll notice
there are many more arguments. These are mostly set by default to
NA
. You DO NOT need to (and most likely
should not!) specify all of these parameters. The requested filters are
appended as Boolean “AND”s, meaning if you specify a vector of
monitoring locations and parameter codes, you will only get back those
specified monitoring locations with the specified parameter codes.
A side bonus of this is that since all of the potential arguments are defined, your favorite integrated development environment (IDE) will almost certainly have an autocomplete feature to let you tab through the potential options.
Flexible Columns Returned
Users can pick which columns are returned with the “properties”
argument. Available properties can be discovered with the
check_OGC_requests
function:
daily_schema <- check_OGC_requests(endpoint = "daily", type = "schema")
names(daily_schema$properties)
## [1] "geometry"
## [2] "id"
## [3] "time_series_id"
## [4] "monitoring_location_id"
## [5] "parameter_code"
## [6] "statistic_id"
## [7] "time"
## [8] "value"
## [9] "unit_of_measure"
## [10] "approval_status"
## [11] "qualifier"
## [12] "last_modified"
API Tokens
You can register an API key for use with USGS water data APIs. There are now limits on how many queries can be requested per IP address per hour. If you find yourself running into limits, you can request an API token here: https://api.waterdata.usgs.gov/signup/
Then save your token in your .Renviron file like this:
API_USGS_PAT = "my_super_secret_token"
You can use usethis::edit_r_environ()
to edit find and
open your .Renviron file. You will need to restart R for that variable
to be recognized. You should not add this file to git projects or
generally share your API key. Anyone else using your API key will count
against the number of requests available to you!
Contextual Query Language Support
Supports Contextual
Query Language (CQL2) syntax for flexible queries. We’ll show how to
use the read_USGS_data
function to make specific CQL2
queries.
Simple Features
Provides Simple Features
functionality. The data is returned with a “geometry” column, which is a
simple feature object, allowing the data to be integrated with the sf
package and
associated geospatial workflows.
Lessons Learned
This section will initially be a random stream of consciousness on lessons learned while developing these functions and playing with the services.
Query limits
A semi-common way to find a lot of data in the past would have been to use a monitoring location query to get a huge list of sites, and then use that huge list of sites (maybe winnowing it down a little) to get the data. These new services return a 403 error if your request is too big (“web server understands your request but refuses to authorize it”). This is true whether or not the request is a GET or POST request (something that is taken care of under the hood), and seems to be a character limit of the overall request. Roughly, it seems like if you were requesting more than 250 monitoring locations, the response will immediately return with a 403 error.
There are at least 2 ways to deal with this. One is to manually split the data requests and bind the results together later. The other is to use the bounding box of the initial request as an input to the data request. Potentially some sites would need to be filtered out later using this method.
Example:
ohio <- read_USGS_monitoring_location(state_name = "Ohio",
site_type_code = "ST")
There are 2883 rows returned that are stream sites in Ohio. If we tried to ask for all the discharge data over the last 7 days from that list of sites:
ohio_discharge <- read_USGS_daily(monitoring_location_id = ohio$monitoring_location_id,
parameter_code = "00060",
time = "P7D")
Error in `req_perform()`:
! HTTP 403 Forbidden.
• Query request denied. Possible reasons include query exceeding server limits.
We could use the fact that the ohio
data frame contains
geospatial information, create a bounding box, and ask for that data
like this:
ohio_discharge <- read_USGS_daily(bbox = sf::st_bbox(ohio),
parameter_code = "00060",
time = "P7D")
A reasonable 1780 are returned with the bounding box query.
Maybe you have a list of sites that are scattered around the country. The bounding box method might not be ideal. There are several ways to loop through a set of sites, here is one simple example:
big_vector_of_sites <- ohio$monitoring_location_id
site_list <- split(big_vector_of_sites, ceiling(seq_along(big_vector_of_sites)/200))
data_returned <- data.frame()
for(sites in site_list){
df_sites <- read_USGS_daily(monitoring_location_id = sites,
parameter_code = "00060",
time = "P7D")
if(nrow(df_sites) == 0){
next
}
data_returned <- rbind(data_returned, df_sites)
}
Note there is fewer data returned in data_returned
because those sites are already filtered down to just “Stream” sites.
The bounding box results ohio_discharge
contained other
types of monitoring location types.
Result limits
There’s a hard cap at 50,000 rows returned per one request. This
means that for 1 dataRetrieval
request, only 50,000 rows
will be returned even if there is more data! If you know you are making
a big request, it will be up to you to split up your request into
“reasonable” chunks. Note that sometimes you’ll notice a big request
gets chunked up and you can see that it actually made a bunch of
requests - this is done automatically (it’s called paging), and the
50,000 limit is still in effect for the total number returned from all
the pages.
limit vs max_results
A user can specify a limit
or
max_results
.
The max_results
argument defines how many rows are
returned (assuming the data has at least max_results
rows
to return). This can be used as a handy way to make sure you aren’t
requesting a ton of data, perhaps to do some initial coding or
troubleshooting.
The limit
argument defines how many rows are returned
per page of data, but does NOT affect the overall number of rows
returned. With a good internet connection, you can probably get away
with ignoring this argument. By default it will be set to the highest
value that the services allow. The reason you might want to change this
argument is that it might be easier on a spotty internet connection to
page through smaller sets of data.
id
Each API endpoint natively returns a column named “id”. The results of the “id” column can be used as inputs into other endpoints, HOWEVER the input in those functions have different names. For example, the “id” column of the monitoring location endpoint is considered the “monitoring_location_id” when used as an input to any of the other functions.
Therefore, dataRetrieval
functions will rename the “id”
column to whatever it is referred to in other functions. Here are the id
translations:
Function | ID returned |
---|---|
read_USGS_monitoring_location | monitoring_location_id |
read_USGS_ts_meta | time_series_id |
read_USGS_daily | daily_id |
If a user would prefer the columns to come back as “id”, they can
specify that using the properties
argument:
site <- "USGS-02238500"
site_1 <- read_USGS_monitoring_location(monitoring_location_id = site,
properties = c("monitoring_location_id",
"state_name",
"country_name"))
names(site_1)
## [1] "monitoring_location_id"
## [2] "state_name"
## [3] "country_name"
## [4] "geometry"
site_2 <- read_USGS_monitoring_location(monitoring_location_id = site,
properties = c("id",
"state_name",
"country_name"))
names(site_2)
## [1] "id" "state_name" "country_name"
## [4] "geometry"
New Functions
As new API endpoints come online, this section will be updated with
any dataRetrieval
function that is created.
Monitoring Location
The read_USGS_monitoring_location
function replaces the
readNWISsite
function.
Location information is basic information about the monitoring location including the name, identifier, agency responsible for data collection, and the date the location was established. It also includes information about the type of location, such as stream, lake, or groundwater, and geographic information about the location, such as state, county, latitude and longitude, and hydrologic unit code (HUC).
To access these services on a web browser, go to https://api.waterdata.usgs.gov/ogcapi/v0/collections/monitoring-locations.
Here is a simple example of requesting one known USGS site:
sites_information <- read_USGS_monitoring_location(monitoring_location_id = "USGS-01491000")
The output includes the following information:
monitoring_location_id | USGS-01491000 |
agency_code | USGS |
state_code | 24 |
basin_code | NA |
horizontal_positional_accuracy | Accurate to + or - 1 sec. |
time_zone_abbreviation | EST |
state_name | Maryland |
altitude | 2.73 |
uses_daylight_savings | Y |
construction_date | NA |
agency_name | U.S. Geological Survey |
county_code | 011 |
altitude_accuracy | 0.1 |
horizontal_position_method_code | M |
aquifer_code | NA |
monitoring_location_number | 01491000 |
county_name | Caroline County |
altitude_method_code | N |
horizontal_position_method_name | Interpolated from MAP. |
national_aquifer_code | NA |
monitoring_location_name | CHOPTANK RIVER NEAR GREENSBORO, MD |
minor_civil_division_code | NA |
altitude_method_name | Interpolated from Digital Elevation Model |
original_horizontal_datum | NAD83 |
aquifer_type_code | NA |
district_code | 24 |
site_type_code | ST |
vertical_datum | NAVD88 |
original_horizontal_datum_name | North American Datum of 1983 |
well_constructed_depth | NA |
country_code | US |
site_type | Stream |
vertical_datum_name | North American Vertical Datum of 1988 |
drainage_area | 113 |
hole_constructed_depth | NA |
country_name | United States of America |
hydrologic_unit_code | 020600050203 |
horizontal_positional_accuracy_code | S |
contributing_drainage_area | NA |
depth_source_code | NA |
geometry | c(-75.7858055555556, 38.9971944444444) |
Maybe that is more information than you need. You can specify which columns get returned with the “properties” argument:
site_info <- read_USGS_monitoring_location(monitoring_location_id = "USGS-01491000",
properties = c("monitoring_location_id",
"site_type",
"drainage_area",
"monitoring_location_name"))
knitr::kable(site_info)
monitoring_location_id | site_type | drainage_area | monitoring_location_name | geometry |
---|---|---|---|---|
USGS-01491000 | Stream | 113 | CHOPTANK RIVER NEAR GREENSBORO, MD | POINT (-75.78581 38.99719) |
Any of those original outputs can also be inputs to the function! So let’s say we want to find all the stream sites in Wisconsin:
sites_wi <- read_USGS_monitoring_location(state_name = "Wisconsin",
site_type = "Stream")
The returned data includes a column “geometry” which is a collection
of simple feature (sf) points. This allows for seamless integration with
the sf
package. Here are 2 quick examples of using the
sf
object in ggplot2 and leaflet:
library(ggplot2)
ggplot(data = sites_wi) +
geom_sf() +
theme_minimal()
library(leaflet)
leaflet_crs <- "+proj=longlat +datum=WGS84" #default leaflet crs
leaflet(data = sites_wi |>
sf::st_transform(crs = leaflet_crs)) |>
addProviderTiles("CartoDB.Positron") |>
addCircleMarkers(popup = ~monitoring_location_name,
radius = 0.1,
opacity = 1)
Time Series Metadata
The read_USGS_ts_meta
function replaces the
whatNWISdata
function.
Daily data and continuous measurements are grouped into time series, which represent a collection of observations of a single parameter, potentially aggregated using a standard statistic, at a single monitoring location. This endpoint provides metadata about those time series, including their operational thresholds, units of measurement, and when the earliest and most recent observations in a time series occurred.
To access these services on a web browser, go to https://api.waterdata.usgs.gov/ogcapi/v0/collections/time-series-metadata.
ts_available <- read_USGS_ts_meta(monitoring_location_id = "USGS-01491000",
parameter_code = c("00060", "00010"))
The returning tables gives information on all available time series. For instance, we can pull a few columns out and see when each time series started, ended, or was last modified.
parameter_name | statistic_id | begin | end | last_modified |
---|---|---|---|---|
Discharge | NA | 1948-01-14 | 2025-04-13 | 2025-04-13 |
Temperature, water | 00011 | 2023-04-20 | 2025-06-06 | 2025-06-06 |
Temperature, water | 00002 | 2023-04-22 | 2025-06-06 | 2025-06-06 |
Discharge | 00003 | 1948-01-02 | 2025-06-06 | 2025-06-06 |
Temperature, water | 00001 | 2023-04-22 | 2025-06-06 | 2025-06-06 |
Temperature, water | 00003 | 2023-04-22 | 2025-06-06 | 2025-06-06 |
Discharge | 00011 | 1985-10-01 | 2025-06-06 | 2025-06-06 |
Temperature, water | 00001 | 1988-10-02 | 2012-05-10 | 2020-08-27 |
Temperature, water | 00002 | 2010-10-02 | 2012-05-10 | 2020-08-27 |
Temperature, water | 00003 | 2010-10-02 | 2012-05-10 | 2020-08-27 |
Daily Values
The read_USGS_daily
function replaces the
readNWISdv
function.
Daily data provide one data value to represent water conditions for the day. Throughout much of the history of the USGS, the primary water data available was daily data collected manually at the monitoring location once each day. With improved availability of computer storage and automated transmission of data, the daily data published today are generally a statistical summary or metric of the continuous data collected each day, such as the daily mean, minimum, or maximum value. Daily data are automatically calculated from the continuous data of the same parameter code and are described by parameter code and a statistic code. These data have also been referred to as “daily values” or “DV”.
To access these services on a web browser, go to https://api.waterdata.usgs.gov/ogcapi/v0/collections/daily.
library(dataRetrieval)
daily_modern <- read_USGS_daily(monitoring_location_id = "USGS-01491000",
parameter_code = c("00060", "00010"),
statistic_id = "00003",
time = c("2023-10-01", "2024-09-30"))
The first thing you might notice is that the new service serves data
in a “long” format, which means there is just a single observations per
row of the data frame (see: Pivot
Data). Many functions will be easier and more efficient to work with
a “long” data frame. For instance, here we can see how
ggplot2
can use the parameter_code column to create a
“facet” plot:
library(ggplot2)
ggplot(data = daily_modern) +
geom_point(aes(x = time, y = value,
color = approval_status)) +
facet_grid(parameter_code ~ ., scale = "free") +
theme_bw()
General Retrieval
The read_USGS_data
function replaces the
readNWISdata
function. This is a lower-level, generalized
function for querying any of the API endpoints.
The new APIs can handle complex requests. For those queries, users will need to construct their own request using Contextual Query Language (CQL2). There’s an excellent article https://api.waterdata.usgs.gov/docs/ogcapi/complex-queries/.
Let’s try to find sites in Wisconsin and Minnesota that have a drainage area greater than 1000 mi^2.
cql <- '{
"op": "and",
"args": [
{
"op": "in",
"args": [
{ "property": "state_name" },
[ "Wisconsin", "Minnesota" ]
]
},
{
"op": ">",
"args": [
{ "property": "drainage_area" },
1000
]
}
]
}'
sites_mn_wi <- read_USGS_data(service = "monitoring-locations",
CQL = cql)
Let’s see what that looks like:
leaflet_crs <- "+proj=longlat +datum=WGS84" #default leaflet crs
pal <- colorNumeric("viridis", sites_mn_wi$drainage_area)
leaflet(data = sites_mn_wi |>
sf::st_transform(crs = leaflet_crs)) |>
addProviderTiles("CartoDB.Positron") |>
addCircleMarkers(popup = ~monitoring_location_name,
color = ~ pal(drainage_area),
radius = 0.1,
opacity = 1) |>
addLegend(pal = pal,
position = "bottomleft",
title = "Drainage Area",
values = ~drainage_area)
Discrete Samples
Discrete USGS water quality can be accessed via the
read_USGS_samples
function. While this is a new, modern
USGS endpoint, it is not served in the same infrastructure as the rest
of these new advertised functions. See Samples Data) for information on
accessing USGS discrete water quality data.
Notes on dataRetrieval development
New Features
Style
New functions will use a “snake case”, such as
read_USGS_samples
. Older functions use camel case, such as
readNWISdv
. The difference is the underscore between words.
This should be a handy way to tell the difference between newer modern
data access, and the older traditional functions.
Structure
Historically, we allowed users to customize their queries via the
...
argument structure. With ...
, users needed
to know the exact names of query parameters before using the function.
Now, the new functions will include ALL possible
arguments that the web service APIs support. This will allow users to
use tab-autocompletes (available in RStudio and other IDEs).
Users will need to understand that they are not required to
specify all of these parameters. In fact, it is not advisable: the
systems can get bogged down with redundant query parameters. We
expect this will be easier for users, but it might take some time to
smooth out the documentation and test usability. There may be additional
consequences, such as users won’t be able to build up argument lists to
pass into the function.
Dependencies
Under the hood, dataRetrieval
changed the dependency
from httr
to httr2
. httr2
is the
modern R package for web requests that is actively developed/maintained.
As we develop functions for the modern USGS web services, we’ll continue
to explore updating package dependencies. Since the new services offer
geospatial output, we also now require the sf
package. The
whisker
package was also included to help create POST CQL2
queries.
Developmental workflow
CRAN-stable documentation will be available on the GitHub pages: https://doi-usgs.github.io/dataRetrieval/
In-development documentation will be available on the USGS GitLab pages: https://water.code-pages.usgs.gov/dataRetrieval
Development of dataRetrieval
will happen on a git branch
called “develop”. The “develop” branch will only move to the “main”
branch when we submit to CRAN, unless there are bug fixes that pertain
to the CRAN release. The “develop” branch WILL change frequently, and
there are no promises of future behavior. Users must accept that they
are using those functions at their own risk. If you willing to accept
this risk, the installation instructions are:
library(remotes)
install_github("DOI-USGS/dataRetrieval",
ref = "develop")
HELP!
That’s a lot of new information and changes. There are certainly going to be scripts that have been passed down through the years that will start breaking once the NWIS servers are decommissioned.
Check back on the documentation often: https://doi-usgs.github.io/dataRetrieval/ Peruse the “Additional Articles”, when we find common issues people have with converting their old workflows, we will try to add articles to clarify new workflows.
Currently, you might be interested in:
If you have additional questions, email comptools@usgs.gov. General questions and bug reports can be reported here: https://github.com/DOI-USGS/dataRetrieval/issues