Continuous Data • dataRetrieval

There is an increasing amount of continuous data available from the USGS. Continuous data are collected via automated sensors installed at a monitoring location. They are collected at a high frequency and often at a fixed 15-minute interval. Depending on the specific monitoring location, the data may be transmitted automatically via telemetry and be available on WDFN within minutes of collection, while other times the delivery of data may be delayed if the monitoring location does not have the capacity to automatically transmit data. Continuous data are described by parameter name and parameter code (pcode). These data might also be referred to as “instantaneous values” or “IV”.

The current service that delivers this data only allows up to 3 years of continuous data to be requested at once. We will update this article if that limitation changes, or if alternative more efficient workflows become available. In the meantime, here are two ways in which you could request full period of record set of continuous data. Let’s walk through a couple of ways to get a lot of continuous data

If you are running on a fairly standard laptop, feel free to make requests in parallel. If you are running on a supercomputer or HPC type environment, your requests will be stopped/killed. There may be techniques to not overwhelm the system, contact comptools@usgs.gov if you need help figuring that out.

First, let’s look for what continuous data is available at a particular site (USGS-0208458892):

library(dataRetrieval)
site <- "USGS-0208458892"

continuous_available <- read_waterdata_combined_meta(
  monitoring_location_id = site,
  data_type = "Continuous values"
)

We can see the available data here:

parameter_name	begin	end
Turbidity, FNU	2015-10-02	2020-02-25
Reservoir elevation	2013-10-01	2026-03-18
Specific cond at 25C	2012-09-21	2026-03-18
Temperature, water	2012-09-21	2026-03-18
Dissolved oxygen	2012-09-21	2026-03-18
pH	2012-09-21	2026-03-18
Salinity	2012-09-21	2026-03-18
Elevation, lake/res, NAVD88	2013-10-01	2026-03-18

Let’s say we are interested in “Specific cond at 25C” and “Salinity” for the water years 2013-2025.

Basic R scripts

First, we’ll create a list of beginning and ending dates that are separated in chunks. The services have been set up to be more efficient returning queries for a single calendar year, so we will set up a data frame that gives those time limits:

start <- as.Date("2012-10-01")
end <- as.Date("2025-09-30")

year_range <- lubridate::year(c(start, end))
date_ranges <- as.Date(paste0(seq(from = year_range[1], 
                                  to = year_range[2]),
                              "-01-01"))[-1]
starts <- c(start, date_ranges)
ends <- c(date_ranges - 1, end)

# create a data frame where each element starts at the beginning
# of a chunk, and ends the day before the next chunk:
time_df <- data.frame(start = starts,
                      end = ends)
time_df

##         start        end
## 1  2012-10-01 2012-12-31
## 2  2013-01-01 2013-12-31
## 3  2014-01-01 2014-12-31
## 4  2015-01-01 2015-12-31
## 5  2016-01-01 2016-12-31
## 6  2017-01-01 2017-12-31
## 7  2018-01-01 2018-12-31
## 8  2019-01-01 2019-12-31
## 9  2020-01-01 2020-12-31
## 10 2021-01-01 2021-12-31
## 11 2022-01-01 2022-12-31
## 12 2023-01-01 2023-12-31
## 13 2024-01-01 2024-12-31
## 14 2025-01-01 2025-09-30

Then we can use a for loop, apply, or purrr to get the data:

for

all_data <- data.frame()
for(i in seq_along(time_df$start)){
  sub_df <- read_waterdata_continuous(monitoring_location_id = site,
                                      # parameter_code = parameter_codes,
                                      time = c(time_df$start[i],
                                               time_df$end[i]))
  all_data <- rbind(all_data, sub_df)
}

apply

get_data_list <- apply(time_df, 1, FUN = function(x){
  read_waterdata_continuous(monitoring_location_id = site,
                            parameter_code = parameter_codes,
                            time = c(x[["start"]],
                                     x[["end"]]))
})

all_data <- do.call("rbind", get_data_list )

purrr

purr_list <- time_df |> 
  purrr::pmap(
    \(start, end, ...) read_waterdata_continuous(
      monitoring_location_id = site, 
      parameter_code = parameter_codes, 
      time = c(start, end))
  )

all_data <- purrr::list_rbind(purr_list)

That’s all fine and good if everything works perfectly. What if something goes wrong in the middle of the pull? You could put some tryCatch statements in the above code, post process out what was missed, and re-request the missing data…OR…consider using a targets pipeline to take care of all of that!

Basic “targets” pipeline

We can use the targets package to create a data pipeline.

First step, save this code in a file called _targets.R. This code creates dynamic branches in the “get_data” target based on the annual time chunks.

# Load packages required to define the pipeline:
library(targets)
library(future)
plan(multisession)

# Set target options:
tar_option_set(packages = c("dataRetrieval"),
               error = "trim")

get_time_df <- function(start, end){
  year_range <- lubridate::year(c(start, end))
  date_ranges <- as.Date(paste0(seq(from = year_range[1], 
                                    to = year_range[2]),
                                "-01-01"))[-1]
  starts <- c(start, date_ranges)
  ends <- c(date_ranges - 1, end)

  time_df <- data.frame(start = starts,
                        end = ends)
  
  return(time_df)
}

list(
  tar_target(name = config,
             command = list(site = "USGS-0208458892",
                            parameter_codes = c("00095", "00480"),
                            start = as.Date("2012-10-01"),
                            end = as.Date("2025-09-30"))),
  tar_target(name = time_df,
             command = get_time_df(start = config[["start"]],
                                   end = config[["end"]])),
  tar_target(name = get_data,
             command = read_waterdata_continuous(
               monitoring_location_id = config[["site"]],
               parameter_code = config[["parameter_codes"]],
               time = c(time_df$start, time_df$end)),
             pattern = map(time_df),
             iteration = "list"),
  tar_target(name = all_data,
           command = dplyr::bind_rows(get_data))
)

Once the _targets.R file is created, you can run:

library(targets)
tar_make()

If everything went according to plan, the last several printed messages would look like this:

> tar_make()
+ config dispatched                          
✔ config completed [3.8s, 178 B]                            
+ time_df dispatched                                        
✔ time_df completed [810ms, 255 B]                          
+ get_data declared [14 branches]                           
Initiating curl with CURL_SSL_BACKEND: openssl[4.9s, 2+, 0-]
Requesting:
https://api.waterdata.usgs.gov/ogcapi/v0/collections/continuous/items?f=json&lang=en-US&skipGeometry=TRUE&monitoring_location_id=USGS-0208458892&parameter_code=00095,00480&time=2012-10-01T00%3A00%3A00Z%2F2012-12-31T00%3A00%3A00Z&limit=50000
Remaining requests this hour:2818                           
Requesting:s ■■■■■■                           [12.1s, 3+, 0-]

...

✔ get_data completed [56.5s, 463.86 kB]                        
+ all_data dispatched                                          
✔ all_data completed [450ms, 1.81 MB]                            
✔ ended pipeline [1m 6.2s, 4 completed, 13 skipped]

If instead it ended with any errors, it would look something like this:

✖ errored pipeline [9.3s, 0 completed, 21 skipped]          
Error:
! Error in tar_make():

If the errors were related to the internet, service outages, or exceeding API requests, you could re-run the pipeline and only the failed jobs would be re-run:

# Rerun:
tar_make()

To load the all_data target into your R environment:

tar_load(all_data)

Run in parallel

As mentioned above, if you are running on a fairly standard laptop, feel free to make requests in parallel. However, please don’t run queries in parallel on a supercomputer or HPC type environment, your requests will be stopped/killed. There may be techniques to not overwhelm the system, contact comptools@usgs.gov if you need help figuring that out.

future.apply

library(future.apply)
plan(multisession)

get_data_list <- future_apply(time_df, 1, FUN = function(x){
  read_waterdata_continuous(monitoring_location_id = site,
                            parameter_code = parameter_codes,
                            time = c(x[["start"]],
                                     x[["end"]]))
})

all_data <- do.call("rbind", get_data_list )

furrr

library(furrr)
plan(multisession)

furrr_list <- mini_time |> 
  furrr::future_pmap(
    \(start, end, ...) read_waterdata_continuous(
      monitoring_location_id = site, 
      parameter_code = parameter_codes, 
      time = c(start, end))
  )

all_data <- purrr::list_rbind(furrr_list)

Parallel “targets” pipeline

You can run the target pipeline in parallel in a few different ways (see the targets documentation). One way is to use the future package. We already added that to the top of the _targets.R file:

# Load packages required to define the pipeline:
library(targets)
library(future)
plan(multisession)

To run in parallel, use tar_make_future instead of tar_make():

> targets::tar_make_future()
+ config dispatched                          
✔ config completed [3.5s, 178 B]                            
+ time_df dispatched                                        
✔ time_df completed [0ms, 205 B]                            
+ get_data declared [5 branches]                            
✔ get_data completed [3m 33.6s, 1.89 MB]                       
+ all_data dispatched                                         
✔ all_data completed [750ms, 1.81 MB]                           
✔ ended pipeline [3m 43.3s, 8 completed, 0 skipped]