header_tag.html

Skip to contents

Increasingly there are high frequency sensor data available for water quality data. There’s a need to join the sensor and discrete data by the closest time.

This article will discuss how to do that using the data.table package.

Let’s look at site “01646500”, and a nearby site with a real-time nitrate-plus-nitrite sensor. Our goal is to get the discharge and nitrate-plus-nitrite sensor data aligned with the discrete water quality samples.

library(dataRetrieval)

site_uv <- "USGS-01646500"
site_samples <- "USGS-01646580"
pcode_uv <- "99133"
pcode_samples <- "00631"
start_date <- as.Date("2018-01-01")
end_date <- as.Date("2020-01-01")
samples_data <- readWQPqw(site_samples, pcode_samples, 
                     startDate = start_date,
                     endDate = end_date)

uv_data <- read_waterdata_continuous(monitoring_location_id = site_uv,
                                     parameter_code =  c(pcode_uv),
                                     time = c(start_date, end_date))

Next we’ll clean up the discrete water quality data to make it easy to follow in this tutorial.

samples_trim <- samples_data |> 
  filter(ActivityTypeCode == "Sample-Routine",
         !is.na(ActivityStartDateTime)) |> 
  select(samples_date = ActivityStartDateTime,
         val_samples = ResultMeasureValue,
         det_txt = ResultDetectionConditionText)
samples_date val_samples det_txt
2018-01-04 15:30:00 1.39 NA
2018-01-15 15:15:00 1.73 NA
2018-02-01 15:30:00 1.66 NA
2018-02-11 19:00:00 1.33 NA
2018-02-13 17:30:00 1.53 NA
2018-02-15 16:00:00 1.65 NA

Finally, we’ll use the data.table package to do a join to the nearest date. The code to do that is here:

library(data.table)

setDT(samples_trim)[, join_date := samples_date]

setDT(uv_data)[, join_date := time]

closest_dt <- uv_data[samples_trim, on = .(join_date), roll = "nearest"]

closest_dt is a data.table object. It similar to a data.frame, but not identical. We can convert it to a data.frame and then use dplyr commands. Note: the whole analysis can be done via data.table, but most examples in dataRetrieval have used dplyr, which is why we bring it back to data.frame. dplyr also has a join_by(closest()) option, but it is more complicated because you can only specify the closeness in either the forward or backwards direction (and we want either direction).

We can calculate “delta_time” - the difference in time between the uv and samples data. We’ll probably want to add a threshold that we don’t join values if they are too far apart in time. In this example, if the difference is greater than 24 hours, we’ll substitute NA.

samples_closest <- data.frame(closest_dt) |> 
  mutate(delta_time = difftime(samples_date, time, 
                               units = "hours"),
         val_uv = if_else(abs(as.numeric(delta_time)) >= 24, NA, value)) |> 
  select(-join_date)
monitoring_location_id time value last_modified parameter_code statistic_id unit_of_measure approval_status qualifier time_series_id samples_date val_samples det_txt delta_time val_uv
USGS-01646500 2018-01-04 15:30:00 1.43 2025-08-25 23:48:02 99133 00011 mg/l Approved ceea3bb9b349410ba4ad72128cb63392 2018-01-04 15:30:00 1.39 NA 0 hours 1.43
USGS-01646500 2018-01-15 15:15:00 1.84 2025-08-25 23:48:02 99133 00011 mg/l Approved ceea3bb9b349410ba4ad72128cb63392 2018-01-15 15:15:00 1.73 NA 0 hours 1.84
USGS-01646500 2018-02-01 15:30:00 1.67 2025-08-25 23:48:02 99133 00011 mg/l Approved ceea3bb9b349410ba4ad72128cb63392 2018-02-01 15:30:00 1.66 NA 0 hours 1.67
USGS-01646500 2018-02-11 19:00:00 1.29 2025-08-25 23:48:02 99133 00011 mg/l Approved ceea3bb9b349410ba4ad72128cb63392 2018-02-11 19:00:00 1.33 NA 0 hours 1.29
USGS-01646500 2018-02-13 17:30:00 1.63 2025-08-25 23:48:02 99133 00011 mg/l Approved ceea3bb9b349410ba4ad72128cb63392 2018-02-13 17:30:00 1.53 NA 0 hours 1.63
USGS-01646500 2018-02-15 16:00:00 1.68 2025-08-25 23:48:02 99133 00011 mg/l Approved ceea3bb9b349410ba4ad72128cb63392 2018-02-15 16:00:00 1.65 NA 0 hours 1.68

Here are a few plots to show the applications of these joins:

library(ggplot2)

ggplot(data = samples_closest) +
  geom_point(aes(x = val_uv, y = val_samples)) +
  theme_bw() +
  xlab("Sensor") +
  ylab("Discrete")

ggplot() +
  geom_line(data = uv_data,
            aes(x = time, value),
            color = "lightgrey") +
  geom_point(data = samples_closest,
             aes(x = samples_date, y = val_samples),
             color= "red") +
  theme_bw() +
  ggtitle("Red dots = discrete samples, grey lines = continuous sensor") +
  xlab("") +
  ylab("Concentration")