Skip to content

Add read_waterdata_nearest_continuous helper#881

Draft
thodson-usgs wants to merge 7 commits intoDOI-USGS:developfrom
thodson-usgs:feat/nearest-continuous
Draft

Add read_waterdata_nearest_continuous helper#881
thodson-usgs wants to merge 7 commits intoDOI-USGS:developfrom
thodson-usgs:feat/nearest-continuous

Conversation

@thodson-usgs
Copy link
Copy Markdown

@thodson-usgs thodson-usgs commented Apr 23, 2026

Summary

Adds read_waterdata_nearest_continuous(targets, ...) — for each target timestamp, returns the single continuous observation closest to that timestamp, fetched in one HTTP round-trip (auto-chunked when the CQL filter gets long).

Try it

Copy-paste into an R session — installs this branch and runs one end-to-end call:

remotes::install_github(
  "thodson-usgs/dataRetrieval-1",
  ref = "feat/nearest-continuous"
)

library(dataRetrieval)

targets <- as.POSIXct(
  c("2023-06-15 10:30:31", "2023-06-15 14:07:12", "2023-06-16 03:45:19"),
  tz = "UTC"
)

near <- read_waterdata_nearest_continuous(
  targets = targets,
  monitoring_location_id = "USGS-02238500",
  parameter_code = "00060"
)
near[, c("monitoring_location_id", "time", "value", "target_time")]
#> # A tibble: 3 × 4
#>   monitoring_location_id time                value target_time
#>   <chr>                  <dttm>              <dbl> <dttm>
#> 1 USGS-02238500          2023-06-15 10:30:00  22.4 2023-06-15 10:30:31
#> 2 USGS-02238500          2023-06-15 14:00:00  22.4 2023-06-15 14:07:12
#> 3 USGS-02238500          2023-06-16 03:45:00  22.4 2023-06-16 03:45:19

One HTTP request goes out, carrying a three-clause (time >= t-window AND time <= t+window) OR ... CQL filter; three rows come back, one per target. Each time is the nearest observation on the 15-minute grid; target_time identifies which target the row corresponds to.

(No API_USGS_PAT needed to run the snippet — the Water Data API serves unauthenticated requests at a lower rate limit. Set it if you're iterating.)

Tie-mode and wider-window variations:

# Widen the window and average numeric columns for targets that fall
# on the midpoint between two grid observations.
read_waterdata_nearest_continuous(
  targets = targets,
  monitoring_location_id = "USGS-02238500",
  parameter_code = "00060",
  window = "PT15M",      # or "15:00", or "15 minutes"
  on_tie = "mean"
)

Why

The Water Data API's time= parameter treats a single instant as an exact match, not a nearest-match — time = "2023-06-15T10:30:31Z" on a 15-minute gauge returns 0 rows. The advertised sortby parameter would make "nearest" expressible as filter = "time <= 'target'" + sortby = -time + limit = 1, but sortby is per-query, so N targets would mean N HTTP round-trips. There is no T_NEAREST CQL function either.

The narrow-window + client-side reduction implemented here is the one pattern that folds N targets into a single request today.

Knobs

  • window = "PT7M30S" — half-window around each target (7.5 minutes, ISO 8601; half of the 15-minute continuous cadence, so most windows contain exactly one observation). Accepts:
    • ISO 8601 durations ("PT7M30S", "PT15M", "PT1H", ...) or any other string lubridate::duration() parses (e.g. "7 minutes 30 seconds")
    • "MM:SS" or "HH:MM:SS" clock-style strings (e.g. "07:30", "15:00", "00:30:00", "01:00:00")
    • for programmatic callers: a number of seconds, a difftime, or a lubridate::Period/Duration
  • on_tie = "first" — how to resolve ties when a target falls at the midpoint between two grid points (rare but possible). Alternatives: "last" (keep the later observation), "mean" (average numeric columns; set time to the target).

Multi-site calls return one row per (target, monitoring_location_id) pair. Targets with no observations in their window are silently dropped. Passing time, filter, or filter_lang raises an error — the helper builds those itself.

Naming

Renamed from the Python get_nearest_continuous to read_waterdata_nearest_continuous to match the R package's convention (read_waterdata_* for OGC-backed functions).

Relationship to #880

This PR is built on top of #880 (Add CQL filter passthrough to OGC waterdata functions) and will look lighter once that lands. The helper's core trick — fanning N targets into one request — is only possible because #880 adds filter / filter_lang support + automatic URL-length-safe chunking to read_waterdata_continuous. The branch feat/nearest-continuous is stacked on feat/cql-filter-passthrough, so until #880 merges the diff here will include both changesets; after #880 merges the commits on its branch become common ancestors and this PR's diff reduces to the one commit introducing read_waterdata_nearest_continuous and its tests.

Please merge #880 first.

Test plan

  • Non-network unit tests via with_mocked_bindings — 44/44 pass. Covers filter construction (one bracketed AND clause per target, joined by OR), nearest-observation reduction, all three on_tie modes (first / last / mean), missing-window drop, multi-site fan-out, empty targets, forbidden-kwarg validation, and window input shapes (ISO 8601 like "PT7M30S" / "PT15M" / "PT1H", natural-language strings like "7 minutes 30 seconds", "MM:SS" and "HH:MM:SS" including fractional seconds, numeric seconds, difftime, lubridate::Period).
  • R CMD check — 0 errors, 0 warnings, 3 unrelated NOTEs.
  • Live end-to-end against USGS-02238500 00060 with three off-grid targets (output shown in the Try it section above). One HTTP request, three rows returned, time snapped to the 15-minute grid, target_time preserved as POSIXct.

Marked as draft pending maintainer review.

🤖 Generated with Claude Code

thodson-usgs and others added 2 commits April 22, 2026 15:21
Every OGC read_waterdata_* function (continuous, daily, field_measurements,
monitoring_location, ts_meta, latest_continuous, latest_daily, channel) now
accepts `filter` and `filter_lang` arguments that are forwarded as the
OGC `filter` / `filter-lang` query parameters. The R argument `filter_lang`
is translated to the hyphenated `filter-lang` URL parameter that the
service expects.

When a filter is a top-level OR chain that exceeds a conservative
URI-length budget (5 KB), the library transparently splits it into
multiple sub-requests and concatenates (and deduplicates) the results.
This keeps the common multi-interval use case out of the caller's way --
they don't need to know about the server's 414 boundary.

Mirrors dataretrieval-python PR DOI-USGS#238.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rame handling

Addresses feedback on the companion Python PR (DOI-USGS/dataretrieval-python#238):

- Skip chunking when `filter_lang` is not `cql-text`. The splitter is
  text- and single-quote-aware and would corrupt cql-json. Non-cql-text
  filters are now forwarded as-is.
- Budget each chunk against the server's URL byte limit
  (`.WATERDATA_URL_BYTE_LIMIT = 8000`, matching the observed HTTP 414
  cliff of ~8,200 bytes) rather than a fixed raw filter length.
  `effective_filter_budget` probes the non-filter URL, subtracts, and
  converts back to raw CQL bytes using the max per-clause encoding
  ratio (with the " OR " joiner included — in R's percent-encoding the
  joiner inflates 2x, heavier than typical clause ratios, and the
  previous clause-only max let chunks overflow the URL cap).
- When the non-filter URL already exceeds the byte limit, return a
  budget larger than the filter so it passes through unchanged — one
  clear 414 is better feedback than N failing sub-requests.
- Move filter chunking out of the recursive `get_ogc_data` path and
  into the post-transform branch, so the probe sees the real request
  args. Collect raw frames, drop empty ones before `rbind` (a plain
  empty frame first would downgrade a later sf result and drop
  geometry/CRS), and dedup on the pre-rename feature `id`.
- Add regression tests for doubled single-quote CQL escape, the URL
  byte budget guarantee, and non-cql-text pass-through.
- Document CQL filter usage with two examples on
  `read_waterdata_continuous`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thodson-usgs and others added 5 commits April 23, 2026 15:45
Mirrors the helper organization in the merged Python PR
(DOI-USGS/dataretrieval-python#238) so the per-language
implementations stay easy to read alongside each other.

The single-vs-fanned distinction is now expressed once, in
`plan_filter_chunks`, which always returns a list of "chunk
overrides" -- `list(NULL)` for "send `args` as-is", or a list of
chunked cql-text expressions otherwise. `fetch_chunks` issues one
request per entry and returns the per-chunk frames plus the first
sub-request (for the `request` attribute). `combine_chunk_frames`
handles the empty-frame and dedup-by-`id` cases.

`get_ogc_data` is now a linear pipeline:

    chunks   <- plan_filter_chunks(args)
    fetched  <- fetch_chunks(args, chunks)
    return_list <- combine_chunk_frames(fetched$frames)
    req      <- fetched$req
    ... post-processing ...

Behavior unchanged: same chunk sizing (URL-byte-budget aware),
same cql-text-only guard, same empty-frame and id-dedup handling.
The only observable difference is that the `request` attribute
now points at the first sub-request instead of the last (matching
Python's choice of representative metadata), which is a
debugging-only change for the chunked path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For each target timestamp, returns the single continuous observation
closest to that timestamp, fetched in one HTTP round-trip (auto-chunked
when the underlying CQL filter gets long).

Why: the Water Data API's `time=` parameter treats a single instant as
an *exact match*, not a nearest-match -- `time=2023-06-15T10:30:31Z` on
a 15-minute gauge returns 0 rows. The advertised `sortby` parameter
would make "nearest" expressible as
`filter=time <= 'target' & sortby=-time & limit=1`, but `sortby` is
per-query, so N targets would mean N HTTP round-trips. There is no
`T_NEAREST` CQL function either.

The narrow-window + client-side reduction implemented here is the one
pattern that folds N targets into a single request today, made
possible by the CQL filter passthrough + auto-chunking on the preceding
filter PR.

Knobs:
- `window` (default 450s, i.e. 7.5 min, half of the 15-min continuous
  cadence) -- accepts numeric seconds, a difftime, a lubridate
  Period/Duration, or a string coercible to one.
- `on_tie` in {"first", "last", "mean"} controls behavior when a target
  sits exactly at the midpoint between two observations.

Passing `time`, `filter`, or `filter_lang` raises an error -- this
function builds those itself.

Mirrors dataretrieval-python PR DOI-USGS#239, renamed from `get_nearest_continuous`
to `read_waterdata_nearest_continuous` to match R package conventions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The primary way to specify `window` is now an `"HH:MM:SS"` string:

    window = "00:07:30"  # default (7.5 min, half of 15-min cadence)
    window = "00:15:00"
    window = "00:30:00"
    window = "01:00:00"

Reads more cleanly than raw seconds (`450`) or a loose time-unit string
(`"7.5 mins"`) when comparing windows at a glance. Programmatic callers
can still pass a number of seconds, a `difftime`, or a
`lubridate::Period`/`Duration` -- the fuzzy `"7.5 mins"` /
`lubridate::duration` string path is dropped in favor of the unambiguous
`HH:MM:SS` form.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Default is now `"07:30"` (MM:SS) instead of `"00:07:30"` -- reads
  cleanly at a glance for the common sub-hour case and matches how
  people write cadence offsets for 15-minute gauges.
- The parser now accepts:
    * MM:SS / HH:MM:SS clock-style strings (new MM:SS form for brevity),
    * ISO 8601 duration strings (`"PT7M30S"`, `"PT15M"`, `"PT1H"`, ...)
      or any other string `lubridate::duration()` parses,
    * numeric seconds, `difftime`, `lubridate::Period`/`Duration`
      (unchanged).
- Error message and tests updated accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All non-ISO string forms (MM:SS, HH:MM:SS, natural-language via
lubridate) still parse; only the declared default changes. Picks the
unambiguous, internationally-standard form for what shows up in the
function signature and the generated help page.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant