Add read_waterdata_nearest_continuous helper#881
Draft
thodson-usgs wants to merge 7 commits intoDOI-USGS:developfrom
Draft
Add read_waterdata_nearest_continuous helper#881thodson-usgs wants to merge 7 commits intoDOI-USGS:developfrom
thodson-usgs wants to merge 7 commits intoDOI-USGS:developfrom
Conversation
Every OGC read_waterdata_* function (continuous, daily, field_measurements, monitoring_location, ts_meta, latest_continuous, latest_daily, channel) now accepts `filter` and `filter_lang` arguments that are forwarded as the OGC `filter` / `filter-lang` query parameters. The R argument `filter_lang` is translated to the hyphenated `filter-lang` URL parameter that the service expects. When a filter is a top-level OR chain that exceeds a conservative URI-length budget (5 KB), the library transparently splits it into multiple sub-requests and concatenates (and deduplicates) the results. This keeps the common multi-interval use case out of the caller's way -- they don't need to know about the server's 414 boundary. Mirrors dataretrieval-python PR DOI-USGS#238. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rame handling Addresses feedback on the companion Python PR (DOI-USGS/dataretrieval-python#238): - Skip chunking when `filter_lang` is not `cql-text`. The splitter is text- and single-quote-aware and would corrupt cql-json. Non-cql-text filters are now forwarded as-is. - Budget each chunk against the server's URL byte limit (`.WATERDATA_URL_BYTE_LIMIT = 8000`, matching the observed HTTP 414 cliff of ~8,200 bytes) rather than a fixed raw filter length. `effective_filter_budget` probes the non-filter URL, subtracts, and converts back to raw CQL bytes using the max per-clause encoding ratio (with the " OR " joiner included — in R's percent-encoding the joiner inflates 2x, heavier than typical clause ratios, and the previous clause-only max let chunks overflow the URL cap). - When the non-filter URL already exceeds the byte limit, return a budget larger than the filter so it passes through unchanged — one clear 414 is better feedback than N failing sub-requests. - Move filter chunking out of the recursive `get_ogc_data` path and into the post-transform branch, so the probe sees the real request args. Collect raw frames, drop empty ones before `rbind` (a plain empty frame first would downgrade a later sf result and drop geometry/CRS), and dedup on the pre-rename feature `id`. - Add regression tests for doubled single-quote CQL escape, the URL byte budget guarantee, and non-cql-text pass-through. - Document CQL filter usage with two examples on `read_waterdata_continuous`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the helper organization in the merged Python PR (DOI-USGS/dataretrieval-python#238) so the per-language implementations stay easy to read alongside each other. The single-vs-fanned distinction is now expressed once, in `plan_filter_chunks`, which always returns a list of "chunk overrides" -- `list(NULL)` for "send `args` as-is", or a list of chunked cql-text expressions otherwise. `fetch_chunks` issues one request per entry and returns the per-chunk frames plus the first sub-request (for the `request` attribute). `combine_chunk_frames` handles the empty-frame and dedup-by-`id` cases. `get_ogc_data` is now a linear pipeline: chunks <- plan_filter_chunks(args) fetched <- fetch_chunks(args, chunks) return_list <- combine_chunk_frames(fetched$frames) req <- fetched$req ... post-processing ... Behavior unchanged: same chunk sizing (URL-byte-budget aware), same cql-text-only guard, same empty-frame and id-dedup handling. The only observable difference is that the `request` attribute now points at the first sub-request instead of the last (matching Python's choice of representative metadata), which is a debugging-only change for the chunked path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For each target timestamp, returns the single continuous observation
closest to that timestamp, fetched in one HTTP round-trip (auto-chunked
when the underlying CQL filter gets long).
Why: the Water Data API's `time=` parameter treats a single instant as
an *exact match*, not a nearest-match -- `time=2023-06-15T10:30:31Z` on
a 15-minute gauge returns 0 rows. The advertised `sortby` parameter
would make "nearest" expressible as
`filter=time <= 'target' & sortby=-time & limit=1`, but `sortby` is
per-query, so N targets would mean N HTTP round-trips. There is no
`T_NEAREST` CQL function either.
The narrow-window + client-side reduction implemented here is the one
pattern that folds N targets into a single request today, made
possible by the CQL filter passthrough + auto-chunking on the preceding
filter PR.
Knobs:
- `window` (default 450s, i.e. 7.5 min, half of the 15-min continuous
cadence) -- accepts numeric seconds, a difftime, a lubridate
Period/Duration, or a string coercible to one.
- `on_tie` in {"first", "last", "mean"} controls behavior when a target
sits exactly at the midpoint between two observations.
Passing `time`, `filter`, or `filter_lang` raises an error -- this
function builds those itself.
Mirrors dataretrieval-python PR DOI-USGS#239, renamed from `get_nearest_continuous`
to `read_waterdata_nearest_continuous` to match R package conventions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The primary way to specify `window` is now an `"HH:MM:SS"` string:
window = "00:07:30" # default (7.5 min, half of 15-min cadence)
window = "00:15:00"
window = "00:30:00"
window = "01:00:00"
Reads more cleanly than raw seconds (`450`) or a loose time-unit string
(`"7.5 mins"`) when comparing windows at a glance. Programmatic callers
can still pass a number of seconds, a `difftime`, or a
`lubridate::Period`/`Duration` -- the fuzzy `"7.5 mins"` /
`lubridate::duration` string path is dropped in favor of the unambiguous
`HH:MM:SS` form.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Default is now `"07:30"` (MM:SS) instead of `"00:07:30"` -- reads
cleanly at a glance for the common sub-hour case and matches how
people write cadence offsets for 15-minute gauges.
- The parser now accepts:
* MM:SS / HH:MM:SS clock-style strings (new MM:SS form for brevity),
* ISO 8601 duration strings (`"PT7M30S"`, `"PT15M"`, `"PT1H"`, ...)
or any other string `lubridate::duration()` parses,
* numeric seconds, `difftime`, `lubridate::Period`/`Duration`
(unchanged).
- Error message and tests updated accordingly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All non-ISO string forms (MM:SS, HH:MM:SS, natural-language via lubridate) still parse; only the declared default changes. Picks the unambiguous, internationally-standard form for what shows up in the function signature and the generated help page. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c8c15a3 to
01e76c4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
read_waterdata_nearest_continuous(targets, ...)— for each target timestamp, returns the single continuous observation closest to that timestamp, fetched in one HTTP round-trip (auto-chunked when the CQL filter gets long).Try it
Copy-paste into an R session — installs this branch and runs one end-to-end call:
One HTTP request goes out, carrying a three-clause
(time >= t-window AND time <= t+window) OR ...CQL filter; three rows come back, one per target. Eachtimeis the nearest observation on the 15-minute grid;target_timeidentifies which target the row corresponds to.(No
API_USGS_PATneeded to run the snippet — the Water Data API serves unauthenticated requests at a lower rate limit. Set it if you're iterating.)Tie-mode and wider-window variations:
Why
The Water Data API's
time=parameter treats a single instant as an exact match, not a nearest-match —time = "2023-06-15T10:30:31Z"on a 15-minute gauge returns 0 rows. The advertisedsortbyparameter would make "nearest" expressible asfilter = "time <= 'target'"+sortby = -time+limit = 1, butsortbyis per-query, so N targets would mean N HTTP round-trips. There is noT_NEARESTCQL function either.The narrow-window + client-side reduction implemented here is the one pattern that folds N targets into a single request today.
Knobs
window = "PT7M30S"— half-window around each target (7.5 minutes, ISO 8601; half of the 15-minute continuous cadence, so most windows contain exactly one observation). Accepts:"PT7M30S","PT15M","PT1H", ...) or any other stringlubridate::duration()parses (e.g."7 minutes 30 seconds")"MM:SS"or"HH:MM:SS"clock-style strings (e.g."07:30","15:00","00:30:00","01:00:00")difftime, or alubridate::Period/Durationon_tie = "first"— how to resolve ties when a target falls at the midpoint between two grid points (rare but possible). Alternatives:"last"(keep the later observation),"mean"(average numeric columns; settimeto the target).Multi-site calls return one row per
(target, monitoring_location_id)pair. Targets with no observations in their window are silently dropped. Passingtime,filter, orfilter_langraises an error — the helper builds those itself.Naming
Renamed from the Python
get_nearest_continuoustoread_waterdata_nearest_continuousto match the R package's convention (read_waterdata_*for OGC-backed functions).Relationship to #880
This PR is built on top of #880 (
Add CQL filter passthrough to OGC waterdata functions) and will look lighter once that lands. The helper's core trick — fanning N targets into one request — is only possible because #880 addsfilter/filter_langsupport + automatic URL-length-safe chunking toread_waterdata_continuous. The branchfeat/nearest-continuousis stacked onfeat/cql-filter-passthrough, so until #880 merges the diff here will include both changesets; after #880 merges the commits on its branch become common ancestors and this PR's diff reduces to the one commit introducingread_waterdata_nearest_continuousand its tests.Please merge #880 first.
Test plan
with_mocked_bindings— 44/44 pass. Covers filter construction (one bracketed AND clause per target, joined byOR), nearest-observation reduction, all threeon_tiemodes (first/last/mean), missing-window drop, multi-site fan-out, empty targets, forbidden-kwarg validation, andwindowinput shapes (ISO 8601 like"PT7M30S"/"PT15M"/"PT1H", natural-language strings like"7 minutes 30 seconds","MM:SS"and"HH:MM:SS"including fractional seconds, numeric seconds,difftime,lubridate::Period).R CMD check— 0 errors, 0 warnings, 3 unrelated NOTEs.USGS-0223850000060with three off-grid targets (output shown in the Try it section above). One HTTP request, three rows returned,timesnapped to the 15-minute grid,target_timepreserved asPOSIXct.Marked as draft pending maintainer review.
🤖 Generated with Claude Code