Add CQL filter passthrough to OGC waterdata functions#880
Add CQL filter passthrough to OGC waterdata functions#880thodson-usgs wants to merge 3 commits intoDOI-USGS:developfrom
Conversation
Every OGC read_waterdata_* function (continuous, daily, field_measurements, monitoring_location, ts_meta, latest_continuous, latest_daily, channel) now accepts `filter` and `filter_lang` arguments that are forwarded as the OGC `filter` / `filter-lang` query parameters. The R argument `filter_lang` is translated to the hyphenated `filter-lang` URL parameter that the service expects. When a filter is a top-level OR chain that exceeds a conservative URI-length budget (5 KB), the library transparently splits it into multiple sub-requests and concatenates (and deduplicates) the results. This keeps the common multi-interval use case out of the caller's way -- they don't need to know about the server's 414 boundary. Mirrors dataretrieval-python PR DOI-USGS#238. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@ldecicco-USGS , would you give some high-level feedback on how we could expose waterdata filters through the dataretrieval API. Feel free to review more of the code. It's AI generated, so you might start with a quick pass and we'll use your feedback to steer the bot. Feel free to ask questions here or iterate with your own bot locally. A couple iterations of this might get to a shipable state. Then I'll take that feedback back to the Python implementation. |
…rame handling Addresses feedback on the companion Python PR (DOI-USGS/dataretrieval-python#238): - Skip chunking when `filter_lang` is not `cql-text`. The splitter is text- and single-quote-aware and would corrupt cql-json. Non-cql-text filters are now forwarded as-is. - Budget each chunk against the server's URL byte limit (`.WATERDATA_URL_BYTE_LIMIT = 8000`, matching the observed HTTP 414 cliff of ~8,200 bytes) rather than a fixed raw filter length. `effective_filter_budget` probes the non-filter URL, subtracts, and converts back to raw CQL bytes using the max per-clause encoding ratio (with the " OR " joiner included — in R's percent-encoding the joiner inflates 2x, heavier than typical clause ratios, and the previous clause-only max let chunks overflow the URL cap). - When the non-filter URL already exceeds the byte limit, return a budget larger than the filter so it passes through unchanged — one clear 414 is better feedback than N failing sub-requests. - Move filter chunking out of the recursive `get_ogc_data` path and into the post-transform branch, so the probe sees the real request args. Collect raw frames, drop empty ones before `rbind` (a plain empty frame first would downgrade a later sf result and drop geometry/CRS), and dedup on the pre-rename feature `id`. - Add regression tests for doubled single-quote CQL escape, the URL byte budget guarantee, and non-cql-text pass-through. - Document CQL filter usage with two examples on `read_waterdata_continuous`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the helper organization in the merged Python PR (DOI-USGS/dataretrieval-python#238) so the per-language implementations stay easy to read alongside each other. The single-vs-fanned distinction is now expressed once, in `plan_filter_chunks`, which always returns a list of "chunk overrides" -- `list(NULL)` for "send `args` as-is", or a list of chunked cql-text expressions otherwise. `fetch_chunks` issues one request per entry and returns the per-chunk frames plus the first sub-request (for the `request` attribute). `combine_chunk_frames` handles the empty-frame and dedup-by-`id` cases. `get_ogc_data` is now a linear pipeline: chunks <- plan_filter_chunks(args) fetched <- fetch_chunks(args, chunks) return_list <- combine_chunk_frames(fetched$frames) req <- fetched$req ... post-processing ... Behavior unchanged: same chunk sizing (URL-byte-budget aware), same cql-text-only guard, same empty-frame and id-dedup handling. The only observable difference is that the `request` attribute now points at the first sub-request instead of the last (matching Python's choice of representative metadata), which is a debugging-only change for the chunked path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
There's a ton of overlap between this and #879 I'll consider how the filter argument could be used in monitoring_locations, ts_meta, combine... but again because the chunking is being added to the other PR we're waiting on - I'm going to close this one. |
|
I didn't think the the time arguments would accept multi-window requests, which the filter allows (e.g., |
|
So this was my though process: My gut says there would be way more people trying stuff like than and then either get the wrong results unknowingly, or complain that dataRetrieval is broken versus those would would use the filter to time windows. Users can pass custom CQL2 into the read_waterdata function like this example: # A wildcard in CQL2 is %
# Here's how to get HUCs that fall within 02070010
cql_huc_wildcard <- '{
"op": "like",
"args": [
{ "property": "hydrologic_unit_code" },
"02070010%"
]
}'
what_huc_sites <- read_waterdata(service = "monitoring-locations",
CQL = cql_huc_wildcard)So we're not prohibiting a complex time window. I'm more inclined to set up an article, or expanding this one: |
Summary
Every OGC
read_waterdata_*function (read_waterdata_continuous,read_waterdata_daily,read_waterdata_field_measurements,read_waterdata_monitoring_location,read_waterdata_ts_meta,read_waterdata_latest_continuous,read_waterdata_latest_daily,read_waterdata_channel) now acceptsfilterandfilter_langarguments that are forwarded as the OGCfilter/filter-langquery parameters. The R argumentfilter_langis translated to the hyphenatedfilter-langURL parameter that the service expects (hyphens aren't valid in R argument names).When a
filteris a top-levelORchain that exceeds a conservative URI-length budget (5 KB), the library transparently splits it into multiple sub-requests and concatenates the results, deduplicated by id. This keeps the common multi-interval use case out of the caller's way — they don't need to know about the server's 414 boundary.This mirrors the Python companion PR: DOI-USGS/dataretrieval-python#238.
Motivation
The OGC
timeparameter accepts a single instant, a single bounded interval, or a half-bounded interval — it does not accept a list of intervals. For workflows that need to pull short windows of continuous data around many field-measurement timestamps (e.g., pairing discrete discharge measurements with the index velocity at the time of each measurement), the existing client requires one HTTP round-trip per window.The waterdata OGC API already supports a
filterquery parameter with CQL OR-expressions, but this isn't currently exposed through the R client's signatures. This PR threads the passthrough through:Long OR chains are handled for the caller:
Chunking behavior
ORchains are split. The splitter is paren- and quote-aware, soORinside sub-expressions like(A OR B)or string literals like'foo OR bar'is preserved.OR, or any single clause already exceeds the budget, the filter is sent as-is (server decides) rather than being mangled.continuous_id,daily_id, etc.) so overlapping user-supplied OR clauses combine losslessly..CQL_FILTER_CHUNK_LEN = 5000) is private and conservative; the continuous endpoint has been observed to return HTTP 414 around ~7 KB of filter text.Caveats
cql-text(default) andcql-json;cql2-text/cql2-jsonreturn400 Invalid filter language.Changes
R/construct_api_requests.R— translatesfilter_lang→filter-langURL key and addsfilter/filter-langto thesingle_paramslist.R/get_ogc_data.R— adds privatesplit_top_level_orandchunk_cql_orhelpers, and fans a longfilterinto per-chunk sub-requests when needed, concatenating and deduping results by output id.R/read_waterdata_{continuous,daily,field_measurements,monitoring_location,ts_meta,latest_continuous,latest_daily,channel}.R— addfilterandfilter_langarguments with documentation.tests/testthat/tests_userFriendly_fxns.R— adds non-network unit tests for the passthrough, hyphenation, splitter/chunker semantics.NEWS— short announcement.Test plan
NOT_CRAN=true API_USGS_PAT=… Rscript -e 'devtools::test()'— 303/303 pass (includes ~9 new tests for filter/filter_lang/split/chunk).Rscript -e 'devtools::check(vignettes = FALSE, args = c("--no-tests", "--no-examples", "--no-manual"))'— 0 errors, 0 warnings, 0 notes related to these changes.Marked as draft pending maintainer review.
🤖 Generated with Claude Code