Skip to content

Add functions to update custom datasets#45

Open
jashapiro wants to merge 13 commits into
mainfrom
jashapiro/update-datasets
Open

Add functions to update custom datasets#45
jashapiro wants to merge 13 commits into
mainfrom
jashapiro/update-datasets

Conversation

@jashapiro
Copy link
Copy Markdown
Member

Closes #43

Here I am adding a few external functions to allow updates to datasets:

  • replace_dataset_data() allows updating a dataset with a new set of samples and/or projects. It is essentially the same as create_dataset(), but does not allow modification of the format or email.
  • add_dataset_samples()/remove_dataset_samples() add or remove samples (or a project, but the name seemed fine with samples unless you have another idea)
  • set_dataset_email() does what it says... I plan to also allow setting the email when starting processing, but that will be part of the download process in Add download_dataset() function #35

I also renamed the get_dataset_info() function temporarily, or rather I renamed it as get_dataset_detail and made it internal. I plan to write wrappers around it for #34 and #36 with simpler lists for users (they don't need all that nesting).

The main api functions for now still return the full detail from the api.

There are also a number of internal functions for things like merging dataset lists and such.

Next up: processing and downloading a dataset!

Copy link
Copy Markdown
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_dataset_samples()/remove_dataset_samples() add or remove samples (or a project, but the name seemed fine with samples unless you have another idea)

This seems fine to me.

I did a first round of review here to understand how all the functions interact, but I didn't directly run/test any code yet because I had some questions about functionality - namely, is the replace function really needed and/or is it the right approach for users.

Comment thread R/datasets.R
#' @keywords internal
#'
#' @returns the dataset ID as a length-1 character string
resolve_dataset_id <- function(dataset) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread R/datasets.R
"Dataset {response$id} created.",
" Use get_dataset_info() to inspect the dataset."
))
message(glue::glue("Dataset {response$id} created."))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I've already commented this but a small thought is to check the id is there, but also if the id is not there we have bigger problems so the check probably isn't needed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a 404/API status check here too (In a separate PR), but I if the API isn't returning id, we are definitely in the bad place.

Comment thread R/datasets.R
dataset_id <- dataset
}
#' @keywords internal
get_dataset_detail <- function(dataset, auth_token) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread R/datasets.R
req_perform() |>
resp_body_json()
},
httr2_http_409 = \(cnd) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered if we wanted to catch other types of errors too, but I'm not sure we'd offer a more informative error message than what we'd get anyways

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be a bit better with 403 errors in particular, which indicate something off with authorization. I have some ideas for doing this better, but I think I would rather do this in a separate PR, as I would want to touch almost every existing http error handler.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #48 to track this

Comment thread R/datasets.R Outdated
Comment thread R/datasets.R Outdated
Comment thread R/datasets.R
#' \dontrun{
#' replace_dataset_data(ds, auth_token = token, samples = c("SCPCS000001", "SCPCS000002"))
#' }
replace_dataset_data <- function(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does feel strange to me as a user to pass in an existing dataset to a function whose effect is to essentially remove everything from it (formatting aside). I would think, just create a new dataset altogether if I'm replacing all the data. To me, a dataset and the data it contains feel pretty equivalent so my first instinct would be to just create a new one if I wanted to wholesale replace.

One question though is, is only one dataset allowed at a time? If so, I would think that just re-issuing create_dataset() could have the effect (can be controlled with an arg) of just overwriting the existing dataset, so a standalone replace function wouldn't necessarily be need. But if not and multiple datasets can be floating around, then replace makes more sense to me.

Can you share a little more about how you envision this being used?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that most of the time people will be using the add_ or remove_ functions, but since the underlying mechanism is a full replace, it was easy enough to have this ability to just start fresh. It is every so slightly nicer to the API not to create a new dataset id every time and abandon the existing one.

Comment thread R/datasets.R
#' \dontrun{
#' set_dataset_email(ds, auth_token = token, email = "user@example.com")
#' }
set_dataset_email <- function(dataset, auth_token, email) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we want to capture the email provided to auth_token to use as a default here. But it's probably best to make them be explicit with this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I do see we have email as an optional argument in create_dataset(). You mentioned that you plan to allow them to set the email at the start of processing too. So, multiple routes would be available to provide the email here? That's fine with me.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about making an environment variable as the default, but really I expect that most of the time people using the API will not set email addresses, and will just poll the API for when the dataset is ready, so I didn't do that.

Comment thread R/datasets.R Outdated
Comment on lines +392 to +393
keep <- purrr::map_lgl(existing, \(p) length(p$SINGLE_CELL) > 0 || length(p$SPATIAL) > 0)
existing[keep]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not purrr::keep()?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

Comment thread R/datasets.R Outdated
#'
#' @returns the updated dataset detail as a list (invisibly)
#'
#' @import httr2
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually needed since it's just calling other scpcar functions?

jashapiro and others added 2 commits June 1, 2026 13:07
Co-authored-by: Stephanie J. Spielman <stephanie.spielman@gmail.com>
Copy link
Copy Markdown
Member Author

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I made some small changes in as requested, and filed an issue about error handling, as there are enough things stacked here to make updating that broadly a bit annoying at the moment (and it would be annoying anyway, touching a many existing functions to do it properly).

I replied about the replace_function(), mostly to say that I don't expect it to be used much at all, but it more closely mirrors the underlying API behavior, so it seemed appropriate to have as an option. As far as I understand, datasets never disappear, so if someone is never going to use a particular dataset, it seems good to let them replace the contents rather than starting fully fresh.

Comment thread R/datasets.R
req_perform() |>
resp_body_json()
},
httr2_http_409 = \(cnd) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be a bit better with 403 errors in particular, which indicate something off with authorization. I have some ideas for doing this better, but I think I would rather do this in a separate PR, as I would want to touch almost every existing http error handler.

Comment thread R/datasets.R
"Dataset {response$id} created.",
" Use get_dataset_info() to inspect the dataset."
))
message(glue::glue("Dataset {response$id} created."))
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a 404/API status check here too (In a separate PR), but I if the API isn't returning id, we are definitely in the bad place.

Comment thread R/datasets.R
#' \dontrun{
#' replace_dataset_data(ds, auth_token = token, samples = c("SCPCS000001", "SCPCS000002"))
#' }
replace_dataset_data <- function(
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that most of the time people will be using the add_ or remove_ functions, but since the underlying mechanism is a full replace, it was easy enough to have this ability to just start fresh. It is every so slightly nicer to the API not to create a new dataset id every time and abandon the existing one.

Comment thread R/datasets.R
#' \dontrun{
#' set_dataset_email(ds, auth_token = token, email = "user@example.com")
#' }
set_dataset_email <- function(dataset, auth_token, email) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about making an environment variable as the default, but really I expect that most of the time people using the API will not set email addresses, and will just poll the API for when the dataset is ready, so I didn't do that.

Comment thread R/datasets.R Outdated
Comment on lines +392 to +393
keep <- purrr::map_lgl(existing, \(p) length(p$SINGLE_CELL) > 0 || length(p$SPATIAL) > 0)
existing[keep]
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

Comment thread R/datasets.R
req_perform() |>
resp_body_json()
},
httr2_http_409 = \(cnd) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #48 to track this

@jashapiro jashapiro requested a review from sjspielman June 1, 2026 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add update dataset functions

2 participants