Add functions to update custom datasets#45
Conversation
we are in R after all
sjspielman
left a comment
There was a problem hiding this comment.
add_dataset_samples()/remove_dataset_samples() add or remove samples (or a project, but the name seemed fine with samples unless you have another idea)
This seems fine to me.
I did a first round of review here to understand how all the functions interact, but I didn't directly run/test any code yet because I had some questions about functionality - namely, is the replace function really needed and/or is it the right approach for users.
| #' @keywords internal | ||
| #' | ||
| #' @returns the dataset ID as a length-1 character string | ||
| resolve_dataset_id <- function(dataset) { |
| "Dataset {response$id} created.", | ||
| " Use get_dataset_info() to inspect the dataset." | ||
| )) | ||
| message(glue::glue("Dataset {response$id} created.")) |
There was a problem hiding this comment.
Maybe I've already commented this but a small thought is to check the id is there, but also if the id is not there we have bigger problems so the check probably isn't needed.
There was a problem hiding this comment.
I will add a 404/API status check here too (In a separate PR), but I if the API isn't returning id, we are definitely in the bad place.
| dataset_id <- dataset | ||
| } | ||
| #' @keywords internal | ||
| get_dataset_detail <- function(dataset, auth_token) { |
| req_perform() |> | ||
| resp_body_json() | ||
| }, | ||
| httr2_http_409 = \(cnd) { |
There was a problem hiding this comment.
I wondered if we wanted to catch other types of errors too, but I'm not sure we'd offer a more informative error message than what we'd get anyways
There was a problem hiding this comment.
I could be a bit better with 403 errors in particular, which indicate something off with authorization. I have some ideas for doing this better, but I think I would rather do this in a separate PR, as I would want to touch almost every existing http error handler.
| #' \dontrun{ | ||
| #' replace_dataset_data(ds, auth_token = token, samples = c("SCPCS000001", "SCPCS000002")) | ||
| #' } | ||
| replace_dataset_data <- function( |
There was a problem hiding this comment.
It does feel strange to me as a user to pass in an existing dataset to a function whose effect is to essentially remove everything from it (formatting aside). I would think, just create a new dataset altogether if I'm replacing all the data. To me, a dataset and the data it contains feel pretty equivalent so my first instinct would be to just create a new one if I wanted to wholesale replace.
One question though is, is only one dataset allowed at a time? If so, I would think that just re-issuing create_dataset() could have the effect (can be controlled with an arg) of just overwriting the existing dataset, so a standalone replace function wouldn't necessarily be need. But if not and multiple datasets can be floating around, then replace makes more sense to me.
Can you share a little more about how you envision this being used?
There was a problem hiding this comment.
I would expect that most of the time people will be using the add_ or remove_ functions, but since the underlying mechanism is a full replace, it was easy enough to have this ability to just start fresh. It is every so slightly nicer to the API not to create a new dataset id every time and abandon the existing one.
| #' \dontrun{ | ||
| #' set_dataset_email(ds, auth_token = token, email = "user@example.com") | ||
| #' } | ||
| set_dataset_email <- function(dataset, auth_token, email) { |
There was a problem hiding this comment.
I wonder if we want to capture the email provided to auth_token to use as a default here. But it's probably best to make them be explicit with this.
There was a problem hiding this comment.
Also, I do see we have email as an optional argument in create_dataset(). You mentioned that you plan to allow them to set the email at the start of processing too. So, multiple routes would be available to provide the email here? That's fine with me.
There was a problem hiding this comment.
I was thinking about making an environment variable as the default, but really I expect that most of the time people using the API will not set email addresses, and will just poll the API for when the dataset is ready, so I didn't do that.
| keep <- purrr::map_lgl(existing, \(p) length(p$SINGLE_CELL) > 0 || length(p$SPATIAL) > 0) | ||
| existing[keep] |
| #' | ||
| #' @returns the updated dataset detail as a list (invisibly) | ||
| #' | ||
| #' @import httr2 |
There was a problem hiding this comment.
Is this actually needed since it's just calling other scpcar functions?
Co-authored-by: Stephanie J. Spielman <stephanie.spielman@gmail.com>
jashapiro
left a comment
There was a problem hiding this comment.
Thanks for the review. I made some small changes in as requested, and filed an issue about error handling, as there are enough things stacked here to make updating that broadly a bit annoying at the moment (and it would be annoying anyway, touching a many existing functions to do it properly).
I replied about the replace_function(), mostly to say that I don't expect it to be used much at all, but it more closely mirrors the underlying API behavior, so it seemed appropriate to have as an option. As far as I understand, datasets never disappear, so if someone is never going to use a particular dataset, it seems good to let them replace the contents rather than starting fully fresh.
| req_perform() |> | ||
| resp_body_json() | ||
| }, | ||
| httr2_http_409 = \(cnd) { |
There was a problem hiding this comment.
I could be a bit better with 403 errors in particular, which indicate something off with authorization. I have some ideas for doing this better, but I think I would rather do this in a separate PR, as I would want to touch almost every existing http error handler.
| "Dataset {response$id} created.", | ||
| " Use get_dataset_info() to inspect the dataset." | ||
| )) | ||
| message(glue::glue("Dataset {response$id} created.")) |
There was a problem hiding this comment.
I will add a 404/API status check here too (In a separate PR), but I if the API isn't returning id, we are definitely in the bad place.
| #' \dontrun{ | ||
| #' replace_dataset_data(ds, auth_token = token, samples = c("SCPCS000001", "SCPCS000002")) | ||
| #' } | ||
| replace_dataset_data <- function( |
There was a problem hiding this comment.
I would expect that most of the time people will be using the add_ or remove_ functions, but since the underlying mechanism is a full replace, it was easy enough to have this ability to just start fresh. It is every so slightly nicer to the API not to create a new dataset id every time and abandon the existing one.
| #' \dontrun{ | ||
| #' set_dataset_email(ds, auth_token = token, email = "user@example.com") | ||
| #' } | ||
| set_dataset_email <- function(dataset, auth_token, email) { |
There was a problem hiding this comment.
I was thinking about making an environment variable as the default, but really I expect that most of the time people using the API will not set email addresses, and will just poll the API for when the dataset is ready, so I didn't do that.
| keep <- purrr::map_lgl(existing, \(p) length(p$SINGLE_CELL) > 0 || length(p$SPATIAL) > 0) | ||
| existing[keep] |
| req_perform() |> | ||
| resp_body_json() | ||
| }, | ||
| httr2_http_409 = \(cnd) { |
Closes #43
Here I am adding a few external functions to allow updates to datasets:
replace_dataset_data()allows updating a dataset with a new set of samples and/or projects. It is essentially the same ascreate_dataset(), but does not allow modification of the format or email.add_dataset_samples()/remove_dataset_samples()add or remove samples (or a project, but the name seemed fine with samples unless you have another idea)set_dataset_email()does what it says... I plan to also allow setting the email when starting processing, but that will be part of the download process in Adddownload_dataset()function #35I also renamed the
get_dataset_info()function temporarily, or rather I renamed it asget_dataset_detailand made it internal. I plan to write wrappers around it for #34 and #36 with simpler lists for users (they don't need all that nesting).The main api functions for now still return the full detail from the api.
There are also a number of internal functions for things like merging dataset lists and such.
Next up: processing and downloading a dataset!