Add functions to update custom datasets by jashapiro · Pull Request #45 · AlexsLemonade/ScPCAr

jashapiro · 2026-05-29T17:51:53Z

Closes #43

Here I am adding a few external functions to allow updates to datasets:

replace_dataset_data() allows updating a dataset with a new set of samples and/or projects. It is essentially the same as create_dataset(), but does not allow modification of the format or email.
add_dataset_samples()/remove_dataset_samples() add or remove samples (or a project, but the name seemed fine with samples unless you have another idea)
set_dataset_email() does what it says... I plan to also allow setting the email when starting processing, but that will be part of the download process in Add download_dataset() function #35

I also renamed the get_dataset_info() function temporarily, or rather I renamed it as get_dataset_detail and made it internal. I plan to write wrappers around it for #34 and #36 with simpler lists for users (they don't need all that nesting).

The main api functions for now still return the full detail from the api.

There are also a number of internal functions for things like merging dataset lists and such.

Next up: processing and downloading a dataset!

we are in R after all

…sets

sjspielman

add_dataset_samples()/remove_dataset_samples() add or remove samples (or a project, but the name seemed fine with samples unless you have another idea)

This seems fine to me.

I did a first round of review here to understand how all the functions interact, but I didn't directly run/test any code yet because I had some questions about functionality - namely, is the replace function really needed and/or is it the right approach for users.

sjspielman · 2026-06-01T15:57:47Z

+#' @keywords internal
+#'
+#' @returns the dataset ID as a length-1 character string
+resolve_dataset_id <- function(dataset) {


sjspielman · 2026-06-01T16:01:01Z

-    "Dataset {response$id} created.",
-    " Use get_dataset_info() to inspect the dataset."
-  ))
+  message(glue::glue("Dataset {response$id} created."))


Maybe I've already commented this but a small thought is to check the id is there, but also if the id is not there we have bigger problems so the check probably isn't needed.

I will add a 404/API status check here too (In a separate PR), but I if the API isn't returning id, we are definitely in the bad place.

sjspielman · 2026-06-01T16:01:20Z

-    dataset_id <- dataset
-  }
+#' @keywords internal
+get_dataset_detail <- function(dataset, auth_token) {


sjspielman · 2026-06-01T16:04:05Z

+        req_perform() |>
+        resp_body_json()
+    },
+    httr2_http_409 = \(cnd) {


I wondered if we wanted to catch other types of errors too, but I'm not sure we'd offer a more informative error message than what we'd get anyways

I could be a bit better with 403 errors in particular, which indicate something off with authorization. I have some ideas for doing this better, but I think I would rather do this in a separate PR, as I would want to touch almost every existing http error handler.

Filed #48 to track this

sjspielman · 2026-06-01T16:12:25Z

+#' \dontrun{
+#' replace_dataset_data(ds, auth_token = token, samples = c("SCPCS000001", "SCPCS000002"))
+#' }
+replace_dataset_data <- function(


It does feel strange to me as a user to pass in an existing dataset to a function whose effect is to essentially remove everything from it (formatting aside). I would think, just create a new dataset altogether if I'm replacing all the data. To me, a dataset and the data it contains feel pretty equivalent so my first instinct would be to just create a new one if I wanted to wholesale replace.

One question though is, is only one dataset allowed at a time? If so, I would think that just re-issuing create_dataset() could have the effect (can be controlled with an arg) of just overwriting the existing dataset, so a standalone replace function wouldn't necessarily be need. But if not and multiple datasets can be floating around, then replace makes more sense to me.

Can you share a little more about how you envision this being used?

I would expect that most of the time people will be using the add_ or remove_ functions, but since the underlying mechanism is a full replace, it was easy enough to have this ability to just start fresh. It is every so slightly nicer to the API not to create a new dataset id every time and abandon the existing one.

sjspielman · 2026-06-01T16:13:56Z

+#' \dontrun{
+#' set_dataset_email(ds, auth_token = token, email = "user@example.com")
+#' }
+set_dataset_email <- function(dataset, auth_token, email) {


I wonder if we want to capture the email provided to auth_token to use as a default here. But it's probably best to make them be explicit with this.

Also, I do see we have email as an optional argument in create_dataset(). You mentioned that you plan to allow them to set the email at the start of processing too. So, multiple routes would be available to provide the email here? That's fine with me.

I was thinking about making an environment variable as the default, but really I expect that most of the time people using the API will not set email addresses, and will just poll the API for when the dataset is ready, so I didn't do that.

sjspielman · 2026-06-01T16:21:48Z

+  keep <- purrr::map_lgl(existing, \(p) length(p$SINGLE_CELL) > 0 || length(p$SPATIAL) > 0)
+  existing[keep]


why not purrr::keep()?

sjspielman · 2026-06-01T16:24:00Z

+#'
+#' @returns the updated dataset detail as a list (invisibly)
+#'
+#' @import httr2


Is this actually needed since it's just calling other scpcar functions?

Co-authored-by: Stephanie J. Spielman <stephanie.spielman@gmail.com>

jashapiro

Thanks for the review. I made some small changes in as requested, and filed an issue about error handling, as there are enough things stacked here to make updating that broadly a bit annoying at the moment (and it would be annoying anyway, touching a many existing functions to do it properly).

I replied about the replace_function(), mostly to say that I don't expect it to be used much at all, but it more closely mirrors the underlying API behavior, so it seemed appropriate to have as an option. As far as I understand, datasets never disappear, so if someone is never going to use a particular dataset, it seems good to let them replace the contents rather than starting fully fresh.

jashapiro · 2026-06-01T16:56:07Z

+        req_perform() |>
+        resp_body_json()
+    },
+    httr2_http_409 = \(cnd) {


I could be a bit better with 403 errors in particular, which indicate something off with authorization. I have some ideas for doing this better, but I think I would rather do this in a separate PR, as I would want to touch almost every existing http error handler.

jashapiro · 2026-06-01T16:57:17Z

-    "Dataset {response$id} created.",
-    " Use get_dataset_info() to inspect the dataset."
-  ))
+  message(glue::glue("Dataset {response$id} created."))


I will add a 404/API status check here too (In a separate PR), but I if the API isn't returning id, we are definitely in the bad place.

jashapiro · 2026-06-01T17:01:31Z

+#' \dontrun{
+#' replace_dataset_data(ds, auth_token = token, samples = c("SCPCS000001", "SCPCS000002"))
+#' }
+replace_dataset_data <- function(


I would expect that most of the time people will be using the add_ or remove_ functions, but since the underlying mechanism is a full replace, it was easy enough to have this ability to just start fresh. It is every so slightly nicer to the API not to create a new dataset id every time and abandon the existing one.

jashapiro · 2026-06-01T17:03:18Z

+#' \dontrun{
+#' set_dataset_email(ds, auth_token = token, email = "user@example.com")
+#' }
+set_dataset_email <- function(dataset, auth_token, email) {


I was thinking about making an environment variable as the default, but really I expect that most of the time people using the API will not set email addresses, and will just poll the API for when the dataset is ready, so I didn't do that.

jashapiro · 2026-06-01T17:06:50Z

+  keep <- purrr::map_lgl(existing, \(p) length(p$SINGLE_CELL) > 0 || length(p$SPATIAL) > 0)
+  existing[keep]


jashapiro · 2026-06-01T17:31:25Z

+        req_perform() |>
+        resp_body_json()
+    },
+    httr2_http_409 = \(cnd) {


Filed #48 to track this

jashapiro added 11 commits May 29, 2026 08:59

add method to scpca_requests

44d4f67

add patch handler and id helper

703bf7a

add functions to replace datasets

2aafe58

make the default format sce

b13a87c

we are in R after all

we need to use PUT not PATCH

1521b51

Add tests

89875f1

add/update docs

1f5881e

docs updates (merge add and remove)

657a7fe

Merge remote-tracking branch 'origin/main' into jashapiro/update-data…

b4b890c

…sets

Add separate function to set email address

1237c3c

make the raw dataset detail an internal function

8234603

jashapiro requested a review from sjspielman May 29, 2026 17:56

This was referenced May 29, 2026

Store auth token in env #46

Open

Add functions to start processing and check status #47

Open

sjspielman reviewed Jun 1, 2026

View reviewed changes

jashapiro and others added 2 commits June 1, 2026 13:07

Apply suggestions from code review

5625ff0

Co-authored-by: Stephanie J. Spielman <stephanie.spielman@gmail.com>

use purrr::keep, remove extra httr2 imports

c63d5bd

jashapiro commented Jun 1, 2026

View reviewed changes

jashapiro requested a review from sjspielman June 1, 2026 18:50

		keep <- purrr::map_lgl(existing, \(p) length(p$SINGLE_CELL) > 0 \|\| length(p$SPATIAL) > 0)
		existing[keep]

Conversation

jashapiro commented May 29, 2026

Uh oh!

sjspielman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jashapiro left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants