Skip to content

fix: move runNumber in legacy scientificMetada to top#2609

Draft
minottic wants to merge 6 commits intomasterfrom
run_scientific
Draft

fix: move runNumber in legacy scientificMetada to top#2609
minottic wants to merge 6 commits intomasterfrom
run_scientific

Conversation

@minottic
Copy link
Member

@minottic minottic commented Mar 10, 2026

Description

Since runNumber was moved from scientificMetadata to a top level dataset property #1485 , this PR adds a migration for legacy data and a fallback when posting or patching v3 datasets which moves scientificMetadata.runNumber to runNumber. This 2nd change is to facilitate customers' adoption as they would need to change their ingestions.

Closes #2592

Tests included

  • Included for each change/fix?
  • Passing?

Documentation

  • swagger documentation updated (required for API changes)
  • official documentation updated

official documentation info

@minottic minottic requested a review from a team as a code owner March 10, 2026 09:00
Comment on lines +45 to +48
filter: { _id: dataset._id },
update: {
$set: { "scientificMetadata.runNumber": dataset.runNumber },
$unset: {runNumber: ""},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this script addresses the following runNumber format?

                    "runNumber" : {
                        "value" : NumberInt(482),
                        "unit" : ""
                    },

Copy link
Member Author

@minottic minottic Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is anyone doing that? runNumber is simply an integer in the top level schema so I guess noone did that? I think we would have a hard time trying to imagine how everyone could have used runNumber in scientificMetadata, but if you reckon this shape is something they used, I can add to the migration

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, what should the migration do if there are unit? Simply concat?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked our DB and some scientificMetadata contains the format above.
This change does not affect our side, as we are going to use high level run number anyways.
But I'm not sure would it affect others or not

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, happy to add that!

@nitrosx
Copy link
Member

nitrosx commented Mar 11, 2026

I am not oppose to this migration script but it can only be optional.
Even if we have an high level field for run number, it is only a best practice to use this field.
We cannot assume what a facility decides to do with run number: use the high level field or store it in a scientific metadata entry.

Also it is a facility decision if they want to migrate legacy data or not and how to convert it from quantity (if they have it as quantity) to a numeric field.

@minottic
Copy link
Member Author

I am not oppose to this migration script but it can only be optional.
Even if we have an high level field for run number, it is only a best practice to use this field.
We cannot assume what a facility decides to do with run number: use the high level field or store it in a scientific metadata entry.

Also it is a facility decision if they want to migrate legacy data or not and how to convert it from quantity (if they have it as quantity) to a numeric field.

the migration will run only if there are datasets with scientificMetadata.runNumber. I am not sure I understand the comment fully, doesn't this apply to all migrations then? This is a change in behaviour between the old backend and the new one, so a migration is needed. For example the FE runNumber in the datasets overview used to rely on scientificMetadata.runNumber and now fetches the top level run number. People will always be able to use scientificMetadata.runNumber in the v3 endpoints and this will move it to the topLevel runNumber. One could argue why this field was moved to a top level field without checking the old behaviour, but that's another question...

@nitrosx
Copy link
Member

nitrosx commented Mar 11, 2026

I am not oppose to this migration script but it can only be optional.
Even if we have an high level field for run number, it is only a best practice to use this field.
We cannot assume what a facility decides to do with run number: use the high level field or store it in a scientific metadata entry.
Also it is a facility decision if they want to migrate legacy data or not and how to convert it from quantity (if they have it as quantity) to a numeric field.

the migration will run only if there are datasets with scientificMetadata.runNumber. I am not sure I understand the comment fully, doesn't this apply to all migrations then? This is a change in behaviour between the old backend and the new one, so a migration is needed. For example the FE runNumber in the datasets overview used to rely on scientificMetadata.runNumber and now fetches the top level run number. People will always be able to use scientificMetadata.runNumber in the v3 endpoints and this will move it to the topLevel runNumber. One could argue why this field was moved to a top level field without checking the old behaviour, but that's another question...

The migration should run only if the facility decided to do so.
It is not up to SciCat project to force a facility to use high level field runNumber, although we can suggest it as best practice.
If a facility has datasets with scientific metadata entry named runNumber, we can not assume that they want to migrate and the migration makes sense.
I will be happy to discuss in person

Copy link
Member

@nitrosx nitrosx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change, although technical and syntactically correct, needs more community discussion

@minottic
Copy link
Member Author

minottic commented Mar 11, 2026

The migration should run only if the facility decided to do so.

but why not for all migrations then? Also, noone forces anyone to run migrations, they are not run on startup. We do the same exact thing here

Say I am a user and I have a look at the POST datasets DTO. I will see a runNumber field. So I will add my runNumber there. Then all historical metadata used to use scientificMetadata.runNumber. If no migration I will have inconsistent behaviour

If you think we should not standardise how to use runNumber as a top level element, I think this will create a lot of confusion for the use, but I can manage that from my side

@minottic
Copy link
Member Author

minottic commented Mar 11, 2026

just to summarise the change, given the latest commit:

  1. on PATCH/POST on v3 endpoint runNumber is populated from scientificMetadata.runNumber if runNumber is missing. Both are kept (we might think to remove the scientificMetadata.runNumber in this case but I am fine keeping it)
  2. on GET on v3 scientificMetadata.runNumber is populated from runNumber if missing

since the migration is a source of discussion, I am happy to manage it from my side and not have it in this repo

@minottic minottic requested a review from nitrosx March 17, 2026 09:56
@minottic
Copy link
Member Author

any further reason this is blocked @nitrosx ?

@minottic
Copy link
Member Author

@nitrosx , @Junjiequan any news here?

Junjiequan
Junjiequan previously approved these changes Mar 19, 2026
Copy link
Member

@Junjiequan Junjiequan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@Junjiequan
Copy link
Member

If this PR is urgent we can dismiss review from @nitrosx , as he will mostly not available untill the week after next week.
I don't see any other blockers with this PR

@Junjiequan Junjiequan dismissed nitrosx’s stale review March 20, 2026 11:31

Requested changes has been addressed. I don't see any other blockers. If later we find any blockers we can revert the PR by then

@minottic
Copy link
Member Author

If this PR is urgent we can dismiss review from @nitrosx , as he will mostly not available untill the week after next week.
I don't see any other blockers with this PR

If you are happy with the changes, I would indeed proceed with the merge

@Junjiequan
Copy link
Member

Junjiequan commented Mar 20, 2026

I just thought about one potential issues here. if we allow overwritting runNumber from scientificMetadata.runNumber if not exists, what about old datasets? wouldn't it causing incosistent datasets?
For example, when creating or updating a dataset, dataset.runNumber may not be provided, but the user might set scientificMetadata.runNumber for a different purpose than the actual meaning of runNumber.

@Junjiequan Junjiequan dismissed their stale review March 20, 2026 14:52

Still need more discussion

@minottic
Copy link
Member Author

minottic commented Mar 20, 2026

I just thought about one potential issues here. if we allow overwritting runNumber from scientificMetadata.runNumber if not exists, what about old datasets? wouldn't it causing incosistent datasets?

That's exactly why we will run a migration at PSI. I think it's indeed not a trivial matter, because:

  1. the FE consumes top level runNumber. Before it used to consume scientificMetadata.runNumber. Already this, kinda supports the argument that they have the same meaning
  2. If I want runNumber to be displayed in FE I need to use the top level field
  3. Even if I did not want the FE to use this field, any new user who would see the DTO of Datasets will put runNumber as a top level field, as that's the most intuitive thing if having a look at the schema (they would not think they need to use the scientificMetadata.runNumber)
  4. Given that we have legacy users and data, we will very likely have historic users using "scientificMetadata.runNumber" and new users using "runNumber"
  5. This will create inconsistencies
  6. To overcome this, PSI will run a migration of its data moving scientificMetadata.runNumber to runNumber and the change of this PR would keep new data consistent

I am happy with other solutions but this bug is causing complains from our users.

Also, I honestly think asking ourselves "what if the user meant something by leaving runNumber empty and used scientificMetadata.runNumber" is a sensible question, but in reality a bit too theoretic, I believe anyone seeing runNumber would interpret it in the same way as scientificMetadata.runNumber and simply leave one of the 2 out (likely the nested one as it's not in the DTO)

Last point, when we moved scientificMetadata.runNumber to runNumber not only we caused this possible inconsistence, but we also introduced a breaking change, as any client would not know anymore what field to consume.

@Junjiequan
Copy link
Member

After this change I'm afraid It still remains the root issue that what if user’s intent for scientificMeata.runNumber is not meant to be used for dataset.runNumber.

@minottic
Copy link
Member Author

sorry, I updated the comment here quite a bit

@minottic
Copy link
Member Author

and also, we've done these defaultings a lot of time it's done in publishedData DTOs and job DTOs.

But maybe to be pragmatic, what if I make another PR where I expose conditionally runNumber as a top level field? In this way, PSI could opt out from having it as a top level field and keep the old behaviour or I can make the change in this PR configurable via ENV?

@Junjiequan
Copy link
Member

Is the complaining comes from mainly for FE display ? if it's the case the table config accepts scientificMetadata.runNumber.value and displays like so:
image
also within labelLocalization the label can be renamed to runNumber

@minottic
Copy link
Member Author

no, the main problem will be this

Even if I did not want the FE to use this field, any new user who would see the DTO of Datasets will put runNumber as a top level field, as that's the most intuitive thing if having a look at the schema (they would not think they need to use the scientificMetadata.runNumber)

@Junjiequan
Copy link
Member

In this PR I think the output part looks fine to me, my only concern was the create/update dto part which tries to predict/ twist user’s intent that it assumes dataset.scientificMeatadata.runNumber = dataset.runNumber

@Junjiequan
Copy link
Member

If I understand the whole situation corretctly, the issue is that legacyusers or services that dont want to use or see the top level runNumber , as it can be confusing which one to use.

if that's the case, please go ahead and add env variables within this PR and expose the field conditionally. And we can treat top level runNumber as a new feature in v4

@minottic
Copy link
Member Author

In this PR I think the output part looks fine to me, my only concern was the create/update dto part which tries to predict/ twist user’s intent that it assumes dataset.scientificMeatadata.runNumber = dataset.runNumber

I think the change cannot happen only in the get dto though, we have dashboarding tools that query the db directly, I'd like to avoid querying for both fields in each query.

But that's valuable feedback, I'll propose something different next week

Copy link
Member

@nitrosx nitrosx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@minottic apologies for being so demanding!!!
can you add a configuration variable in order to enable or disable this behavior?

@minottic minottic marked this pull request as draft March 26, 2026 09:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dataset's runNumber inconsistencies

4 participants