Skip to content

Merge option create#86

Open
RoteKekse wants to merge 5 commits into
mainfrom
merge_option_create
Open

Merge option create#86
RoteKekse wants to merge 5 commits into
mainfrom
merge_option_create

Conversation

@RoteKekse

Copy link
Copy Markdown
Collaborator

Hey Carla, can you review this PR and understand what it does and maybe also test it?

I am not yet 100% sure if this is the way to go, let me know what you think!
Best
Micha

@RoteKekse RoteKekse force-pushed the merge_option_create branch from 84b6938 to 4da3197 Compare March 25, 2025 07:08
@carla-terboven

Copy link
Copy Markdown
Collaborator

What it does:
If the archive.json for the file already exists, this new archive.json is “merged” into the existing archive.json

The method create_archive is usually called in the parsers. This PR is therefore relevant when a file with the same name is uploaded again and when reprocessing uploads.
It is only called for (archive).json files.

In contrast to the already existing overwrite option where the old archive is replaced by the new one, the merge works a little differently.
Of course, the question arises as what the merging of entries with many quantities and subsections actually means.
Here it seems to be implemented in such a way that only “new” values are merged. So if a quantity or subsection did not exist before or is empty, it is filled with the content of the new entry. Changes to a previously existing quantity are therefore ignored. (The term merge may be somewhat misleading. Only what did not exist before is updated, not everything merged).

Even if the term merge may seem a little misleading, this corresponds to the logic that we often use in the normalizers. There we have the logic: if quantity not None then set the quantity as follows. This means that overwriting previous values is also not possible, only setting new quantities.

At the moment I can't think of the use case where you don't want to overwrite but add all the new things but not having the normalizer.
I'll have to put together an example later to try out the code. Maybe then I'll find more problems or use cases.

Comment thread src/baseclasses/helper/utilities.py Outdated

if not archive.m_context.raw_path_exists(file_name) or overwrite:
if not archive.m_context.raw_path_exists(file_name) or overwrite or merge:
if merge and file_name.endswith('.json'):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this could be even more specific with .archive.json instead of only .json ? Just in case one day some group wants to upload json data

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that is true, i also thought checking if the file exists also, so or (merge and archive.m_context.raw_path_exists(file_name)) and file_name.endswith('archive.json') even

Comment thread src/baseclasses/helper/utilities.py
Comment thread src/baseclasses/helper/utilities.py Outdated
Comment thread src/baseclasses/helper/utilities.py Outdated
import json

if not archive.m_context.raw_path_exists(file_name) or overwrite:
if not archive.m_context.raw_path_exists(file_name) or overwrite or merge:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge only seem to work if archive.m_context.raw_path_exists(file_name) exists. Maybe we want to check for that.

Because in case of merge==True and (not archive.m_context.raw_path_exists(file_name))==True I get an error.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i also wrote sth above

@carla-terboven

carla-terboven commented Mar 25, 2025

Copy link
Copy Markdown
Collaborator

Ok it seems like everything that is parsed from a file gets replaced when uploading a new file and reprocess. Other quantities that can be set manually are not replaced.

But quantities do not get removed.

And the behavior is slightly different when clicking the save button in between or not. Then a different "new_entry" is used compared to just reprocessing. That's not intuitive for me.

@RoteKekse

Copy link
Copy Markdown
Collaborator Author

so the background for this is the following. right now we usually do the parsing in the normalizer of an entry. for the huge excel table in in the solar cell field, we actually create all the process entries in the parsing, so the create_archive function is not just getting an entry with sample_id and file name, but also filled out all properties. so the excel table creates 30-40 archives and all of the process dont have a data file reference.

Now imagine that someone adds a column which is not picked up, then i need to have process that this is added un reprocessing.

actually writing this i realize that the hole parsing here is completely different than the one where there is a one to one correspondents between entry and measurement file.

the actual issue was, that there was a typo in a column header of the excel and i fixed the parsing. but reprocessing did not update the archive, since it already existed. and overwriting might destroy manual changed data.

i am really not sure how to deal with this.

@carla-terboven

Copy link
Copy Markdown
Collaborator

Ok I see. I did not test this "one file to many archives" option but I understand the idea now. Actually this is quite a typical thing when we do the stuff with the excels. I think Christina also replaced a lot of the already uploaded data by hand to reprocess with the new parsing.

But I am also not sure how to deal with that. Overwriting is really a tricky thing to do automatically since NOMAD does not provide a real history of changes. And things could get lost really uncontrollable.

@RoteKekse

Copy link
Copy Markdown
Collaborator Author

yes this is why i never overwrite only add

@carla-terboven

Copy link
Copy Markdown
Collaborator

When uploading a file with an already existing name this code somehow overwrites all quantities that are different from the existing archive without even clicking the reprocess button.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants