The NCBI and PDB (coming soon!) pipelines both involve an archiving step when stale files are moved from the primary dataset folder (refdata-files/datasets/{DATASET_NAME}/raw_data/) to an archive folder. Future integrations will likely also require some type of archiving of outdated files.
Should we develop a common strategy and support functionality for archiving files for any dataset?
If so, I think there are several options for how we organize the archive folders:
Group by archive date
This is how NCBI currently works (roughly). These files:
refdata-files/datasets/{DATASET_NAME}/raw_data/some/rescordset/file.txt
refdata-files/datasets/{DATASET_NAME}/metadata/some_recordset.json
would be archived to:
refdata-files/datasets/{DATASET_NAME}/archive/{ARCHIVE_DATE}/{ARCHIVE_REASON}/raw_data/some/recordset/file.txt
refdata-files/datasets/{DATASET_NAME}/archive/{ARCHIVE_DATE}/{ARCHIVE_REASON}/metadata/some_recordset.json
Group by recordset
Another option would be to have the same files archived to:
refdata-files/datasets/{DATASET_NAME}/archive/raw_data/some/recordset/file.{ARCHIVE_DATE}.{ARCHIVE_REASON}.txt
refdata-files/datasets/{DATASET_NAME}/archive/metadata/some_recordset.{ARCHIVE_DATE}.{ARCHIVE_REASON}.json
I'm leaning toward the second option, primarily because if you needed to go back and find an old version of a file, you would need to know the exact date it was archived in order to find it in the first scenario. With the second scenario, you would be able to see all the previous versions of a file in the same folder.
The NCBI and PDB (coming soon!) pipelines both involve an archiving step when stale files are moved from the primary dataset folder (
refdata-files/datasets/{DATASET_NAME}/raw_data/) to an archive folder. Future integrations will likely also require some type of archiving of outdated files.Should we develop a common strategy and support functionality for archiving files for any dataset?
If so, I think there are several options for how we organize the archive folders:
Group by archive date
This is how NCBI currently works (roughly). These files:
would be archived to:
Group by recordset
Another option would be to have the same files archived to:
I'm leaning toward the second option, primarily because if you needed to go back and find an old version of a file, you would need to know the exact date it was archived in order to find it in the first scenario. With the second scenario, you would be able to see all the previous versions of a file in the same folder.