Skip to content

Common archiving workflow #175

Description

@mattldawson

The NCBI and PDB (coming soon!) pipelines both involve an archiving step when stale files are moved from the primary dataset folder (refdata-files/datasets/{DATASET_NAME}/raw_data/) to an archive folder. Future integrations will likely also require some type of archiving of outdated files.

Should we develop a common strategy and support functionality for archiving files for any dataset?

If so, I think there are several options for how we organize the archive folders:

Group by archive date

This is how NCBI currently works (roughly). These files:

refdata-files/datasets/{DATASET_NAME}/raw_data/some/rescordset/file.txt
refdata-files/datasets/{DATASET_NAME}/metadata/some_recordset.json

would be archived to:

refdata-files/datasets/{DATASET_NAME}/archive/{ARCHIVE_DATE}/{ARCHIVE_REASON}/raw_data/some/recordset/file.txt
refdata-files/datasets/{DATASET_NAME}/archive/{ARCHIVE_DATE}/{ARCHIVE_REASON}/metadata/some_recordset.json

Group by recordset

Another option would be to have the same files archived to:

refdata-files/datasets/{DATASET_NAME}/archive/raw_data/some/recordset/file.{ARCHIVE_DATE}.{ARCHIVE_REASON}.txt
refdata-files/datasets/{DATASET_NAME}/archive/metadata/some_recordset.{ARCHIVE_DATE}.{ARCHIVE_REASON}.json

I'm leaning toward the second option, primarily because if you needed to go back and find an old version of a file, you would need to know the exact date it was archived in order to find it in the first scenario. With the second scenario, you would be able to see all the previous versions of a file in the same folder.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestquestionFurther information is requested

    Fields

    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions