Mid-Run Checkpointing And Validation

In the current implementation of SigLLM, all completed progress is lost if a run is not fully finished before terminating. Due to the fact that SigLLM runs can take many hours, this is an important issue to end users who don't want to waste compute on failed all-or-nothing runs. A run can be terminated early for two main reasons:

1. Compute limits are exceeded. For example, some GPU clusters only allow jobs with strict time limits. Currently, if the run takes longer than the amount of time allotted, the run shuts down and no intermediate results are saved in persistent memory.
2. SigLLM produces a malformed output for at least one of the windows. In this case, the entire run crashes during post-processing. The SigLLM predictions, even those which are correctly formatted, are not saved in persistent memory.

I propose a mid-run checkpointing scheme to save intermediate results for each window during a run. This would entail creating a checkpoint folder where individual window predictions can be saved as they are generated. Additionally, in order to ensure that all saved progress is useful, we validate LLM predictions for each window during the run before saving them. I will implement a retry policy for malformed LLM outputs that do not pass this verification, so that all windows succeed with high probability.

With the above changes in place, we will implement a scheme where on future runs of the same data, windows which were already predicted will no longer need to be regenerated and will simply be loaded from memory. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mid-Run Checkpointing And Validation #64

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Mid-Run Checkpointing And Validation #64

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions