Search before asking
Motivation
DolphinScheduler already supports manual workflow instance deletion, but it still does not provide a built-in retention and cleanup system for historical workflow runtime data.
Today, operators usually solve this problem in one of two ways:
- call the existing workflow instance delete APIs one by one or in batches
- write custom scripts that delete data directly from the metadata database and log directories
This creates several problems:
- no built-in retention policy
- no built-in scheduler for cleanup
- no dry-run or preview capability
- no centralized metrics or operation summary for cleanup runs
- cleanup logic is currently centered in the API layer, so it is not reusable by an internal scheduler
- the current delete path does not own all runtime artifacts that reference workflow instances or task instances
The existing delete path is useful, but it is not yet a platform-level lifecycle management feature.
Current built-in behavior mainly comes from:
WorkflowInstanceServiceImpl.deleteWorkflowInstanceById(...)
TaskInstanceServiceImpl.deleteByWorkflowInstanceId(...)
AlertDao.deleteByWorkflowInstanceId(...)
- leader-only control loops managed by
MasterCoordinator
The current delete chain already removes workflow instances, task instances, task logs, alerts, and sub-workflow instances recursively. However, it still has important gaps for a production-grade retention feature:
- no project-level or global policy model
- no leader-only scheduled cleanup job
- no cleanup preview API
- no cleanup run summary and metrics
- no batch-oriented internal cleanup service in
dolphinscheduler-service
- no explicit cleanup for some related runtime tables such as:
t_ds_command
t_ds_error_command
t_ds_task_instance_context
t_ds_relation_sub_workflow
- child-side relation cleanup in
t_ds_relation_workflow_instance
This DSIP proposes a built-in workflow instance retention and cleanup feature that is safe, observable, opt-in, and reusable by both manual APIs and scheduled cleanup.
Out of scope (NOT included)
- workflow definition cleanup
- resource center cleanup
- workflow definition archive/export
- generic database partition management
- cleanup of external artifacts not directly managed by DolphinScheduler
- soft delete / recycle bin / restore flow
- master workflow log cleanup and UI workflow log browsing (that is a separate topic, e.g. DSIP-107)
Design Detail
1. Goals
The MVP should provide:
- a built-in project opt-in retention policy for historical workflow instances
- a reusable internal cleanup service shared by manual delete APIs and scheduled cleanup
- manual preview and manual trigger APIs
- leader-only scheduled cleanup on the active master
- conservative safety controls: final-state only, dry-run, bounded batch size, project opt-in, disabled by default
- complete cleanup coverage for workflow-instance-related runtime data owned by DolphinScheduler
2. Current state and main gaps
2.1 Current delete path
Current deletion starts in the API layer:
WorkflowInstanceServiceImpl.deleteWorkflowInstanceById(User, Integer) checks project auth and requires a final state
WorkflowInstanceServiceImpl.deleteWorkflowInstanceById(int) performs recursive delete
TaskInstanceServiceImpl.deleteByWorkflowInstanceId(Integer) removes task logs best-effort and deletes task rows and task-group queue rows
AlertDao.deleteByWorkflowInstanceId(Integer) removes alerts and alert send status rows
2.2 Architecture gap
The real delete primitive currently lives in dolphinscheduler-api, not in dolphinscheduler-service. That means scheduled cleanup cannot reuse the same business logic cleanly.
2.3 Coverage gap
The current recursive delete path does not fully own all runtime tables that should be cleaned with a workflow instance family. In particular, the retention design must explicitly cover:
t_ds_workflow_instance
t_ds_task_instance
t_ds_task_group_queue
t_ds_task_instance_context
t_ds_alert
t_ds_alert_send_status
t_ds_command
t_ds_error_command
t_ds_serial_command
t_ds_relation_workflow_instance
t_ds_relation_sub_workflow
Two relation models already exist in the codebase:
t_ds_relation_workflow_instance: runtime parent-task-instance to child-workflow-instance relation
t_ds_relation_sub_workflow: runtime parent-task-code to sub-workflow-instance relation, used by dynamic sub-workflow query paths
The cleanup service must treat both tables as runtime metadata that should be deleted when the corresponding workflow instance family is removed.
3. Proposal summary
This DSIP proposes the following architecture:
- add a new internal cleanup service in
dolphinscheduler-service
- refactor existing manual delete APIs to delegate to the new cleanup service
- add a project-scoped retention policy table
- add leader-only scheduled cleanup in
dolphinscheduler-master
- add manual preview and manual run APIs in
dolphinscheduler-api
- use workflow-instance families as the cleanup unit instead of deleting individual rows independently
4. Architecture design
4.1 New cleanup service in dolphinscheduler-service
Add a new service, for example:
WorkflowInstanceCleanupService
WorkflowInstanceRetentionPolicyService
Responsibilities of WorkflowInstanceCleanupService:
- resolve cleanup candidates
- expand root workflow instances into complete workflow-instance families
- validate cleanup eligibility
- perform batch cleanup of all related runtime artifacts
- support dry-run / preview mode
- return structured cleanup summaries
This service should become the single business entry point for workflow instance cleanup.
4.2 Refactor existing manual delete onto the same service
Keep current single-delete and batch-delete APIs, but refactor them into thin wrappers over WorkflowInstanceCleanupService.
That gives DolphinScheduler one canonical delete engine for:
- current manual instance delete
- batch instance delete
- manual retention preview/run
- scheduled retention cleanup
4.3 Leader-only scheduler in dolphinscheduler-master
Add a new leader-only coordinator in the master, for example:
WorkflowInstanceCleanupCoordinator
The coordinator should be started only when MasterCoordinator becomes active, similar to TaskGroupCoordinator and WorkflowSerialCoordinator.
This should not be implemented as a user Quartz schedule. This is an internal maintenance task and belongs to the master control plane.
4.4 Module split
dolphinscheduler-service
- cleanup business logic
- candidate selection
- family expansion
- delete graph orchestration
- cleanup summary generation
dolphinscheduler-api
- permission checks
- request / response DTOs
- policy CRUD APIs
- manual preview / manual run APIs
- audit / operation logging for manual actions
dolphinscheduler-master
- leader-only scheduled trigger
- config binding for cleanup scheduler
- cleanup metrics emission
- structured run summary logs
dolphinscheduler-dao
- retention policy entity / mapper / repository
- batch delete methods for missing cleanup tables
- new retention-related indexes
5. Cleanup unit and semantics
5.1 Cleanup unit = workflow-instance family
The scheduled cleanup unit should be a root workflow instance plus all descendant sub-workflow instances.
For scheduled cleanup, candidate selection should start from root workflow instances only:
is_sub_workflow = 0
- final state only
end_time is not null
The cleanup service then expands the family using t_ds_relation_workflow_instance.
5.2 Family eligibility rule
A family is deletable only when all family members satisfy:
- final state
end_time is not null
end_time < cutoff
If a root instance is old enough but any descendant sub-workflow is still too new or not in a final state, the whole family must be skipped for that run.
This is safer than partial family cleanup.
5.3 Hard delete for MVP
The MVP should use hard delete, not soft delete.
Reasons:
- current manual delete is already hard delete
- soft delete would require pervasive filtering across workflow, task, alert, and query APIs
- soft delete would increase long-term storage and index cost instead of solving the retention problem
- recycle-bin and restore semantics require UI and API redesign far beyond the scope of this DSIP
To reduce risk, hard delete is combined with strong safety controls:
- feature disabled by default
- project opt-in policy
- dry-run support
- final-state only
- safety lag beyond retention cutoff
- bounded family count per run
Archive / export support can be added later without blocking the MVP.
6. Policy model
6.1 Global operator config in master
Add a new nested config in MasterConfig, for example master.workflow-instance-cleanup.*.
Suggested config keys:
| Key |
Default |
Description |
master.workflow-instance-cleanup.enabled |
false |
Enables the cleanup coordinator |
master.workflow-instance-cleanup.scan-interval |
1h |
Fixed delay between cleanup runs |
master.workflow-instance-cleanup.default-retention-days |
30 |
Default retention used by project policy |
master.workflow-instance-cleanup.minimum-retention-days |
7 |
Lower bound allowed for project policy or manual override |
master.workflow-instance-cleanup.safety-lag |
1d |
Additional delay after retention cutoff |
master.workflow-instance-cleanup.max-families-per-run |
100 |
Upper bound of workflow families per scheduled run |
master.workflow-instance-cleanup.delete-task-logs |
true |
Whether scheduled cleanup should attempt physical task-log deletion |
master.workflow-instance-cleanup.dry-run |
false |
Global dry-run mode for safe rollout |
Global config is controlled by operators and is not exposed as a user-facing API in the MVP.
6.2 Project opt-in policy table
Add a dedicated table instead of reusing t_ds_project_preference.
t_ds_project_preference currently stores opaque string preferences and is not a good operational table for structured cleanup policies.
Suggested new table:
CREATE TABLE `t_ds_workflow_instance_retention_policy` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`project_code` bigint(20) NOT NULL,
`enabled` tinyint(4) NOT NULL DEFAULT '0',
`retention_days` int(11) NOT NULL,
`delete_task_logs` tinyint(4) NOT NULL DEFAULT '1',
`create_user_id` int(11) DEFAULT NULL,
`update_user_id` int(11) DEFAULT NULL,
`create_time` datetime NOT NULL,
`update_time` datetime NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `uk_project_code` (`project_code`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
Project policy is opt-in. If no policy row exists for a project, scheduled cleanup does not run for that project.
Project policy rules:
enabled = true turns on scheduled cleanup for that project
retention_days >= master.workflow-instance-cleanup.minimum-retention-days
delete_task_logs can override the global default for that project
7. Index and DAO changes
High-volume cleanup needs better scan and delete support.
7.1 New indexes
Recommended additions:
ALTER TABLE `t_ds_workflow_instance`
ADD KEY `idx_retention_scan` (`project_code`, `is_sub_workflow`, `state`, `end_time`, `id`);
ALTER TABLE `t_ds_command`
ADD KEY `idx_workflow_instance_id` (`workflow_instance_id`);
ALTER TABLE `t_ds_error_command`
ADD KEY `idx_workflow_instance_id` (`workflow_instance_id`);
Notes:
t_ds_serial_command already has an index on workflow_instance_id
t_ds_relation_workflow_instance already has parent and child indexes
t_ds_relation_sub_workflow already has parent and child indexes
t_ds_task_instance_context already has a unique key including task_instance_id
7.2 New DAO methods
The cleanup service needs missing batch operations. Add DAO / mapper methods such as:
CommandMapper.deleteByWorkflowInstanceIds(List<Integer>)
ErrorCommandMapper.deleteByWorkflowInstanceIds(List<Integer>)
SerialCommandMapper.deleteByWorkflowInstanceIds(List<Integer>)
WorkflowInstanceRelationMapper.deleteByParentWorkflowInstanceIds(List<Integer>)
WorkflowInstanceRelationMapper.deleteByWorkflowInstanceIds(List<Integer>)
RelationSubWorkflowMapper.deleteByParentWorkflowInstanceIds(List<Integer>)
RelationSubWorkflowMapper.deleteBySubWorkflowInstanceIds(List<Integer>)
TaskInstanceContextMapper.deleteByTaskInstanceIds(List<Integer>)
The cleanup service should prefer repository-layer batch methods instead of calling raw mappers directly from master or API logic.
8. Cleanup graph and delete order
For a workflow-instance family, the cleanup service should own one canonical delete graph.
8.1 Resolution phase
- start from one root workflow instance
- expand all descendant workflow instances from
t_ds_relation_workflow_instance
- collect all workflow instance ids in the family
- query all task instances belonging to the family
- collect all task instance ids and alert ids
8.2 Delete order
Recommended order:
- remove physical task logs via
ILogService (best effort)
- delete
t_ds_task_group_queue by workflow instance ids
- delete
t_ds_task_instance_context by task instance ids
- delete
t_ds_task_instance by workflow instance ids
- delete
t_ds_alert_send_status by alert ids
- delete
t_ds_alert by workflow instance ids
- delete
t_ds_serial_command by workflow instance ids
- delete
t_ds_command by workflow instance ids
- delete
t_ds_error_command by workflow instance ids
- delete
t_ds_relation_sub_workflow by parent workflow instance ids and by sub-workflow instance ids
- delete
t_ds_relation_workflow_instance by parent workflow instance ids and by child workflow instance ids
- delete
t_ds_workflow_instance by ids
This delete graph explicitly closes the orphan-data gaps in the current implementation.
8.3 Log deletion policy
Task log deletion should remain best effort.
If a worker or master host is unavailable, the cleanup service should:
- record the failure in metrics / logs
- continue deleting metadata rows
Log cleanup failure should not block retention forever.
9. API design
9.1 Policy APIs
Get policy
GET /projects/{projectCode}/workflow-instance-retention-policy
Response example:
{
"enabled": true,
"retentionDays": 30,
"deleteTaskLogs": true,
"minimumRetentionDays": 7,
"defaultRetentionDays": 30
}
Update policy
PUT /projects/{projectCode}/workflow-instance-retention-policy
Request example:
{
"enabled": true,
"retentionDays": 30,
"deleteTaskLogs": true
}
Validation rules:
retentionDays > 0
retentionDays >= minimumRetentionDays
9.2 Manual preview API
POST /projects/{projectCode}/workflow-instances/cleanup/preview
Request example:
{
"retentionDays": 30,
"limit": 100,
"deleteTaskLogs": true
}
Response example:
{
"candidateFamilyCount": 12,
"candidateWorkflowInstanceCount": 37,
"candidateTaskInstanceCount": 824,
"oldestEndTime": "2026-01-01T00:00:00",
"sampleRootWorkflowInstanceIds": [101, 102, 103],
"skippedFamilies": {
"NON_FINAL_MEMBER": 1,
"RETENTION_NOT_REACHED": 3
}
}
9.3 Manual run API
POST /projects/{projectCode}/workflow-instances/cleanup/run
Request example:
{
"dryRun": false,
"retentionDays": 30,
"limit": 100,
"deleteTaskLogs": true
}
Response example:
{
"dryRun": false,
"deletedFamilyCount": 10,
"deletedWorkflowInstanceCount": 31,
"deletedTaskInstanceCount": 721,
"taskLogDeleteFailureCount": 2,
"skippedFamilyCount": 2,
"durationMillis": 8123
}
9.4 Existing APIs
Keep existing workflow instance delete APIs unchanged from the outside, but re-implement them using WorkflowInstanceCleanupService.
10. Permission and control model
10.1 Global operator control
Operators keep final control because scheduled cleanup only works when:
master.workflow-instance-cleanup.enabled=true
- a project policy explicitly enables cleanup for that project
10.2 Project-side permissions
For MVP, reuse existing project write permission checks:
- get policy: project read permission is acceptable
- update policy: project write permission
- preview cleanup: project write permission
- manual run: project write permission
This aligns with current ProjectServiceImpl permission patterns and avoids introducing a new permission matrix in the MVP.
10.3 Existing instance delete permission
Keep current INSTANCE_DELETE permission for the existing single-instance and batch-instance delete endpoints.
11. Scheduler behavior
11.1 Trigger model
WorkflowInstanceCleanupCoordinator runs only on the active master.
It should be started in the same active/standby lifecycle where MasterCoordinator already starts:
TaskGroupCoordinator
WorkflowSerialCoordinator
11.2 Run algorithm
Each scheduled run should:
- load enabled project policies
- iterate projects in stable order
- for each project, scan root workflow instances ordered by
end_time asc, id asc
- expand each root into a workflow-instance family
- validate family eligibility
- execute delete or dry-run summary
- stop when
max-families-per-run is reached
11.3 Transaction boundary
Use one transaction per workflow-instance family.
That gives a good balance:
- the family remains the atomic cleanup unit
- a large global transaction is avoided
- partial failure only affects one family
11.4 Failover behavior
If the active master dies during cleanup:
- the in-flight family transaction is rolled back by the database if not committed
- the new leader resumes on the next scheduled run
No extra distributed lease table is required for the MVP because the coordinator is already leader-only.
12. Safety rules
Mandatory safeguards:
- disabled by default
- project opt-in only
- final-state only
end_time must be present
retentionDays >= minimumRetentionDays
- configurable safety lag
- dry-run support
- one transaction per family
- bounded families per run
- eligibility re-check before delete
Optional extension after MVP:
- maintenance-window support
- cleanup pause switch in UI
- persistent cleanup run history table
13. Observability
Add metrics in master, for example:
ds.workflow.cleanup.run.count
ds.workflow.cleanup.run.failure.count
ds.workflow.cleanup.run.duration
ds.workflow.cleanup.candidate.family.count
ds.workflow.cleanup.deleted.workflow.count
ds.workflow.cleanup.deleted.task.count
ds.workflow.cleanup.log.delete.failure.count
ds.workflow.cleanup.skipped.family.count
Each run should also emit one structured summary log containing:
- trigger type (
SCHEDULED / MANUAL)
- project code
- retention days
- dry-run flag
- deleted counts
- skipped counts
- error summary
Persistent run-history storage is not required in the MVP.
14. Detailed implementation plan
Phase 1: internal cleanup refactor
- add
WorkflowInstanceCleanupService in dolphinscheduler-service
- move canonical delete graph out of
WorkflowInstanceServiceImpl
- add missing DAO batch delete methods
- refactor existing delete APIs to use the new service
Phase 2: policy and manual APIs
- add
t_ds_workflow_instance_retention_policy
- add policy service, repository, mapper, entity
- add preview and manual-run APIs
- add validation and permission checks
Phase 3: master scheduled cleanup
- add
WorkflowInstanceCleanupCoordinator
- add cleanup config to
MasterConfig
- integrate with
MasterCoordinator
- add metrics and structured logs
Phase 4: UI and operator visibility
- project policy UI
- cleanup preview UI entry
- operation result display
UI can be delivered in separate PRs after the backend behavior is stable.
15. Alternatives considered
15.1 External scripts only
Rejected.
That is the current situation and is exactly the gap this DSIP wants to solve.
15.2 Direct database purge only
Rejected as the core solution.
Database-only purge cannot safely clean DolphinScheduler-managed task logs, runtime relations, and cross-table workflow family metadata.
15.3 Soft delete first, hard delete later
Rejected for the MVP.
This would require broad query-path and UI changes and turns a retention feature into a much larger product redesign.
15.4 Archive before purge
Valuable, but out of scope for the MVP. The cleanup service should be designed so that an archive/export hook can be added later.
16. Risks and mitigations
| Risk |
Mitigation |
| Missing a related runtime table |
Centralize cleanup graph and add explicit batch delete methods with tests |
| Cleanup puts too much load on DB |
Add scan indexes, oldest-first scan, per-family transaction, bounded families per run |
| Hard delete is considered risky |
Disable by default, require project opt-in, enforce minimum retention days, support dry-run |
| Leader failover during cleanup |
Leader-only coordinator + per-family transaction + next-run retry |
| Worker log deletion may fail |
Make log deletion best effort and observable |
Compatibility, Deprecation, and Migration Plan
- Fully backward compatible by default:
- existing delete endpoints remain
- no existing API is removed
- no workflow execution behavior changes when the feature is disabled
- Schema changes are additive:
- new
t_ds_workflow_instance_retention_policy table
- new indexes on existing tables
- new mapper / repository methods
- No cross-service RPC compatibility break is required
- No data migration is required for existing workflow instances
- Scheduled cleanup is disabled by default and project opt-in, so existing deployments keep the current behavior until operators enable the feature
- Existing orphan rows from historical manual scripts are not automatically repaired globally, but future cleanup of a workflow family will clean all artifacts covered by the new delete graph
Test Plan
1. Unit tests
- family expansion from
t_ds_relation_workflow_instance
- root-only candidate selection
- family eligibility validation
- minimum retention day validation
- dry-run summary generation
- cleanup delete-order orchestration
2. DAO tests
- new policy table CRUD
- batch delete by workflow instance ids for command / error command / serial command
- relation-table delete methods
- task-instance-context batch delete methods
3. API tests
- get / update retention policy
- preview cleanup
- manual cleanup run
- permission validation for project read/write access
- compatibility of existing instance delete APIs after refactor
4. Integration tests
- scheduled cleanup only runs on active master
- standby master does not run cleanup
- active-master failover during cleanup does not corrupt data
- metrics and structured logs are emitted correctly
5. End-to-end data scenarios
Prepare workflow families that contain:
- plain workflow instances
- nested sub-workflows
- dynamic sub-workflows
- task logs
- alerts
- serial command rows
- dependent-task contexts in
t_ds_task_instance_context
Verify that after cleanup:
- all workflow-instance rows are deleted
- all related task / queue / context rows are deleted
- command / error command / serial command rows are deleted
- relation tables are deleted from both parent and child perspectives
- task logs are removed when enabled
- no unexpected rows remain in the covered runtime tables
Search before asking
Motivation
DolphinScheduler already supports manual workflow instance deletion, but it still does not provide a built-in retention and cleanup system for historical workflow runtime data.
Today, operators usually solve this problem in one of two ways:
This creates several problems:
The existing delete path is useful, but it is not yet a platform-level lifecycle management feature.
Current built-in behavior mainly comes from:
WorkflowInstanceServiceImpl.deleteWorkflowInstanceById(...)TaskInstanceServiceImpl.deleteByWorkflowInstanceId(...)AlertDao.deleteByWorkflowInstanceId(...)MasterCoordinatorThe current delete chain already removes workflow instances, task instances, task logs, alerts, and sub-workflow instances recursively. However, it still has important gaps for a production-grade retention feature:
dolphinscheduler-servicet_ds_commandt_ds_error_commandt_ds_task_instance_contextt_ds_relation_sub_workflowt_ds_relation_workflow_instanceThis DSIP proposes a built-in workflow instance retention and cleanup feature that is safe, observable, opt-in, and reusable by both manual APIs and scheduled cleanup.
Out of scope (NOT included)
Design Detail
1. Goals
The MVP should provide:
2. Current state and main gaps
2.1 Current delete path
Current deletion starts in the API layer:
WorkflowInstanceServiceImpl.deleteWorkflowInstanceById(User, Integer)checks project auth and requires a final stateWorkflowInstanceServiceImpl.deleteWorkflowInstanceById(int)performs recursive deleteTaskInstanceServiceImpl.deleteByWorkflowInstanceId(Integer)removes task logs best-effort and deletes task rows and task-group queue rowsAlertDao.deleteByWorkflowInstanceId(Integer)removes alerts and alert send status rows2.2 Architecture gap
The real delete primitive currently lives in
dolphinscheduler-api, not indolphinscheduler-service. That means scheduled cleanup cannot reuse the same business logic cleanly.2.3 Coverage gap
The current recursive delete path does not fully own all runtime tables that should be cleaned with a workflow instance family. In particular, the retention design must explicitly cover:
t_ds_workflow_instancet_ds_task_instancet_ds_task_group_queuet_ds_task_instance_contextt_ds_alertt_ds_alert_send_statust_ds_commandt_ds_error_commandt_ds_serial_commandt_ds_relation_workflow_instancet_ds_relation_sub_workflowTwo relation models already exist in the codebase:
t_ds_relation_workflow_instance: runtime parent-task-instance to child-workflow-instance relationt_ds_relation_sub_workflow: runtime parent-task-code to sub-workflow-instance relation, used by dynamic sub-workflow query pathsThe cleanup service must treat both tables as runtime metadata that should be deleted when the corresponding workflow instance family is removed.
3. Proposal summary
This DSIP proposes the following architecture:
dolphinscheduler-servicedolphinscheduler-masterdolphinscheduler-api4. Architecture design
4.1 New cleanup service in
dolphinscheduler-serviceAdd a new service, for example:
WorkflowInstanceCleanupServiceWorkflowInstanceRetentionPolicyServiceResponsibilities of
WorkflowInstanceCleanupService:This service should become the single business entry point for workflow instance cleanup.
4.2 Refactor existing manual delete onto the same service
Keep current single-delete and batch-delete APIs, but refactor them into thin wrappers over
WorkflowInstanceCleanupService.That gives DolphinScheduler one canonical delete engine for:
4.3 Leader-only scheduler in
dolphinscheduler-masterAdd a new leader-only coordinator in the master, for example:
WorkflowInstanceCleanupCoordinatorThe coordinator should be started only when
MasterCoordinatorbecomes active, similar toTaskGroupCoordinatorandWorkflowSerialCoordinator.This should not be implemented as a user Quartz schedule. This is an internal maintenance task and belongs to the master control plane.
4.4 Module split
dolphinscheduler-servicedolphinscheduler-apidolphinscheduler-masterdolphinscheduler-dao5. Cleanup unit and semantics
5.1 Cleanup unit = workflow-instance family
The scheduled cleanup unit should be a root workflow instance plus all descendant sub-workflow instances.
For scheduled cleanup, candidate selection should start from root workflow instances only:
is_sub_workflow = 0end_timeis not nullThe cleanup service then expands the family using
t_ds_relation_workflow_instance.5.2 Family eligibility rule
A family is deletable only when all family members satisfy:
end_timeis not nullend_time < cutoffIf a root instance is old enough but any descendant sub-workflow is still too new or not in a final state, the whole family must be skipped for that run.
This is safer than partial family cleanup.
5.3 Hard delete for MVP
The MVP should use hard delete, not soft delete.
Reasons:
To reduce risk, hard delete is combined with strong safety controls:
Archive / export support can be added later without blocking the MVP.
6. Policy model
6.1 Global operator config in master
Add a new nested config in
MasterConfig, for examplemaster.workflow-instance-cleanup.*.Suggested config keys:
master.workflow-instance-cleanup.enabledfalsemaster.workflow-instance-cleanup.scan-interval1hmaster.workflow-instance-cleanup.default-retention-days30master.workflow-instance-cleanup.minimum-retention-days7master.workflow-instance-cleanup.safety-lag1dmaster.workflow-instance-cleanup.max-families-per-run100master.workflow-instance-cleanup.delete-task-logstruemaster.workflow-instance-cleanup.dry-runfalseGlobal config is controlled by operators and is not exposed as a user-facing API in the MVP.
6.2 Project opt-in policy table
Add a dedicated table instead of reusing
t_ds_project_preference.t_ds_project_preferencecurrently stores opaque string preferences and is not a good operational table for structured cleanup policies.Suggested new table:
Project policy is opt-in. If no policy row exists for a project, scheduled cleanup does not run for that project.
Project policy rules:
enabled = trueturns on scheduled cleanup for that projectretention_days >= master.workflow-instance-cleanup.minimum-retention-daysdelete_task_logscan override the global default for that project7. Index and DAO changes
High-volume cleanup needs better scan and delete support.
7.1 New indexes
Recommended additions:
Notes:
t_ds_serial_commandalready has an index onworkflow_instance_idt_ds_relation_workflow_instancealready has parent and child indexest_ds_relation_sub_workflowalready has parent and child indexest_ds_task_instance_contextalready has a unique key includingtask_instance_id7.2 New DAO methods
The cleanup service needs missing batch operations. Add DAO / mapper methods such as:
CommandMapper.deleteByWorkflowInstanceIds(List<Integer>)ErrorCommandMapper.deleteByWorkflowInstanceIds(List<Integer>)SerialCommandMapper.deleteByWorkflowInstanceIds(List<Integer>)WorkflowInstanceRelationMapper.deleteByParentWorkflowInstanceIds(List<Integer>)WorkflowInstanceRelationMapper.deleteByWorkflowInstanceIds(List<Integer>)RelationSubWorkflowMapper.deleteByParentWorkflowInstanceIds(List<Integer>)RelationSubWorkflowMapper.deleteBySubWorkflowInstanceIds(List<Integer>)TaskInstanceContextMapper.deleteByTaskInstanceIds(List<Integer>)The cleanup service should prefer repository-layer batch methods instead of calling raw mappers directly from master or API logic.
8. Cleanup graph and delete order
For a workflow-instance family, the cleanup service should own one canonical delete graph.
8.1 Resolution phase
t_ds_relation_workflow_instance8.2 Delete order
Recommended order:
ILogService(best effort)t_ds_task_group_queueby workflow instance idst_ds_task_instance_contextby task instance idst_ds_task_instanceby workflow instance idst_ds_alert_send_statusby alert idst_ds_alertby workflow instance idst_ds_serial_commandby workflow instance idst_ds_commandby workflow instance idst_ds_error_commandby workflow instance idst_ds_relation_sub_workflowby parent workflow instance ids and by sub-workflow instance idst_ds_relation_workflow_instanceby parent workflow instance ids and by child workflow instance idst_ds_workflow_instanceby idsThis delete graph explicitly closes the orphan-data gaps in the current implementation.
8.3 Log deletion policy
Task log deletion should remain best effort.
If a worker or master host is unavailable, the cleanup service should:
Log cleanup failure should not block retention forever.
9. API design
9.1 Policy APIs
Get policy
GET /projects/{projectCode}/workflow-instance-retention-policyResponse example:
{ "enabled": true, "retentionDays": 30, "deleteTaskLogs": true, "minimumRetentionDays": 7, "defaultRetentionDays": 30 }Update policy
PUT /projects/{projectCode}/workflow-instance-retention-policyRequest example:
{ "enabled": true, "retentionDays": 30, "deleteTaskLogs": true }Validation rules:
retentionDays > 0retentionDays >= minimumRetentionDays9.2 Manual preview API
POST /projects/{projectCode}/workflow-instances/cleanup/previewRequest example:
{ "retentionDays": 30, "limit": 100, "deleteTaskLogs": true }Response example:
{ "candidateFamilyCount": 12, "candidateWorkflowInstanceCount": 37, "candidateTaskInstanceCount": 824, "oldestEndTime": "2026-01-01T00:00:00", "sampleRootWorkflowInstanceIds": [101, 102, 103], "skippedFamilies": { "NON_FINAL_MEMBER": 1, "RETENTION_NOT_REACHED": 3 } }9.3 Manual run API
POST /projects/{projectCode}/workflow-instances/cleanup/runRequest example:
{ "dryRun": false, "retentionDays": 30, "limit": 100, "deleteTaskLogs": true }Response example:
{ "dryRun": false, "deletedFamilyCount": 10, "deletedWorkflowInstanceCount": 31, "deletedTaskInstanceCount": 721, "taskLogDeleteFailureCount": 2, "skippedFamilyCount": 2, "durationMillis": 8123 }9.4 Existing APIs
Keep existing workflow instance delete APIs unchanged from the outside, but re-implement them using
WorkflowInstanceCleanupService.10. Permission and control model
10.1 Global operator control
Operators keep final control because scheduled cleanup only works when:
master.workflow-instance-cleanup.enabled=true10.2 Project-side permissions
For MVP, reuse existing project write permission checks:
This aligns with current
ProjectServiceImplpermission patterns and avoids introducing a new permission matrix in the MVP.10.3 Existing instance delete permission
Keep current
INSTANCE_DELETEpermission for the existing single-instance and batch-instance delete endpoints.11. Scheduler behavior
11.1 Trigger model
WorkflowInstanceCleanupCoordinatorruns only on the active master.It should be started in the same active/standby lifecycle where
MasterCoordinatoralready starts:TaskGroupCoordinatorWorkflowSerialCoordinator11.2 Run algorithm
Each scheduled run should:
end_time asc, id ascmax-families-per-runis reached11.3 Transaction boundary
Use one transaction per workflow-instance family.
That gives a good balance:
11.4 Failover behavior
If the active master dies during cleanup:
No extra distributed lease table is required for the MVP because the coordinator is already leader-only.
12. Safety rules
Mandatory safeguards:
end_timemust be presentretentionDays >= minimumRetentionDaysOptional extension after MVP:
13. Observability
Add metrics in master, for example:
ds.workflow.cleanup.run.countds.workflow.cleanup.run.failure.countds.workflow.cleanup.run.durationds.workflow.cleanup.candidate.family.countds.workflow.cleanup.deleted.workflow.countds.workflow.cleanup.deleted.task.countds.workflow.cleanup.log.delete.failure.countds.workflow.cleanup.skipped.family.countEach run should also emit one structured summary log containing:
SCHEDULED/MANUAL)Persistent run-history storage is not required in the MVP.
14. Detailed implementation plan
Phase 1: internal cleanup refactor
WorkflowInstanceCleanupServiceindolphinscheduler-serviceWorkflowInstanceServiceImplPhase 2: policy and manual APIs
t_ds_workflow_instance_retention_policyPhase 3: master scheduled cleanup
WorkflowInstanceCleanupCoordinatorMasterConfigMasterCoordinatorPhase 4: UI and operator visibility
UI can be delivered in separate PRs after the backend behavior is stable.
15. Alternatives considered
15.1 External scripts only
Rejected.
That is the current situation and is exactly the gap this DSIP wants to solve.
15.2 Direct database purge only
Rejected as the core solution.
Database-only purge cannot safely clean DolphinScheduler-managed task logs, runtime relations, and cross-table workflow family metadata.
15.3 Soft delete first, hard delete later
Rejected for the MVP.
This would require broad query-path and UI changes and turns a retention feature into a much larger product redesign.
15.4 Archive before purge
Valuable, but out of scope for the MVP. The cleanup service should be designed so that an archive/export hook can be added later.
16. Risks and mitigations
Compatibility, Deprecation, and Migration Plan
t_ds_workflow_instance_retention_policytableTest Plan
1. Unit tests
t_ds_relation_workflow_instance2. DAO tests
3. API tests
4. Integration tests
5. End-to-end data scenarios
Prepare workflow families that contain:
t_ds_task_instance_contextVerify that after cleanup: