Skip to content

[DSIP-108][WorkflowInstance] Built-in Workflow Instance Retention and Cleanup #18295

@xinxingi

Description

@xinxingi

Search before asking

  • I had searched in the DSIP and found no similar DSIP.

Motivation

DolphinScheduler already supports manual workflow instance deletion, but it still does not provide a built-in retention and cleanup system for historical workflow runtime data.

Today, operators usually solve this problem in one of two ways:

  1. call the existing workflow instance delete APIs one by one or in batches
  2. write custom scripts that delete data directly from the metadata database and log directories

This creates several problems:

  • no built-in retention policy
  • no built-in scheduler for cleanup
  • no dry-run or preview capability
  • no centralized metrics or operation summary for cleanup runs
  • cleanup logic is currently centered in the API layer, so it is not reusable by an internal scheduler
  • the current delete path does not own all runtime artifacts that reference workflow instances or task instances

The existing delete path is useful, but it is not yet a platform-level lifecycle management feature.

Current built-in behavior mainly comes from:

  • WorkflowInstanceServiceImpl.deleteWorkflowInstanceById(...)
  • TaskInstanceServiceImpl.deleteByWorkflowInstanceId(...)
  • AlertDao.deleteByWorkflowInstanceId(...)
  • leader-only control loops managed by MasterCoordinator

The current delete chain already removes workflow instances, task instances, task logs, alerts, and sub-workflow instances recursively. However, it still has important gaps for a production-grade retention feature:

  • no project-level or global policy model
  • no leader-only scheduled cleanup job
  • no cleanup preview API
  • no cleanup run summary and metrics
  • no batch-oriented internal cleanup service in dolphinscheduler-service
  • no explicit cleanup for some related runtime tables such as:
    • t_ds_command
    • t_ds_error_command
    • t_ds_task_instance_context
    • t_ds_relation_sub_workflow
    • child-side relation cleanup in t_ds_relation_workflow_instance

This DSIP proposes a built-in workflow instance retention and cleanup feature that is safe, observable, opt-in, and reusable by both manual APIs and scheduled cleanup.

Out of scope (NOT included)

  • workflow definition cleanup
  • resource center cleanup
  • workflow definition archive/export
  • generic database partition management
  • cleanup of external artifacts not directly managed by DolphinScheduler
  • soft delete / recycle bin / restore flow
  • master workflow log cleanup and UI workflow log browsing (that is a separate topic, e.g. DSIP-107)

Design Detail

1. Goals

The MVP should provide:

  1. a built-in project opt-in retention policy for historical workflow instances
  2. a reusable internal cleanup service shared by manual delete APIs and scheduled cleanup
  3. manual preview and manual trigger APIs
  4. leader-only scheduled cleanup on the active master
  5. conservative safety controls: final-state only, dry-run, bounded batch size, project opt-in, disabled by default
  6. complete cleanup coverage for workflow-instance-related runtime data owned by DolphinScheduler

2. Current state and main gaps

2.1 Current delete path

Current deletion starts in the API layer:

  • WorkflowInstanceServiceImpl.deleteWorkflowInstanceById(User, Integer) checks project auth and requires a final state
  • WorkflowInstanceServiceImpl.deleteWorkflowInstanceById(int) performs recursive delete
  • TaskInstanceServiceImpl.deleteByWorkflowInstanceId(Integer) removes task logs best-effort and deletes task rows and task-group queue rows
  • AlertDao.deleteByWorkflowInstanceId(Integer) removes alerts and alert send status rows

2.2 Architecture gap

The real delete primitive currently lives in dolphinscheduler-api, not in dolphinscheduler-service. That means scheduled cleanup cannot reuse the same business logic cleanly.

2.3 Coverage gap

The current recursive delete path does not fully own all runtime tables that should be cleaned with a workflow instance family. In particular, the retention design must explicitly cover:

  • t_ds_workflow_instance
  • t_ds_task_instance
  • t_ds_task_group_queue
  • t_ds_task_instance_context
  • t_ds_alert
  • t_ds_alert_send_status
  • t_ds_command
  • t_ds_error_command
  • t_ds_serial_command
  • t_ds_relation_workflow_instance
  • t_ds_relation_sub_workflow

Two relation models already exist in the codebase:

  • t_ds_relation_workflow_instance: runtime parent-task-instance to child-workflow-instance relation
  • t_ds_relation_sub_workflow: runtime parent-task-code to sub-workflow-instance relation, used by dynamic sub-workflow query paths

The cleanup service must treat both tables as runtime metadata that should be deleted when the corresponding workflow instance family is removed.

3. Proposal summary

This DSIP proposes the following architecture:

  1. add a new internal cleanup service in dolphinscheduler-service
  2. refactor existing manual delete APIs to delegate to the new cleanup service
  3. add a project-scoped retention policy table
  4. add leader-only scheduled cleanup in dolphinscheduler-master
  5. add manual preview and manual run APIs in dolphinscheduler-api
  6. use workflow-instance families as the cleanup unit instead of deleting individual rows independently

4. Architecture design

4.1 New cleanup service in dolphinscheduler-service

Add a new service, for example:

  • WorkflowInstanceCleanupService
  • WorkflowInstanceRetentionPolicyService

Responsibilities of WorkflowInstanceCleanupService:

  • resolve cleanup candidates
  • expand root workflow instances into complete workflow-instance families
  • validate cleanup eligibility
  • perform batch cleanup of all related runtime artifacts
  • support dry-run / preview mode
  • return structured cleanup summaries

This service should become the single business entry point for workflow instance cleanup.

4.2 Refactor existing manual delete onto the same service

Keep current single-delete and batch-delete APIs, but refactor them into thin wrappers over WorkflowInstanceCleanupService.

That gives DolphinScheduler one canonical delete engine for:

  • current manual instance delete
  • batch instance delete
  • manual retention preview/run
  • scheduled retention cleanup

4.3 Leader-only scheduler in dolphinscheduler-master

Add a new leader-only coordinator in the master, for example:

  • WorkflowInstanceCleanupCoordinator

The coordinator should be started only when MasterCoordinator becomes active, similar to TaskGroupCoordinator and WorkflowSerialCoordinator.

This should not be implemented as a user Quartz schedule. This is an internal maintenance task and belongs to the master control plane.

4.4 Module split

dolphinscheduler-service
  • cleanup business logic
  • candidate selection
  • family expansion
  • delete graph orchestration
  • cleanup summary generation
dolphinscheduler-api
  • permission checks
  • request / response DTOs
  • policy CRUD APIs
  • manual preview / manual run APIs
  • audit / operation logging for manual actions
dolphinscheduler-master
  • leader-only scheduled trigger
  • config binding for cleanup scheduler
  • cleanup metrics emission
  • structured run summary logs
dolphinscheduler-dao
  • retention policy entity / mapper / repository
  • batch delete methods for missing cleanup tables
  • new retention-related indexes

5. Cleanup unit and semantics

5.1 Cleanup unit = workflow-instance family

The scheduled cleanup unit should be a root workflow instance plus all descendant sub-workflow instances.

For scheduled cleanup, candidate selection should start from root workflow instances only:

  • is_sub_workflow = 0
  • final state only
  • end_time is not null

The cleanup service then expands the family using t_ds_relation_workflow_instance.

5.2 Family eligibility rule

A family is deletable only when all family members satisfy:

  • final state
  • end_time is not null
  • end_time < cutoff

If a root instance is old enough but any descendant sub-workflow is still too new or not in a final state, the whole family must be skipped for that run.

This is safer than partial family cleanup.

5.3 Hard delete for MVP

The MVP should use hard delete, not soft delete.

Reasons:

  • current manual delete is already hard delete
  • soft delete would require pervasive filtering across workflow, task, alert, and query APIs
  • soft delete would increase long-term storage and index cost instead of solving the retention problem
  • recycle-bin and restore semantics require UI and API redesign far beyond the scope of this DSIP

To reduce risk, hard delete is combined with strong safety controls:

  • feature disabled by default
  • project opt-in policy
  • dry-run support
  • final-state only
  • safety lag beyond retention cutoff
  • bounded family count per run

Archive / export support can be added later without blocking the MVP.

6. Policy model

6.1 Global operator config in master

Add a new nested config in MasterConfig, for example master.workflow-instance-cleanup.*.

Suggested config keys:

Key Default Description
master.workflow-instance-cleanup.enabled false Enables the cleanup coordinator
master.workflow-instance-cleanup.scan-interval 1h Fixed delay between cleanup runs
master.workflow-instance-cleanup.default-retention-days 30 Default retention used by project policy
master.workflow-instance-cleanup.minimum-retention-days 7 Lower bound allowed for project policy or manual override
master.workflow-instance-cleanup.safety-lag 1d Additional delay after retention cutoff
master.workflow-instance-cleanup.max-families-per-run 100 Upper bound of workflow families per scheduled run
master.workflow-instance-cleanup.delete-task-logs true Whether scheduled cleanup should attempt physical task-log deletion
master.workflow-instance-cleanup.dry-run false Global dry-run mode for safe rollout

Global config is controlled by operators and is not exposed as a user-facing API in the MVP.

6.2 Project opt-in policy table

Add a dedicated table instead of reusing t_ds_project_preference.

t_ds_project_preference currently stores opaque string preferences and is not a good operational table for structured cleanup policies.

Suggested new table:

CREATE TABLE `t_ds_workflow_instance_retention_policy` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `project_code` bigint(20) NOT NULL,
  `enabled` tinyint(4) NOT NULL DEFAULT '0',
  `retention_days` int(11) NOT NULL,
  `delete_task_logs` tinyint(4) NOT NULL DEFAULT '1',
  `create_user_id` int(11) DEFAULT NULL,
  `update_user_id` int(11) DEFAULT NULL,
  `create_time` datetime NOT NULL,
  `update_time` datetime NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `uk_project_code` (`project_code`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

Project policy is opt-in. If no policy row exists for a project, scheduled cleanup does not run for that project.

Project policy rules:

  • enabled = true turns on scheduled cleanup for that project
  • retention_days >= master.workflow-instance-cleanup.minimum-retention-days
  • delete_task_logs can override the global default for that project

7. Index and DAO changes

High-volume cleanup needs better scan and delete support.

7.1 New indexes

Recommended additions:

ALTER TABLE `t_ds_workflow_instance`
  ADD KEY `idx_retention_scan` (`project_code`, `is_sub_workflow`, `state`, `end_time`, `id`);

ALTER TABLE `t_ds_command`
  ADD KEY `idx_workflow_instance_id` (`workflow_instance_id`);

ALTER TABLE `t_ds_error_command`
  ADD KEY `idx_workflow_instance_id` (`workflow_instance_id`);

Notes:

  • t_ds_serial_command already has an index on workflow_instance_id
  • t_ds_relation_workflow_instance already has parent and child indexes
  • t_ds_relation_sub_workflow already has parent and child indexes
  • t_ds_task_instance_context already has a unique key including task_instance_id

7.2 New DAO methods

The cleanup service needs missing batch operations. Add DAO / mapper methods such as:

  • CommandMapper.deleteByWorkflowInstanceIds(List<Integer>)
  • ErrorCommandMapper.deleteByWorkflowInstanceIds(List<Integer>)
  • SerialCommandMapper.deleteByWorkflowInstanceIds(List<Integer>)
  • WorkflowInstanceRelationMapper.deleteByParentWorkflowInstanceIds(List<Integer>)
  • WorkflowInstanceRelationMapper.deleteByWorkflowInstanceIds(List<Integer>)
  • RelationSubWorkflowMapper.deleteByParentWorkflowInstanceIds(List<Integer>)
  • RelationSubWorkflowMapper.deleteBySubWorkflowInstanceIds(List<Integer>)
  • TaskInstanceContextMapper.deleteByTaskInstanceIds(List<Integer>)

The cleanup service should prefer repository-layer batch methods instead of calling raw mappers directly from master or API logic.

8. Cleanup graph and delete order

For a workflow-instance family, the cleanup service should own one canonical delete graph.

8.1 Resolution phase

  1. start from one root workflow instance
  2. expand all descendant workflow instances from t_ds_relation_workflow_instance
  3. collect all workflow instance ids in the family
  4. query all task instances belonging to the family
  5. collect all task instance ids and alert ids

8.2 Delete order

Recommended order:

  1. remove physical task logs via ILogService (best effort)
  2. delete t_ds_task_group_queue by workflow instance ids
  3. delete t_ds_task_instance_context by task instance ids
  4. delete t_ds_task_instance by workflow instance ids
  5. delete t_ds_alert_send_status by alert ids
  6. delete t_ds_alert by workflow instance ids
  7. delete t_ds_serial_command by workflow instance ids
  8. delete t_ds_command by workflow instance ids
  9. delete t_ds_error_command by workflow instance ids
  10. delete t_ds_relation_sub_workflow by parent workflow instance ids and by sub-workflow instance ids
  11. delete t_ds_relation_workflow_instance by parent workflow instance ids and by child workflow instance ids
  12. delete t_ds_workflow_instance by ids

This delete graph explicitly closes the orphan-data gaps in the current implementation.

8.3 Log deletion policy

Task log deletion should remain best effort.

If a worker or master host is unavailable, the cleanup service should:

  • record the failure in metrics / logs
  • continue deleting metadata rows

Log cleanup failure should not block retention forever.

9. API design

9.1 Policy APIs

Get policy

GET /projects/{projectCode}/workflow-instance-retention-policy

Response example:

{
  "enabled": true,
  "retentionDays": 30,
  "deleteTaskLogs": true,
  "minimumRetentionDays": 7,
  "defaultRetentionDays": 30
}
Update policy

PUT /projects/{projectCode}/workflow-instance-retention-policy

Request example:

{
  "enabled": true,
  "retentionDays": 30,
  "deleteTaskLogs": true
}

Validation rules:

  • retentionDays > 0
  • retentionDays >= minimumRetentionDays

9.2 Manual preview API

POST /projects/{projectCode}/workflow-instances/cleanup/preview

Request example:

{
  "retentionDays": 30,
  "limit": 100,
  "deleteTaskLogs": true
}

Response example:

{
  "candidateFamilyCount": 12,
  "candidateWorkflowInstanceCount": 37,
  "candidateTaskInstanceCount": 824,
  "oldestEndTime": "2026-01-01T00:00:00",
  "sampleRootWorkflowInstanceIds": [101, 102, 103],
  "skippedFamilies": {
    "NON_FINAL_MEMBER": 1,
    "RETENTION_NOT_REACHED": 3
  }
}

9.3 Manual run API

POST /projects/{projectCode}/workflow-instances/cleanup/run

Request example:

{
  "dryRun": false,
  "retentionDays": 30,
  "limit": 100,
  "deleteTaskLogs": true
}

Response example:

{
  "dryRun": false,
  "deletedFamilyCount": 10,
  "deletedWorkflowInstanceCount": 31,
  "deletedTaskInstanceCount": 721,
  "taskLogDeleteFailureCount": 2,
  "skippedFamilyCount": 2,
  "durationMillis": 8123
}

9.4 Existing APIs

Keep existing workflow instance delete APIs unchanged from the outside, but re-implement them using WorkflowInstanceCleanupService.

10. Permission and control model

10.1 Global operator control

Operators keep final control because scheduled cleanup only works when:

  • master.workflow-instance-cleanup.enabled=true
  • a project policy explicitly enables cleanup for that project

10.2 Project-side permissions

For MVP, reuse existing project write permission checks:

  • get policy: project read permission is acceptable
  • update policy: project write permission
  • preview cleanup: project write permission
  • manual run: project write permission

This aligns with current ProjectServiceImpl permission patterns and avoids introducing a new permission matrix in the MVP.

10.3 Existing instance delete permission

Keep current INSTANCE_DELETE permission for the existing single-instance and batch-instance delete endpoints.

11. Scheduler behavior

11.1 Trigger model

WorkflowInstanceCleanupCoordinator runs only on the active master.

It should be started in the same active/standby lifecycle where MasterCoordinator already starts:

  • TaskGroupCoordinator
  • WorkflowSerialCoordinator

11.2 Run algorithm

Each scheduled run should:

  1. load enabled project policies
  2. iterate projects in stable order
  3. for each project, scan root workflow instances ordered by end_time asc, id asc
  4. expand each root into a workflow-instance family
  5. validate family eligibility
  6. execute delete or dry-run summary
  7. stop when max-families-per-run is reached

11.3 Transaction boundary

Use one transaction per workflow-instance family.

That gives a good balance:

  • the family remains the atomic cleanup unit
  • a large global transaction is avoided
  • partial failure only affects one family

11.4 Failover behavior

If the active master dies during cleanup:

  • the in-flight family transaction is rolled back by the database if not committed
  • the new leader resumes on the next scheduled run

No extra distributed lease table is required for the MVP because the coordinator is already leader-only.

12. Safety rules

Mandatory safeguards:

  • disabled by default
  • project opt-in only
  • final-state only
  • end_time must be present
  • retentionDays >= minimumRetentionDays
  • configurable safety lag
  • dry-run support
  • one transaction per family
  • bounded families per run
  • eligibility re-check before delete

Optional extension after MVP:

  • maintenance-window support
  • cleanup pause switch in UI
  • persistent cleanup run history table

13. Observability

Add metrics in master, for example:

  • ds.workflow.cleanup.run.count
  • ds.workflow.cleanup.run.failure.count
  • ds.workflow.cleanup.run.duration
  • ds.workflow.cleanup.candidate.family.count
  • ds.workflow.cleanup.deleted.workflow.count
  • ds.workflow.cleanup.deleted.task.count
  • ds.workflow.cleanup.log.delete.failure.count
  • ds.workflow.cleanup.skipped.family.count

Each run should also emit one structured summary log containing:

  • trigger type (SCHEDULED / MANUAL)
  • project code
  • retention days
  • dry-run flag
  • deleted counts
  • skipped counts
  • error summary

Persistent run-history storage is not required in the MVP.

14. Detailed implementation plan

Phase 1: internal cleanup refactor

  • add WorkflowInstanceCleanupService in dolphinscheduler-service
  • move canonical delete graph out of WorkflowInstanceServiceImpl
  • add missing DAO batch delete methods
  • refactor existing delete APIs to use the new service

Phase 2: policy and manual APIs

  • add t_ds_workflow_instance_retention_policy
  • add policy service, repository, mapper, entity
  • add preview and manual-run APIs
  • add validation and permission checks

Phase 3: master scheduled cleanup

  • add WorkflowInstanceCleanupCoordinator
  • add cleanup config to MasterConfig
  • integrate with MasterCoordinator
  • add metrics and structured logs

Phase 4: UI and operator visibility

  • project policy UI
  • cleanup preview UI entry
  • operation result display

UI can be delivered in separate PRs after the backend behavior is stable.

15. Alternatives considered

15.1 External scripts only

Rejected.

That is the current situation and is exactly the gap this DSIP wants to solve.

15.2 Direct database purge only

Rejected as the core solution.

Database-only purge cannot safely clean DolphinScheduler-managed task logs, runtime relations, and cross-table workflow family metadata.

15.3 Soft delete first, hard delete later

Rejected for the MVP.

This would require broad query-path and UI changes and turns a retention feature into a much larger product redesign.

15.4 Archive before purge

Valuable, but out of scope for the MVP. The cleanup service should be designed so that an archive/export hook can be added later.

16. Risks and mitigations

Risk Mitigation
Missing a related runtime table Centralize cleanup graph and add explicit batch delete methods with tests
Cleanup puts too much load on DB Add scan indexes, oldest-first scan, per-family transaction, bounded families per run
Hard delete is considered risky Disable by default, require project opt-in, enforce minimum retention days, support dry-run
Leader failover during cleanup Leader-only coordinator + per-family transaction + next-run retry
Worker log deletion may fail Make log deletion best effort and observable

Compatibility, Deprecation, and Migration Plan

  • Fully backward compatible by default:
    • existing delete endpoints remain
    • no existing API is removed
    • no workflow execution behavior changes when the feature is disabled
  • Schema changes are additive:
    • new t_ds_workflow_instance_retention_policy table
    • new indexes on existing tables
    • new mapper / repository methods
  • No cross-service RPC compatibility break is required
  • No data migration is required for existing workflow instances
  • Scheduled cleanup is disabled by default and project opt-in, so existing deployments keep the current behavior until operators enable the feature
  • Existing orphan rows from historical manual scripts are not automatically repaired globally, but future cleanup of a workflow family will clean all artifacts covered by the new delete graph

Test Plan

1. Unit tests

  • family expansion from t_ds_relation_workflow_instance
  • root-only candidate selection
  • family eligibility validation
  • minimum retention day validation
  • dry-run summary generation
  • cleanup delete-order orchestration

2. DAO tests

  • new policy table CRUD
  • batch delete by workflow instance ids for command / error command / serial command
  • relation-table delete methods
  • task-instance-context batch delete methods

3. API tests

  • get / update retention policy
  • preview cleanup
  • manual cleanup run
  • permission validation for project read/write access
  • compatibility of existing instance delete APIs after refactor

4. Integration tests

  • scheduled cleanup only runs on active master
  • standby master does not run cleanup
  • active-master failover during cleanup does not corrupt data
  • metrics and structured logs are emitted correctly

5. End-to-end data scenarios

Prepare workflow families that contain:

  • plain workflow instances
  • nested sub-workflows
  • dynamic sub-workflows
  • task logs
  • alerts
  • serial command rows
  • dependent-task contexts in t_ds_task_instance_context

Verify that after cleanup:

  • all workflow-instance rows are deleted
  • all related task / queue / context rows are deleted
  • command / error command / serial command rows are deleted
  • relation tables are deleted from both parent and child perspectives
  • task logs are removed when enabled
  • no unexpected rows remain in the covered runtime tables

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions