[BDP-102028] feat(optimizer): [0/N] Optimizer API and internal model#527
[BDP-102028] feat(optimizer): [0/N] Optimizer API and internal model#527mkuchenbecker wants to merge 23 commits into
Conversation
Introduces the optimizer service module with: - MySQL/H2 schema for table_operations, table_stats, table_stats_history, and table_operations_history - JPA entities with JSON column support (vladmihalcea hibernate-types) - All model/DTO/enum types: OperationType, OperationStatus, TableStats, CompleteOperationRequest, JobResult, OperationMetrics, etc. - JPA AttributeConverters for JobResult and OperationMetrics JSON columns - MapStruct mapper (OptimizerMapper) for entity→DTO conversion - Spring Boot application shell and build wiring (settings.gradle, build.gradle dockerPrereqs) No repositories, controllers, or service layer yet — those follow in subsequent PRs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove OperationMetrics class and converter; stats are read directly from table_stats instead of duplicating into operations - Remove orphanFilesDeleted/orphanBytesDeleted from history entity, DTO, and schema; operation-specific data belongs in the result JSON - Add addedSizeBytes to CommitDelta for tracking write volume - Fix OperationType javadoc to describe current state, not roadmap - Fix TableOperationsHistoryRow javadoc: written on operation complete, not by Spark app directly - Add field comments to all DTOs and request objects Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
These fields never belonged in the data model — remove them at the source rather than adding then deleting in a later PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
||
| /** Terminal states for a completed Spark maintenance job. */ | ||
| public enum OperationHistoryStatus { | ||
| SUCCESS, |
There was a problem hiding this comment.
We should have keep the existing status such as canceled, queued etc. These are valid status as some times jobs could not be submitted due to GGW/Yarn issue etc.
There was a problem hiding this comment.
Happy to add these, but it's unclear how they would be populated given the current lifecycle. My general position is to add states as they are actually used.
| private String jobId; | ||
|
|
||
| /** Reserved for future per-operation metadata; currently unused. */ | ||
| private String metrics; |
There was a problem hiding this comment.
Can we have a class instead to capture more info? Or do we plan to capture json string here?
There was a problem hiding this comment.
Claude: Removed the metrics field from the DTO, entity, and schema since it is unused. A -- TODO block comment in the schema notes that per-operation metric columns will be added as operations are onboarded.
| /** Same UUID as the originating {@code table_operations.id}. Set by the caller; not generated. */ | ||
| @Id | ||
| @Column(name = "id", nullable = false, length = 36) | ||
| private String id; |
There was a problem hiding this comment.
Looks like this UUID and generated as part of job submission?
There was a problem hiding this comment.
Claude: Yes, it is a UUID, but it is not generated at submission. It is the same UUID as the originating table_operations.id, set by the Analyzer when the recommendation row is first created. The history row reuses that UUID when the complete endpoint is called, so each cycle is traceable end-to-end by a single id.
| -- Optimizer Service Schema | ||
| -- Compatible with MySQL (production) and H2 in MySQL mode (tests). | ||
| CREATE TABLE IF NOT EXISTS table_operations ( | ||
| id VARCHAR(36) NOT NULL, |
There was a problem hiding this comment.
Can we consider adding indexes for these tables too?
There was a problem hiding this comment.
Plan is to add indexes when query patterns require them — current PKs cover the access paths used by the analyzer and scheduler today. All key columns are VARCHAR/TIMESTAMP, so secondary indexes can be added cheaply once a real query path needs one.
- Widen-to-tighten: VARCHAR(255) -> VARCHAR(128) for database_name and table_name across all entities and the schema, aligning with prod conventions (can always be widened later, not tightened). - Rename databaseId -> databaseName in TableStatsRow, TableStatsHistoryRow, TableStatsDto, TableStatsHistoryDto, and UpsertTableStatsRequest for consistency with the operations entities and DTOs. - Drop the unused metrics field from TableOperationsRow, TableOperationsDto, and the schema. Add a TODO note in the schema that per-operation metric columns will be added as operations are onboarded. - Rename submittedAt -> completedAt in TableOperationsHistoryRow, TableOperationsHistoryDto, and the schema (column submitted_at -> completed_at, index idx_submitted_at -> idx_completed_at). The history row is written when the complete endpoint is called, so the timestamp captures completion; submission time is already on table_operations.scheduled_at. - Change TableStatsHistoryRow.id from BIGINT auto-increment to VARCHAR(36) UUID, set by the caller, matching the other id-bearing entities. - Add @JsonIgnoreProperties(ignoreUnknown = true) to CommitDelta for consistency with TableStats and SnapshotMetrics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…le_name) Add a composite secondary index on (database_name, table_name) to table_operations_history at the schema and entity layers. This backs a new name-based history-lookup endpoint added on optimizer-2; without the index, the query degrades to a full scan on a table that grows with every operation completion. The other three optimizer tables get no new indexes — no new query patterns on them this round. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…at JPA boundary Forward fix in response to review feedback that data-model decisions belong in this PR (optimizer-0), not in downstream stack layers. Brings the data-model end-state to where optimizer-1+ already are, so the optimizer-0..optimizer-1 diff is just repositories + wiring. - Rename api/model/OperationHistoryStatus → HistoryStatus. - Move api/model/TableStats → model/TableStats (the in-memory stats domain type is used by both entities and DTOs, so it lives in a neutral package rather than under api/model/). - Delete config/JobResultConverter. Entities now store the JobResult as a JSON String column directly; serialization happens at the wire-API boundary via OptimizerMapper helpers. - Switch the operation/status columns on TableOperationsRow and TableOperationsHistoryRow from JPA-bound enums to String. Keeps the entity layer decoupled from wire-API enum identity. - Add String↔OperationType, String↔OperationStatus, String↔HistoryStatus, and String↔JobResult default helpers to OptimizerMapper so MapStruct can bridge entity (String) and DTO (typed) columns. - Update DTO/entity imports to follow the renamed/moved types.
Per-layer types: wire-API enums (api/model/), DB-side String at JPA boundary, and an internal in-memory model layer that is what the analyzer and scheduler operate on. The wire and DB sides convert at their boundary; consumers of the optimizer library work in the internal types. - model/HistoryStatus, model/OperationStatus, model/OperationType: internal enums mirroring the wire-API counterparts. Decoupled so the analyzer/scheduler can evolve their state machines without churning the wire or DB shapes. - model/Table: an OpenHouse table enriched with stats + properties. Built from a TableStatsRow. - model/TableOperation: analyzer's decision-to-schedule + scheduler's unit of work. Constructed from TableOperationsRow or from a Table; converts back via toRow().
| @Builder | ||
| @NoArgsConstructor | ||
| @AllArgsConstructor | ||
| public class TableOperation { |
There was a problem hiding this comment.
This is effectively a duplicate of the API TableOperation, but its kept separate so we can evolve internally without immediately exposing internal details to the customer.
Add idx_toph_optype_uuid_completed on table_operations_history. Drives the correlated MAX(completed_at) subquery in TableOperationsHistoryRepository.findLatestPerTable (introduced in optimizer-1), turning it into an index-only lookup per (operation_type, table_uuid) instead of an O(N²) scan. Lands with the schema in optimizer-0 since the index is part of the data model definition; the query that depends on it lands with the repository in optimizer-1.
Make data types in api/ and model/ self-contained — no cross-layer imports between them and no references into the DB layer. The internal model layer owns conversion to the api edge via a new mapper sub-package. api/ changes: - Add api/model/TableStats (api-layer copy of the JSON payload). - Drop cross-layer imports from TableStatsDto, TableStatsHistoryDto, UpsertTableStatsRequest; they pick up TableStats from the same package. model/ changes: - Add model/JobResult (internal copy of the result payload). - Add model/TableOperationsHistory (internal container mirroring the history-row field set in typed form). - Remove cross-layer factory methods Table.from(TableStatsRow), TableOperation.from(TableOperationsRow), and TableOperation.toRow(). Construction at the DB boundary moves to a future model/mapper/ ModelDbMapper that ships with the db/ rename on optimizer-1. - Add model/mapper/ApiModelMapper — converts api/ DTOs ↔ model/ types. Only place inside model/ where api/ types appear. Per-PR ownership: - api/ and model/ live on this PR. - db/ (currently entity/) and its boundary-side mapper (model/mapper/ModelDbMapper) land on optimizer-1. - The existing api/mapper/OptimizerMapper still references entity/ on this branch; it gets retired on optimizer-2 once the service routes through the new mappers.
The DB layer (entities + api↔db mapper) belongs to optimizer-1, not optimizer-0. optimizer-0 owns only the wire-API surface and the internal model. Delete from this PR: - entity/ package (TableOperationsRow, TableOperationsHistoryRow, TableStatsRow, TableStatsHistoryRow, package-info). - api/mapper/OptimizerMapper — was the api↔entity bridge. With the entity files moving out of this PR and the new model/mapper/ taking over conversion duties, this mapper is no longer needed here. optimizer-1 will re-introduce these as db/ (renamed) with db-side per-layer types and a model/mapper/ModelDbMapper.
The DDL is part of the db/ layer's ownership (optimizer-1). Move the schema file and its schema-init properties out of optimizer-0 so this PR is purely api/ + model/. Delete: - src/main/resources/db/optimizer-schema.sql. - spring.sql.init.mode, spring.sql.init.schema-locations, and spring.jpa.defer-datasource-initialization from application.properties (they reference the deleted schema file). optimizer-1 re-introduces these alongside the db/ entities and repositories.
DB-layer dependencies belong to optimizer-1. With entities, schema, and the api/mapper deleted from this PR, the JPA + MySQL stack is unused — remove the dependency declarations and configuration that referenced them. build.gradle: - Drop spring-boot-starter-data-jpa, mysql-connector-java, the vladmihalcea hibernate-types JSON serializer, and the h2 test runtime. application.properties: - Drop spring.jpa.* and spring.datasource.* lines. Delete services/optimizer/src/test/resources/application-test.properties (H2 test datasource config — re-introduced on optimizer-1 alongside the repositories and repo tests).
No external system creates table operations — operations are written by the in-process analyzer directly through the model layer. The request type has no wire consumer and no internal consumer, so it's dead code. Delete services/optimizer/.../api/model/UpsertTableOperationsRequest.java.
JobResult is removed from the optimizer API. CompleteOperationRequest (user-edited) now carries only operationId + status — the failure detail abstraction has been retired. The internal model and DTOs no longer carry it either, and the type itself is deleted from both api/ and model/. CompleteOperationRequest: - operationId moved from path to body (user manual edit). - jobId field removed. - result field removed. api/model/TableOperationsHistoryDto: - Drop jobId and result fields. model/TableOperationsHistory: - Drop jobId and result fields. model/mapper/ApiModelMapper: - Remove toModelJobResult / toApiJobResult helpers + JobResult import. - toHistory()/toDto() no longer touch jobId or result. Delete: - services/optimizer/.../api/model/JobResult.java - services/optimizer/.../model/JobResult.java Downstream propagation: opt-2's service signature changes (completeOperation now takes only the request body); db/HistoryStatus remains needed on opt-1 but db/JobResult no longer is. See memory/tasks/mkuchenb-optimizer-3-fixes.md for the full propagation list.
Add tableUuid, databaseName, tableName, and operationType to the complete request body. They're debug-only — the server keys lookup off operationId — but preserving them on logs and traces helps an operator diagnose a failing complete call without joining back to the operation row.
Every line in application.properties is run-time config (server.port, spring.application.name, actuator endpoints). optimizer-0 has no controllers and no endpoint to serve — the file is doing nothing here. The first PR that actually runs a web service is optimizer-2. Delete the file from this PR. optimizer-2 will re-introduce it alongside the REST controllers. The OptimizerServiceApplication @SpringBootApplication shell stays on this branch — optimizer-1's repository tests use @SpringBootTest and need an application class to discover.
Prepare model/ for a service-layer rewrite that returns only model/ types (no api/ DTO leakage into the service interface). - model/Table: add `Instant updatedAt`. The service stamps it on every upsert; controllers read it when assembling the wire DTO. - model/TableStatsHistory: new internal-model counterpart to db.TableStatsHistoryRow. Fields mirror the row in internal types (id, tableUuid, databaseName, tableName, stats, recordedAt). - ApiModelMapper: add the missing api↔model conversions that controllers will own once the service drops api/ knowledge — Table ↔ TableStatsDto, TableStatsHistory ↔ TableStatsHistoryDto, and toTable(tableUuid, UpsertTableStatsRequest).
Several fields under api/model/ and model/ were left undocumented in the earlier per-layer-types passes. Audit + fill them in: api/model/TableOperationsHistoryDto: databaseName, tableName, operationType — add display/role docs. api/model/HistoryStatus: SUCCESS, FAILED — add enum-value docs. api/model/TableStats inner classes: - SnapshotMetrics: clusterId, tableVersion, tableLocation, tableSizeBytes — add field docs. - CommitDelta: numFilesAdded, numFilesDeleted, addedSizeBytes, deletedSizeBytes — add field docs. model/Table: tableUuid, databaseName, tableId, tableProperties, stats — add field docs. model/TableStats: same field-doc additions on SnapshotMetrics and CommitDelta as the api/ counterpart. model/OperationStatus: PENDING, SCHEDULING, SCHEDULED, CANCELED — add enum-value docs. model/OperationType: ORPHAN_FILES_DELETION — add enum-value doc. model/HistoryStatus: SUCCESS, FAILED — add enum-value docs. model/TableStatsHistory: id, tableUuid, databaseName, tableName — add field docs.
clusterId is per-table-immutable in OpenHouse — it never changes after the table is created — so persisting and transmitting it on every snapshot is dead weight. Remove from the wire and internal representations. - api/model/TableStats.SnapshotMetrics: drop clusterId. - model/TableStats.SnapshotMetrics: drop clusterId. - model/mapper/ApiModelMapper: drop the clusterId hop in toModelSnapshot and toApiSnapshot.
… ApiModelMapper Replace the api/model boundary mapper with conversion methods on the types themselves. The api layer now imports model/ directly via to/from methods — controllers and other api-edge callers no longer inject a mapper bean. The dependency direction is a strict downward chain: api → model → db api types know about model types (and call model methods); model types know about db types (next round). db remains import-free. No central mapper, no risk of a cycle through a hub class. api/model/* changes (each gets a `toModel()` instance method + a static `fromModel(...)` factory): - TableOperationsDto ↔ model.TableOperation. - TableOperationsHistoryDto ↔ model.TableOperationsHistory. - TableStatsDto ↔ model.Table. - TableStatsHistoryDto ↔ model.TableStatsHistory. - UpsertTableStatsRequest → model.Table (one-way; takes the path-var tableUuid; updatedAt is server-stamped). - TableStats (+ SnapshotMetrics + CommitDelta inner) ↔ model.TableStats. - OperationType / OperationStatus / HistoryStatus (api enums) ↔ model enums. CompleteOperationRequest keeps its fields plain — callers extract `operationId` and `status.toModel()` directly; no wrapper needed. Delete services/optimizer/.../model/mapper/ApiModelMapper.java.
… to TableStats model.TableStats now carries its own identity (tableUuid, databaseName, tableName) and metadata (tableProperties, updatedAt) alongside the snapshot + delta payload. Consumers no longer need an outer wrapper to know which table the stats belong to. api.TableStatsDto.toModel() and api.UpsertTableStatsRequest.toModel() now return model.TableStats (was model.Table). The two types only happened to have the same shape — semantically a DTO for stats is stats, not a table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ationsHistory Moved down from opt-2. The service-layer code (opt-2) uses .toBuilder() on both types; the lombok annotation that enables it belongs on the PR that owns model/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Optimizer Stack
Summary
PR 0 of N in the optimizer stack.
Overall Project
Service Design doc.
Introduces the optimizer service API and internal model
Changes
Testing Done
This PR contains only the data model (entities, DTOs, converters). Repository tests follow in PR 1. Verified:
./gradlew :services:optimizer:compileJavapasses./gradlew compileJava(full project) passes with no regressionsAdditional Information