[BDP-102028] feat(optimizer): [1/N] Optimizer Database by mkuchenbecker · Pull Request #530 · linkedin/openhouse

mkuchenbecker · 2026-04-06T18:01:58Z

Optimizer Stack

PR	Content
#527	Data Model
#530 (this)	Database Repos
#531	REST service
#533	Analyzer app
#534	Scheduler app
#tbd	Spark BatchedOFD app
#tbd	Infra, docker-compose, smoke test

Summary

PR 1 of N in the optimizer stack.
Overall Project
Service Design doc.

Spring Data JPA repositories for all four optimizer tables with filtered query support, plus tests exercising save/find, filtered queries, upsert semantics, and append-only history.

Changes

Repositories: TableOperationsRepository, TableOperationsHistoryRepository, TableStatsRepository, TableStatsHistoryRepository — each with JPQL filtered query methods.

Tests: Repository tests for all four tables plus OptimizerServiceContextTest verifying the Spring context loads.

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

./gradlew :services:optimizer:test — all tests pass (H2 in MySQL mode).

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

Spring Data JPA repositories for all four optimizer tables with filtered query support. Includes tests exercising save/find, filtered queries, upsert semantics, and append-only history. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Address PR review comments: rename findFiltered → find across all repos, remove redundant findByTableUuid/findByTableUuidSince from history repos, add explicit assertion to context test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Shared JPA entities and repositories for optimizer apps (analyzer, scheduler). All repos expose a single find method with optional filters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Propagate CompleteOperationRequest orphan field removal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

abhisheknath2011 · 2026-04-11T01:47:21Z

+
+dependencies {
+  implementation 'org.springframework.boot:spring-boot-starter-data-jpa:2.7.8'
+  implementation 'com.vladmihalcea:hibernate-types-55:2.21.1'


Where exactly this dependency is used? Seems like new dependency.

Claude: It is used by TableStatsRow in this shared module. The @TypeDef(name = "json", typeClass = JsonStringType.class) and @Type(type = "json") annotations on TableStatsRow.stats and TableStatsRow.tableProperties resolve to com.vladmihalcea.hibernate.type.json.JsonStringType, which lives in this dependency. It is the same library already used by services/optimizer for the same purpose.

- Repositories: update JPQL and parameter names to match the renamed entity fields (databaseName, completedAt). Change TableOperationsHistoryRepository and TableStatsHistoryRepository ID type parameter from Long to String to match the entity PK (UUID set by the caller, not auto-generated). - Tests: update builders and getters to use the renamed fields (databaseName, completedAt). Replace the autoIncrementId test with callerSetIdIsPreserved which verifies the caller-set UUID round-trips through save/findById. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…y row Address PR #530 review feedback: the lightweight read-side TableOperationHistoryRow in the apps/optimizer shared module did not surface the denormalized database_name and table_name columns, even though the underlying schema carries them. Add them so analyst-style queries from the analyzer/scheduler side can read operation history without joining back to table_operations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

abhisheknath2011 · 2026-05-13T04:55:10Z

+  public static class CommitDelta {
+    private Long numFilesAdded;
+    private Long numFilesDeleted;
+    private Long deletedSizeBytes;


Should we capture additional stats like added file size, data file size etc?

Agreed, yes. I didn't want to add all stats at this time, mainly because the list should be driven from what we need to make decisions for each operation.

Do you think we should get to comprehensive set now or are you fine deferring until they are being populated / used?

If I fast-forward a year, and we are onboarding a new operation, we might need to ingest a new stat, and consume it from a new analyzer, bin pack on a new stat with the scheduler. My main thought was to do the same for the operations we onboard in the short term so we can make the onboarding plan well-tread.

The metrics you mention will absolutely be used so I'm happy to add them now though.

abhisheknath2011 · 2026-05-13T05:01:21Z

+ * (same UUID as the originating {@code table_operations.id}).
+ */
+@Repository
+public interface TableOperationsHistoryRepository


Is this duplicate of TableOperationHistoryRepository?

It was renamed, but claude made teh change at the wrong pr layer and left the old one. I have updated.

Add idx_toph_optype_uuid_completed (operation_type, table_uuid, completed_at) on table_operations_history. TableOperationHistoryRepository.findLatestPerTable uses a correlated MAX(completed_at) subquery; without this index it degenerates to O(N²) and does not complete at 1M-row history scale. With it the inner subquery becomes an index-only lookup per outer row. Update the repo method's javadoc to point at the new index by name and drop the resolved TODO. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Enables per-database iteration in the analyzer. Returns the bounded set of database_name values present in table_stats; the analyzer uses it to drive the outer loop when no specific databaseName filter is supplied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move Table, TableOperation, OperationType, OperationStatus, HistoryStatus from the analyzer-internal package into the shared apps/optimizer module. The scheduler will consume the same domain types as the analyzer. Per-layer types still hold (wire-API, internal model, DB each define their own representation); this just consolidates the internal layer so multiple internal consumers (analyzer, scheduler) share one set of classes. TableOperation gains a nullable, non-persisted fileCount field. Consumers that need it (OFD bin-packing) populate it at read time from table_stats; the DB row does not carry it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…; add TableStatsHistory Aligns apps/optimizer with the SQL table names (table_operations, table_operations_history) and the existing services/optimizer convention: - TableOperationRow → TableOperationsRow - TableOperationHistoryRow → TableOperationsHistoryRow - TableOperationHistoryRepository → TableOperationsHistoryRepository Adds the missing TableStatsHistoryRow + TableStatsHistoryRepository so apps/optimizer is a complete entity set covering all four optimizer DB tables. services/optimizer will consume these in a follow-up commit on optimizer-2 (the services-side duplicates will be deleted). Adds an explanatory javadoc on TableOperationsRow.version documenting the application-level optimistic-concurrency-control role used by the scheduler's CAS transitions (resolves PR #530 thread 3231557313). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

apps/optimizer was a misplaced shared library duplicating the JPA layer. services/optimizer is the canonical optimizer module — schedulers and analyzers embed it directly as a library. This commit consolidates by moving the JPA entities, repositories, and in-memory domain model from apps/optimizer into services/optimizer, deleting the apps module, and updating the service-side wiring + tests accordingly. - git mv 13 files (entities/repos/model) from apps/optimizer to services/optimizer; preserves history. - Delete services-side pre-R7 duplicates: 4 entities, 4 repos, the duplicate api/model/TableStats DTO, the now-unneeded JobResultConverter. - Rename services-side wire-API enum OperationHistoryStatus → HistoryStatus. - Drop the apps/optimizer module entry from settings.gradle. - OptimizerMapper: add String↔OperationType, String↔OperationStatus, String↔HistoryStatus, String↔JobResult default helpers so MapStruct can bridge the entity (String at JPA boundary) and the wire DTOs. - Update DTOs that import TableStats/HistoryStatus to the new package locations. - Rewrite repo tests against the simplified history repo shape and fix a stale find(...) positional-arg signature in the operations repo test.

@EqualsAndHashCode

R7-1 imported the looser apps-side variant of TableStatsRow, TableStatsHistoryRow, and model/TableStats into the services-side paths, regressing the locked-down shape that optimizer-0 had. R8-1's git mv carried the regression forward. This commit makes optimizer-1's HEAD match optimizer-0's canonical shape so the optimizer-0..optimizer-1 diff no longer shows ghost model edits. - TableStatsRow: restore @EqualsAndHashCode, AccessLevel.PROTECTED on NoArgsConstructor + AllArgsConstructor, and toBuilder=true on @builder. Drop @Setter (no callers — repo tests and downstream consumers use the builder). - TableStatsHistoryRow: restore the dropped "can" in the javadoc. - model/TableStats: restore @JsonIgnoreProperties(ignoreUnknown = true) on the outer class + both inner classes, and restore the CommitDelta.addedSizeBytes field that R7-1 dropped.

@Enumerated

optimizer-0 retired entity/, the schema, JPA/MySQL deps, and the api/mapper. This PR brings the DB layer back as db/ with its own self-contained types and a model↔db boundary mapper. db/ package: - TableOperationsRow, TableOperationsHistoryRow, TableStatsRow, TableStatsHistoryRow — JPA entities (same field set as the pre-deletion entity/ versions, with two exceptions: enum fields on the operations rows are now typed db/-side enums via @Enumerated(STRING), and TableOperationsHistoryRow loses the jobId/result columns since they were removed from the wire on optimizer-0). - OperationType, OperationStatus, HistoryStatus — db-layer enums. - TableStats (+ inner SnapshotMetrics, CommitDelta) — db-layer JSON payload, mirrors the model/ + api/ counterparts in shape but is its own class. model/mapper/ModelDbMapper: - Translates between model/ domain objects and db/ rows. - Lives in model/ per the boundary rule (model/ owns conversions to both edges; api/, model/, db/ data types are self-contained). Repositories: imports switched to db/; find() and findLatestPerTable take typed db enums instead of String. Repository tests: builders pass typed db enums; remove jobId/result fields no longer on TableOperationsHistoryRow. Schema (db/optimizer-schema.sql): restored. table_operations_history no longer has job_id / result columns. The idx_toph_optype_uuid_completed index for findLatestPerTable is preserved. build.gradle: restore spring-boot-starter-data-jpa, hibernate-types, mysql-connector-java, h2 dependencies. application-test.properties: restored (H2 test datasource).

…columns The DB layer no longer mirrors the wire-side TableStats JSON envelope. Instead the two structurally-separate concepts inside it — point-in-time snapshot metrics and per-commit delta counters — are persisted as two independent JSON columns. Per-layer decoupling: the api/ envelope can evolve without forcing the DB column shape to change in lockstep. Tables and class names are unchanged: table_stats / table_stats_history on the SQL side; TableStatsRow / TableStatsHistoryRow on the Java side. Changes: - Delete db/TableStats (the envelope wrapper is no longer needed). - Add db/SnapshotMetrics (plain POJO; serialized into the `snapshot` JSON column). - Add db/CommitDeltaMetrics (plain POJO; serialized into the `delta` JSON column). - TableStatsRow: replace `stats: TableStats` with `snapshot: SnapshotMetrics` and `delta: CommitDeltaMetrics`. - TableStatsHistoryRow: same split. - Schema: replace `stats TEXT` with `snapshot TEXT` and `delta TEXT` on both tables. - ModelDbMapper: split/join at the boundary. New helpers `toDbSnapshot`, `toDbDelta`, `joinStats` translate between the single model-layer TableStats and the two DB columns. `toStatsHistoryRow` projects a TableStats into the two-column row. - Repository tests: build rows with the new two-field shape.

table_stats is the current-state row (one per table). Per-commit deltas are an append-only history concern and belong only to TableStatsHistoryRow. Storing a delta on the current-state row implied an aggregation that isn't actually performed. - TableStatsRow: remove the `delta` field. - table_stats schema: drop the `delta` column. - ModelDbMapper.toTable: project only snapshot to model.TableStats; history-only deltas remain in TableStatsHistoryRow. - TableStatsRepositoryTest: drop .delta(...) builder usage.

Round out the model↔db boundary for the upcoming service-layer rewrite that returns only internal-model types: - toTable: stamp model.Table.updatedAt from the row's updated_at column so the model carries the freshness needed by callers without leaking the row. - toStatsHistory: new — db.TableStatsHistoryRow → model.TableStatsHistory. Joins the row's snapshot + delta columns into the model's single TableStats payload.

The optimizer-1 db/ rewrite accidentally dropped the batch CAS helpers used by the scheduler. Restore them with db/-typed enum parameters and JPQL queries that compare against fully-qualified db.OperationStatus constants. - markSchedulingBatch(ids, scheduledAt): PENDING → SCHEDULING. - markScheduledBatch(ids, jobId): SCHEDULING → SCHEDULED. - markPendingBatch(ids): SCHEDULING → PENDING (job-launch failure retry). - cancelDuplicatePendingBatch(operationType, keepIds): drop dupe PENDING rows for an operation type, keeping the supplied IDs.

Two cleanups on the DB layer, plus a doc audit. clusterId removal: - db/SnapshotMetrics: drop clusterId. - model/mapper/ModelDbMapper: drop clusterId from toModelSnapshot and toDbSnapshot. - Repository tests: drop .clusterId("cl1") from builders. (The api/ and model/ copies were retired in the prior optimizer-0 commit; this completes the removal at the db edge.) version removal: - db/TableOperationsRow: drop the `version` field. The batch CAS pattern's atomicity comes from filtering on `status` (PENDING → SCHEDULING is unambiguous on status alone); the version bump was decorative. - table_operations schema: drop the `version BIGINT` column. - TableOperationsRepository: remove `r.version = r.version + 1` from markSchedulingBatch / markScheduledBatch / markPendingBatch query strings. - model/mapper/ModelDbMapper.toRow: stop initializing version on the row builder. Doc audit on db/: - db/SnapshotMetrics, db/CommitDeltaMetrics: doc every field. - db/HistoryStatus, db/OperationStatus, db/OperationType: doc every enum value. - db/TableOperationsRow, db/TableOperationsHistoryRow, db/TableStatsRow, db/TableStatsHistoryRow: doc every field.

…e ModelDbMapper Replace the model/db boundary mapper with conversion methods on the model types themselves. Same pattern that opt-0 just applied at the api↔model boundary — each layer's type carries the to/from methods for the layer below. The dependency chain after this commit: api → model → db api/* → model/* (added on opt-0). model/* → db/* (this commit). db/* still imports nothing — bottom of the chain. model/* changes (each gets a `toRow()` instance method + a static `fromRow(...)` factory): - Table ↔ db.TableStatsRow (current-state row; snapshot only, delta lives on history rows). - TableOperation ↔ db.TableOperationsRow. - TableOperationsHistory ↔ db.TableOperationsHistoryRow. - TableStatsHistory ↔ db.TableStatsHistoryRow (joins/splits the snapshot + delta columns). - TableStats inner: SnapshotMetrics ↔ db.SnapshotMetrics, CommitDelta ↔ db.CommitDeltaMetrics. TableStats itself exposes toSnapshotRow() / toDeltaRow() for the split-write side and a static fromRows(snapshot, delta) for the join-read side. - OperationType / OperationStatus / HistoryStatus (model enums) ↔ db enums. Delete services/optimizer/.../model/mapper/ModelDbMapper.java.

Enriches model.TableStats with identity (tableUuid, databaseName, tableName) and metadata (tableProperties, updatedAt), and reroutes the api DTOs' toModel/fromModel pair to model.TableStats. opt-1's existing toSnapshotRow / toDeltaRow / fromRows helpers are preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

TableStats.toRow() / fromRow() let the service operate purely on the self-describing model.TableStats type instead of going through Table. Existing toSnapshotRow / toDeltaRow / fromRows helpers are preserved for the history path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mkuchenbecker mentioned this pull request Apr 6, 2026

[BDP-102028] feat(optimizer): [2/N] Optimizer REST Service and Controller #531

Open

17 tasks

mkuchenbecker marked this pull request as draft April 6, 2026 18:02

mkuchenbecker mentioned this pull request Apr 6, 2026

[BDP-102028] feat(optimizer): [0/N] Optimizer API and internal model #527

Open

17 tasks