Skip to content

[BDP-102028] feat(optimizer): [1/N] Optimizer Database#530

Open
mkuchenbecker wants to merge 37 commits into
mkuchenb/optimizer-0from
mkuchenb/optimizer-1
Open

[BDP-102028] feat(optimizer): [1/N] Optimizer Database#530
mkuchenbecker wants to merge 37 commits into
mkuchenb/optimizer-0from
mkuchenb/optimizer-1

Conversation

@mkuchenbecker
Copy link
Copy Markdown
Collaborator

@mkuchenbecker mkuchenbecker commented Apr 6, 2026

Optimizer Stack

PR Content
#527 Data Model
#530 (this) Database Repos
#531 REST service
#533 Analyzer app
#534 Scheduler app
#tbd Spark BatchedOFD app
#tbd Infra, docker-compose, smoke test

Summary

PR 1 of N in the optimizer stack.
Overall Project
Service Design doc.

Spring Data JPA repositories for all four optimizer tables with filtered query support, plus tests exercising save/find, filtered queries, upsert semantics, and append-only history.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

Repositories: TableOperationsRepository, TableOperationsHistoryRepository, TableStatsRepository, TableStatsHistoryRepository — each with JPQL filtered query methods.

Tests: Repository tests for all four tables plus OptimizerServiceContextTest verifying the Spring context loads.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

./gradlew :services:optimizer:test — all tests pass (H2 in MySQL mode).

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

Spring Data JPA repositories for all four optimizer tables with
filtered query support. Includes tests exercising save/find,
filtered queries, upsert semantics, and append-only history.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mkuchenbecker and others added 3 commits April 6, 2026 11:35
Address PR review comments: rename findFiltered → find across all repos,
remove redundant findByTableUuid/findByTableUuidSince from history repos,
add explicit assertion to context test.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Shared JPA entities and repositories for optimizer apps (analyzer,
scheduler). All repos expose a single find method with optional filters.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Propagate CompleteOperationRequest orphan field removal.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mkuchenbecker mkuchenbecker changed the title feat(optimizer): add repositories and repository tests feat(optimizer): [1/N] Optimizer Repository Apr 6, 2026
@mkuchenbecker mkuchenbecker changed the title feat(optimizer): [1/N] Optimizer Repository feat(optimizer): [2/N] Optimizer Repository Apr 6, 2026
@mkuchenbecker mkuchenbecker marked this pull request as ready for review April 6, 2026 19:46
@mkuchenbecker mkuchenbecker changed the title feat(optimizer): [2/N] Optimizer Repository feat(optimizer): [1/N] Optimizer Repository Apr 6, 2026
Comment thread apps/optimizer/build.gradle Outdated

dependencies {
implementation 'org.springframework.boot:spring-boot-starter-data-jpa:2.7.8'
implementation 'com.vladmihalcea:hibernate-types-55:2.21.1'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where exactly this dependency is used? Seems like new dependency.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: It is used by TableStatsRow in this shared module. The @TypeDef(name = "json", typeClass = JsonStringType.class) and @Type(type = "json") annotations on TableStatsRow.stats and TableStatsRow.tableProperties resolve to com.vladmihalcea.hibernate.type.json.JsonStringType, which lives in this dependency. It is the same library already used by services/optimizer for the same purpose.

mkuchenbecker and others added 5 commits April 30, 2026 09:55
- Repositories: update JPQL and parameter names to match the
  renamed entity fields (databaseName, completedAt). Change
  TableOperationsHistoryRepository and TableStatsHistoryRepository
  ID type parameter from Long to String to match the entity PK
  (UUID set by the caller, not auto-generated).
- Tests: update builders and getters to use the renamed fields
  (databaseName, completedAt). Replace the autoIncrementId test
  with callerSetIdIsPreserved which verifies the caller-set UUID
  round-trips through save/findById.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y row

Address PR #530 review feedback: the lightweight read-side
TableOperationHistoryRow in the apps/optimizer shared module did
not surface the denormalized database_name and table_name columns,
even though the underlying schema carries them. Add them so
analyst-style queries from the analyzer/scheduler side can read
operation history without joining back to table_operations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
public static class CommitDelta {
private Long numFilesAdded;
private Long numFilesDeleted;
private Long deletedSizeBytes;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we capture additional stats like added file size, data file size etc?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, yes. I didn't want to add all stats at this time, mainly because the list should be driven from what we need to make decisions for each operation.

Do you think we should get to comprehensive set now or are you fine deferring until they are being populated / used?

If I fast-forward a year, and we are onboarding a new operation, we might need to ingest a new stat, and consume it from a new analyzer, bin pack on a new stat with the scheduler. My main thought was to do the same for the operations we onboard in the short term so we can make the onboarding plan well-tread.

The metrics you mention will absolutely be used so I'm happy to add them now though.

* (same UUID as the originating {@code table_operations.id}).
*/
@Repository
public interface TableOperationsHistoryRepository
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this duplicate of TableOperationHistoryRepository?

Copy link
Copy Markdown
Collaborator Author

@mkuchenbecker mkuchenbecker May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was renamed, but claude made teh change at the wrong pr layer and left the old one. I have updated.

Add idx_toph_optype_uuid_completed (operation_type, table_uuid, completed_at)
on table_operations_history. TableOperationHistoryRepository.findLatestPerTable
uses a correlated MAX(completed_at) subquery; without this index it degenerates
to O(N²) and does not complete at 1M-row history scale. With it the inner
subquery becomes an index-only lookup per outer row.

Update the repo method's javadoc to point at the new index by name and drop
the resolved TODO.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mkuchenbecker mkuchenbecker changed the title feat(optimizer): [1/N] Optimizer Repository [BDP-102028] feat(optimizer): [1/N] Optimizer Repository May 13, 2026
mkuchenbecker and others added 11 commits May 13, 2026 11:46
Enables per-database iteration in the analyzer. Returns the bounded set of
database_name values present in table_stats; the analyzer uses it to drive
the outer loop when no specific databaseName filter is supplied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move Table, TableOperation, OperationType, OperationStatus, HistoryStatus
from the analyzer-internal package into the shared apps/optimizer module.
The scheduler will consume the same domain types as the analyzer.

Per-layer types still hold (wire-API, internal model, DB each define their
own representation); this just consolidates the internal layer so multiple
internal consumers (analyzer, scheduler) share one set of classes.

TableOperation gains a nullable, non-persisted fileCount field. Consumers
that need it (OFD bin-packing) populate it at read time from table_stats;
the DB row does not carry it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…; add TableStatsHistory

Aligns apps/optimizer with the SQL table names (table_operations,
table_operations_history) and the existing services/optimizer convention:

- TableOperationRow → TableOperationsRow
- TableOperationHistoryRow → TableOperationsHistoryRow
- TableOperationHistoryRepository → TableOperationsHistoryRepository

Adds the missing TableStatsHistoryRow + TableStatsHistoryRepository so
apps/optimizer is a complete entity set covering all four optimizer DB
tables. services/optimizer will consume these in a follow-up commit on
optimizer-2 (the services-side duplicates will be deleted).

Adds an explanatory javadoc on TableOperationsRow.version documenting the
application-level optimistic-concurrency-control role used by the
scheduler's CAS transitions (resolves PR #530 thread 3231557313).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
apps/optimizer was a misplaced shared library duplicating the JPA layer.
services/optimizer is the canonical optimizer module — schedulers and
analyzers embed it directly as a library. This commit consolidates by
moving the JPA entities, repositories, and in-memory domain model from
apps/optimizer into services/optimizer, deleting the apps module, and
updating the service-side wiring + tests accordingly.

- git mv 13 files (entities/repos/model) from apps/optimizer to
  services/optimizer; preserves history.
- Delete services-side pre-R7 duplicates: 4 entities, 4 repos, the
  duplicate api/model/TableStats DTO, the now-unneeded
  JobResultConverter.
- Rename services-side wire-API enum OperationHistoryStatus →
  HistoryStatus.
- Drop the apps/optimizer module entry from settings.gradle.
- OptimizerMapper: add String↔OperationType, String↔OperationStatus,
  String↔HistoryStatus, String↔JobResult default helpers so MapStruct
  can bridge the entity (String at JPA boundary) and the wire DTOs.
- Update DTOs that import TableStats/HistoryStatus to the new
  package locations.
- Rewrite repo tests against the simplified history repo shape and
  fix a stale find(...) positional-arg signature in the operations
  repo test.
R7-1 imported the looser apps-side variant of TableStatsRow,
TableStatsHistoryRow, and model/TableStats into the services-side
paths, regressing the locked-down shape that optimizer-0 had. R8-1's
git mv carried the regression forward. This commit makes
optimizer-1's HEAD match optimizer-0's canonical shape so the
optimizer-0..optimizer-1 diff no longer shows ghost model edits.

- TableStatsRow: restore @EqualsAndHashCode, AccessLevel.PROTECTED
  on NoArgsConstructor + AllArgsConstructor, and toBuilder=true on
  @builder. Drop @Setter (no callers — repo tests and downstream
  consumers use the builder).
- TableStatsHistoryRow: restore the dropped "can" in the javadoc.
- model/TableStats: restore @JsonIgnoreProperties(ignoreUnknown =
  true) on the outer class + both inner classes, and restore the
  CommitDelta.addedSizeBytes field that R7-1 dropped.
@mkuchenbecker mkuchenbecker changed the title [BDP-102028] feat(optimizer): [1/N] Optimizer Repository [BDP-102028] feat(optimizer): [1/N] Optimizer Database May 14, 2026
mkuchenbecker and others added 14 commits May 14, 2026 14:04
optimizer-0 retired entity/, the schema, JPA/MySQL deps, and the
api/mapper. This PR brings the DB layer back as db/ with its own
self-contained types and a model↔db boundary mapper.

db/ package:
- TableOperationsRow, TableOperationsHistoryRow, TableStatsRow,
  TableStatsHistoryRow — JPA entities (same field set as the
  pre-deletion entity/ versions, with two exceptions: enum fields
  on the operations rows are now typed db/-side enums via
  @Enumerated(STRING), and TableOperationsHistoryRow loses the
  jobId/result columns since they were removed from the wire on
  optimizer-0).
- OperationType, OperationStatus, HistoryStatus — db-layer enums.
- TableStats (+ inner SnapshotMetrics, CommitDelta) — db-layer
  JSON payload, mirrors the model/ + api/ counterparts in shape
  but is its own class.

model/mapper/ModelDbMapper:
- Translates between model/ domain objects and db/ rows.
- Lives in model/ per the boundary rule (model/ owns conversions
  to both edges; api/, model/, db/ data types are self-contained).

Repositories: imports switched to db/; find() and findLatestPerTable
take typed db enums instead of String.

Repository tests: builders pass typed db enums; remove jobId/result
fields no longer on TableOperationsHistoryRow.

Schema (db/optimizer-schema.sql): restored. table_operations_history
no longer has job_id / result columns. The
idx_toph_optype_uuid_completed index for findLatestPerTable is
preserved.

build.gradle: restore spring-boot-starter-data-jpa, hibernate-types,
mysql-connector-java, h2 dependencies.

application-test.properties: restored (H2 test datasource).
…columns

The DB layer no longer mirrors the wire-side TableStats JSON envelope.
Instead the two structurally-separate concepts inside it — point-in-time
snapshot metrics and per-commit delta counters — are persisted as two
independent JSON columns. Per-layer decoupling: the api/ envelope can
evolve without forcing the DB column shape to change in lockstep.

Tables and class names are unchanged: table_stats / table_stats_history
on the SQL side; TableStatsRow / TableStatsHistoryRow on the Java side.

Changes:
- Delete db/TableStats (the envelope wrapper is no longer needed).
- Add db/SnapshotMetrics (plain POJO; serialized into the `snapshot`
  JSON column).
- Add db/CommitDeltaMetrics (plain POJO; serialized into the `delta`
  JSON column).
- TableStatsRow: replace `stats: TableStats` with `snapshot: SnapshotMetrics`
  and `delta: CommitDeltaMetrics`.
- TableStatsHistoryRow: same split.
- Schema: replace `stats TEXT` with `snapshot TEXT` and `delta TEXT` on
  both tables.
- ModelDbMapper: split/join at the boundary. New helpers `toDbSnapshot`,
  `toDbDelta`, `joinStats` translate between the single model-layer
  TableStats and the two DB columns. `toStatsHistoryRow` projects a
  TableStats into the two-column row.
- Repository tests: build rows with the new two-field shape.
table_stats is the current-state row (one per table). Per-commit
deltas are an append-only history concern and belong only to
TableStatsHistoryRow. Storing a delta on the current-state row
implied an aggregation that isn't actually performed.

- TableStatsRow: remove the `delta` field.
- table_stats schema: drop the `delta` column.
- ModelDbMapper.toTable: project only snapshot to model.TableStats;
  history-only deltas remain in TableStatsHistoryRow.
- TableStatsRepositoryTest: drop .delta(...) builder usage.
Round out the model↔db boundary for the upcoming service-layer
rewrite that returns only internal-model types:
- toTable: stamp model.Table.updatedAt from the row's updated_at
  column so the model carries the freshness needed by callers
  without leaking the row.
- toStatsHistory: new — db.TableStatsHistoryRow → model.TableStatsHistory.
  Joins the row's snapshot + delta columns into the model's single
  TableStats payload.
The optimizer-1 db/ rewrite accidentally dropped the batch CAS
helpers used by the scheduler. Restore them with db/-typed enum
parameters and JPQL queries that compare against fully-qualified
db.OperationStatus constants.

- markSchedulingBatch(ids, scheduledAt): PENDING → SCHEDULING.
- markScheduledBatch(ids, jobId): SCHEDULING → SCHEDULED.
- markPendingBatch(ids): SCHEDULING → PENDING (job-launch failure
  retry).
- cancelDuplicatePendingBatch(operationType, keepIds): drop dupe
  PENDING rows for an operation type, keeping the supplied IDs.
Two cleanups on the DB layer, plus a doc audit.

clusterId removal:
- db/SnapshotMetrics: drop clusterId.
- model/mapper/ModelDbMapper: drop clusterId from toModelSnapshot
  and toDbSnapshot.
- Repository tests: drop .clusterId("cl1") from builders.
(The api/ and model/ copies were retired in the prior optimizer-0
commit; this completes the removal at the db edge.)

version removal:
- db/TableOperationsRow: drop the `version` field. The batch CAS
  pattern's atomicity comes from filtering on `status` (PENDING →
  SCHEDULING is unambiguous on status alone); the version bump was
  decorative.
- table_operations schema: drop the `version BIGINT` column.
- TableOperationsRepository: remove `r.version = r.version + 1`
  from markSchedulingBatch / markScheduledBatch / markPendingBatch
  query strings.
- model/mapper/ModelDbMapper.toRow: stop initializing version on
  the row builder.

Doc audit on db/:
- db/SnapshotMetrics, db/CommitDeltaMetrics: doc every field.
- db/HistoryStatus, db/OperationStatus, db/OperationType: doc every
  enum value.
- db/TableOperationsRow, db/TableOperationsHistoryRow,
  db/TableStatsRow, db/TableStatsHistoryRow: doc every field.
…e ModelDbMapper

Replace the model/db boundary mapper with conversion methods on the
model types themselves. Same pattern that opt-0 just applied at the
api↔model boundary — each layer's type carries the to/from methods
for the layer below.

The dependency chain after this commit:

  api  →  model  →  db

api/* → model/* (added on opt-0).
model/* → db/* (this commit).
db/* still imports nothing — bottom of the chain.

model/* changes (each gets a `toRow()` instance method + a static
`fromRow(...)` factory):
- Table ↔ db.TableStatsRow (current-state row; snapshot only,
  delta lives on history rows).
- TableOperation ↔ db.TableOperationsRow.
- TableOperationsHistory ↔ db.TableOperationsHistoryRow.
- TableStatsHistory ↔ db.TableStatsHistoryRow (joins/splits the
  snapshot + delta columns).
- TableStats inner: SnapshotMetrics ↔ db.SnapshotMetrics,
  CommitDelta ↔ db.CommitDeltaMetrics. TableStats itself exposes
  toSnapshotRow() / toDeltaRow() for the split-write side and a
  static fromRows(snapshot, delta) for the join-read side.
- OperationType / OperationStatus / HistoryStatus (model enums) ↔
  db enums.

Delete services/optimizer/.../model/mapper/ModelDbMapper.java.
Enriches model.TableStats with identity (tableUuid, databaseName, tableName)
and metadata (tableProperties, updatedAt), and reroutes the api DTOs'
toModel/fromModel pair to model.TableStats. opt-1's existing toSnapshotRow /
toDeltaRow / fromRows helpers are preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TableStats.toRow() / fromRow() let the service operate purely on the
self-describing model.TableStats type instead of going through Table.
Existing toSnapshotRow / toDeltaRow / fromRows helpers are preserved
for the history path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants