ci: centralize config computation to fix env.* limitation#113
Closed
jackluo923 wants to merge 20 commits intorelease-0.293-clp-connectorfrom
Closed
ci: centralize config computation to fix env.* limitation#113jackluo923 wants to merge 20 commits intorelease-0.293-clp-connectorfrom
jackluo923 wants to merge 20 commits intorelease-0.293-clp-connectorfrom
Conversation
…ilds Change Docker image tagging from static :dev tags to dynamic branch-based tags, and update all workflow conditions to accept any branch with the release-0.293-clp-connector-snapshot prefix. Changes: - Docker image tags now use actual branch name via type=ref,event=branch * Before: ghcr.io/repo/prestissimo-worker:dev * After: ghcr.io/repo/prestissimo-worker:release-0.293-clp-connector-snapshot-nov29 - Push conditions changed from exact match to prefix match using startsWith() - Cancel-in-progress logic updated to exclude all snapshot prefix branches - PR title checks now accept release-0.293-clp-connector-snapshot* pattern Example: Pushing to branch "release-0.293-clp-connector-snapshot-nov29" will: - Build and push images tagged with :release-0.293-clp-connector-snapshot-nov29 - Run all CI jobs to completion without cancellation - Enable PR title validation for PRs targeting this branch This allows multiple snapshot branches to coexist with unique image tags.
Add alternative provider implementations for metadata and split discovery,
enabling flexible deployment configurations without MySQL dependency.
New providers:
- ClpYamlMetadataProvider: Reads table/column metadata from YAML config files
instead of querying MySQL database. Supports nested schemas and polymorphic
types through hierarchical YAML structure.
- ClpPinotSplitProvider: Queries Apache Pinot database for archive file paths
via HTTP/JSON API instead of MySQL. Determines split types based on file
extensions (.clp.zst for IR, otherwise ARCHIVE).
Configuration:
- Add clp.metadata-provider-type=YAML option for YAML-based metadata
- Add clp.split-provider-type=PINOT option for Pinot-based split discovery
- Add clp.metadata-yaml-path config for YAML metadata file location
The YAML metadata provider parses a two-level structure:
1. Main metadata file: maps connector → schema → table → schema_file_path
2. Individual table schema files: define column names and types
This provides deployment flexibility - users can now choose between:
- Database-driven (MySQL for both metadata and splits)
- Hybrid (YAML metadata + Pinot splits)
- Or any combination based on their infrastructure
Dependencies: Adds jackson-dataformat-yaml and snakeyaml 2.1
Ensure all timestamp literals are converted to nanoseconds before generating KQL and SQL filters for pushdown, regardless of their original precision (milliseconds, microseconds, etc.). Problem: Presto supports multiple timestamp precisions (TIMESTAMP with millisecond precision, TIMESTAMP_MICROSECONDS, etc.), but CLP's query engine expects timestamps in a consistent nanosecond format. Previously, timestamp literals were passed through without precision normalization, causing incorrect results when different timestamp types were used in queries. Solution: Add timestamp normalization logic in ClpFilterToKqlConverter: - tryEnsureNanosecondTimestamp(): Detects TIMESTAMP and TIMESTAMP_MICROSECONDS types - ensureNanosecondTimestamp(): Converts timestamp values to nanoseconds using: 1. Extract epoch seconds via TimestampType.getEpochSecond() 2. Extract nanosecond fraction via TimestampType.getNanos() 3. Convert to total nanoseconds: SECONDS.toNanos(seconds) + nanosFraction Applied to: - BETWEEN operator: Both lower and upper bounds - Comparison operators: All timestamp literal values (=, !=, <, >, <=, >=) This ensures consistent timestamp representation in both KQL filters (sent to CLP engine) and SQL filters (used for metadata filtering), regardless of the timestamp precision used in the original Presto query. Example: Query: WHERE timestamp BETWEEN TIMESTAMP '2025-01-01' AND TIMESTAMP '2025-01-02' Before: Literal values passed as milliseconds or microseconds After: Both bounds normalized to nanosecond representation
Implement a plan optimizer that pushes down LIMIT + ORDER BY operations to reduce split scanning when querying on timestamp metadata columns. The optimizer detects TopN patterns in query plans and uses archive metadata (timestamp bounds, message counts) to select only the minimum set of archives needed to guarantee correct results. Key features: - Detects and rewrites TopN → [Project] → [Filter] → TableScan patterns - Pushes TopN specifications through ClpTableLayoutHandle - Handles overlapping archives by merging into non-overlapping groups - Supports both ASC (earliest-first) and DESC (latest-first) ordering - Uses worst-case analysis to ensure correctness when archive ranges overlap Implementation: - ClpTopNSpec: Data structure for TopN specifications (limit + orderings) - ClpComputePushDown: Enhanced plan optimizer with TopN detection - ClpMySqlSplitProvider: Archive selection logic using metadata queries - Comprehensive test coverage for various TopN scenarios Performance: For queries like "SELECT * FROM logs ORDER BY timestamp DESC LIMIT 100", this eliminates scanning of archives that cannot contain the top results, significantly reducing I/O.
Implement TopN pushdown optimization in ClpPinotSplitProvider to minimize archive scanning when executing LIMIT + ORDER BY queries on timestamp metadata columns. The optimization uses archive metadata (timestamp bounds and message counts) to intelligently select only the minimum set of archives needed to guarantee correct TopN results, significantly reducing I/O for time-range queries. Implementation details: Archive selection algorithm: - Fetches archive metadata (file path, creation time, modification time, message count) - Merges overlapping archives into non-overlapping groups by timestamp ranges - For DESC ordering: includes newest group + older groups until limit is covered - For ASC ordering: includes oldest group + newer groups until limit is covered - Uses worst-case analysis to ensure correctness when archives overlap Code structure: - ClpPinotSplitProvider.listSplits(): Detects TopN specs and routes to optimized path - selectTopNArchives(): Implements the archive selection algorithm - toArchiveGroups(): Merges overlapping archives into logical components - ArchiveMeta: Represents archive metadata with validation - ArchiveGroup: Represents merged archive groups for overlap handling ClpTopNSpec enhancements: - Made fields private with proper encapsulation - Added @JsonProperty annotations to getters for correct serialization - Maintains immutability for thread-safety Code quality improvements: - Proper exception handling with context-aware error messages - Input validation in constructors (bounds checking, null checks) - Extracted determineSplitType() helper to eliminate duplication - Made fields final where immutable (pinotDatabaseUrl, ArchiveMeta fields) - Improved logging with table names and SQL queries for debugging - Better encapsulation: private fields, getters-only access Performance impact: For queries like "SELECT * FROM logs ORDER BY timestamp DESC LIMIT 100", this eliminates scanning of archives outside the time range of interest, providing substantial I/O reduction for large archive sets.
Implements dynamic schema discovery from YAML metadata files, allowing the CLP connector
to support multiple schemas beyond just "default" (e.g., clp.dev.table, clp.prod.table).
Key changes:
- Add default listSchemaNames() method to ClpMetadataProvider interface for DRY principle
- Implement dynamic schema discovery in ClpYamlMetadataProvider with thread-safe caching
- Update ClpMetadata to delegate schema listing to metadata providers
- Add comprehensive error handling with graceful fallback to default schema
- Optimize performance with double-checked locking and ObjectMapper reuse
- Add 15+ unit tests covering all edge cases and error scenarios
- Fix checkstyle violations in UberClpPinotSplitProvider and ClpPinotSplitProvider
The implementation is backward compatible - existing single-schema setups continue to work
without configuration changes. Multi-schema support is activated automatically when the
YAML metadata file contains multiple schema entries.
Example YAML structure:
```yaml
clp:
default:
logs: /path/to/default/logs.yaml
dev:
test_logs: /path/to/dev/logs.yaml
prod:
production_logs: /path/to/prod/logs.yaml
```
Performance optimizations:
- Thread-safe caching prevents repeated YAML parsing
- Double-checked locking pattern for lazy initialization
- Reused ObjectMapper instances reduce object creation overhead
- Synchronized table schema map updates ensure thread safety
Testing:
- All 11 tests in TestClpYamlMetadataProvider pass
- All 4 tests in TestClpMultiSchema pass
- Fixed testListSchemaNamesNoCatalogField to handle missing catalog gracefully
…restodb#26254) ## Description Fix flakytests that uses `experimental.spiller-spill-path` as `/tmp/presto/spills/` in query runner ## Motivation and Context prestodb#25890 ## Impact <!---Describe any public API or user-facing feature change or any performance impact--> ## Test Plan <!---Please fill in how you tested your change--> ## Contributor checklist - [ ] Please make sure your submission complies with our [contributing guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md), in particular [code style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style) and [commit standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards). - [ ] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced. - [ ] Documented new properties (with its default value), SQL syntax, functions, or other functionality. - [ ] If release notes are required, they follow the [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines). - [ ] Adequate tests were added if applicable. - [ ] CI passed. ## Release Notes Please follow [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines) and fill in the release notes below. ``` == NO RELEASE NOTE == ```
Upgrade joda-time to 2.13.1 to support new timezone data including America/Coyhaique. Update tests to use dates where timezone gaps still exist in the updated tzdata: - testLocallyUnrepresentableDateLiterals: 1970-01-01 -> 1932-04-01 - testLocallyUnrepresentableTimeLiterals: use 2017-04-02 or 2012-04-01 Cherry-picked from upstream: prestodb/presto@0a504d20e6
… runners Key changes: - Add comprehensive CI architecture documentation explaining: - Terminology (presto, prestocpp, prestissimo) - Unified builder image strategy for ephemeral runners - Job dependency graph - Comparison with upstream (prestodb/presto) CI - Performance benefits of pre-warmed ccache and Maven deps - Consolidate presto-java8/java17 jobs into single matrix-based `presto` job - ARTIFACT_JAVA_VERSION controls which version uploads artifacts/images (default: '8') - Add prestissimo image building to prestocpp workflow - Downloads artifacts from build job, packages into runtime image - Uses same tagging strategy as presto image (immutable + SNAPSHOT tags) - Centralize IMAGE_VERSION_TYPE configuration (set to 'BETA') - Applied to both presto and prestissimo images - Document artifacts (presto-server, presto-cli, presto-native-build) and Docker images (unified-builder, presto, prestissimo)
Disable the CLP integration test due to a timestamp filtering issue in the CLP native reader where the query returns empty results when filtering by timestamp. A separate PR will be raised to fix this unit test.
Restructure header comments to explain key design decisions: - Caching strategy: why bake caches into Docker image layers - Image tagging: dual tags (immutable + SNAPSHOT) and version streams - Builder image tag: auto-computed dependency hash Also update comparison table with split image tag rows (builder vs runtime).
Fixes CI workflow failure caused by using `${{ env.* }}` in `with:`
blocks when calling reusable workflows (GitHub Actions limitation).
Changes:
- Add inline `config` job that computes builder-image and runtime tags
- Delete compute-builder-tag.yml (merged into ci.yml)
- Update presto-build.yml and prestocpp to use passed runtime tags
- Update create-builder-image.yml to only require builder-image input
|
Caution Review failedThe pull request is closed. WalkthroughThis PR refactors the CI/CD pipeline into modular reusable workflows, introduces a unified YScope Presto builder Docker image, adds YAML-based metadata support to the CLP plugin with Top-N query pushdown, implements Pinot and Pinot Uber split providers, updates dependencies, and standardizes test configurations with UUID-based spill paths. Changes
Sequence Diagram(s)sequenceDiagram
actor PR as Pull Request Trigger
participant Config as config job
participant BuilderImg as create-builder-image job
participant PrestoBuild as presto-build job
participant PrestoTest as tests job
participant PrestoCppBuild as prestocpp-linux-build-and-unit-test job
participant Integration as integration-tests job
participant GHCR as GHCR Registry
PR->>Config: workflow triggered
Config->>Config: compute builder-image hash<br/>compute runtime-version-tag<br/>compute runtime-snapshot-tag
Config-->>BuilderImg: outputs: builder-image, tags
BuilderImg->>GHCR: check image exists
alt image not found
BuilderImg->>GHCR: build & push builder image
end
BuilderImg-->>PrestoBuild: builder-image ready
BuilderImg-->>PrestoCppBuild: builder-image ready
BuilderImg-->>PrestoTest: builder-image ready
BuilderImg-->>Integration: builder-image ready
par Java & C++ Builds
PrestoBuild->>PrestoBuild: checkout, setup Java<br/>Maven build (multi-version)
PrestoBuild->>PrestoBuild: conditionally upload artifacts
PrestoBuild->>GHCR: build Presto Docker image
PrestoCppBuild->>PrestoCppBuild: checkout, setup builder image<br/>cmake/ninja build<br/>upload prestocpp artifacts
PrestoCppBuild->>GHCR: build prestissimo runtime image
end
par Tests & Integration
PrestoTest->>PrestoTest: matrix test execution<br/>presto-tests across Java versions
Integration->>Integration: download Presto artifacts<br/>run integration-e2e-java<br/>run integration-storage-java<br/>run integration-sidecar-java
end
PrestoBuild-->>PrestoTest: artifact outputs
PrestoCppBuild-->>Integration: artifact outputs
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50–75 minutes Areas requiring extra attention:
Possibly related PRs
Suggested reviewers
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (70)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Centralizes all configuration computation in a single
configjob inlined inci.yml. This fixes the CI failure and eliminates duplicated version tag computation across workflows.Problem
CI workflow fails immediately (0s) with error:
Root cause: GitHub Actions doesn't allow
${{ env.* }}inwith:sections when calling reusable workflows.Example of problematic code:
Additional issue: Runtime version tag computation was duplicated in
presto-build.ymlandprestocpp-linux-build-and-unit-test.yml.Solution
env.*not allowed inwith:blocksneeds.config.outputs.*)configjobcompute-builder-tag.ymlworkflowconfigjobKey Design Decisions
Centralized Config Job
A single
configjob computes all shared configuration upfront:builder-imageghcr.io/y-scope/presto/unified-builder:a1b2c3d4-...runtime-version-tag0.293-BETA-20250529140509-484b00eruntime-snapshot-tag0.293-BETA-SNAPSHOTSingle Source of Truth
IMAGE_VERSION_TYPEis defined once inci.yml'senv:block. Changing the version stream (RELEASE/BETA/DEV) requires editing only one line.Implementation Details
Job Dependency Graph
Before:
compute-builder-tagwas a separate reusable workflowprestoandprestocppeach computed their own runtime version tagsenv.IMAGE_VERSION_TYPEpassed viawith:blocks (broken)After:
configis an inline job inci.ymlenv.*usage inwith:blocksFiles Changed
ci.ymlconfigjob with 3 outputs, update all job referencescompute-builder-tag.ymlcreate-builder-image.ymlbuilder-taginput, simplify concurrency grouppresto-build.ymlimage-version-typewithruntime-version-tag+runtime-snapshot-tagprestocpp-linux-build-and-unit-test.ymlTest Plan
Contributor Checklist
Summary by CodeRabbit
New Features
Enhancements
Dependencies
Chores
✏️ Tip: You can customize this high-level summary in your review settings.