From faef46822f76574e1a0bf75a6a258d6d7f7f623e Mon Sep 17 00:00:00 2001 From: Mark Wolters Date: Fri, 12 Jun 2026 13:05:09 -0400 Subject: [PATCH 1/5] adding skeleton for release notes documentation --- .github/workflows/pr_checklist.md | 1 + docs/release notes/4.0.0-RC.8/BugFixes.md | 44 +++++++++++++++++ .../4.0.0-RC.8/DocumentationAndTutorials.md | 8 ++++ docs/release notes/4.0.0-RC.8/FusedPQ.md | 36 ++++++++++++++ .../4.0.0-RC.8/ImprovedDatasetHandling.md | 12 +++++ .../ParallelGraphIndexConstruction.md | 48 +++++++++++++++++++ .../4.0.0-RC.8/TestingEnhancements.md | 21 ++++++++ docs/release notes/README.md | 16 +++++++ 8 files changed, 186 insertions(+) create mode 100644 docs/release notes/4.0.0-RC.8/BugFixes.md create mode 100644 docs/release notes/4.0.0-RC.8/DocumentationAndTutorials.md create mode 100644 docs/release notes/4.0.0-RC.8/FusedPQ.md create mode 100644 docs/release notes/4.0.0-RC.8/ImprovedDatasetHandling.md create mode 100644 docs/release notes/4.0.0-RC.8/ParallelGraphIndexConstruction.md create mode 100644 docs/release notes/4.0.0-RC.8/TestingEnhancements.md create mode 100644 docs/release notes/README.md diff --git a/.github/workflows/pr_checklist.md b/.github/workflows/pr_checklist.md index ee9db22d3..e17f6cf0e 100644 --- a/.github/workflows/pr_checklist.md +++ b/.github/workflows/pr_checklist.md @@ -9,5 +9,6 @@ __Before you submit for review:__ - [ ] Did you adhere to the code formatting guidelines (TBD) - [ ] Did you group your changes for easy review, providing meaningful descriptions for each commit? - [ ] Did you ensure that all files contain the correct copyright header? +- [ ] Did you add documentation for this feature to the release notes directory? If you did not complete any of these, then please explain below. diff --git a/docs/release notes/4.0.0-RC.8/BugFixes.md b/docs/release notes/4.0.0-RC.8/BugFixes.md new file mode 100644 index 000000000..145aae28b --- /dev/null +++ b/docs/release notes/4.0.0-RC.8/BugFixes.md @@ -0,0 +1,44 @@ +## Bug Fixes and Issue Resolutions + +### Fix: NullPointerException in `OnDiskGraphIndex#ramBytesUsed` + +**Problem** +Now that we lazily load the inMemoryNeighbors and the inMemoryFeatures, we need to handle the case in `OnDiskGraphIndex` where they are null or have values that are null when the `ramBytesUsed()` method is called. + +**Resolution** +Added appropriate null checks and safeguards to ensure `ramBytesUsed()` can be safely invoked in all valid states. + +**Related Issues** +- [#586](https://github.com/datastax/jvector/issues/586) + +--- + +### Fix: Protection Against Invalid Ordinal Mappings + +**Problem** +JVector relies on the calling source code to pass in ordinal maps constructed outside of the JVector library. Improper or inconsistent ordinal mappings can lead to failures when the Graph is built or incorrect indexing or search results. + +**Resolution** +Added safeguards to detect invalid ordinal mappings. + +**Notes** +Full validation of ordinal mapping requires iterating over the entire set of ordinals and can be a costly operation. This safeguard will only be activated if debug logging is enabled or if `System.getProperties().containsKey("VECTOR_DEBUG")` + +**Related Issues** +- [568](https://github.com/datastax/jvector/issues/568) + +--- + + +### Fix: extractTrainingVectors may produce more than MAX_PQ_TRAINING_SET_SIZE vectors + +**Problem** +`extractTrainingVectors` could return more vectors than the intended maximum (`MAX_PQ_TRAINING_SET_SIZE`), leading to excessive memory usage during PQ training. + +**Resolution** +Uses floyd's random sampling algorithm to select random training vectors from the RandomAccessVectorValues. The solution has two phases. The first is to select MAX_PQ_TRAINING_SET_SIZE random ordinals. Then, it maps those ordinals to vectors. + +**Related Issues** +- [590](https://github.com/datastax/jvector/issues/590) + +--- diff --git a/docs/release notes/4.0.0-RC.8/DocumentationAndTutorials.md b/docs/release notes/4.0.0-RC.8/DocumentationAndTutorials.md new file mode 100644 index 000000000..31ae74735 --- /dev/null +++ b/docs/release notes/4.0.0-RC.8/DocumentationAndTutorials.md @@ -0,0 +1,8 @@ +### Documentation and Tutorials + +**Description** +Detailed javadoc added for all JVector components. Quickstart [tutorials](https://github.com/datastax/jvector/tree/main/jvector-examples/src/main/java/io/github/jbellis/jvector/example/tutorial) added to jvector-examples including documentation and code. + +**Purpose / Impact** +- Provide better documentation for JVector and its components +- Give new users an entry point showing how to use JVector diff --git a/docs/release notes/4.0.0-RC.8/FusedPQ.md b/docs/release notes/4.0.0-RC.8/FusedPQ.md new file mode 100644 index 000000000..bdbb3cf70 --- /dev/null +++ b/docs/release notes/4.0.0-RC.8/FusedPQ.md @@ -0,0 +1,36 @@ +### Fused Product Quantization (Fused PQ) + +**Description** +Fused Product Quantization (Fused PQ) is a performance optimization that embeds compressed Product Quantization (PQ) codes directly into the graph index structure alongside each node's neighbor lists. This eliminates the need for separate lookups to retrieve compressed vectors during graph traversal, significantly improving query performance by reducing memory access overhead. The feature stores PQ-encoded neighbor vectors inline with the graph edges, enabling faster approximate similarity scoring during search operations. By embedding the compressed neighbor vectors into the graph index itself, Fused PQ eliminates the need to maintain a separate in-memory data structure for PQ-encoded vectors, reducing heap memory usage while maintaining fast approximate similarity search performance.​ + +**Purpose / Impact** +- Reduces memory usage for large-scale vector datasets +- Improves cache locality during graph traversal +- Enables higher writer scalability for large / high-dimensional vector workloads + +**How to Enable** +To enable Fused PQ when writing an on-disk graph index: + +1. **Create the FusedPQ feature** by passing your graph's max degree and a ProductQuantization compressor to the constructor: + ```java + var fusedPQFeature = new FusedPQ(graph.maxDegree(), pq); + ``` + +2. **Add it to your OnDiskGraphIndexWriter builder**: + ```java + var writer = new OnDiskGraphIndexWriter.Builder(graph, outputPath) + .with(fusedPQFeature) + .build(); + ``` + +3. **Provide a state supplier during the write phase** that includes the graph view and PQ vectors: + ```java + Map> writeSuppliers = new EnumMap<>(FeatureId.class); + writeSuppliers.put(FeatureId.FUSED_PQ, ordinal -> new FusedPQ.State(view, pqVectors, ordinal)); + writer.write(writeSuppliers); + ``` + +**Notes** +- Fused PQ requires a 256-cluster ProductQuantization compressor. The feature automatically embeds compressed neighbor vectors inline with the graph structure during the write operation. +- To enable the FUSED_PQ feature, we introduced the new version 6 file format for our graph indices. + diff --git a/docs/release notes/4.0.0-RC.8/ImprovedDatasetHandling.md b/docs/release notes/4.0.0-RC.8/ImprovedDatasetHandling.md new file mode 100644 index 000000000..b59bb5d53 --- /dev/null +++ b/docs/release notes/4.0.0-RC.8/ImprovedDatasetHandling.md @@ -0,0 +1,12 @@ +### Improved DataSet Handling + +**Description** +Revision of how datasets are handled internally within JVector for acquisition and representation of vector data. Includes overhaul of the loading process, better logging and error handling, and virtualization and metadata handling. + +**Purpose / Impact** +- Make datasets easier to find, work with, and define through metadata for internal testing +- Resolves several issues with regression and inter-release comparisons + +**Notes** +- Used internally by JVector Bench classes, no client impact + diff --git a/docs/release notes/4.0.0-RC.8/ParallelGraphIndexConstruction.md b/docs/release notes/4.0.0-RC.8/ParallelGraphIndexConstruction.md new file mode 100644 index 000000000..a5775cf01 --- /dev/null +++ b/docs/release notes/4.0.0-RC.8/ParallelGraphIndexConstruction.md @@ -0,0 +1,48 @@ +### Parallel Graph Index Construction + +**Description** +OnDiskParallelGraphIndexWriter significantly accelerates graph index construction by addressing the disk I/O bottleneck that limits the serial OnDiskGraphIndexWriter. This implementation uses asynchronous file I/O with multiple worker threads to write graph records in parallel, with parallelism automatically determined by available system resources (or configurable via builder options). By parallelizing both record building and disk writes while maintaining correct ordering, this approach dramatically reduces the time required to persist large graph indexes to disk. + +**Purpose / Impact** +- Eliminates i/o bottleneck in on disk graph construction +- Maintains backwards compatibility for existing clients of the JVector library + +**How to Enable** +To enable parallel graph index writes, simply use `OnDiskParallelGraphIndexWriter.Builder` instead of `OnDiskGraphIndexWriter.Builder`: + +**Basic usage (uses default parallelism based on available processors):** +```java +try (var writer = new OnDiskParallelGraphIndexWriter.Builder(graph, outputPath) + .with(features...) + .build()) { + writer.write(featureSuppliers); +} +``` + +**Advanced configuration:** +```java +try (var writer = new OnDiskParallelGraphIndexWriter.Builder(graph, outputPath) + .with(features...) + .withParallelWorkerThreads(8) // Optional: specify thread count (0 = auto) + .withParallelDirectBuffers(true) // Optional: use direct ByteBuffers for better performance + .build()) { + writer.write(featureSuppliers); +} +``` + +The parallel writer is a drop-in replacement for the standard writer with the same API, automatically leveraging multiple threads and asynchronous I/O to accelerate the write process. + +**Notes** +- Currently still marked as @experimental +- Includes deprecation of method +```java +public synchronized void writeInline(int ordinal, Map stateMap) +``` +in favor of the more descriptive method +```java +public synchronized void writeFeaturesInline(int ordinal, Map stateMap) +``` + +**Related Issues** +- [579](https://github.com/datastax/jvector/issues/579) + diff --git a/docs/release notes/4.0.0-RC.8/TestingEnhancements.md b/docs/release notes/4.0.0-RC.8/TestingEnhancements.md new file mode 100644 index 000000000..16ca36716 --- /dev/null +++ b/docs/release notes/4.0.0-RC.8/TestingEnhancements.md @@ -0,0 +1,21 @@ +### Testing Enhancements + +**Description** +Enhancements to the JVector testing infrastructure: +- On disk index cache added for Grid benchmark harness +- Logging subsystem overhaul +- New JMH tests +- Test results now include metrics for `nodes visited`, `heap usage`, `disk usage`, `PQ Distance` + +**Purpose / Impact** +- Faster testing cycle +- Better comprehension of test results +- new metrics to compare inter-release + +**Notes** +- Used internally by JVector, no client impact + +**Related Issues** +- [615](https://github.com/datastax/jvector/issues/615) +- [616](https://github.com/datastax/jvector/issues/616) + diff --git a/docs/release notes/README.md b/docs/release notes/README.md new file mode 100644 index 000000000..d0daa691f --- /dev/null +++ b/docs/release notes/README.md @@ -0,0 +1,16 @@ +## Release Notes + +This directory collects and aggregates release notes for each feature added to the JVector library on a release by release basis. + +### Guidelines +* Structure + * Each JVector release has its own sub-directory within this directory, named according to the release version as specified in the `pom.xml` file as `revision`, e.g. `4.0.0-RC.9` + * Within the sub-directory for each release each feature is represented by its own independent file containing the release details for that feature. +* Content + * Each feature file should contain a concise but informative description of the feature, including any relevant details such as the motivation for the feature, how it works, and any important implications or considerations for users, including any known risks. + * If applicable, the feature file should also include links to relevant documentation, code examples, or other resources that can help users understand and utilize the feature effectively. + * The documentation for every feature *must* at a minimum include details on how that feature is enabled and configured, or, if no explicit enablement and / or configuration is necessary, this must be stated. + * Documentation for each feature should also include reference to any existing issues that are related to the feature. +* Usage + * When a PR for a new feature is created, the author of the pull request should create a new file for that feature in the appropriate release sub-directory and populate it with the relevant release notes content as described above. + * At the time when the release is cut all of the release notes for that release will be aggregated into a single release notes document that will become a release artifact publicly available in github under the releases section of the JVector repository. \ No newline at end of file From c578d13f32b3341c7f3114371cfb698a3a6ff6e6 Mon Sep 17 00:00:00 2001 From: Mark Wolters Date: Fri, 12 Jun 2026 13:17:50 -0400 Subject: [PATCH 2/5] add gha for aggregration --- .github/workflows/generate-release-notes.yml | 98 ++++++++++ docs/release notes/4.0.0-RC.8/4.0.0-RC.8.md | 190 +++++++++++++++++++ 2 files changed, 288 insertions(+) create mode 100644 .github/workflows/generate-release-notes.yml create mode 100644 docs/release notes/4.0.0-RC.8/4.0.0-RC.8.md diff --git a/.github/workflows/generate-release-notes.yml b/.github/workflows/generate-release-notes.yml new file mode 100644 index 000000000..9972d07f1 --- /dev/null +++ b/.github/workflows/generate-release-notes.yml @@ -0,0 +1,98 @@ +name: Generate Release Notes + +on: + workflow_dispatch: + inputs: + version: + description: 'Release version to generate notes for (e.g. 4.0.0-RC.9)' + required: true + type: string + previous_version: + description: 'Previous release version for the version-range header (e.g. 4.0.0-RC.8) — leave blank to omit' + required: false + type: string + default: '' + +jobs: + generate-release-notes: + runs-on: ubuntu-latest + permissions: + contents: write + + steps: + - uses: actions/checkout@v4 + + - name: Validate inputs and directory + run: | + NOTES_DIR="docs/release notes/${{ inputs.version }}" + + if [ ! -d "${NOTES_DIR}" ]; then + echo "::error::Release notes directory '${NOTES_DIR}' does not exist." + exit 1 + fi + + # Count source files (exclude the output file if it already exists) + COUNT=$(find "${NOTES_DIR}" -maxdepth 1 -name "*.md" \ + -not -name "${{ inputs.version }}.md" | wc -l) + if [ "${COUNT}" -eq 0 ]; then + echo "::error::No .md source files found in '${NOTES_DIR}'." + exit 1 + fi + + echo "Found ${COUNT} source file(s) in '${NOTES_DIR}'" + + - name: Assemble release notes + run: | + VERSION="${{ inputs.version }}" + PREV="${{ inputs.previous_version }}" + NOTES_DIR="docs/release notes/${VERSION}" + OUTPUT="${NOTES_DIR}/${VERSION}.md" + + # Build the version-range header line + if [ -n "${PREV}" ]; then + VERSION_LINE="## Version: ${PREV} - ${VERSION}" + else + VERSION_LINE="## Version: ${VERSION}" + fi + + # Write the preamble + printf '# JVector Release Notes\n\n' > "${OUTPUT}" + printf '%s\n\n' "${VERSION_LINE}" >> "${OUTPUT}" + printf -- '---\n\n' >> "${OUTPUT}" + printf '## New Features and Enhancements\n\n' >> "${OUTPUT}" + + # Append feature files in alphabetical order. + # BugFixes.md is excluded here and appended last because it carries + # its own ## section header. + while IFS= read -r -d '' f; do + cat "$f" >> "${OUTPUT}" + # Add a trailing separator if the file does not already end with one + if [ "$(tail -c 5 "$f" | tr -d '[:space:]')" != "---" ]; then + printf '\n\n---\n\n' >> "${OUTPUT}" + fi + done < <(find "${NOTES_DIR}" -maxdepth 1 -name "*.md" \ + -not -name "BugFixes.md" \ + -not -name "${VERSION}.md" \ + -print0 | sort -z) + + # Append BugFixes.md last — it owns its ## section header + BUGFIXES="${NOTES_DIR}/BugFixes.md" + if [ -f "${BUGFIXES}" ]; then + printf '\n' >> "${OUTPUT}" + cat "${BUGFIXES}" >> "${OUTPUT}" + fi + + echo "Generated ${OUTPUT}:" + cat "${OUTPUT}" + + - name: Commit and push + run: | + git config user.name "github-actions[bot]" + git config user.email "github-actions[bot]@users.noreply.github.com" + git add "docs/release notes/${{ inputs.version }}/${{ inputs.version }}.md" + if git diff --cached --quiet; then + echo "Output file is unchanged — nothing to commit." + else + git commit -m "Generate release notes for ${{ inputs.version }}" + git push + fi diff --git a/docs/release notes/4.0.0-RC.8/4.0.0-RC.8.md b/docs/release notes/4.0.0-RC.8/4.0.0-RC.8.md new file mode 100644 index 000000000..d49159234 --- /dev/null +++ b/docs/release notes/4.0.0-RC.8/4.0.0-RC.8.md @@ -0,0 +1,190 @@ +# JVector Release Notes + +## Version: 4.0.0-rc.7 - 4.0.0-rc.8 + +--- + +## New Features and Enhancements + +### Fused Product Quantization (Fused PQ) + +**Description** +Fused Product Quantization (Fused PQ) is a performance optimization that embeds compressed Product Quantization (PQ) codes directly into the graph index structure alongside each node's neighbor lists. This eliminates the need for separate lookups to retrieve compressed vectors during graph traversal, significantly improving query performance by reducing memory access overhead. The feature stores PQ-encoded neighbor vectors inline with the graph edges, enabling faster approximate similarity scoring during search operations. By embedding the compressed neighbor vectors into the graph index itself, Fused PQ eliminates the need to maintain a separate in-memory data structure for PQ-encoded vectors, reducing heap memory usage while maintaining fast approximate similarity search performance.​ + +**Purpose / Impact** +- Reduces memory usage for large-scale vector datasets +- Improves cache locality during graph traversal +- Enables higher writer scalability for large / high-dimensional vector workloads + +**How to Enable** +To enable Fused PQ when writing an on-disk graph index: + +1. **Create the FusedPQ feature** by passing your graph's max degree and a ProductQuantization compressor to the constructor: + ```java + var fusedPQFeature = new FusedPQ(graph.maxDegree(), pq); + ``` + +2. **Add it to your OnDiskGraphIndexWriter builder**: + ```java + var writer = new OnDiskGraphIndexWriter.Builder(graph, outputPath) + .with(fusedPQFeature) + .build(); + ``` + +3. **Provide a state supplier during the write phase** that includes the graph view and PQ vectors: + ```java + Map> writeSuppliers = new EnumMap<>(FeatureId.class); + writeSuppliers.put(FeatureId.FUSED_PQ, ordinal -> new FusedPQ.State(view, pqVectors, ordinal)); + writer.write(writeSuppliers); + ``` + +**Notes** +- Fused PQ requires a 256-cluster ProductQuantization compressor. The feature automatically embeds compressed neighbor vectors inline with the graph structure during the write operation. +- To enable the FUSED_PQ feature, we introduced the new version 6 file format for our graph indices. + +--- + +### Parallel Graph Index Construction + +**Description** +OnDiskParallelGraphIndexWriter significantly accelerates graph index construction by addressing the disk I/O bottleneck that limits the serial OnDiskGraphIndexWriter. This implementation uses asynchronous file I/O with multiple worker threads to write graph records in parallel, with parallelism automatically determined by available system resources (or configurable via builder options). By parallelizing both record building and disk writes while maintaining correct ordering, this approach dramatically reduces the time required to persist large graph indexes to disk. + +**Purpose / Impact** +- Eliminates i/o bottleneck in on disk graph construction +- Maintains backwards compatibility for existing clients of the JVector library + +**How to Enable** +To enable parallel graph index writes, simply use `OnDiskParallelGraphIndexWriter.Builder` instead of `OnDiskGraphIndexWriter.Builder`: + +**Basic usage (uses default parallelism based on available processors):** +```java +try (var writer = new OnDiskParallelGraphIndexWriter.Builder(graph, outputPath) + .with(features...) + .build()) { + writer.write(featureSuppliers); +} +``` + +**Advanced configuration:** +```java +try (var writer = new OnDiskParallelGraphIndexWriter.Builder(graph, outputPath) + .with(features...) + .withParallelWorkerThreads(8) // Optional: specify thread count (0 = auto) + .withParallelDirectBuffers(true) // Optional: use direct ByteBuffers for better performance + .build()) { + writer.write(featureSuppliers); +} +``` + +The parallel writer is a drop-in replacement for the standard writer with the same API, automatically leveraging multiple threads and asynchronous I/O to accelerate the write process. + +**Notes** +- Currently still marked as @experimental +- Includes deprecation of method +```java +public synchronized void writeInline(int ordinal, Map stateMap) +``` +in favor of the more descriptive method +```java +public synchronized void writeFeaturesInline(int ordinal, Map stateMap) +``` + +**Related Issues** +- [579](https://github.com/datastax/jvector/issues/579) + +--- + +### Documentation and Tutorials + +**Description** +Detailed javadoc added for all JVector components. Quickstart [tutorials](https://github.com/datastax/jvector/tree/main/jvector-examples/src/main/java/io/github/jbellis/jvector/example/tutorial) added to jvector-examples including documentation and code. + +**Purpose / Impact** +- Provide better documentation for JVector and its components +- Give new users an entry point showing how to use JVector + +--- + +### Improved DataSet Handling + +**Description** +Revision of how datasets are handled internally within JVector for acquisition and representation of vector data. Includes overhaul of the loading process, better logging and error handling, and virtualization and metadata handling. + +**Purpose / Impact** +- Make datasets easier to find, work with, and define through metadata for internal testing +- Resolves several issues with regression and inter-release comparisons + +**Notes** +- Used internally by JVector Bench classes, no client impact + +--- + +### Testing Enhancements + +**Description** +Enhancements to the JVector testing infrastructure: +- On disk index cache added for Grid benchmark harness +- Logging subsystem overhaul +- New JMH tests +- Test results now include metrics for `nodes visited`, `heap usage`, `disk usage`, `PQ Distance` + +**Purpose / Impact** +- Faster testing cycle +- Better comprehension of test results +- new metrics to compare inter-release + +**Notes** +- Used internally by JVector, no client impact + +**Related Issues** +- [615](https://github.com/datastax/jvector/issues/615) +- [616](https://github.com/datastax/jvector/issues/616) + +--- + +## Bug Fixes and Issue Resolutions + +### Fix: NullPointerException in `OnDiskGraphIndex#ramBytesUsed` + +**Problem** +Now that we lazily load the inMemoryNeighbors and the inMemoryFeatures, we need to handle the case in `OnDiskGraphIndex` where they are null or have values that are null when the `ramBytesUsed()` method is called. + +**Resolution** +Added appropriate null checks and safeguards to ensure `ramBytesUsed()` can be safely invoked in all valid states. + +**Related Issues** +- [#586](https://github.com/datastax/jvector/issues/586) + +--- + +### Fix: Protection Against Invalid Ordinal Mappings + +**Problem** +JVector relies on the calling source code to pass in ordinal maps constructed outside of the JVector library. Improper or inconsistent ordinal mappings can lead to failures when the Graph is built or incorrect indexing or search results. + +**Resolution** +Added safeguards to detect invalid ordinal mappings. + +**Notes** +Full validation of ordinal mapping requires iterating over the entire set of ordinals and can be a costly operation. This safeguard will only be activated if debug logging is enabled or if `System.getProperties().containsKey("VECTOR_DEBUG")` + +**Related Issues** +- [568](https://github.com/datastax/jvector/issues/568) + +--- + + +### Fix: extractTrainingVectors may produce more than MAX_PQ_TRAINING_SET_SIZE vectors + +**Problem** +`extractTrainingVectors` could return more vectors than the intended maximum (`MAX_PQ_TRAINING_SET_SIZE`), leading to excessive memory usage during PQ training. + +**Resolution** +Uses floyd's random sampling algorithm to select random training vectors from the RandomAccessVectorValues. The solution has two phases. The first is to select MAX_PQ_TRAINING_SET_SIZE random ordinals. Then, it maps those ordinals to vectors. + +**Related Issues** +- [590](https://github.com/datastax/jvector/issues/590) + +--- + + From 27c749ee18eb6a4ec0420cc8a139f1b6fd384c85 Mon Sep 17 00:00:00 2001 From: Mark Wolters Date: Fri, 12 Jun 2026 14:49:16 -0400 Subject: [PATCH 3/5] exclude workflow from rat checks --- rat-excludes.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/rat-excludes.txt b/rat-excludes.txt index 436c97822..5beb78c18 100644 --- a/rat-excludes.txt +++ b/rat-excludes.txt @@ -4,6 +4,7 @@ CONTRIBUTIONS.md .github/workflows/pr_checklist.md .github/workflows/unit-tests.yaml .github/workflows/generate-changelog.yaml +.github/workflows/generate-release-notes.yml package.json .github/workflows/tag-release.yml .github/workflows/run-bench.yml From 46b4f9fdeee1703dde7996fef7c894684bd42b82 Mon Sep 17 00:00:00 2001 From: Mark Wolters Date: Wed, 17 Jun 2026 10:08:01 -0400 Subject: [PATCH 4/5] arrange by PR with tags for sections --- .github/workflows/generate-release-notes.yml | 129 +++++++++++++++---- docs/release notes/README.md | 75 +++++++++-- 2 files changed, 170 insertions(+), 34 deletions(-) diff --git a/.github/workflows/generate-release-notes.yml b/.github/workflows/generate-release-notes.yml index 9972d07f1..95afe7701 100644 --- a/.github/workflows/generate-release-notes.yml +++ b/.github/workflows/generate-release-notes.yml @@ -31,7 +31,6 @@ jobs: exit 1 fi - # Count source files (exclude the output file if it already exists) COUNT=$(find "${NOTES_DIR}" -maxdepth 1 -name "*.md" \ -not -name "${{ inputs.version }}.md" | wc -l) if [ "${COUNT}" -eq 0 ]; then @@ -47,40 +46,126 @@ jobs: PREV="${{ inputs.previous_version }}" NOTES_DIR="docs/release notes/${VERSION}" OUTPUT="${NOTES_DIR}/${VERSION}.md" + GITHUB_REPO="datastax/jvector" - # Build the version-range header line + # ----------------------------------------------------------------------- + # Section definitions — controls grouping and output order. + # Files named ..md are automatically grouped under the matching + # header. Add a new tag here to introduce a new section; reorder the + # TAG_ORDER array to change the order in which sections appear. + # Note: multiple tags may share a header (e.g. bugfix and fix both map to + # "Bug Fixes") — the first tag in TAG_ORDER that matches wins the header + # slot, and all aliases are collected into that same section. + # ----------------------------------------------------------------------- + TAG_ORDER=("feature" "enhancement" "performance" "bugfix" "fix" "docs" "testing") + declare -A TAG_HEADERS + TAG_HEADERS["feature"]="## New Features" + TAG_HEADERS["enhancement"]="## Enhancements" + TAG_HEADERS["performance"]="## Performance Improvements" + TAG_HEADERS["bugfix"]="## Bug Fixes and Issue Resolutions" + TAG_HEADERS["fix"]="## Bug Fixes and Issue Resolutions" + TAG_HEADERS["docs"]="## Documentation and Tutorials" + TAG_HEADERS["testing"]="## Testing Enhancements" + + # ----------------------------------------------------------------------- + # Preamble + # ----------------------------------------------------------------------- if [ -n "${PREV}" ]; then VERSION_LINE="## Version: ${PREV} - ${VERSION}" else VERSION_LINE="## Version: ${VERSION}" fi + printf '# JVector Release Notes\n\n%s\n\n---\n\n' "${VERSION_LINE}" > "${OUTPUT}" - # Write the preamble - printf '# JVector Release Notes\n\n' > "${OUTPUT}" - printf '%s\n\n' "${VERSION_LINE}" >> "${OUTPUT}" - printf -- '---\n\n' >> "${OUTPUT}" - printf '## New Features and Enhancements\n\n' >> "${OUTPUT}" + # ----------------------------------------------------------------------- + # Bucket source files into new-style (..md) or legacy bins. + # Legacy files (e.g. BugFixes.md, old free-form names) are appended + # verbatim at the end; they are expected to carry their own ## headers. + # ----------------------------------------------------------------------- + declare -A tag_to_files # tag -> newline-separated list of absolute paths + legacy_files=() - # Append feature files in alphabetical order. - # BugFixes.md is excluded here and appended last because it carries - # its own ## section header. while IFS= read -r -d '' f; do - cat "$f" >> "${OUTPUT}" - # Add a trailing separator if the file does not already end with one - if [ "$(tail -c 5 "$f" | tr -d '[:space:]')" != "---" ]; then - printf '\n\n---\n\n' >> "${OUTPUT}" + fname=$(basename "$f") + [[ "$fname" == "${VERSION}.md" ]] && continue + + if [[ "$fname" =~ ^([0-9]+)\.([a-z]+)\.md$ ]]; then + tag="${BASH_REMATCH[2]}" + tag_to_files["$tag"]+="${f}"$'\n' + else + legacy_files+=("$f") fi done < <(find "${NOTES_DIR}" -maxdepth 1 -name "*.md" \ - -not -name "BugFixes.md" \ - -not -name "${VERSION}.md" \ - -print0 | sort -z) + -not -name "${VERSION}.md" -print0 | sort -z) + + # ----------------------------------------------------------------------- + # Write new-style sections in TAG_ORDER sequence. + # Each unique section header is written exactly once even when multiple + # tag aliases (e.g. bugfix + fix) share the same header. + # Entries within a section are sorted by PR number ascending. + # ----------------------------------------------------------------------- + declare -A written_headers + + for tag in "${TAG_ORDER[@]}"; do + [[ -z "${tag_to_files[$tag]}" ]] && continue + + header="${TAG_HEADERS[$tag]}" + + # Emit the header only the first time we encounter it + if [[ -z "${written_headers[$header]+x}" ]]; then + written_headers["$header"]=1 + printf '%s\n\n' "${header}" >> "${OUTPUT}" + fi - # Append BugFixes.md last — it owns its ## section header - BUGFIXES="${NOTES_DIR}/BugFixes.md" - if [ -f "${BUGFIXES}" ]; then + # Collect all files whose tag aliases share this header + all_files_for_section=() + for alias_tag in "${TAG_ORDER[@]}"; do + [[ "${TAG_HEADERS[$alias_tag]}" != "$header" ]] && continue + [[ -z "${tag_to_files[$alias_tag]}" ]] && continue + while IFS= read -r file; do + [[ -n "$file" ]] && all_files_for_section+=("$file") + done <<< "${tag_to_files[$alias_tag]}" + done + + # Sort entries by PR number (numeric prefix of the filename) + IFS=$'\n' sorted_section=($( + for f in "${all_files_for_section[@]}"; do + fname=$(basename "$f") + [[ "$fname" =~ ^([0-9]+)\. ]] && echo "${BASH_REMATCH[1]} $f" + done | sort -n | awk '{print $2}' + )) + unset IFS + + for f in "${sorted_section[@]}"; do + fname=$(basename "$f") + [[ "$fname" =~ ^([0-9]+)\. ]] && pr_num="${BASH_REMATCH[1]}" || pr_num="" + + # Copy the file line-by-line, inserting the PR link immediately after + # the opening ### heading so it appears directly under the entry title. + inserted_pr=false + while IFS= read -r line; do + printf '%s\n' "$line" >> "${OUTPUT}" + if ! $inserted_pr && [ -n "$pr_num" ] && [[ "$line" =~ ^### ]]; then + printf '\n**PR:** [#%s](https://github.com/%s/pull/%s)\n' \ + "$pr_num" "$GITHUB_REPO" "$pr_num" >> "${OUTPUT}" + inserted_pr=true + fi + done < "$f" + + # Ensure a trailing separator between entries + if [ "$(tail -c 5 "$f" | tr -d '[:space:]')" != "---" ]; then + printf '\n\n---\n\n' >> "${OUTPUT}" + fi + done + done + + # ----------------------------------------------------------------------- + # Append legacy files verbatim (carry their own ## section headers). + # ----------------------------------------------------------------------- + for f in "${legacy_files[@]}"; do printf '\n' >> "${OUTPUT}" - cat "${BUGFIXES}" >> "${OUTPUT}" - fi + cat "$f" >> "${OUTPUT}" + done echo "Generated ${OUTPUT}:" cat "${OUTPUT}" diff --git a/docs/release notes/README.md b/docs/release notes/README.md index d0daa691f..dc81f90de 100644 --- a/docs/release notes/README.md +++ b/docs/release notes/README.md @@ -2,15 +2,66 @@ This directory collects and aggregates release notes for each feature added to the JVector library on a release by release basis. -### Guidelines -* Structure - * Each JVector release has its own sub-directory within this directory, named according to the release version as specified in the `pom.xml` file as `revision`, e.g. `4.0.0-RC.9` - * Within the sub-directory for each release each feature is represented by its own independent file containing the release details for that feature. -* Content - * Each feature file should contain a concise but informative description of the feature, including any relevant details such as the motivation for the feature, how it works, and any important implications or considerations for users, including any known risks. - * If applicable, the feature file should also include links to relevant documentation, code examples, or other resources that can help users understand and utilize the feature effectively. - * The documentation for every feature *must* at a minimum include details on how that feature is enabled and configured, or, if no explicit enablement and / or configuration is necessary, this must be stated. - * Documentation for each feature should also include reference to any existing issues that are related to the feature. -* Usage - * When a PR for a new feature is created, the author of the pull request should create a new file for that feature in the appropriate release sub-directory and populate it with the relevant release notes content as described above. - * At the time when the release is cut all of the release notes for that release will be aggregated into a single release notes document that will become a release artifact publicly available in github under the releases section of the JVector repository. \ No newline at end of file +### Guidelines + +#### Structure + +* Each JVector release has its own sub-directory named after its version as specified in `pom.xml` (`revision`), e.g. `4.0.0-RC.9`. +* Within that sub-directory, each PR that warrants a release note has its own file. + +#### File naming + +Files must follow the convention **`..md`**, for example: + +| Filename | PR | Section in release notes | +|---|---|---| +| `668.performance.md` | [#668](https://github.com/datastax/jvector/pull/668) | Performance Improvements | +| `659.feature.md` | [#659](https://github.com/datastax/jvector/pull/659) | New Features | +| `672.bugfix.md` | [#672](https://github.com/datastax/jvector/pull/672) | Bug Fixes and Issue Resolutions | + +**Valid tags** (controls which section the entry appears in): + +| Tag | Section header | +|---|---| +| `feature` | New Features | +| `enhancement` | Enhancements | +| `performance` | Performance Improvements | +| `bugfix` or `fix` | Bug Fixes and Issue Resolutions | +| `docs` | Documentation and Tutorials | +| `testing` | Testing Enhancements | + +The workflow groups entries by tag, orders sections as listed above, and inserts a PR link automatically under each entry's heading — you do not need to add the link manually. + +#### File content + +Each file should contain **only the `###`-level entry content** for that PR — no `##` section header (that is generated from the tag). Example: + +```markdown +### My Feature Title + +**Description** +A concise but informative description of the feature — motivation, how it works, +and any implications or risks for users. + +**How to Enable** +... + +**Notes** +... +``` + +Content requirements: +* Include the motivation for the change and how it works. +* State how the feature is enabled/configured, or explicitly state that no configuration is required. +* Reference any related issues. +* Include code examples or links to documentation where relevant. + +#### Usage + +When a PR for a new feature is opened, the author creates a file in the appropriate release sub-directory (e.g. `docs/release notes/4.0.0-RC.9/668.performance.md`) and populates it with the entry content. + +At release time, the **Generate Release Notes** GitHub Actions workflow is triggered manually. It assembles all entries into a single `.md` file — grouping by tag, sorting by PR number within each group, and injecting PR links — which becomes a publicly available release artifact on GitHub. + +#### Backward compatibility + +Files that do not match the `..md` pattern (e.g. the legacy `BugFixes.md` from earlier releases) are appended verbatim at the end of the generated notes. They are expected to carry their own `##` section header. From 4f810fef8643aec43926dc67cc85d6ea7224c0fe Mon Sep 17 00:00:00 2001 From: Mark Wolters Date: Thu, 18 Jun 2026 12:18:59 -0400 Subject: [PATCH 5/5] renaming example for prs and not feature --- docs/release notes/4.0.0-RC.8/BugFixes.md | 44 ------------------- .../{FusedPQ.md => pr561.feature.md} | 0 docs/release notes/4.0.0-RC.8/pr588.bugfix.md | 12 +++++ docs/release notes/4.0.0-RC.8/pr602.bugfix.md | 15 +++++++ ...exConstruction.md => pr608.enhancement.md} | 0 docs/release notes/4.0.0-RC.8/pr610.bugfix.md | 12 +++++ ...estingEnhancements.md => pr612.testing.md} | 3 +- ...vedDatasetHandling.md => pr613.testing.md} | 0 ...mentationAndTutorials.md => pr617.docs.md} | 0 9 files changed, 40 insertions(+), 46 deletions(-) delete mode 100644 docs/release notes/4.0.0-RC.8/BugFixes.md rename docs/release notes/4.0.0-RC.8/{FusedPQ.md => pr561.feature.md} (100%) create mode 100644 docs/release notes/4.0.0-RC.8/pr588.bugfix.md create mode 100644 docs/release notes/4.0.0-RC.8/pr602.bugfix.md rename docs/release notes/4.0.0-RC.8/{ParallelGraphIndexConstruction.md => pr608.enhancement.md} (100%) create mode 100644 docs/release notes/4.0.0-RC.8/pr610.bugfix.md rename docs/release notes/4.0.0-RC.8/{TestingEnhancements.md => pr612.testing.md} (90%) rename docs/release notes/4.0.0-RC.8/{ImprovedDatasetHandling.md => pr613.testing.md} (100%) rename docs/release notes/4.0.0-RC.8/{DocumentationAndTutorials.md => pr617.docs.md} (100%) diff --git a/docs/release notes/4.0.0-RC.8/BugFixes.md b/docs/release notes/4.0.0-RC.8/BugFixes.md deleted file mode 100644 index 145aae28b..000000000 --- a/docs/release notes/4.0.0-RC.8/BugFixes.md +++ /dev/null @@ -1,44 +0,0 @@ -## Bug Fixes and Issue Resolutions - -### Fix: NullPointerException in `OnDiskGraphIndex#ramBytesUsed` - -**Problem** -Now that we lazily load the inMemoryNeighbors and the inMemoryFeatures, we need to handle the case in `OnDiskGraphIndex` where they are null or have values that are null when the `ramBytesUsed()` method is called. - -**Resolution** -Added appropriate null checks and safeguards to ensure `ramBytesUsed()` can be safely invoked in all valid states. - -**Related Issues** -- [#586](https://github.com/datastax/jvector/issues/586) - ---- - -### Fix: Protection Against Invalid Ordinal Mappings - -**Problem** -JVector relies on the calling source code to pass in ordinal maps constructed outside of the JVector library. Improper or inconsistent ordinal mappings can lead to failures when the Graph is built or incorrect indexing or search results. - -**Resolution** -Added safeguards to detect invalid ordinal mappings. - -**Notes** -Full validation of ordinal mapping requires iterating over the entire set of ordinals and can be a costly operation. This safeguard will only be activated if debug logging is enabled or if `System.getProperties().containsKey("VECTOR_DEBUG")` - -**Related Issues** -- [568](https://github.com/datastax/jvector/issues/568) - ---- - - -### Fix: extractTrainingVectors may produce more than MAX_PQ_TRAINING_SET_SIZE vectors - -**Problem** -`extractTrainingVectors` could return more vectors than the intended maximum (`MAX_PQ_TRAINING_SET_SIZE`), leading to excessive memory usage during PQ training. - -**Resolution** -Uses floyd's random sampling algorithm to select random training vectors from the RandomAccessVectorValues. The solution has two phases. The first is to select MAX_PQ_TRAINING_SET_SIZE random ordinals. Then, it maps those ordinals to vectors. - -**Related Issues** -- [590](https://github.com/datastax/jvector/issues/590) - ---- diff --git a/docs/release notes/4.0.0-RC.8/FusedPQ.md b/docs/release notes/4.0.0-RC.8/pr561.feature.md similarity index 100% rename from docs/release notes/4.0.0-RC.8/FusedPQ.md rename to docs/release notes/4.0.0-RC.8/pr561.feature.md diff --git a/docs/release notes/4.0.0-RC.8/pr588.bugfix.md b/docs/release notes/4.0.0-RC.8/pr588.bugfix.md new file mode 100644 index 000000000..c53e2da62 --- /dev/null +++ b/docs/release notes/4.0.0-RC.8/pr588.bugfix.md @@ -0,0 +1,12 @@ +### Fix: NullPointerException in `OnDiskGraphIndex#ramBytesUsed` + +**Problem** +Now that we lazily load the inMemoryNeighbors and the inMemoryFeatures, we need to handle the case in `OnDiskGraphIndex` where they are null or have values that are null when the `ramBytesUsed()` method is called. + +**Resolution** +Added appropriate null checks and safeguards to ensure `ramBytesUsed()` can be safely invoked in all valid states. + +**Related Issues** +- [#586](https://github.com/datastax/jvector/issues/586) + +--- diff --git a/docs/release notes/4.0.0-RC.8/pr602.bugfix.md b/docs/release notes/4.0.0-RC.8/pr602.bugfix.md new file mode 100644 index 000000000..28400a855 --- /dev/null +++ b/docs/release notes/4.0.0-RC.8/pr602.bugfix.md @@ -0,0 +1,15 @@ +### Fix: Protection Against Invalid Ordinal Mappings + +**Problem** +JVector relies on the calling source code to pass in ordinal maps constructed outside of the JVector library. Improper or inconsistent ordinal mappings can lead to failures when the Graph is built or incorrect indexing or search results. + +**Resolution** +Added safeguards to detect invalid ordinal mappings. + +**Notes** +Full validation of ordinal mapping requires iterating over the entire set of ordinals and can be a costly operation. This safeguard will only be activated if debug logging is enabled or if `System.getProperties().containsKey("VECTOR_DEBUG")` + +**Related Issues** +- [568](https://github.com/datastax/jvector/issues/568) + +--- diff --git a/docs/release notes/4.0.0-RC.8/ParallelGraphIndexConstruction.md b/docs/release notes/4.0.0-RC.8/pr608.enhancement.md similarity index 100% rename from docs/release notes/4.0.0-RC.8/ParallelGraphIndexConstruction.md rename to docs/release notes/4.0.0-RC.8/pr608.enhancement.md diff --git a/docs/release notes/4.0.0-RC.8/pr610.bugfix.md b/docs/release notes/4.0.0-RC.8/pr610.bugfix.md new file mode 100644 index 000000000..5096451c0 --- /dev/null +++ b/docs/release notes/4.0.0-RC.8/pr610.bugfix.md @@ -0,0 +1,12 @@ +### Fix: extractTrainingVectors may produce more than MAX_PQ_TRAINING_SET_SIZE vectors + +**Problem** +`extractTrainingVectors` could return more vectors than the intended maximum (`MAX_PQ_TRAINING_SET_SIZE`), leading to excessive memory usage during PQ training. + +**Resolution** +Uses floyd's random sampling algorithm to select random training vectors from the RandomAccessVectorValues. The solution has two phases. The first is to select MAX_PQ_TRAINING_SET_SIZE random ordinals. Then, it maps those ordinals to vectors. + +**Related Issues** +- [590](https://github.com/datastax/jvector/issues/590) + +--- diff --git a/docs/release notes/4.0.0-RC.8/TestingEnhancements.md b/docs/release notes/4.0.0-RC.8/pr612.testing.md similarity index 90% rename from docs/release notes/4.0.0-RC.8/TestingEnhancements.md rename to docs/release notes/4.0.0-RC.8/pr612.testing.md index 16ca36716..8ad2dcbf4 100644 --- a/docs/release notes/4.0.0-RC.8/TestingEnhancements.md +++ b/docs/release notes/4.0.0-RC.8/pr612.testing.md @@ -17,5 +17,4 @@ Enhancements to the JVector testing infrastructure: **Related Issues** - [615](https://github.com/datastax/jvector/issues/615) -- [616](https://github.com/datastax/jvector/issues/616) - +- [616](https://github.com/datastax/jvector/issues/616) \ No newline at end of file diff --git a/docs/release notes/4.0.0-RC.8/ImprovedDatasetHandling.md b/docs/release notes/4.0.0-RC.8/pr613.testing.md similarity index 100% rename from docs/release notes/4.0.0-RC.8/ImprovedDatasetHandling.md rename to docs/release notes/4.0.0-RC.8/pr613.testing.md diff --git a/docs/release notes/4.0.0-RC.8/DocumentationAndTutorials.md b/docs/release notes/4.0.0-RC.8/pr617.docs.md similarity index 100% rename from docs/release notes/4.0.0-RC.8/DocumentationAndTutorials.md rename to docs/release notes/4.0.0-RC.8/pr617.docs.md