Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
183 changes: 183 additions & 0 deletions .github/workflows/generate-release-notes.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
name: Generate Release Notes

on:
workflow_dispatch:
inputs:
version:
description: 'Release version to generate notes for (e.g. 4.0.0-RC.9)'
required: true
type: string
previous_version:
description: 'Previous release version for the version-range header (e.g. 4.0.0-RC.8) — leave blank to omit'
required: false
type: string
default: ''

jobs:
generate-release-notes:
runs-on: ubuntu-latest
permissions:
contents: write

steps:
- uses: actions/checkout@v4

- name: Validate inputs and directory
run: |
NOTES_DIR="docs/release notes/${{ inputs.version }}"

if [ ! -d "${NOTES_DIR}" ]; then
echo "::error::Release notes directory '${NOTES_DIR}' does not exist."
exit 1
fi

COUNT=$(find "${NOTES_DIR}" -maxdepth 1 -name "*.md" \
-not -name "${{ inputs.version }}.md" | wc -l)
if [ "${COUNT}" -eq 0 ]; then
echo "::error::No .md source files found in '${NOTES_DIR}'."
exit 1
fi

echo "Found ${COUNT} source file(s) in '${NOTES_DIR}'"

- name: Assemble release notes
run: |
VERSION="${{ inputs.version }}"
PREV="${{ inputs.previous_version }}"
NOTES_DIR="docs/release notes/${VERSION}"
OUTPUT="${NOTES_DIR}/${VERSION}.md"
GITHUB_REPO="datastax/jvector"

# -----------------------------------------------------------------------
# Section definitions — controls grouping and output order.
# Files named <PR#>.<tag>.md are automatically grouped under the matching
# header. Add a new tag here to introduce a new section; reorder the
# TAG_ORDER array to change the order in which sections appear.
# Note: multiple tags may share a header (e.g. bugfix and fix both map to
# "Bug Fixes") — the first tag in TAG_ORDER that matches wins the header
# slot, and all aliases are collected into that same section.
# -----------------------------------------------------------------------
TAG_ORDER=("feature" "enhancement" "performance" "bugfix" "fix" "docs" "testing")
declare -A TAG_HEADERS
TAG_HEADERS["feature"]="## New Features"
TAG_HEADERS["enhancement"]="## Enhancements"
TAG_HEADERS["performance"]="## Performance Improvements"
TAG_HEADERS["bugfix"]="## Bug Fixes and Issue Resolutions"
TAG_HEADERS["fix"]="## Bug Fixes and Issue Resolutions"
TAG_HEADERS["docs"]="## Documentation and Tutorials"
TAG_HEADERS["testing"]="## Testing Enhancements"

# -----------------------------------------------------------------------
# Preamble
# -----------------------------------------------------------------------
if [ -n "${PREV}" ]; then
VERSION_LINE="## Version: ${PREV} - ${VERSION}"
else
VERSION_LINE="## Version: ${VERSION}"
fi
printf '# JVector Release Notes\n\n%s\n\n---\n\n' "${VERSION_LINE}" > "${OUTPUT}"

# -----------------------------------------------------------------------
# Bucket source files into new-style (<PR#>.<tag>.md) or legacy bins.
# Legacy files (e.g. BugFixes.md, old free-form names) are appended
# verbatim at the end; they are expected to carry their own ## headers.
# -----------------------------------------------------------------------
declare -A tag_to_files # tag -> newline-separated list of absolute paths
legacy_files=()

while IFS= read -r -d '' f; do
fname=$(basename "$f")
[[ "$fname" == "${VERSION}.md" ]] && continue

if [[ "$fname" =~ ^([0-9]+)\.([a-z]+)\.md$ ]]; then
tag="${BASH_REMATCH[2]}"
tag_to_files["$tag"]+="${f}"$'\n'
else
legacy_files+=("$f")
fi
done < <(find "${NOTES_DIR}" -maxdepth 1 -name "*.md" \
-not -name "${VERSION}.md" -print0 | sort -z)

# -----------------------------------------------------------------------
# Write new-style sections in TAG_ORDER sequence.
# Each unique section header is written exactly once even when multiple
# tag aliases (e.g. bugfix + fix) share the same header.
# Entries within a section are sorted by PR number ascending.
# -----------------------------------------------------------------------
declare -A written_headers

for tag in "${TAG_ORDER[@]}"; do
[[ -z "${tag_to_files[$tag]}" ]] && continue

header="${TAG_HEADERS[$tag]}"

# Emit the header only the first time we encounter it
if [[ -z "${written_headers[$header]+x}" ]]; then
written_headers["$header"]=1
printf '%s\n\n' "${header}" >> "${OUTPUT}"
fi

# Collect all files whose tag aliases share this header
all_files_for_section=()
for alias_tag in "${TAG_ORDER[@]}"; do
[[ "${TAG_HEADERS[$alias_tag]}" != "$header" ]] && continue
[[ -z "${tag_to_files[$alias_tag]}" ]] && continue
while IFS= read -r file; do
[[ -n "$file" ]] && all_files_for_section+=("$file")
done <<< "${tag_to_files[$alias_tag]}"
done

# Sort entries by PR number (numeric prefix of the filename)
IFS=$'\n' sorted_section=($(
for f in "${all_files_for_section[@]}"; do
fname=$(basename "$f")
[[ "$fname" =~ ^([0-9]+)\. ]] && echo "${BASH_REMATCH[1]} $f"
done | sort -n | awk '{print $2}'
))
unset IFS

for f in "${sorted_section[@]}"; do
fname=$(basename "$f")
[[ "$fname" =~ ^([0-9]+)\. ]] && pr_num="${BASH_REMATCH[1]}" || pr_num=""

# Copy the file line-by-line, inserting the PR link immediately after
# the opening ### heading so it appears directly under the entry title.
inserted_pr=false
while IFS= read -r line; do
printf '%s\n' "$line" >> "${OUTPUT}"
if ! $inserted_pr && [ -n "$pr_num" ] && [[ "$line" =~ ^### ]]; then
printf '\n**PR:** [#%s](https://github.com/%s/pull/%s)\n' \
"$pr_num" "$GITHUB_REPO" "$pr_num" >> "${OUTPUT}"
inserted_pr=true
fi
done < "$f"

# Ensure a trailing separator between entries
if [ "$(tail -c 5 "$f" | tr -d '[:space:]')" != "---" ]; then
printf '\n\n---\n\n' >> "${OUTPUT}"
fi
done
done

# -----------------------------------------------------------------------
# Append legacy files verbatim (carry their own ## section headers).
# -----------------------------------------------------------------------
for f in "${legacy_files[@]}"; do
printf '\n' >> "${OUTPUT}"
cat "$f" >> "${OUTPUT}"
done

echo "Generated ${OUTPUT}:"
cat "${OUTPUT}"

- name: Commit and push
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git add "docs/release notes/${{ inputs.version }}/${{ inputs.version }}.md"
if git diff --cached --quiet; then
echo "Output file is unchanged — nothing to commit."
else
git commit -m "Generate release notes for ${{ inputs.version }}"
git push
fi
1 change: 1 addition & 0 deletions .github/workflows/pr_checklist.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,6 @@ __Before you submit for review:__
- [ ] Did you adhere to the code formatting guidelines (TBD)
- [ ] Did you group your changes for easy review, providing meaningful descriptions for each commit?
- [ ] Did you ensure that all files contain the correct copyright header?
- [ ] Did you add documentation for this feature to the release notes directory?

If you did not complete any of these, then please explain below.
190 changes: 190 additions & 0 deletions docs/release notes/4.0.0-RC.8/4.0.0-RC.8.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
# JVector Release Notes

## Version: 4.0.0-rc.7 - 4.0.0-rc.8

---

## New Features and Enhancements

### Fused Product Quantization (Fused PQ)

**Description**
Fused Product Quantization (Fused PQ) is a performance optimization that embeds compressed Product Quantization (PQ) codes directly into the graph index structure alongside each node's neighbor lists. This eliminates the need for separate lookups to retrieve compressed vectors during graph traversal, significantly improving query performance by reducing memory access overhead. The feature stores PQ-encoded neighbor vectors inline with the graph edges, enabling faster approximate similarity scoring during search operations. By embedding the compressed neighbor vectors into the graph index itself, Fused PQ eliminates the need to maintain a separate in-memory data structure for PQ-encoded vectors, reducing heap memory usage while maintaining fast approximate similarity search performance.​

**Purpose / Impact**
- Reduces memory usage for large-scale vector datasets
- Improves cache locality during graph traversal
- Enables higher writer scalability for large / high-dimensional vector workloads

**How to Enable**
To enable Fused PQ when writing an on-disk graph index:

1. **Create the FusedPQ feature** by passing your graph's max degree and a ProductQuantization compressor to the constructor:
```java
var fusedPQFeature = new FusedPQ(graph.maxDegree(), pq);
```

2. **Add it to your OnDiskGraphIndexWriter builder**:
```java
var writer = new OnDiskGraphIndexWriter.Builder(graph, outputPath)
.with(fusedPQFeature)
.build();
```

3. **Provide a state supplier during the write phase** that includes the graph view and PQ vectors:
```java
Map<FeatureId, IntFunction<Feature.State>> writeSuppliers = new EnumMap<>(FeatureId.class);
writeSuppliers.put(FeatureId.FUSED_PQ, ordinal -> new FusedPQ.State(view, pqVectors, ordinal));
writer.write(writeSuppliers);
```

**Notes**
- Fused PQ requires a 256-cluster ProductQuantization compressor. The feature automatically embeds compressed neighbor vectors inline with the graph structure during the write operation.
- To enable the FUSED_PQ feature, we introduced the new version 6 file format for our graph indices.

---

### Parallel Graph Index Construction

**Description**
OnDiskParallelGraphIndexWriter significantly accelerates graph index construction by addressing the disk I/O bottleneck that limits the serial OnDiskGraphIndexWriter. This implementation uses asynchronous file I/O with multiple worker threads to write graph records in parallel, with parallelism automatically determined by available system resources (or configurable via builder options). By parallelizing both record building and disk writes while maintaining correct ordering, this approach dramatically reduces the time required to persist large graph indexes to disk.

**Purpose / Impact**
- Eliminates i/o bottleneck in on disk graph construction
- Maintains backwards compatibility for existing clients of the JVector library

**How to Enable**
To enable parallel graph index writes, simply use `OnDiskParallelGraphIndexWriter.Builder` instead of `OnDiskGraphIndexWriter.Builder`:

**Basic usage (uses default parallelism based on available processors):**
```java
try (var writer = new OnDiskParallelGraphIndexWriter.Builder(graph, outputPath)
.with(features...)
.build()) {
writer.write(featureSuppliers);
}
```

**Advanced configuration:**
```java
try (var writer = new OnDiskParallelGraphIndexWriter.Builder(graph, outputPath)
.with(features...)
.withParallelWorkerThreads(8) // Optional: specify thread count (0 = auto)
.withParallelDirectBuffers(true) // Optional: use direct ByteBuffers for better performance
.build()) {
writer.write(featureSuppliers);
}
```

The parallel writer is a drop-in replacement for the standard writer with the same API, automatically leveraging multiple threads and asynchronous I/O to accelerate the write process.

**Notes**
- Currently still marked as @experimental
- Includes deprecation of method
```java
public synchronized void writeInline(int ordinal, Map<FeatureId, Feature.State> stateMap)
```
in favor of the more descriptive method
```java
public synchronized void writeFeaturesInline(int ordinal, Map<FeatureId, Feature.State> stateMap)
```

**Related Issues**
- [579](https://github.com/datastax/jvector/issues/579)

---

### Documentation and Tutorials

**Description**
Detailed javadoc added for all JVector components. Quickstart [tutorials](https://github.com/datastax/jvector/tree/main/jvector-examples/src/main/java/io/github/jbellis/jvector/example/tutorial) added to jvector-examples including documentation and code.

**Purpose / Impact**
- Provide better documentation for JVector and its components
- Give new users an entry point showing how to use JVector

---

### Improved DataSet Handling

**Description**
Revision of how datasets are handled internally within JVector for acquisition and representation of vector data. Includes overhaul of the loading process, better logging and error handling, and virtualization and metadata handling.

**Purpose / Impact**
- Make datasets easier to find, work with, and define through metadata for internal testing
- Resolves several issues with regression and inter-release comparisons

**Notes**
- Used internally by JVector Bench classes, no client impact

---

### Testing Enhancements

**Description**
Enhancements to the JVector testing infrastructure:
- On disk index cache added for Grid benchmark harness
- Logging subsystem overhaul
- New JMH tests
- Test results now include metrics for `nodes visited`, `heap usage`, `disk usage`, `PQ Distance`

**Purpose / Impact**
- Faster testing cycle
- Better comprehension of test results
- new metrics to compare inter-release

**Notes**
- Used internally by JVector, no client impact

**Related Issues**
- [615](https://github.com/datastax/jvector/issues/615)
- [616](https://github.com/datastax/jvector/issues/616)

---

## Bug Fixes and Issue Resolutions

### Fix: NullPointerException in `OnDiskGraphIndex#ramBytesUsed`

**Problem**
Now that we lazily load the inMemoryNeighbors and the inMemoryFeatures, we need to handle the case in `OnDiskGraphIndex` where they are null or have values that are null when the `ramBytesUsed()` method is called.

**Resolution**
Added appropriate null checks and safeguards to ensure `ramBytesUsed()` can be safely invoked in all valid states.

**Related Issues**
- [#586](https://github.com/datastax/jvector/issues/586)

---

### Fix: Protection Against Invalid Ordinal Mappings

**Problem**
JVector relies on the calling source code to pass in ordinal maps constructed outside of the JVector library. Improper or inconsistent ordinal mappings can lead to failures when the Graph is built or incorrect indexing or search results.

**Resolution**
Added safeguards to detect invalid ordinal mappings.

**Notes**
Full validation of ordinal mapping requires iterating over the entire set of ordinals and can be a costly operation. This safeguard will only be activated if debug logging is enabled or if `System.getProperties().containsKey("VECTOR_DEBUG")`

**Related Issues**
- [568](https://github.com/datastax/jvector/issues/568)

---


### Fix: extractTrainingVectors may produce more than MAX_PQ_TRAINING_SET_SIZE vectors

**Problem**
`extractTrainingVectors` could return more vectors than the intended maximum (`MAX_PQ_TRAINING_SET_SIZE`), leading to excessive memory usage during PQ training.

**Resolution**
Uses floyd's random sampling algorithm to select random training vectors from the RandomAccessVectorValues. The solution has two phases. The first is to select MAX_PQ_TRAINING_SET_SIZE random ordinals. Then, it maps those ordinals to vectors.

**Related Issues**
- [590](https://github.com/datastax/jvector/issues/590)

---


Loading
Loading