KIN-11: Add vocabulary learning, clustering, and aggregate creation flow#20
KIN-11: Add vocabulary learning, clustering, and aggregate creation flow#20brandonkindred wants to merge 1 commit into
Conversation
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Free Tier Details
You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| } | ||
| } | ||
| return vocabularyRepository.save(vocabulary); | ||
| } |
There was a problem hiding this comment.
Duplicate words on upsert of existing vocabulary
High Severity
When upsertVocabulary loads an existing Vocabulary from Neo4j via findByLabel, the transient wordToIndexMap field is an empty HashMap (initialized by the default constructor but never rebuilt from valueList, since Neo4j OGM typically hydrates @Property fields via reflection, bypassing the setValueList setter that calls initializeMappings()). Calling addWord on this vocabulary fails to detect existing words, appending duplicates to valueList on every training call. The method needs to call vocabulary.initializeMappings() after loading from the repository.
| childIds.add(sourceVocabulary.getId()); | ||
| childIds.addAll(matching.stream().map(Vocabulary::getId).collect(Collectors.toList())); | ||
| vocabularyRepository.attachChildVocabularies(savedAggregate.getId(), childIds); | ||
| } |
There was a problem hiding this comment.
Unbounded aggregate vocabulary creation on every training call
Medium Severity
createAggregateVocabularyIfSimilar creates a new aggregate vocabulary node on every training invocation that meets the similarity threshold, because the label includes System.currentTimeMillis() making it always unique. There is no check for an existing aggregate. Over repeated training calls, this produces unbounded growth of aggregate nodes in the graph store, and previously-created aggregates themselves become candidates for future similarity matches, compounding the growth.
| - [x] Run targeted tests and fix any regressions. | ||
|
|
||
| ## Execution Notes | ||
| This plan has been fully implemented in this branch. |
There was a problem hiding this comment.
Implementation plan committed as permanent project file
Low Severity
KIN-11_VOCABULARY_PLAN.md is a task-tracking implementation plan with completed checkboxes and the note "This plan has been fully implemented in this branch." This is a project-management artifact that has served its purpose and does not belong as a permanent file in the repository root alongside actual documentation like README.md and API_SPEC.md.


Motivation
Description
VocabularyServicewhich normalizes labels, upserts vocabularies from training features, clusters strongly-related features usingFeatureWeightthresholds, computes Jaccard similarity between vocabularies, and creates aggregate vocabularies when similarity exceeds the threshold.attachChildVocabularies(...)Cypher-backed method toVocabularyRepositoryto persist(:Vocabulary)-[:PART_OF_VOCABULARY]->(:Vocabulary)relationships for aggregate nodes.vocabularyService.learnFromTrainingFeatures(...)in the/rl/trainendpoint before the existingbrain.train(...)call.VocabularyServiceTestscovering Jaccard similarity, clustering behavior, and aggregate-creation + child-attachment call paths, and saved an implementation plan inKIN-11_VOCABULARY_PLAN.md.Testing
src/test/java/com/deepthought/models/services/VocabularyServiceTests.javathat exercise similarity calculation, clustering, and aggregate creation; these tests were written and committed.mvn -q -Dtest=VocabularyTests,VocabularyServiceTests test, but the test run failed in this environment due to Maven Central/plugin resolution returning HTTP 403 forspring-boot-maven-plugin:2.2.6.RELEASE, so tests could not be executed here.Codex Task
Note
Medium Risk
Adds additional Neo4j writes and relationship creation on the
/rl/trainhot path, which could impact training performance and graph growth if thresholds/labels aren’t tuned. Logic is new but covered by unit tests for clustering, similarity scoring, and aggregate attachment.Overview
Introduces automatic vocabulary learning during
/rl/trainby calling a newVocabularyServicebeforebrain.train(...).The service upserts a base vocabulary from training features, clusters strongly-related features using
FeatureWeightedges, computes Jaccard similarity against existing vocabularies, and (when similar enough) creates an aggregate vocabulary and links child vocabularies via(:Vocabulary)-[:PART_OF_VOCABULARY]->(:Vocabulary).Adds
VocabularyRepository.attachChildVocabularies(...)Cypher query support, plus new unit tests for clustering/similarity/aggregate creation and a short implementation plan doc (KIN-11_VOCABULARY_PLAN.md).Written by Cursor Bugbot for commit c322004. This will update automatically on new commits. Configure here.