-
Notifications
You must be signed in to change notification settings - Fork 0
Feature/pr analysis rate limiting #105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 1.2.1-rc
Are you sure you want to change the base?
Conversation
- Updated JobService to use REQUIRES_NEW transaction propagation for deleting ignored jobs, ensuring fresh entity retrieval and preventing issues with the calling transaction. - Removed token limitation from AI connection model and related DTOs, transitioning to project-level configuration for token limits. - Adjusted AIConnectionDTO tests to reflect the removal of token limitation. - Enhanced Bitbucket, GitHub, and GitLab AI client services to check token limits before analysis, throwing DiffTooLargeException when limits are exceeded. - Updated command processors to utilize project-level token limits instead of AI connection-specific limits. - Modified webhook processing to handle diff size issues gracefully, posting informative messages to VCS when analysis is skipped due to large diffs. - Cleaned up integration tests to remove references to token limitation in AI connection creation and updates.
…sis processing. Project PR analysis max analysis token limit implementation
…Exception in webhook processors
… entities from async contexts
…ies without re-fetching in async contexts
… lazy loading of associations
…cy across transaction contexts
…oading of associations
…ansaction management in async context
…mproved job management
…ervice for direct deletion
…nt in webhook processing
…n for RAG context
- Added AST-based code splitter using Tree-sitter for accurate code parsing. - Introduced TreeSitterParser for dynamic language loading and caching. - Created scoring configuration for RAG query result reranking with configurable boost factors and priority patterns. - Refactored RAGQueryService to utilize the new scoring configuration for enhanced result ranking. - Improved metadata extraction and handling for better context in scoring.
…rove code referencing in prompts
… target branch results
…emental updates in RAG operations
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/codecrow analyze |
|
Comment commands are not enabled for this project Check the job logs in CodeCrow for detailed error information. |
|
| Status | FAIL |
| Risk Level | CRITICAL |
| Review Coverage | 81 files analyzed in depth |
| Confidence | HIGH |
Executive Summary
This PR implements a cross-ecosystem rate limiting and token estimation framework for the analysis pipeline, involving significant database migrations and RAG pipeline logic. While the architecture for AST-based splitting is sound, the current implementation contains critical breaking changes in the service layer and mismatched constructor signatures in the Python RAG engine. The transition of token limitation settings from AI connections to project configurations is currently incomplete, presenting a high risk to system stability.
Recommendation
Decision: FAIL
The PR cannot be merged in its current state due to several blockers that will cause immediate compilation and runtime failures in both the Java backend and the Python RAG service. Please address the interface mismatches and the incomplete service layer refactoring before resubmission.
Key Review Pillars
🏗️ Architecture & Integration
- Breaking Changes: The removal of
tokenLimitationfields from DTOs and entities was not synchronized with the service layer, resulting in direct reference errors. - Interface Mismatch: The RAG
IndexManagerattempts to pass unsupported arguments to theASTCodeSplitter, which will trigger aTypeErrorat runtime.
🛡️ System Integrity
- Concurrency & Locking: Potential dangling locks identified in the analysis processor if the caller fails to release pre-acquired locks.
- Data Validation: Missing constraints on new numeric limits (
maxAnalysisTokenLimit) could lead to downstream resource exhaustion or logic errors.
⚡ Performance & Scalability
- Efficiency: The webhook deduplication service performs synchronous cleanup on the hot path; shifting this to a background task is recommended for high-load scenarios.
- Granularity: The increase of RAG chunk sizes to 8,000 characters may exceed standard embedding model token limits, potentially degrading semantic accuracy.
Issues Overview
| Severity | Count | |
|---|---|---|
| 🔴 High | 3 | Critical issues requiring immediate attention |
| 🟡 Medium | 12 | Issues that should be addressed |
| 🔵 Low | 8 | Minor issues and improvements |
| ℹ️ Info | 1 | Informational notes and suggestions |
Analysis completed on 2026-01-29 00:08:03 | View Full Report | Pull Request
📋 Detailed Issues (24)
🔴 High Severity Issues
Id on Platform: 1550
Category: 🏗️ Architecture
File: .../ai/AIConnection.java:98
Issue: Removing 'setTokenLimitation' and 'getTokenLimitation' will break existing consumers visible in the codebase context, specifically 'AIConnectionService.updateAiConnection' and 'AIConnectionDTO.fromAiConnection'.
💡 Suggested Fix
Ensure all callers (AIConnectionService, AIConnectionDTO, AIConnectionBuilder, and tests) are updated to use the new ProjectConfig-based limitation before removing these methods, or keep them as deprecated delegates if backward compatibility is required during the transition.
No suggested fix providedId on Platform: 1561
Category: 🐛 Bug Risk
File: .../service/AIConnectionService.java:33
Issue: Removing 'tokenLimitation' from 'CreateAIConnectionRequest' (and 'UpdateAiConnectionRequest') will cause a compilation error or runtime failure in 'AIConnectionService.updateAiConnection' because it attempts to access 'request.tokenLimitation'.
💡 Suggested Fix
Either retain the field in the DTOs or update 'AIConnectionService.java' to remove the logic that processes 'tokenLimitation'. Given that the logic exists in the service, the field should likely be kept in the DTO or handled appropriately.
--- a/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/ai/dto/request/CreateAIConnectionRequest.java
+++ b/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/ai/dto/request/CreateAIConnectionRequest.java
@@ -13,4 +13,6 @@
public String aiModel;
@NotBlank(message = "API key is required")
public String apiKey;
+ @NotBlank(message = "Please specify max token limit")
+ public String tokenLimitation;
}Id on Platform: 1562
Category: 🐛 Bug Risk
File: .../service/AIConnectionService.java:33
Issue: The 'AIConnectionService' explicitly references 'request.tokenLimitation' inside 'updateAiConnection'. Removing this field from the DTO will break the service layer code seen in the context.
💡 Suggested Fix
Restore the 'tokenLimitation' field to 'UpdateAiConnectionRequest' to maintain compatibility with 'AIConnectionService'.
--- a/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/ai/dto/request/UpdateAiConnectionRequest.java
+++ b/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/ai/dto/request/UpdateAiConnectionRequest.java
@@ -9,5 +9,6 @@
public AIProviderKey providerKey;
public String aiModel;
public String apiKey;
+ public String tokenLimitation;
}🟡 Medium Severity Issues
Id on Platform: 1552
Category: ✨ Best Practices
File: .../analysis/PullRequestAnalysisProcessor.java:231
Issue: When 'isPreAcquired' is true, the processor skips the 'releaseLock' call in the 'finally' block. This assumes the caller will ALWAYS handle the release. If the caller fails to do so (e.g., due to an unexpected exception before this method returns), the lock will remain held until it naturally expires, potentially blocking subsequent analyses.
💡 Suggested Fix
While the logic correctly avoids double-releasing if the 'releaseLock' method isn't idempotent, it is safer to ensure the caller has a robust mechanism for release. If the lock service supports idempotent releases, the 'finally' block should be simplified to always release.
No suggested fix providedId on Platform: 1553
Category: ⚡ Performance
File: .../service/WebhookDeduplicationService.java:71
Issue: The cleanup of the ConcurrentHashMap is performed synchronously inside the isDuplicateCommitAnalysis method. As the map grows or under high load, iterating through the entry set on every check can introduce latency in the webhook processing pipeline.
💡 Suggested Fix
Move the cleanup logic to a scheduled background task using @scheduled to keep the hot path (checking duplicates) as fast as possible.
--- a/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/service/WebhookDeduplicationService.java
+++ b/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/service/WebhookDeduplicationService.java
@@ -3,6 +3,7 @@
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
+import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Service;
@@ -67,17 +68,14 @@
// Record this analysis
recentCommitAnalyses.put(key, now);
- // Cleanup old entries
- cleanupOldEntries(now);
-
return false;
}
/**
* Remove entries older than the dedup window to prevent memory growth.
*/
- private void cleanupOldEntries(Instant now) {
+ @Scheduled(fixedDelay = 60000)
+ public void cleanupOldEntries() {
+ Instant now = Instant.now();
recentCommitAnalyses.entrySet().removeIf(entry -> {
long age = now.getEpochSecond() - entry.getValue().getEpochSecond();Id on Platform: 1554
Category: 🐛 Bug Risk
File: .../service/ProjectService.java:383
Issue: The code uses project.getWorkspace().getId() to verify ownership. If the Project entity is loaded via findByIdWithConnections, the 'workspace' association might be null if not eagerly fetched or if the join is not defined correctly in the repository, potentially causing a NullPointerException.
💡 Suggested Fix
Ensure the workspace is part of the fetch graph or use a repository method that joins the workspace to safely check the ID.
--- a/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/project/service/ProjectService.java
+++ b/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/project/service/ProjectService.java
@@ -383,7 +383,7 @@
.orElseThrow(() -> new NoSuchElementException("Project not found"));
// Verify workspace ownership
- if (!project.getWorkspace().getId().equals(workspaceId)) {
+ if (project.getWorkspace() == null || !project.getWorkspace().getId().equals(workspaceId)) {
throw new NoSuchElementException("Project not found in workspace");
}Id on Platform: 1555
Category: ✨ Best Practices
File: .../controller/ProjectController.java:597
Issue: The new field 'maxAnalysisTokenLimit' is an Integer which could be negative or excessively large. It is best practice to validate such numeric limits to prevent logic errors or resource exhaustion downstream.
💡 Suggested Fix
Add validation annotations like @min(1) to the maxAnalysisTokenLimit field in the UpdateAnalysisSettingsRequest record.
--- a/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/project/controller/ProjectController.java
+++ b/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/project/controller/ProjectController.java
@@ -595,5 +595,6 @@
Boolean branchAnalysisEnabled,
String installationMethod,
+ @jakarta.validation.constraints.Min(0)
Integer maxAnalysisTokenLimit
) {}Id on Platform: 1557
Category: 🐛 Bug Risk
File: .../service/RagOperationsServiceImpl.java:813
Issue: The code uses 'indexedCommit.substring(0, 7)' and 'currentCommit.substring(0, 7)' without verifying that the commit hashes are at least 7 characters long. While commit hashes are usually 40 chars, short strings (e.g. 'none' or dev aliases) will cause a StringIndexOutOfBoundsException.
💡 Suggested Fix
Use Math.min to safely calculate the substring end index, similar to the logic previously removed in this diff.
--- a/java-ecosystem/libs/rag-engine/src/main/java/org/rostilos/codecrow/ragengine/service/RagOperationsServiceImpl.java
+++ b/java-ecosystem/libs/rag-engine/src/main/java/org/rostilos/codecrow/ragengine/service/RagOperationsServiceImpl.java
@@ -812,2 +812,2 @@
- log.info("Incremental diff for branch '{}' ({} -> {}): bytes={}",
- targetBranch, indexedCommit.substring(0, 7), currentCommit.substring(0, 7),
+ log.info("Incremental diff for branch '{}' ({} -> {}): bytes={}",
+ targetBranch, indexedCommit.substring(0, Math.min(7, indexedCommit.length())), currentCommit.substring(0, Math.min(7, currentCommit.length())),Id on Platform: 1558
Category: ✨ Best Practices
File: .../processor/WebhookAsyncProcessor.java
Issue: Circular reference risk. While using @lazy with @Autowired on self-injection is a common workaround for Spring proxying issues, it can complicate the bean lifecycle and is often considered a code smell.
💡 Suggested Fix
Consider moving the @transactional logic into a separate dedicated service (e.g., WebhookProcessingService) rather than using self-injection within the same class.
No suggested fix providedId on Platform: 1560
Category: ✨ Best Practices
File: .../config/AsyncConfig.java
Issue: The manual implementation of a RejectedExecutionHandler that calls 'r.run()' is functionally equivalent to the standard 'ThreadPoolExecutor.CallerRunsPolicy'. Using the built-in policy is cleaner and less error-prone.
💡 Suggested Fix
Replace the manual lambda with the standard Spring/JDK CallerRunsPolicy.
--- a/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/config/AsyncConfig.java
+++ b/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/config/AsyncConfig.java
@@ -50,14 +50,7 @@
executor.setThreadNamePrefix("webhook-");
executor.setWaitForTasksToCompleteOnShutdown(true);
executor.setAwaitTerminationSeconds(120);
- // Log when tasks are rejected due to full queue
- executor.setRejectedExecutionHandler((r, e) -> {
- log.error("WEBHOOK EXECUTOR REJECTED TASK! Queue is full. Pool size: {}, Active: {}, Queue size: {}",
- e.getPoolSize(), e.getActiveCount(), e.getQueue().size());
- // Try to run in caller thread as fallback
- if (!e.isShutdown()) {
- r.run();
- }
- });
+ executor.setRejectedExecutionHandler(new java.util.concurrent.ThreadPoolExecutor.CallerRunsPolicy());
executor.initialize();Id on Platform: 1564
Category: ⚡ Performance
File: .../index_manager/indexer.py:82
Issue: The repository size estimation uses random.sample on the file list. If the sampling happens during every index check and the filesystem is slow or the file list is massive, this could be overhead, but specifically, the current implementation doesn't handle the case where the sample size is larger than the remaining files correctly if logic were to change (though currently guarded). More importantly, it performs a full load and split for the sample which is heavy.
💡 Suggested Fix
Consider estimating based on file metadata (size) rather than full content loading and splitting for a sample of 100 files, as loading/splitting is the most expensive part of the pipeline.
No suggested fix providedId on Platform: 1568
Category: 🐛 Bug Risk
File: .../splitter/query_runner.py:212
Issue: The use of cursor.matches(tree.root_node) returns an iterator. Converting it to a list using list() can cause high memory consumption for very large files with thousands of AST nodes and matches.
💡 Suggested Fix
Iterate directly over the cursor matches instead of casting to a list to reduce memory overhead.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/query_runner.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/query_runner.py
@@ -218,1 +218,1 @@
- raw_matches = list(cursor.matches(tree.root_node))
+ raw_matches = cursor.matches(tree.root_node)Id on Platform: 1570
Category: 🐛 Bug Risk
File: .../models/scoring_config.py
Issue: The priority matching logic uses 'any(p in path_lower for p in self.high)'. This can lead to false positives. For example, a file named 'base_component_test.py' would match 'component' (MEDIUM) and 'test' (LOW). Since 'HIGH' is checked first, it will return the first match it finds, but if a path contains both a high and low keyword, the behavior might be unexpected depending on keyword order.
💡 Suggested Fix
Consider using regex boundaries or splitting the path into segments to ensure 'auth' doesn't match 'author_profile.jpg' and that 'test' takes precedence if it's a test file.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/models/scoring_config.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/models/scoring_config.py
@@ -102,10 +102,11 @@
path_lower = file_path.lower()
-
+ import re
+ def matches(patterns): return any(re.search(rf'\\b{re.escape(p)}\\b|[\\/_]{re.escape(p)}[\\/_.]', path_lower) for p in patterns)
if any(p in path_lower for p in self.high):
return ('HIGH', self.high_boost)Id on Platform: 1572
Category: ✨ Best Practices
File: .../tests/test_rag.py:63
Issue: The chunk_size was increased from 800 to 8000. Most common embedding models (like OpenAI's text-embedding-ada-002 or various HuggingFace models) have a token limit that translates roughly to 1500-3000 characters. 8000 characters will likely lead to truncation or loss of semantic granularity.
💡 Suggested Fix
Lower the default chunk size to a range compatible with standard embedding models (e.g., 2000-4000) or ensure the downstream components can handle large chunks without significant performance degradation.
--- a/python-ecosystem/rag-pipeline/tests/test_rag.py
+++ b/python-ecosystem/rag-pipeline/tests/test_rag.py
@@ -62,1 +62,1 @@
- assert config.chunk_size == 8000 # Increased to fit most semantic units
+ assert config.chunk_size == 2000 # Increased to fit most semantic units while remaining within embedding limitsId on Platform: 1573
Category: ✨ Best Practices
File: .../core/index_manager.py:64
Issue: The new initialization of ASTCodeSplitter includes an enrich_embedding_text argument. However, as seen in the codebase context for src/rag_pipeline/core/ast_splitter.py, the __init__ method only accepts max_chunk_size, min_chunk_size, chunk_overlap, and parser_threshold. Passing an unknown keyword argument will raise a TypeError at runtime.
💡 Suggested Fix
Remove the unsupported enrich_embedding_text parameter from the constructor call to prevent a TypeError.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager.py
@@ -63,5 +63,4 @@
chunk_overlap=config.chunk_overlap,
- parser_threshold=3, # Low threshold - AST benefits even small files
- enrich_embedding_text=True # Prepend semantic context for better embeddings
+ parser_threshold=3 # Low threshold - AST benefits even small files
)🔵 Low Severity Issues
Id on Platform: 1551
Category: ✨ Best Practices
File: .../project/Project.java:230
Issue: Creating a 'new ProjectConfig()' every time getEffectiveConfig() is called on a null configuration might be slightly inefficient if called in tight loops, though it provides safety.
💡 Suggested Fix
Consider initializing the 'configuration' field with a default value at the declaration site or in the constructor to avoid repeated allocations and null checks.
--- a/java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/model/project/Project.java
+++ b/java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/model/project/Project.java
@@ -229,1 +229,12 @@
- public org.rostilos.codecrow.core.model.project.config.ProjectConfig getEffectiveConfig() {
- return configuration != null ? configuration : new org.rostilos.codecrow.core.model.project.config.ProjectConfig();
- }
+ public org.rostilos.codecrow.core.model.project.config.ProjectConfig getEffectiveConfig() {
+ if (configuration == null) {
+ configuration = new org.rostilos.codecrow.core.model.project.config.ProjectConfig();
+ }
+ return configuration;
+ }Id on Platform: 1556
Category: ✨ Best Practices
File: .../service/JobService.java:228
Issue: Repeated use of 'jobRepository.findById(job.getId()).orElse(job)' across multiple methods (startJob, completeJob, cancelJob, skipJob) leads to code duplication.
💡 Suggested Fix
Extract the re-fetching logic into a private helper method to improve maintainability.
--- a/java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/service/JobService.java
+++ b/java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/service/JobService.java
@@ -224,3 +224,3 @@
- // Re-fetch the job in case it was passed from a different transaction context
- job = jobRepository.findById(job.getId()).orElse(job);
+ job = ensureManaged(job);
@@ -320,0 +320,4 @@
+ private Job ensureManaged(Job job) {
+ return jobRepository.findById(job.getId()).orElse(job);
+ }
+Id on Platform: 1559
Category: 🛡️ Error Handling
File: .../processor/WebhookAsyncProcessor.java
Issue: Potential for duplicate job failure state updates. If self.processWebhookInTransaction fails and has already attempted to fail the job within its own try-catch or transaction, the outer catch block attempts to fail it again. While likely idempotent in the DB, it causes unnecessary error logs.
💡 Suggested Fix
Add a check to see if the job is already in a terminal state before calling failJob in the outer catch block.
--- a/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/processor/WebhookAsyncProcessor.java
+++ b/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/processor/WebhookAsyncProcessor.java
@@ -73,3 +73,3 @@
- try {
- jobService.failJob(job, "Async processing failed: " + e.getMessage());
- } catch (Exception failError) {
+ try {
+ if (!job.isTerminal()) {
+ jobService.failJob(job, "Async processing failed: " + e.getMessage());
+ }
+ } catch (Exception failError) {Id on Platform: 1563
Category: 🔒 Security
File: .../llm/llm_factory.py:143
Issue: Logging the 'ai_api_key' is not happening here, but the 'provider', 'model', and 'temperature' are logged. While these are not secrets, ensure that the 'ai_api_key' variable (which is available in the scope) is never accidentally added to this log line in future iterations to prevent secret leakage in logs.
💡 Suggested Fix
Maintain the current logging but ensure PII or credentials never enter the log stream.
No suggested fix providedId on Platform: 1565
Category: 🛡️ Error Handling
File: .../index_manager/indexer.py:247
Issue: In the finally block of index_repository, existing_other_branch_points is deleted, but if the variable was never initialized due to an early exception before the assignment, a UnboundLocalError could occur.
💡 Suggested Fix
Initialize existing_other_branch_points to an empty list at the start of the method.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/indexer.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/indexer.py
@@ -124,11 +124,11 @@
temp_collection_name = self.collection_manager.create_versioned_collection(alias_name)
# Check existing collection and preserve other branch data
+ existing_other_branch_points = []
old_collection_exists = self.collection_manager.alias_exists(alias_name)
if not old_collection_exists:
old_collection_exists = self.collection_manager.collection_exists(alias_name)
- existing_other_branch_points = []
if old_collection_exists:
actual_collection = self.collection_manager.resolve_alias(alias_name) or alias_name
existing_other_branch_points = self.branch_manager.preserve_other_branch_points(Id on Platform: 1567
Category: 🧹 Code Quality
File: .../splitter/metadata.py
Issue: The extract_inheritance method for Python uses a regex that assumes simple parentheticals for inheritance, but Python allows complex expressions, keyword arguments (metaclasses), and trailing commas which might break this simplified split logic.
💡 Suggested Fix
Use a more robust regex or manual parsing for Python inheritance to handle potential trailing commas and keyword arguments in class definitions.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py
@@ -226,1 +226,1 @@
- result['extends'] = [e.strip() for e in extends.split(',') if e.strip()]
+ result['extends'] = [e.strip() for e in extends.split(',') if e.strip() and '=' not in e]Id on Platform: 1569
Category: ✨ Best Practices
File: .../splitter/tree_parser.py:109
Issue: The parse method converts the entire source string to bytes using UTF-8 for every call. For very large files, repeated conversion can be inefficient.
💡 Suggested Fix
If the caller already has bytes, allow passing bytes directly, or cache the byte-encoded version of the source if multiple operations are performed.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/tree_parser.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/tree_parser.py
@@ -107,2 +107,4 @@
- parser = Parser(language)
- tree = parser.parse(bytes(source_code, "utf8"))
+ parser = Parser(language)
+ source_bytes = source_code if isinstance(source_code, bytes) else source_code.encode("utf8")
+ tree = parser.parse(source_bytes)
return treeId on Platform: 1571
Category: 🐛 Bug Risk
File: .../services/query_service.py:112
Issue: The chunk_id generation uses hash(text[:100]). Using only the first 100 characters for identity might lead to collisions for files that share common headers, license blocks, or imports, causing the deduplication logic to incorrectly skip unique chunks.
💡 Suggested Fix
Increase the prefix length for the hash or use a hash of the full text to ensure uniqueness of the chunk identity.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/services/query_service.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/services/query_service.py
@@ -111,3 +111,3 @@
# Create chunk identity (path + start of content)
- chunk_id = f"{path}:{branch}:{hash(result.get('text', '')[:100])}"
+ chunk_id = f"{path}:{branch}:{hash(result.get('text', ''))}"ℹ️ Informational Notes
Id on Platform: 1566
Category: 🧹 Code Quality
File: .../index_manager/stats_manager.py:106
Issue: The parameter alias_checker is defined in the method signature but not utilized within the loop or the logic that determines which collections to process.
💡 Suggested Fix
Either use the alias_checker to skip aliased collections or remove the unused parameter from the method signature if it's not needed for the current logic.
--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/stats_manager.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/stats_manager.py
@@ -106,1 +106,1 @@
- def list_all_indices(self, alias_checker) -> List[IndexStats]:
+ def list_all_indices(self) -> List[IndexStats]:Files Affected
- .../processor/WebhookAsyncProcessor.java: 2 issues
- .../index_manager/indexer.py: 2 issues
- .../service/AIConnectionService.java: 2 issues
- .../splitter/metadata.py: 1 issue
- .../analysis/PullRequestAnalysisProcessor.java: 1 issue
- .../core/index_manager.py: 1 issue
- .../splitter/tree_parser.py: 1 issue
- .../index_manager/stats_manager.py: 1 issue
- .../service/WebhookDeduplicationService.java: 1 issue
- .../service/ProjectService.java: 1 issue
No description provided.