Skip to content

Conversation

@rostilos
Copy link
Owner

No description provided.

- Updated JobService to use REQUIRES_NEW transaction propagation for deleting ignored jobs, ensuring fresh entity retrieval and preventing issues with the calling transaction.
- Removed token limitation from AI connection model and related DTOs, transitioning to project-level configuration for token limits.
- Adjusted AIConnectionDTO tests to reflect the removal of token limitation.
- Enhanced Bitbucket, GitHub, and GitLab AI client services to check token limits before analysis, throwing DiffTooLargeException when limits are exceeded.
- Updated command processors to utilize project-level token limits instead of AI connection-specific limits.
- Modified webhook processing to handle diff size issues gracefully, posting informative messages to VCS when analysis is skipped due to large diffs.
- Cleaned up integration tests to remove references to token limitation in AI connection creation and updates.
…sis processing. Project PR analysis max analysis token limit implementation
- Added AST-based code splitter using Tree-sitter for accurate code parsing.
- Introduced TreeSitterParser for dynamic language loading and caching.
- Created scoring configuration for RAG query result reranking with configurable boost factors and priority patterns.
- Refactored RAGQueryService to utilize the new scoring configuration for enhanced result ranking.
- Improved metadata extraction and handling for better context in scoring.
@coderabbitai
Copy link

coderabbitai bot commented Jan 29, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@rostilos
Copy link
Owner Author

/codecrow analyze

@codecrow-ai
Copy link

codecrow-ai bot commented Jan 29, 2026

⚠️ CodeCrow Command Failed

Comment commands are not enabled for this project


Check the job logs in CodeCrow for detailed error information.

@codecrow-local
Copy link

⚠️ Code Analysis Results

Summary

Pull Request Review: Feature/pr analysis rate limiting

Status FAIL
Risk Level CRITICAL
Review Coverage 81 files analyzed in depth
Confidence HIGH

Executive Summary

This PR implements a cross-ecosystem rate limiting and token estimation framework for the analysis pipeline, involving significant database migrations and RAG pipeline logic. While the architecture for AST-based splitting is sound, the current implementation contains critical breaking changes in the service layer and mismatched constructor signatures in the Python RAG engine. The transition of token limitation settings from AI connections to project configurations is currently incomplete, presenting a high risk to system stability.

Recommendation

Decision: FAIL

The PR cannot be merged in its current state due to several blockers that will cause immediate compilation and runtime failures in both the Java backend and the Python RAG service. Please address the interface mismatches and the incomplete service layer refactoring before resubmission.


Key Review Pillars

🏗️ Architecture & Integration

  • Breaking Changes: The removal of tokenLimitation fields from DTOs and entities was not synchronized with the service layer, resulting in direct reference errors.
  • Interface Mismatch: The RAG IndexManager attempts to pass unsupported arguments to the ASTCodeSplitter, which will trigger a TypeError at runtime.

🛡️ System Integrity

  • Concurrency & Locking: Potential dangling locks identified in the analysis processor if the caller fails to release pre-acquired locks.
  • Data Validation: Missing constraints on new numeric limits (maxAnalysisTokenLimit) could lead to downstream resource exhaustion or logic errors.

⚡ Performance & Scalability

  • Efficiency: The webhook deduplication service performs synchronous cleanup on the hot path; shifting this to a background task is recommended for high-load scenarios.
  • Granularity: The increase of RAG chunk sizes to 8,000 characters may exceed standard embedding model token limits, potentially degrading semantic accuracy.

Issues Overview

Severity Count
🔴 High 3 Critical issues requiring immediate attention
🟡 Medium 12 Issues that should be addressed
🔵 Low 8 Minor issues and improvements
ℹ️ Info 1 Informational notes and suggestions

Analysis completed on 2026-01-29 00:08:03 | View Full Report | Pull Request


📋 Detailed Issues (24)

🔴 High Severity Issues

Id on Platform: 1550

Category: 🏗️ Architecture

File: .../ai/AIConnection.java:98

Issue: Removing 'setTokenLimitation' and 'getTokenLimitation' will break existing consumers visible in the codebase context, specifically 'AIConnectionService.updateAiConnection' and 'AIConnectionDTO.fromAiConnection'.

💡 Suggested Fix

Ensure all callers (AIConnectionService, AIConnectionDTO, AIConnectionBuilder, and tests) are updated to use the new ProjectConfig-based limitation before removing these methods, or keep them as deprecated delegates if backward compatibility is required during the transition.

No suggested fix provided

View Issue Details


Id on Platform: 1561

Category: 🐛 Bug Risk

File: .../service/AIConnectionService.java:33

Issue: Removing 'tokenLimitation' from 'CreateAIConnectionRequest' (and 'UpdateAiConnectionRequest') will cause a compilation error or runtime failure in 'AIConnectionService.updateAiConnection' because it attempts to access 'request.tokenLimitation'.

💡 Suggested Fix

Either retain the field in the DTOs or update 'AIConnectionService.java' to remove the logic that processes 'tokenLimitation'. Given that the logic exists in the service, the field should likely be kept in the DTO or handled appropriately.

--- a/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/ai/dto/request/CreateAIConnectionRequest.java
+++ b/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/ai/dto/request/CreateAIConnectionRequest.java
@@ -13,4 +13,6 @@
     public String aiModel;
     @NotBlank(message = "API key is required")
     public String apiKey;
+    @NotBlank(message = "Please specify max token limit")
+    public String tokenLimitation;
 }

View Issue Details


Id on Platform: 1562

Category: 🐛 Bug Risk

File: .../service/AIConnectionService.java:33

Issue: The 'AIConnectionService' explicitly references 'request.tokenLimitation' inside 'updateAiConnection'. Removing this field from the DTO will break the service layer code seen in the context.

💡 Suggested Fix

Restore the 'tokenLimitation' field to 'UpdateAiConnectionRequest' to maintain compatibility with 'AIConnectionService'.

--- a/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/ai/dto/request/UpdateAiConnectionRequest.java
+++ b/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/ai/dto/request/UpdateAiConnectionRequest.java
@@ -9,5 +9,6 @@
     public AIProviderKey providerKey;
     public String aiModel;
     public String apiKey;
+    public String tokenLimitation;
 }

View Issue Details


🟡 Medium Severity Issues

Id on Platform: 1552

Category: ✨ Best Practices

File: .../analysis/PullRequestAnalysisProcessor.java:231

Issue: When 'isPreAcquired' is true, the processor skips the 'releaseLock' call in the 'finally' block. This assumes the caller will ALWAYS handle the release. If the caller fails to do so (e.g., due to an unexpected exception before this method returns), the lock will remain held until it naturally expires, potentially blocking subsequent analyses.

💡 Suggested Fix

While the logic correctly avoids double-releasing if the 'releaseLock' method isn't idempotent, it is safer to ensure the caller has a robust mechanism for release. If the lock service supports idempotent releases, the 'finally' block should be simplified to always release.

No suggested fix provided

View Issue Details


Id on Platform: 1553

Category: ⚡ Performance

File: .../service/WebhookDeduplicationService.java:71

Issue: The cleanup of the ConcurrentHashMap is performed synchronously inside the isDuplicateCommitAnalysis method. As the map grows or under high load, iterating through the entry set on every check can introduce latency in the webhook processing pipeline.

💡 Suggested Fix

Move the cleanup logic to a scheduled background task using @scheduled to keep the hot path (checking duplicates) as fast as possible.

--- a/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/service/WebhookDeduplicationService.java
+++ b/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/service/WebhookDeduplicationService.java
@@ -3,6 +3,7 @@
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
+import org.springframework.scheduling.annotation.Scheduled;
 import org.springframework.stereotype.Service;
 
@@ -67,17 +68,14 @@
         // Record this analysis
         recentCommitAnalyses.put(key, now);
         
-        // Cleanup old entries
-        cleanupOldEntries(now);
-        
         return false;
     }
     
     /**
      * Remove entries older than the dedup window to prevent memory growth.
      */
-    private void cleanupOldEntries(Instant now) {
+    @Scheduled(fixedDelay = 60000)
+    public void cleanupOldEntries() {
+        Instant now = Instant.now();
         recentCommitAnalyses.entrySet().removeIf(entry -> {
             long age = now.getEpochSecond() - entry.getValue().getEpochSecond();

View Issue Details


Id on Platform: 1554

Category: 🐛 Bug Risk

File: .../service/ProjectService.java:383

Issue: The code uses project.getWorkspace().getId() to verify ownership. If the Project entity is loaded via findByIdWithConnections, the 'workspace' association might be null if not eagerly fetched or if the join is not defined correctly in the repository, potentially causing a NullPointerException.

💡 Suggested Fix

Ensure the workspace is part of the fetch graph or use a repository method that joins the workspace to safely check the ID.

--- a/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/project/service/ProjectService.java
+++ b/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/project/service/ProjectService.java
@@ -383,7 +383,7 @@
                 .orElseThrow(() -> new NoSuchElementException("Project not found"));
         
         // Verify workspace ownership
-        if (!project.getWorkspace().getId().equals(workspaceId)) {
+        if (project.getWorkspace() == null || !project.getWorkspace().getId().equals(workspaceId)) {
             throw new NoSuchElementException("Project not found in workspace");
         }

View Issue Details


Id on Platform: 1555

Category: ✨ Best Practices

File: .../controller/ProjectController.java:597

Issue: The new field 'maxAnalysisTokenLimit' is an Integer which could be negative or excessively large. It is best practice to validate such numeric limits to prevent logic errors or resource exhaustion downstream.

💡 Suggested Fix

Add validation annotations like @min(1) to the maxAnalysisTokenLimit field in the UpdateAnalysisSettingsRequest record.

--- a/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/project/controller/ProjectController.java
+++ b/java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/project/controller/ProjectController.java
@@ -595,5 +595,6 @@
             Boolean branchAnalysisEnabled,
             String installationMethod,
+            @jakarta.validation.constraints.Min(0) 
             Integer maxAnalysisTokenLimit
     ) {}

View Issue Details


Id on Platform: 1557

Category: 🐛 Bug Risk

File: .../service/RagOperationsServiceImpl.java:813

Issue: The code uses 'indexedCommit.substring(0, 7)' and 'currentCommit.substring(0, 7)' without verifying that the commit hashes are at least 7 characters long. While commit hashes are usually 40 chars, short strings (e.g. 'none' or dev aliases) will cause a StringIndexOutOfBoundsException.

💡 Suggested Fix

Use Math.min to safely calculate the substring end index, similar to the logic previously removed in this diff.

--- a/java-ecosystem/libs/rag-engine/src/main/java/org/rostilos/codecrow/ragengine/service/RagOperationsServiceImpl.java
+++ b/java-ecosystem/libs/rag-engine/src/main/java/org/rostilos/codecrow/ragengine/service/RagOperationsServiceImpl.java
@@ -812,2 +812,2 @@
-        log.info("Incremental diff for branch '{}' ({} -> {}): bytes={}", 
-                targetBranch, indexedCommit.substring(0, 7), currentCommit.substring(0, 7), 
+        log.info("Incremental diff for branch '{}' ({} -> {}): bytes={}",
+                targetBranch, indexedCommit.substring(0, Math.min(7, indexedCommit.length())), currentCommit.substring(0, Math.min(7, currentCommit.length())),

View Issue Details


Id on Platform: 1558

Category: ✨ Best Practices

File: .../processor/WebhookAsyncProcessor.java

Issue: Circular reference risk. While using @lazy with @Autowired on self-injection is a common workaround for Spring proxying issues, it can complicate the bean lifecycle and is often considered a code smell.

💡 Suggested Fix

Consider moving the @transactional logic into a separate dedicated service (e.g., WebhookProcessingService) rather than using self-injection within the same class.

No suggested fix provided

View Issue Details


Id on Platform: 1560

Category: ✨ Best Practices

File: .../config/AsyncConfig.java

Issue: The manual implementation of a RejectedExecutionHandler that calls 'r.run()' is functionally equivalent to the standard 'ThreadPoolExecutor.CallerRunsPolicy'. Using the built-in policy is cleaner and less error-prone.

💡 Suggested Fix

Replace the manual lambda with the standard Spring/JDK CallerRunsPolicy.

--- a/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/config/AsyncConfig.java
+++ b/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/config/AsyncConfig.java
@@ -50,14 +50,7 @@
         executor.setThreadNamePrefix("webhook-");
         executor.setWaitForTasksToCompleteOnShutdown(true);
         executor.setAwaitTerminationSeconds(120);
-        // Log when tasks are rejected due to full queue
-        executor.setRejectedExecutionHandler((r, e) -> {
-            log.error("WEBHOOK EXECUTOR REJECTED TASK! Queue is full. Pool size: {}, Active: {}, Queue size: {}",
-                    e.getPoolSize(), e.getActiveCount(), e.getQueue().size());
-            // Try to run in caller thread as fallback
-            if (!e.isShutdown()) {
-                r.run();
-            }
-        });
+        executor.setRejectedExecutionHandler(new java.util.concurrent.ThreadPoolExecutor.CallerRunsPolicy());
         executor.initialize();

View Issue Details


Id on Platform: 1564

Category: ⚡ Performance

File: .../index_manager/indexer.py:82

Issue: The repository size estimation uses random.sample on the file list. If the sampling happens during every index check and the filesystem is slow or the file list is massive, this could be overhead, but specifically, the current implementation doesn't handle the case where the sample size is larger than the remaining files correctly if logic were to change (though currently guarded). More importantly, it performs a full load and split for the sample which is heavy.

💡 Suggested Fix

Consider estimating based on file metadata (size) rather than full content loading and splitting for a sample of 100 files, as loading/splitting is the most expensive part of the pipeline.

No suggested fix provided

View Issue Details


Id on Platform: 1568

Category: 🐛 Bug Risk

File: .../splitter/query_runner.py:212

Issue: The use of cursor.matches(tree.root_node) returns an iterator. Converting it to a list using list() can cause high memory consumption for very large files with thousands of AST nodes and matches.

💡 Suggested Fix

Iterate directly over the cursor matches instead of casting to a list to reduce memory overhead.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/query_runner.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/query_runner.py
@@ -218,1 +218,1 @@
-            raw_matches = list(cursor.matches(tree.root_node))
+            raw_matches = cursor.matches(tree.root_node)

View Issue Details


Id on Platform: 1570

Category: 🐛 Bug Risk

File: .../models/scoring_config.py

Issue: The priority matching logic uses 'any(p in path_lower for p in self.high)'. This can lead to false positives. For example, a file named 'base_component_test.py' would match 'component' (MEDIUM) and 'test' (LOW). Since 'HIGH' is checked first, it will return the first match it finds, but if a path contains both a high and low keyword, the behavior might be unexpected depending on keyword order.

💡 Suggested Fix

Consider using regex boundaries or splitting the path into segments to ensure 'auth' doesn't match 'author_profile.jpg' and that 'test' takes precedence if it's a test file.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/models/scoring_config.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/models/scoring_config.py
@@ -102,10 +102,11 @@
         path_lower = file_path.lower()
-        
+        import re
+        def matches(patterns): return any(re.search(rf'\\b{re.escape(p)}\\b|[\\/_]{re.escape(p)}[\\/_.]', path_lower) for p in patterns)
         if any(p in path_lower for p in self.high):
             return ('HIGH', self.high_boost)

View Issue Details


Id on Platform: 1572

Category: ✨ Best Practices

File: .../tests/test_rag.py:63

Issue: The chunk_size was increased from 800 to 8000. Most common embedding models (like OpenAI's text-embedding-ada-002 or various HuggingFace models) have a token limit that translates roughly to 1500-3000 characters. 8000 characters will likely lead to truncation or loss of semantic granularity.

💡 Suggested Fix

Lower the default chunk size to a range compatible with standard embedding models (e.g., 2000-4000) or ensure the downstream components can handle large chunks without significant performance degradation.

--- a/python-ecosystem/rag-pipeline/tests/test_rag.py
+++ b/python-ecosystem/rag-pipeline/tests/test_rag.py
@@ -62,1 +62,1 @@
-    assert config.chunk_size == 8000  # Increased to fit most semantic units
+    assert config.chunk_size == 2000  # Increased to fit most semantic units while remaining within embedding limits

View Issue Details


Id on Platform: 1573

Category: ✨ Best Practices

File: .../core/index_manager.py:64

Issue: The new initialization of ASTCodeSplitter includes an enrich_embedding_text argument. However, as seen in the codebase context for src/rag_pipeline/core/ast_splitter.py, the __init__ method only accepts max_chunk_size, min_chunk_size, chunk_overlap, and parser_threshold. Passing an unknown keyword argument will raise a TypeError at runtime.

💡 Suggested Fix

Remove the unsupported enrich_embedding_text parameter from the constructor call to prevent a TypeError.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager.py
@@ -63,5 +63,4 @@
             chunk_overlap=config.chunk_overlap,
-            parser_threshold=3,  # Low threshold - AST benefits even small files
-            enrich_embedding_text=True  # Prepend semantic context for better embeddings
+            parser_threshold=3  # Low threshold - AST benefits even small files
         )

View Issue Details


🔵 Low Severity Issues

Id on Platform: 1551

Category: ✨ Best Practices

File: .../project/Project.java:230

Issue: Creating a 'new ProjectConfig()' every time getEffectiveConfig() is called on a null configuration might be slightly inefficient if called in tight loops, though it provides safety.

💡 Suggested Fix

Consider initializing the 'configuration' field with a default value at the declaration site or in the constructor to avoid repeated allocations and null checks.

--- a/java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/model/project/Project.java
+++ b/java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/model/project/Project.java
@@ -229,1 +229,12 @@
-    public org.rostilos.codecrow.core.model.project.config.ProjectConfig getEffectiveConfig() {
-        return configuration != null ? configuration : new org.rostilos.codecrow.core.model.project.config.ProjectConfig();
-    }
+    public org.rostilos.codecrow.core.model.project.config.ProjectConfig getEffectiveConfig() {
+        if (configuration == null) {
+            configuration = new org.rostilos.codecrow.core.model.project.config.ProjectConfig();
+        }
+        return configuration;
+    }

View Issue Details


Id on Platform: 1556

Category: ✨ Best Practices

File: .../service/JobService.java:228

Issue: Repeated use of 'jobRepository.findById(job.getId()).orElse(job)' across multiple methods (startJob, completeJob, cancelJob, skipJob) leads to code duplication.

💡 Suggested Fix

Extract the re-fetching logic into a private helper method to improve maintainability.

--- a/java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/service/JobService.java
+++ b/java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/service/JobService.java
@@ -224,3 +224,3 @@
-        // Re-fetch the job in case it was passed from a different transaction context
-        job = jobRepository.findById(job.getId()).orElse(job);
+        job = ensureManaged(job);
@@ -320,0 +320,4 @@
+    private Job ensureManaged(Job job) {
+        return jobRepository.findById(job.getId()).orElse(job);
+    }
+

View Issue Details


Id on Platform: 1559

Category: 🛡️ Error Handling

File: .../processor/WebhookAsyncProcessor.java

Issue: Potential for duplicate job failure state updates. If self.processWebhookInTransaction fails and has already attempted to fail the job within its own try-catch or transaction, the outer catch block attempts to fail it again. While likely idempotent in the DB, it causes unnecessary error logs.

💡 Suggested Fix

Add a check to see if the job is already in a terminal state before calling failJob in the outer catch block.

--- a/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/processor/WebhookAsyncProcessor.java
+++ b/java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/generic/processor/WebhookAsyncProcessor.java
@@ -73,3 +73,3 @@
-            try {
-                jobService.failJob(job, "Async processing failed: " + e.getMessage());
-            } catch (Exception failError) {
+            try {
+                if (!job.isTerminal()) {
+                    jobService.failJob(job, "Async processing failed: " + e.getMessage());
+                }
+            } catch (Exception failError) {

View Issue Details


Id on Platform: 1563

Category: 🔒 Security

File: .../llm/llm_factory.py:143

Issue: Logging the 'ai_api_key' is not happening here, but the 'provider', 'model', and 'temperature' are logged. While these are not secrets, ensure that the 'ai_api_key' variable (which is available in the scope) is never accidentally added to this log line in future iterations to prevent secret leakage in logs.

💡 Suggested Fix

Maintain the current logging but ensure PII or credentials never enter the log stream.

No suggested fix provided

View Issue Details


Id on Platform: 1565

Category: 🛡️ Error Handling

File: .../index_manager/indexer.py:247

Issue: In the finally block of index_repository, existing_other_branch_points is deleted, but if the variable was never initialized due to an early exception before the assignment, a UnboundLocalError could occur.

💡 Suggested Fix

Initialize existing_other_branch_points to an empty list at the start of the method.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/indexer.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/indexer.py
@@ -124,11 +124,11 @@
         temp_collection_name = self.collection_manager.create_versioned_collection(alias_name)
 
         # Check existing collection and preserve other branch data
+        existing_other_branch_points = []
         old_collection_exists = self.collection_manager.alias_exists(alias_name)
         if not old_collection_exists:
             old_collection_exists = self.collection_manager.collection_exists(alias_name)
         
-        existing_other_branch_points = []
         if old_collection_exists:
             actual_collection = self.collection_manager.resolve_alias(alias_name) or alias_name
             existing_other_branch_points = self.branch_manager.preserve_other_branch_points(

View Issue Details


Id on Platform: 1567

Category: 🧹 Code Quality

File: .../splitter/metadata.py

Issue: The extract_inheritance method for Python uses a regex that assumes simple parentheticals for inheritance, but Python allows complex expressions, keyword arguments (metaclasses), and trailing commas which might break this simplified split logic.

💡 Suggested Fix

Use a more robust regex or manual parsing for Python inheritance to handle potential trailing commas and keyword arguments in class definitions.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py
@@ -226,1 +226,1 @@
-                result['extends'] = [e.strip() for e in extends.split(',') if e.strip()]
+                result['extends'] = [e.strip() for e in extends.split(',') if e.strip() and '=' not in e]

View Issue Details


Id on Platform: 1569

Category: ✨ Best Practices

File: .../splitter/tree_parser.py:109

Issue: The parse method converts the entire source string to bytes using UTF-8 for every call. For very large files, repeated conversion can be inefficient.

💡 Suggested Fix

If the caller already has bytes, allow passing bytes directly, or cache the byte-encoded version of the source if multiple operations are performed.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/tree_parser.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/tree_parser.py
@@ -107,2 +107,4 @@
-            parser = Parser(language)
-            tree = parser.parse(bytes(source_code, "utf8"))
+            parser = Parser(language)
+            source_bytes = source_code if isinstance(source_code, bytes) else source_code.encode("utf8")
+            tree = parser.parse(source_bytes)
             return tree

View Issue Details


Id on Platform: 1571

Category: 🐛 Bug Risk

File: .../services/query_service.py:112

Issue: The chunk_id generation uses hash(text[:100]). Using only the first 100 characters for identity might lead to collisions for files that share common headers, license blocks, or imports, causing the deduplication logic to incorrectly skip unique chunks.

💡 Suggested Fix

Increase the prefix length for the hash or use a hash of the full text to ensure uniqueness of the chunk identity.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/services/query_service.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/services/query_service.py
@@ -111,3 +111,3 @@
             # Create chunk identity (path + start of content)
-            chunk_id = f"{path}:{branch}:{hash(result.get('text', '')[:100])}"
+            chunk_id = f"{path}:{branch}:{hash(result.get('text', ''))}"

View Issue Details


ℹ️ Informational Notes

Id on Platform: 1566

Category: 🧹 Code Quality

File: .../index_manager/stats_manager.py:106

Issue: The parameter alias_checker is defined in the method signature but not utilized within the loop or the logic that determines which collections to process.

💡 Suggested Fix

Either use the alias_checker to skip aliased collections or remove the unused parameter from the method signature if it's not needed for the current logic.

--- a/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/stats_manager.py
+++ b/python-ecosystem/rag-pipeline/src/rag_pipeline/core/index_manager/stats_manager.py
@@ -106,1 +106,1 @@
-    def list_all_indices(self, alias_checker) -> List[IndexStats]:
+    def list_all_indices(self) -> List[IndexStats]:

View Issue Details


Files Affected

  • .../processor/WebhookAsyncProcessor.java: 2 issues
  • .../index_manager/indexer.py: 2 issues
  • .../service/AIConnectionService.java: 2 issues
  • .../splitter/metadata.py: 1 issue
  • .../analysis/PullRequestAnalysisProcessor.java: 1 issue
  • .../core/index_manager.py: 1 issue
  • .../splitter/tree_parser.py: 1 issue
  • .../index_manager/stats_manager.py: 1 issue
  • .../service/WebhookDeduplicationService.java: 1 issue
  • .../service/ProjectService.java: 1 issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants