Adopt Turso-inspired performance patterns: batch inserts, lazy schema, and LRU cache#489
Conversation
- Implement batch insertion for symbols, chunks, and embeddings in CodebaseDb using transactions. - Add lazy schema initialization in CodebaseDb and ConversationsDb using tokio::sync::OnceCell. - Refactor ConnectionManager to use moka::future::Cache for LRU connection pooling and libsql for async database connections. - Offload heavy CPU-bound tasks (embedding serialization and FTS population) to tokio::task::spawn_blocking. - Add unit tests for batch inserts and connection management.
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
|
Warning Review limit reached
More reviews will be available in 51 minutes and 19 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request migrates the ConnectionManager from r2d2 and DashMap to moka::future::Cache and libsql::Connection, making connection management fully asynchronous. It also introduces lazy schema initialization using OnceCell in both ConversationsDb and CodebaseDb, ensuring schemas are created on-demand before any database operations. Additionally, CodebaseDb is enhanced with batch insertion capabilities for symbols, chunks, and embeddings, along with FTS index population. Feedback on the changes highlights several performance and correctness improvements in CodebaseDb: using INSERT OR REPLACE instead of INSERT for batch embedding inserts to ensure idempotency, avoiding unnecessary string cloning in symbol and chunk batch inserts by using .as_str() and .as_deref(), and avoiding the anti-pattern of using spawn_blocking combined with block_on for the asynchronous execute_batch call in populate_fts.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| for (id, blob) in embeddings_with_blobs { | ||
| tx.execute( | ||
| "INSERT INTO code_embeddings (id, embedding) VALUES (?1, ?2)", | ||
| params![id, blob], | ||
| ).await?; | ||
| } |
There was a problem hiding this comment.
Using INSERT instead of INSERT OR REPLACE for batch inserting embeddings will cause unique constraint violations and abort the transaction if any chunk's embedding is updated or re-inserted. Using INSERT OR REPLACE ensures idempotency.
| for (id, blob) in embeddings_with_blobs { | |
| tx.execute( | |
| "INSERT INTO code_embeddings (id, embedding) VALUES (?1, ?2)", | |
| params![id, blob], | |
| ).await?; | |
| } | |
| for (id, blob) in embeddings_with_blobs { | |
| tx.execute( | |
| "INSERT OR REPLACE INTO code_embeddings (id, embedding) VALUES (?1, ?2)", | |
| params![id, blob], | |
| ).await?; | |
| } |
| for s in symbols { | ||
| tx.execute( | ||
| "INSERT OR REPLACE INTO symbols | ||
| (id, name, kind, file_path, line_start, line_end, signature, visibility, doc_comment, language, module_path, complexity) | ||
| VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8, ?9, ?10, ?11, ?12)", | ||
| params![ | ||
| s.id.clone(), s.name.clone(), s.kind.clone(), s.file_path.clone(), | ||
| s.line_start, s.line_end, s.signature.clone(), s.visibility.clone(), | ||
| s.doc_comment.clone(), s.language.clone(), s.module_path.clone(), s.complexity | ||
| ], | ||
| ).await?; | ||
| } |
There was a problem hiding this comment.
Cloning all strings (s.id.clone(), s.name.clone(), etc.) inside the loop for every symbol in a batch is inefficient and allocates memory unnecessarily. Since libsql's params! macro accepts references, you can use .as_str() and .as_deref() to avoid these allocations.
| for s in symbols { | |
| tx.execute( | |
| "INSERT OR REPLACE INTO symbols | |
| (id, name, kind, file_path, line_start, line_end, signature, visibility, doc_comment, language, module_path, complexity) | |
| VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8, ?9, ?10, ?11, ?12)", | |
| params![ | |
| s.id.clone(), s.name.clone(), s.kind.clone(), s.file_path.clone(), | |
| s.line_start, s.line_end, s.signature.clone(), s.visibility.clone(), | |
| s.doc_comment.clone(), s.language.clone(), s.module_path.clone(), s.complexity | |
| ], | |
| ).await?; | |
| } | |
| for s in symbols { | |
| tx.execute( | |
| "INSERT OR REPLACE INTO symbols | |
| (id, name, kind, file_path, line_start, line_end, signature, visibility, doc_comment, language, module_path, complexity) | |
| VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8, ?9, ?10, ?11, ?12)", | |
| params![ | |
| s.id.as_str(), s.name.as_str(), s.kind.as_str(), s.file_path.as_str(), | |
| s.line_start, s.line_end, s.signature.as_deref(), s.visibility.as_deref(), | |
| s.doc_comment.as_deref(), s.language.as_str(), s.module_path.as_deref(), s.complexity | |
| ], | |
| ).await?; | |
| } |
| for c in chunks { | ||
| tx.execute( | ||
| "INSERT OR REPLACE INTO code_chunks (id, path, content, language, symbol_id, tokens) | ||
| VALUES (?1, ?2, ?3, ?4, ?5, ?6)", | ||
| params![c.id.clone(), c.path.clone(), c.content.clone(), c.language.clone(), c.symbol_id.clone(), c.tokens], | ||
| ).await?; | ||
| } |
There was a problem hiding this comment.
Cloning c.content and other strings inside the loop for every chunk in a batch is highly inefficient, especially since code chunks can contain very large file contents. Passing references using .as_str() and .as_deref() avoids these expensive allocations.
| for c in chunks { | |
| tx.execute( | |
| "INSERT OR REPLACE INTO code_chunks (id, path, content, language, symbol_id, tokens) | |
| VALUES (?1, ?2, ?3, ?4, ?5, ?6)", | |
| params![c.id.clone(), c.path.clone(), c.content.clone(), c.language.clone(), c.symbol_id.clone(), c.tokens], | |
| ).await?; | |
| } | |
| for c in chunks { | |
| tx.execute( | |
| "INSERT OR REPLACE INTO code_chunks (id, path, content, language, symbol_id, tokens) | |
| VALUES (?1, ?2, ?3, ?4, ?5, ?6)", | |
| params![c.id.as_str(), c.path.as_str(), c.content.as_str(), c.language.as_deref(), c.symbol_id.as_deref(), c.tokens], | |
| ).await?; | |
| } |
| let conn = self.conn.clone(); | ||
| tokio::task::spawn_blocking(move || { | ||
| let rt = tokio::runtime::Handle::current(); | ||
| rt.block_on(async { | ||
| conn.execute_batch(&populate_fts_from_chunks_sql()) | ||
| .await | ||
| .context("failed to populate FTS index")?; | ||
| Ok::<(), anyhow::Error>(()) | ||
| }) | ||
| }).await.context("spawn_blocking for FTS population failed")??; | ||
| Ok(()) |
There was a problem hiding this comment.
Spawning a blocking task with spawn_blocking only to immediately block on an asynchronous operation (conn.execute_batch) using block_on is an anti-pattern. It wastes a thread pool thread and adds unnecessary overhead. Since execute_batch is already asynchronous, you can directly .await it in the main async context.
self.conn.execute_batch(&populate_fts_from_chunks_sql())
.await
.context("failed to populate FTS index")?;
Ok(())- Resolved conflicts in connection_manager.rs, conversations_db.rs, db.rs - Preserved PR #489's batch inserts (insert_symbols_batch, insert_chunks_batch, insert_embeddings_batch) - Preserved lazy schema initialization via OnceCell - Adapted all methods to use ConnectionManager::with_conn (r2d2) instead of libsql async - Updated tests to use ConnectionManager API
This change adopts several performance patterns inspired by Turso to improve Xavier's scalability and responsiveness.
Key improvements:
insert_symbols_batch,insert_chunks_batch, andinsert_embeddings_batchtoCodebaseDb. These methods use libSQL transactions to drastically reduce I/O overhead during initial repository indexing.OnceCell, preventing unnecessary table creation for inactive projects.ConnectionManagernow usesmoka::future::Cacheto manage active database connections with a capacity limit and idle timeout, replacing the previous manual eviction logic.libsqlfor asynchronous operations, and heavy CPU-bound tasks are offloaded tospawn_blockingto avoid blocking the Tokio executor.Verified with unit tests in
src/codebase/db.rs,src/codebase/conversations_db.rs, andsrc/codebase/connection_manager.rs.Fixes #427
PR created automatically by Jules for task 18222215058586446253 started by @iberi22