Skip to content

Optimize app and implement vector db#2

Open
hawkh wants to merge 1 commit intomainfrom
cursor/optimize-app-and-implement-vector-db-8289
Open

Optimize app and implement vector db#2
hawkh wants to merge 1 commit intomainfrom
cursor/optimize-app-and-implement-vector-db-8289

Conversation

@hawkh
Copy link
Copy Markdown
Owner

@hawkh hawkh commented Aug 14, 2025

Add ChromaDB as a persistent, configurable vector store with incremental indexing, making FAISS an optional fallback.


Open in Cursor Open in Web
Open with Devin

Co-authored-by: kommi.avanthi <kommi.avanthi@gmail.com>
@hawkh hawkh self-assigned this Aug 14, 2025
Repository owner deleted a comment from cursor bot Aug 14, 2025
Copy link
Copy Markdown
Owner Author

@hawkh hawkh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@hawkh hawkh marked this pull request as ready for review August 14, 2025 16:05
Copy link
Copy Markdown
Owner Author

@hawkh hawkh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 potential issues.

View issues and 4 additional flags in Devin Review.

Open in Devin Review

python-dotenv==1.0.0
faiss-cpu==1.7.4
langchain-google-genai==0.0.6
chromadb==0.4.22
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Missing faiss-cpu dependency causes runtime error when VECTOR_DB is not CHROMA

The faiss-cpu package was removed from requirements.txt (line 6 was deleted), but the code in main.py, chat_with_pdf.py, and vectorstore_utils.py still supports FAISS as a fallback when VECTOR_DB environment variable is not set to "CHROMA".

When a user sets VECTOR_DB=FAISS (or any value other than "CHROMA"), the code at main.py:57 will execute from langchain_community.vectorstores import FAISS, which will fail with ModuleNotFoundError: No module named 'faiss' because the dependency is no longer installed.

The default value for VECTOR_DB is "CHROMA" so this won't affect default usage, but the code explicitly supports FAISS as an option (as mentioned in the About page: "Efficient similarity search using Chroma (persistent) or FAISS (local)") without providing the necessary dependency.

Recommendation: Either add faiss-cpu==1.7.4 back to requirements.txt, or make it an optional dependency with clear documentation, or remove FAISS support from the code entirely.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@@ -22,17 +29,50 @@ def get_pdf_text(pdf_docs):
text += page.extract_text()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 TypeError when PDF page returns None from extract_text()

In main.py:29, the code does text += page.extract_text() without handling the case where extract_text() returns None. According to PyPDF2 documentation, extract_text() can return None for pages that have no extractable text (e.g., scanned images without OCR, or corrupted pages).

When processing such a PDF, this will cause:

TypeError: can only concatenate str (not "NoneType") to str

Notably, chat_with_pdf.py:42 handles this correctly with text += page.extract_text() or "", but main.py does not have this fix.

Recommendation: Change text += page.extract_text() to text += page.extract_text() or "" to handle None values, consistent with the implementation in chat_with_pdf.py.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +49 to +55
vs = Chroma.from_texts(
texts=text_chunks,
embedding=embeddings,
collection_name="pdf_chunks",
persist_directory=PERSIST_DIR,
)
vs.persist()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Chroma.from_texts overwrites existing data instead of implementing incremental indexing

The PR description claims to implement "incremental indexing", and the FAISS branch correctly implements this by loading the existing index, adding new texts, and saving. However, the Chroma branch at main.py:49-55 uses Chroma.from_texts() which creates a new collection, overwriting any existing data with the same collection name.

When a user:

  1. Uploads and processes PDF A (data stored in Chroma)
  2. Uploads and processes PDF B (Chroma.from_texts is called again)

The data from PDF A is lost because from_texts() creates a fresh collection. This is inconsistent with the FAISS behavior and the stated goal of incremental indexing.

The same issue exists in chat_with_pdf.py:73-79 and vectorstore_utils.py:22-29.

Recommendation: For incremental indexing with Chroma, load the existing collection first using Chroma(collection_name=..., persist_directory=..., embedding_function=...), then call add_texts() to add new documents, similar to the FAISS implementation.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants