Optimize app and implement vector db by hawkh · Pull Request #2 · hawkh/chatwithpdf

hawkh · 2025-08-14T15:49:01Z

Add ChromaDB as a persistent, configurable vector store with incremental indexing, making FAISS an optional fallback.

Co-authored-by: kommi.avanthi <kommi.avanthi@gmail.com>

hawkh

done

hawkh

ok

devin-ai-integration

Devin Review found 3 potential issues.

View issues and 4 additional flags in Devin Review.

devin-ai-integration · 2026-01-21T18:10:13Z

requirements.txt

 python-dotenv==1.0.0
-faiss-cpu==1.7.4
 langchain-google-genai==0.0.6
+chromadb==0.4.22


🔴 Missing faiss-cpu dependency causes runtime error when VECTOR_DB is not CHROMA

The faiss-cpu package was removed from requirements.txt (line 6 was deleted), but the code in main.py, chat_with_pdf.py, and vectorstore_utils.py still supports FAISS as a fallback when VECTOR_DB environment variable is not set to "CHROMA".

When a user sets VECTOR_DB=FAISS (or any value other than "CHROMA"), the code at main.py:57 will execute from langchain_community.vectorstores import FAISS, which will fail with ModuleNotFoundError: No module named 'faiss' because the dependency is no longer installed.

The default value for VECTOR_DB is "CHROMA" so this won't affect default usage, but the code explicitly supports FAISS as an option (as mentioned in the About page: "Efficient similarity search using Chroma (persistent) or FAISS (local)") without providing the necessary dependency.

Recommendation: Either add faiss-cpu==1.7.4 back to requirements.txt, or make it an optional dependency with clear documentation, or remove FAISS support from the code entirely.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-01-21T18:10:14Z

main.py

@@ -22,17 +29,50 @@ def get_pdf_text(pdf_docs):
            text += page.extract_text()


🔴 TypeError when PDF page returns None from extract_text()

In main.py:29, the code does text += page.extract_text() without handling the case where extract_text() returns None. According to PyPDF2 documentation, extract_text() can return None for pages that have no extractable text (e.g., scanned images without OCR, or corrupted pages).

When processing such a PDF, this will cause:

TypeError: can only concatenate str (not "NoneType") to str

Notably, chat_with_pdf.py:42 handles this correctly with text += page.extract_text() or "", but main.py does not have this fix.

Recommendation: Change text += page.extract_text() to text += page.extract_text() or "" to handle None values, consistent with the implementation in chat_with_pdf.py.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-01-21T18:10:15Z

main.py

+        vs = Chroma.from_texts(
+            texts=text_chunks,
+            embedding=embeddings,
+            collection_name="pdf_chunks",
+            persist_directory=PERSIST_DIR,
+        )
+        vs.persist()


🔴 Chroma.from_texts overwrites existing data instead of implementing incremental indexing

The PR description claims to implement "incremental indexing", and the FAISS branch correctly implements this by loading the existing index, adding new texts, and saving. However, the Chroma branch at main.py:49-55 uses Chroma.from_texts() which creates a new collection, overwriting any existing data with the same collection name.

When a user:

Uploads and processes PDF A (data stored in Chroma)

Uploads and processes PDF B (Chroma.from_texts is called again)

The data from PDF A is lost because from_texts() creates a fresh collection. This is inconsistent with the FAISS behavior and the stated goal of incremental indexing.

The same issue exists in chat_with_pdf.py:73-79 and vectorstore_utils.py:22-29.

Recommendation: For incremental indexing with Chroma, load the existing collection first using Chroma(collection_name=..., persist_directory=..., embedding_function=...), then call add_texts() to add new documents, similar to the FAISS implementation.

Was this helpful? React with 👍 or 👎 to provide feedback.

Add configurable vector store backend with Chroma and FAISS support

d211086

Co-authored-by: kommi.avanthi <kommi.avanthi@gmail.com>

hawkh self-assigned this Aug 14, 2025

Repository owner deleted a comment from cursor bot Aug 14, 2025

hawkh commented Aug 14, 2025

View reviewed changes

hawkh marked this pull request as ready for review August 14, 2025 16:05

hawkh commented Aug 20, 2025

View reviewed changes

devin-ai-integration bot reviewed Jan 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize app and implement vector db#2

Optimize app and implement vector db#2
hawkh wants to merge 1 commit intomainfrom
cursor/optimize-app-and-implement-vector-db-8289

hawkh commented Aug 14, 2025 •

edited by devin-ai-integration bot

Loading

Uh oh!

hawkh left a comment

Uh oh!

hawkh left a comment

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot Jan 21, 2026

Uh oh!

devin-ai-integration bot Jan 21, 2026

Uh oh!

devin-ai-integration bot Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -22,17 +29,50 @@ def get_pdf_text(pdf_docs):
		text += page.extract_text()

Conversation

hawkh commented Aug 14, 2025 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hawkh left a comment

Choose a reason for hiding this comment

Uh oh!

hawkh left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hawkh commented Aug 14, 2025 •

edited by devin-ai-integration bot

Loading