Conversation
Co-authored-by: kommi.avanthi <kommi.avanthi@gmail.com>
| python-dotenv==1.0.0 | ||
| faiss-cpu==1.7.4 | ||
| langchain-google-genai==0.0.6 | ||
| chromadb==0.4.22 |
There was a problem hiding this comment.
🔴 Missing faiss-cpu dependency causes runtime error when VECTOR_DB is not CHROMA
The faiss-cpu package was removed from requirements.txt (line 6 was deleted), but the code in main.py, chat_with_pdf.py, and vectorstore_utils.py still supports FAISS as a fallback when VECTOR_DB environment variable is not set to "CHROMA".
When a user sets VECTOR_DB=FAISS (or any value other than "CHROMA"), the code at main.py:57 will execute from langchain_community.vectorstores import FAISS, which will fail with ModuleNotFoundError: No module named 'faiss' because the dependency is no longer installed.
The default value for VECTOR_DB is "CHROMA" so this won't affect default usage, but the code explicitly supports FAISS as an option (as mentioned in the About page: "Efficient similarity search using Chroma (persistent) or FAISS (local)") without providing the necessary dependency.
Recommendation: Either add faiss-cpu==1.7.4 back to requirements.txt, or make it an optional dependency with clear documentation, or remove FAISS support from the code entirely.
Was this helpful? React with 👍 or 👎 to provide feedback.
| @@ -22,17 +29,50 @@ def get_pdf_text(pdf_docs): | |||
| text += page.extract_text() | |||
There was a problem hiding this comment.
🔴 TypeError when PDF page returns None from extract_text()
In main.py:29, the code does text += page.extract_text() without handling the case where extract_text() returns None. According to PyPDF2 documentation, extract_text() can return None for pages that have no extractable text (e.g., scanned images without OCR, or corrupted pages).
When processing such a PDF, this will cause:
TypeError: can only concatenate str (not "NoneType") to str
Notably, chat_with_pdf.py:42 handles this correctly with text += page.extract_text() or "", but main.py does not have this fix.
Recommendation: Change text += page.extract_text() to text += page.extract_text() or "" to handle None values, consistent with the implementation in chat_with_pdf.py.
Was this helpful? React with 👍 or 👎 to provide feedback.
| vs = Chroma.from_texts( | ||
| texts=text_chunks, | ||
| embedding=embeddings, | ||
| collection_name="pdf_chunks", | ||
| persist_directory=PERSIST_DIR, | ||
| ) | ||
| vs.persist() |
There was a problem hiding this comment.
🔴 Chroma.from_texts overwrites existing data instead of implementing incremental indexing
The PR description claims to implement "incremental indexing", and the FAISS branch correctly implements this by loading the existing index, adding new texts, and saving. However, the Chroma branch at main.py:49-55 uses Chroma.from_texts() which creates a new collection, overwriting any existing data with the same collection name.
When a user:
- Uploads and processes PDF A (data stored in Chroma)
- Uploads and processes PDF B (Chroma.from_texts is called again)
The data from PDF A is lost because from_texts() creates a fresh collection. This is inconsistent with the FAISS behavior and the stated goal of incremental indexing.
The same issue exists in chat_with_pdf.py:73-79 and vectorstore_utils.py:22-29.
Recommendation: For incremental indexing with Chroma, load the existing collection first using Chroma(collection_name=..., persist_directory=..., embedding_function=...), then call add_texts() to add new documents, similar to the FAISS implementation.
Was this helpful? React with 👍 or 👎 to provide feedback.
Add ChromaDB as a persistent, configurable vector store with incremental indexing, making FAISS an optional fallback.