Skip to content

Auto-build FAISS index on startup when absent#1

Draft
Copilot wants to merge 2 commits into
masterfrom
copilot/automatically-fix-deployment-failure
Draft

Auto-build FAISS index on startup when absent#1
Copilot wants to merge 2 commits into
masterfrom
copilot/automatically-fix-deployment-failure

Conversation

Copilot AI commented Jun 2, 2026

Copy link
Copy Markdown

Fresh deployments (e.g. Streamlit Cloud) only have committed repo files; data/faiss_index/ is gitignored, so the app immediately hit st.error("Index not found…") / st.stop() on every cold start.

Changes

ingest.py

  • Extracted ingestion logic into build_index(data_dir, index_dir) — importable by the app, CLI usage (python ingest.py) unchanged.
  • Paths now resolved relative to __file__ rather than CWD.

app/main.py

  • Dropped the hard st.error / st.stop() guard.
  • Paths resolved via Path(__file__).parent.parent — correct regardless of Streamlit's working directory.
  • Repo root injected into sys.path so ingest is importable from the app/ subdirectory.
  • load_vectorstore() now auto-calls build_index inside a spinner when the index is missing; surfaces a actionable error message if it fails.
@st.cache_resource(show_spinner="Loading knowledge base...")
def load_vectorstore():
    if not INDEX_DIR.exists():
        with st.spinner("Knowledge base not found -- building it now from source PDFs..."):
            try:
                from ingest import build_index
                build_index(data_dir=DATA_DIR, index_dir=INDEX_DIR)
            except Exception as exc:
                st.error(f"Failed to build the knowledge base automatically: {exc}\n\n"
                         "Please run `python ingest.py` from the repository root and redeploy.")
                st.stop()
    emb = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
    return FAISS.load_local(str(INDEX_DIR), emb, allow_dangerous_deserialization=True)

README.md (new)

  • Documents self-healing startup, local setup, and Streamlit Cloud deployment.
Original prompt

Automatically fix the Streamlit deployment failure in riiseup08/RAG_Streamlit where the app shows Index not found. Run python ingest.py first. on startup.

Context:

  • The app in app/main.py currently expects data/faiss_index to already exist and stops with an error if it does not.
  • The repository includes the source PDFs in data/ and an ingest.py script that can build the FAISS index from those PDFs.
  • In fresh deployment environments such as Streamlit Cloud, generated artifacts like data/faiss_index are not present unless committed, so startup fails.

Goal:

  • Modify the app so it works automatically in a fresh environment without requiring a separate manual python ingest.py step.

Requirements:

  1. Update the app startup flow so that if data/faiss_index is missing, the app builds the index automatically from the existing PDFs instead of stopping with an error.
  2. Reuse or factor shared logic appropriately rather than duplicating complex ingestion logic if possible.
  3. Preserve current behavior when the index already exists: it should continue to load the existing FAISS index.
  4. Keep the app user-friendly:
    • show a clear spinner/status message while the knowledge base is being built
    • surface a useful error if automatic index creation fails
  5. Ensure paths work correctly from the repository layout in deployment.
  6. Update documentation if needed so setup/deployment behavior is clear.
  7. If tests are feasible in this repo, add or adjust them; otherwise keep the change minimal and focused.

Deliverable:

  • Open a pull request with the code changes needed to make the app self-heal by generating the FAISS index automatically when absent.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: Index not found. Run python ingest.py first." msg that i get from streAMLIT

Assistant: The direct cause is: your Streamlit app is looking for a local folder at data/faiss_index, and that folder does not exist when the app starts.

In this repo:

INDEX_DIR = Path("data/faiss_index")

@st.cache_resource(show_spinner="Loading knowledge base...")
def load_vectorstore():
    emb = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
    return FAISS.load_local(str(INDEX_DIR), emb, allow_dangerous_deserialization=True)

if not INDEX_DIR.exists():
    st.error("Index not found. Run python ingest.py first.")
    st.stop()

And ingest.py is what creates that directory:

DATA_DIR = Path("data")
INDEX_DIR = DATA_DIR / "faiss_index"

DOCS = {
    "policy_on_results_EN.pdf": "en",
    "policy_on_results_FR.pdf": "fr",
}

...

vectorstore = FAISS.from_documents(all_docs, embeddings)
vectorstore.save_local(str(INDEX_DIR))
print(f"Done! Index saved at {INDEX_DIR}")

What this means

Your repo already contains the source PDFs in data/:

  • data/policy_on_results_EN.pdf
  • data/policy_on_results_FR.pdf

But it does not contain the generated FAISS index folder data/faiss_index. So when Streamlit runs, it fails the existence check and shows that message.

Why this happens on Streamlit

On Streamlit Cloud or a fresh environment:

  • only committed repo files are present at startup
  • generated files like data/faiss_index are usually not there unless:
    1. you committed them to GitHub, or
    2. you build them during app startup/deployment

Right now your app expects the index to already exist, but your deployment does not create it automatically.

Fastest fixes

Option 1: Run ingest locally, then commit the generated index

Run:

python ingest.py

That should create:

  • data/faiss_index/index.faiss
  • data/faiss_index/index.pkl

Then commit and push that folder.

This is the quickest fix, though committing generated vector indexes is not always ideal.

Option 2: Generate the index automatically if missing

A better app behavior is:

  • if data/faiss_index exists, load it
  • otherwise build it from the PDFs automatically

That avoids needing a separate manual python ingest.py step in deployment.

Root cause

The violated assumption is: app/main.py assumes the FAISS index already exists in the deployed filesystem.
That assumption is not enforced anywhere in deployment, and the repo currently only includes the PDFs, not the built index.

Confidence

High. The code explicitly checks for data/faiss_index, and the repository contents you shared show the PDFs in data/ but no faiss_index directory.

If you want, I can do one of these next

  1. show you the exact commands to generate and push data/faiss_index
  2. give you the code change to auto-build the index in Streamlit
  3. create a PR for riiseup08/RAG_Streamlit that fixes this automatically

User: create a...

This pull request was created from Copilot chat.

Copilot AI changed the title [WIP] Update app to build FAISS index automatically on startup Auto-build FAISS index on startup when absent Jun 2, 2026
Copilot AI requested a review from riiseup08 June 2, 2026 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants