diff --git a/.env.example b/.env.example new file mode 100644 index 0000000..eaf4bb2 --- /dev/null +++ b/.env.example @@ -0,0 +1,18 @@ +PHI_URL=.... +QWEN_URL=... + +EMBEDS_URL=... +DEFAULT_MODEL=microsoft/Phi-4-mini-instruct +DEFAULT_EMBEDDING=intfloat/multilingual-e5-large-instruct-modal + +API_KEY=... +EMBEDS_API_KEY=... + +GROBID_URL=... +GROBID_QUANTITIES_URL=... + + +QWEN_URL=... +GROBID_MATERIALS_URL=... +API_KEY=... +EMBEDS_API_KEY=... \ No newline at end of file diff --git a/README.md b/README.md index a1155fe..1ea14bf 100644 --- a/README.md +++ b/README.md @@ -44,6 +44,8 @@ Additionally, this frontend provides the visualisation of named entities on LLM ## Documentation + **For full technical documentation** of the `document-qa-engine` library **[`docs/README.md`](docs/README.md)**. + ### Embedding selection In the latest version, there is the possibility to select both embedding functions and LLMs. There are some limitations, OpenAI embeddings cannot be used with open source models, and vice-versa. @@ -93,7 +95,7 @@ To release a new version: To use docker: -- docker run `lfoppiano/document-insights-qa:{latest_version)` +- docker run `lfoppiano/document-insights-qa:{latest_version}` - docker run `lfoppiano/document-insights-qa:latest-develop` for the latest development version diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..78a83c7 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,249 @@ +# πŸ“ document-qa-engine documentation + +> **Version**: 0.5.1 Β· **License**: Apache 2.0 Β· **PyPI**: `pip install document-qa-engine` + +A Python library and Streamlit application for **Question/Answering on scientific PDF documents** using Retrieval-Augmented Generation (RAG). It uses [GROBID](https://github.com/kermitt2/grobid) for structured text extraction, [ChromaDB](https://www.trychroma.com/) for vector storage, and any OpenAI-compatible LLM for answering. + + +## Overview + +Most PDF Q/A tools feed raw extracted text to an LLM, which is noisy and loses document structure. **document-qa-engine** takes a different approach: + +1. **Structured extraction** Sends the PDF to a GROBID server, which returns TEI-XML with separate sections (title, abstract, body paragraphs, figures, back matter) and precise bounding-box coordinates for every paragraph. +2. **Smart chunking** Paragraphs can be kept as-is or merged into larger chunks using token-aware merging, while preserving coordinate metadata. +3. **Vector embeddings** Each chunk is embedded (via a remote API or local model) and stored in an in-memory ChromaDB collection. +4. **Retrieval + LLM answering** User questions are embedded, the most similar chunks are retrieved, and an LLM generates an answer from that context. +5. **PDF highlighting** The Streamlit frontend highlights the exact PDF regions the LLM used, with a color gradient (orange = most relevant, blue = least relevant). +6. **NER post-processing** *(optional)* LLM responses are scanned for physical quantities (via grobid-quantities) and materials mentions (via grobid-superconductors), then annotated inline. + + +## Installation + +### Option 1: PyPI (library only) + +```bash +pip install document-qa-engine +``` + +### Option 2: From source (full app) + +```bash +git clone https://github.com/lfoppiano/document-qa.git +cd document-qa +pip install -r requirements.txt +``` + +### Option 3: Docker + +```bash +# Latest stable release +docker run -p 8501:8501 lfoppiano/document-insights-qa:latest + +# Latest development build +docker run -p 8501:8501 lfoppiano/document-insights-qa:latest-develop +``` + +### Prerequisites + +You need access to: + +| Service | Required? | Purpose | +|---------|-----------|---------| +| **GROBID server** | βœ… Yes | Parses PDFs into structured text | +| **Embedding API** | βœ… Yes | Converts text to vectors | +| **LLM API** (OpenAI-compatible) | βœ… Yes | Answers questions | +| **grobid-quantities** | ❌ Optional | NER for measurements | +| **grobid-superconductors** | ❌ Optional | NER for materials | + + + +## Configuration + +All configuration is through environment variables. Create a `.env` file in the project root: + +```env +# ── LLM Endpoints ──────────────────────────────────────── +# Each key in API_MODELS maps a model name to its base URL. +PHI_URL=http://localhost:1234/v1 # Phi-4-mini-instruct endpoint +QWEN_URL=http://localhost:1234/v1 # Qwen3-0.6B endpoint +API_KEY=your-llm-api-key # Auth key for LLM APIs + +# ── Embedding Endpoint ─────────────────────────────────── +EMBEDS_URL=http://127.0.0.1:1234/v1 # Embedding service URL +EMBEDS_API_KEY=your-embedding-api-key # Auth key for embedding API + +# ── Defaults ───────────────────────────────────────────── +DEFAULT_MODEL=microsoft/Phi-4-mini-instruct +DEFAULT_EMBEDDING=intfloat/multilingual-e5-large-instruct-modal + +# ── GROBID Services ────────────────────────────────────── +GROBID_URL=https://your-grobid-url +GROBID_QUANTITIES_URL=https://your-grobid-quantities-url/ +GROBID_MATERIALS_URL=https://your-grobid-superconductors-url/ +``` + +### Variable Reference + +| Variable | Description | +|----------|-------------| +| `PHI_URL` | Base URL for the Phi-4-mini-instruct vLLM server (OpenAI-compatible) | +| `QWEN_URL` | Base URL for the Qwen3-0.6B vLLM server (OpenAI-compatible) | +| `API_KEY` | Bearer token for authenticating with the LLM endpoints | +| `EMBEDS_URL` | Base URL for the embedding service (must expose `/embeddings` endpoint) | +| `EMBEDS_API_KEY` | Bearer token for authenticating with the embedding service | +| `DEFAULT_MODEL` | Model name pre-selected in the UI dropdown | +| `DEFAULT_EMBEDDING` | Embedding name pre-selected in the UI dropdown | +| `GROBID_URL` | Full URL to a running GROBID server | +| `GROBID_QUANTITIES_URL` | URL to a grobid-quantities server (for measurement NER) | +| `GROBID_MATERIALS_URL` | URL to a grobid-superconductors server (for materials NER) | + +--- + +## Quick Start β€” Streamlit App + +```bash +# 1. Set up environment +cp .env.example .env # Edit with your endpoints + +# 2. Run the app +streamlit run streamlit_app.py +``` + +Then open `http://localhost:8501`, upload a PDF, and ask questions. + +--- + +## Quick Start β€” As a Python Library + +```python +from langchain_openai import ChatOpenAI +from document_qa.custom_embeddings import ModalEmbeddings +from document_qa.document_qa_engine import DocumentQAEngine, DataStorage + +# 1. Set up the LLM +llm = ChatOpenAI( + model="microsoft/Phi-4-mini-instruct", + temperature=0.0, + base_url="http://localhost:1234/v1", + api_key="your-api-key" +) + +# 2. Set up embeddings +embeddings = ModalEmbeddings( + url="http://localhost:1234/v1", + model_name="intfloat/multilingual-e5-large-instruct", + api_key="your-embedding-key" +) + +# 3. Create the storage and engine +storage = DataStorage(embeddings) +engine = DocumentQAEngine( + llm=llm, + data_storage=storage, + grobid_url="https://lfoppiano-grobid.hf.space/" +) + +# 4. Load a PDF (creates in-memory embeddings) +doc_id = engine.create_memory_embeddings( + pdf_path="path/to/paper.pdf", + chunk_size=500 # tokens per chunk (-1 = keep paragraphs) +) + +# 5. Ask a question +_, answer, coordinates = engine.query_document( + query="What is the main contribution of this paper?", + doc_id=doc_id, + context_size=10 # number of chunks to use as context +) +print(answer) + +# 6. Or just retrieve relevant passages (no LLM) +passages, coordinates = engine.query_storage( + query="What materials were studied?", + doc_id=doc_id, + context_size=5 +) +for p in passages: + print(p) +``` + + +## Streamlit App Features + +### Query Modes + +| Mode | What It Does | When to Use | +|------|-------------|-------------| +| **LLM Q/A** | Retrieves context β†’ sends to LLM β†’ returns a natural language answer | Default β€” for asking questions | +| **Embeddings** | Returns the raw text passages most similar to your question | Debugging β€” to see what context the LLM would receive | +| **Question Coefficient** | Computes `min_similarity - mean_similarity` as a quality estimate | Experimental β€” to predict answer reliability | + +### Settings + +| Setting | Default | Description | +|---------|---------|-------------| +| Chunk size | `-1` (paragraphs) | Token count per text chunk. `-1` keeps GROBID paragraphs intact. | +| Context size | `10` (paragraphs) / `4` (chunks) | Number of chunks sent to the LLM as context | +| Scroll to context | Off | Auto-scroll the PDF viewer to the most relevant passage | +| NER processing | Off | Run grobid-quantities + grobid-superconductors on LLM responses | + +### PDF Annotations + +After each query, the PDF viewer highlights the passages used as context: +- **Orange** (warm) = most relevant passage +- **Blue** (cold) = least relevant passage +- **Dotted border** = the single most relevant passage + + + +## Troubleshooting + +### SQLite version error + +``` +streamlit: Your system has an unsupported version of sqlite3. +Chroma requires sqlite3 >= 3.35.0. +``` + +**Linux fix**: See [this StackOverflow answer](https://stackoverflow.com/questions/76958817/streamlit-your-system-has-an-unsupported-version-of-sqlite3-chroma-requires-sq). +**More info**: [Chroma troubleshooting docs](https://docs.trychroma.com/troubleshooting#sqlite). + +### "The information is not provided in the given context" + +The LLM couldn't find the answer in the retrieved passages. Try: +1. **Increase context size** β€” use the sidebar slider to retrieve more passages +2. **Decrease chunk size** β€” smaller chunks may match more precisely +3. **Use Embeddings mode** β€” switch to "Embeddings" query mode to see what passages are being retrieved and verify they contain the answer + +### MissingSchema error on embeddings + +``` +requests.exceptions.MissingSchema: Invalid URL +``` + +Ensure `EMBEDS_URL` in your `.env` starts with `https://` or `http://`. Example: +```env +EMBEDS_URL=https://your-modal-endpoint.modal.run/v1 +``` + +### GROBID connection errors + +Make sure your GROBID server is running and accessible: +```bash +curl https://grobid.hf.space/api/isalive +``` + +If using a local GROBID instance: +```bash +docker run --rm -p 8070:8070 lfoppiano/grobid:0.8.0 +# Then set GROBID_URL=http://localhost:8070 +``` + +### Embedding API returning empty results + +- Verify the API is running: `curl {EMBEDS_URL}/embeddings` +- Check that `EMBEDS_API_KEY` matches the server's expected key +- Ensure the URL does **not** have a trailing `/embeddings` (the client appends it automatically) + +--- + diff --git a/document_qa/custom_embeddings.py b/document_qa/custom_embeddings.py index 50c5910..07bca33 100644 --- a/document_qa/custom_embeddings.py +++ b/document_qa/custom_embeddings.py @@ -1,48 +1,97 @@ +"""Custom LangChain-compatible embedding client. + +Provides :class:`ModalEmbeddings`, a drop-in ``Embeddings`` implementation +that calls any service exposing an ``/embeddings`` endpoint (OpenAI, +vLLM, Modal, LM Studio, etc.). + +""" + from typing import List import requests from langchain_core.embeddings import Embeddings class ModalEmbeddings(Embeddings): + """LangChain ``Embeddings`` backed by an OpenAI-compatible HTTP API. + + The service must expose a ``POST /embeddings`` endpoint that accepts + ``{"model": "…", "input": ["…"]}`` and returns the standard OpenAI + response shape. + + Args: + url: Base URL of the embedding service (e.g. ``"http://localhost:1234/v1"``). + model_name: Model identifier(e.g. ``"intfloat/multilingual-e5-large-instruct"``). + api_key: Optional bearer token for authenticated endpoints. + """ + def __init__(self, url: str, model_name: str, api_key: str = None): self.url = url self.model_name = model_name self.api_key = api_key - def embed(self, text: List[str]) -> List[List[str]]: - # We remove newlines from the text to avoid issues with the embedding model. + def embed(self, text: List[str]) -> List[List[float]]: + """Embed a list of texts via the configured API. + + Newlines are replaced with spaces before sending, since most + embedding models treat them as noise. + + Args: + text: Strings to embed. + + Returns: + list[list[float]]: One embedding vector per input string. + + Raises: + requests.HTTPError: If the API returns a non-2xx status. + """ + # Newlines degrade embedding quality for most models cleaned_text = [t.replace("\n", " ") for t in text] - payload = {'text': "\n".join(cleaned_text)} + headers = { + "Content-Type": "application/json" + } - headers = {} if self.api_key: - headers = {'x-api-key': self.api_key} + headers["Authorization"] = f"Bearer {self.api_key}" response = requests.post( - self.url, - data=payload, - files=[], + f"{self.url}/embeddings", + json={ + "model": self.model_name, + "input": cleaned_text + }, headers=headers ) + response.raise_for_status() - # print(response.text) - return response.json() + data = response.json()["data"] + return [item["embedding"] for item in data] - def embed_documents(self, text: List[str]) -> List[List[str]]: - """ - Embed a list of documents using the embedding model. + def embed_documents(self, text: List[str]) -> List[List[float]]: + """Embed multiple documents (LangChain interface). + + Args: + text: Document strings to embed. + + Returns: + list[list[float]]: One embedding vector per document. """ return self.embed(text) - def embed_query(self, text: str) -> List[str]: - """ - Embed a query + def embed_query(self, text: str) -> List[float]: + """Embed a single query string (LangChain interface). + + Args: + text: The query string. + + Returns: + list[float]: The embedding vector for *text*. """ return self.embed([text])[0] def get_model_name(self) -> str: + """Return the model identifier used for embedding requests.""" return self.model_name diff --git a/document_qa/document_qa_engine.py b/document_qa/document_qa_engine.py index 7560ecf..b2aea27 100644 --- a/document_qa/document_qa_engine.py +++ b/document_qa/document_qa_engine.py @@ -1,3 +1,9 @@ +"""Core Q/A engine for scientific PDF documents. + +This module provides the main classes for building a Retrieval-Augmented +Generation (RAG) pipeline over scientific PDFs. +""" + import copy import os from pathlib import Path @@ -20,9 +26,17 @@ class TextMerger: - """ - This class tries to replicate the RecursiveTextSplitter from LangChain, to preserve and merge the - coordinate information from the PDF document. + """Token-aware text merger that preserves PDF coordinate metadata. + + Unlike LangChain's ``RecursiveTextSplitter``, this merger keeps the + bounding-box coordinates extracted by GROBID so that downstream + consumers (e.g. the PDF viewer) can highlight the exact regions. + + Args: + model_name: A tiktoken model name (e.g. ``"gpt-4"``). When given, + the tokenizer for that model is used. + encoding_name: A tiktoken encoding name (default ``"gpt2"``). + Ignored when *model_name* is provided. """ def __init__(self, model_name=None, encoding_name="gpt2"): @@ -32,6 +46,19 @@ def __init__(self, model_name=None, encoding_name="gpt2"): self.enc = tiktoken.get_encoding(encoding_name) def encode(self, text, allowed_special=set(), disallowed_special="all"): + """Tokenize *text* and return a list of token IDs. + + Thin wrapper around ``tiktoken.Encoding.encode`` that exposes the + same special-token controls. + + Args: + text: The string to tokenize. + allowed_special: Set of special tokens allowed in *text*. + disallowed_special: Special-token handling policy. + + Returns: + list[int]: Token IDs produced by the configured tokenizer. + """ return self.enc.encode( text, allowed_special=allowed_special, @@ -39,6 +66,24 @@ def encode(self, text, allowed_special=set(), disallowed_special="all"): ) def merge_passages(self, passages, chunk_size, tolerance=0.2): + """Merge consecutive passages into chunks of approximately *chunk_size* tokens. + + Args: + passages: List of dicts, each with ``"text"`` (str) and + ``"coordinates"`` (str) keys β€” as returned by + method:`GrobidProcessor.process_structure`. + chunk_size: Target number of tokens per merged chunk. + tolerance: Fraction of *chunk_size* allowed as overflow + (default ``0.2``). + + Returns: + list[dict]: Merged passages. Each dict has: + + - ``"text"`` β€” concatenated paragraph texts. + - ``"coordinates"`` β€” semicolon-joined coordinate strings. + - ``"type"`` β€” always ``"aggregated chunks"``. + - ``"section"`` / ``"subSection"`` β€” always ``"mixed"``. + """ new_passages = [] new_coordinates = [] current_texts = [] @@ -94,6 +139,8 @@ def merge_passages(self, passages, chunk_size, tolerance=0.2): class BaseRetrieval: + """Abstract base for retrieval backends. + """ def __init__( self, @@ -119,6 +166,19 @@ class NER_Retrival(VectorStore): class DataStorage: + """Manages per-document vector-store collections. + + Each uploaded PDF gets its own ChromaDB collection, + keyed by a document ID (typically an MD5 hash). Collections can live + in memory or be persisted to disk. + + Args: + embedding_function: A LangChain-compatible ``Embeddings`` instance + root_path: Optional directory for persisted embeddings. + engine: The vector-store class to use. + + """ + embeddings_dict = {} embeddings_map_from_md5 = {} embeddings_map_to_md5 = {} @@ -167,15 +227,25 @@ def load_embeddings(self, embeddings_root_path: Union[str, Path]) -> None: print("Embedding loaded: ", len(self.embeddings_dict.keys())) def get_loaded_embeddings_ids(self): + """Return the document IDs (MD5 hashes) of all loaded collections.""" return list(self.embeddings_dict.keys()) def get_md5_from_filename(self, filename): + """Look up the MD5 document ID for a given original *filename*.""" return self.embeddings_map_to_md5[filename] def get_filename_from_md5(self, md5): + """Look up the original filename for a given *md5* document ID.""" return self.embeddings_map_from_md5[md5] def embed_document(self, doc_id, texts, metadatas): + """Create (or replace) an in-memory vector collection for a document. + + Args: + doc_id: Unique identifier for the document. + texts: List of text chunks to embed. + metadatas: List of metadata dicts (one per chunk). + """ if doc_id not in self.embeddings_dict.keys(): self.embeddings_dict[doc_id] = self.engine.from_texts( texts, @@ -195,6 +265,24 @@ def embed_document(self, doc_id, texts, metadatas): class DocumentQAEngine: + """End-to-end RAG engine for scientific PDF documents. + + Orchestrates the full pipeline: + + 1. **PDF parsing** via a GROBID server (structured text + coordinates). + 2. **Chunking** β€” paragraphs kept as-is or merged with :class:`TextMerger`. + 3. **Embedding and storage** chunks are embedded and stored. + 4. **Retrieval + LLM** β€” relevant chunks are retrieved and fed to an LLM + to produce an answer. + + Args: + llm: A LangChain chat model (e.g. ``ChatOpenAI``). + data_storage: A `DataStorage` instance for managing embeddings. + grobid_url: URL of the GROBID server. + memory: Optional ``ConversationBufferMemory`` for multi-turn context. + + """ + llm = None qa_chain_type = None @@ -229,7 +317,30 @@ def query_document( context_size=4, extraction_schema=None, verbose=False - ) -> (Any, str): + ) -> tuple[Any, str]: + """Ask a question and get an LLM-generated answer. + + Retrieves the most relevant chunks from the vector store, feeds + them as context to the LLM, and returns the response. + + Args: + query: The natural-language question. + doc_id: Document identifier returned by create_memory_embeddings`. + output_parser: Optional LangChain output parser. If provided the + raw LLM response is re-processed into structured output. + context_size: Number of chunks to retrieve as context (default 4). + extraction_schema: Optional extraction schema. + verbose: Print debug information. + + Returns: + tuple: ``(parsed_output | None, raw_text_response, coordinates)`` + + - *parsed_output* β€” structured data if a parser/schema was given, + otherwise ``None``. + - *raw_text_response* β€” the LLM's raw text answer. + - *coordinates* β€” list of lists of coordinate strings for each + retrieved chunk (for PDF highlighting). + """ # self.load_embeddings(self.embeddings_root_path) if verbose: @@ -258,9 +369,22 @@ def query_document( else: return None, response, coordinates - def query_storage(self, query: str, doc_id, context_size=4) -> (List[Document], list): - """ - Returns the context related to a given query + def query_storage(self, query: str, doc_id, context_size=4) -> tuple[List[Document], list]: + """Retrieve relevant text passages without calling the LLM. + + Useful for debugging which chunks would be used as context, or for + building custom pipelines on top of the retrieval step. + + Args: + query: The natural-language question. + doc_id: Document identifier. + context_size: Number of chunks to retrieve (default 4). + + Returns: + tuple: ``(texts, coordinates)`` + + - *texts* β€” list of passage strings. + - *coordinates* β€” list of lists of coordinate strings. """ documents, coordinates = self._get_context(doc_id, query, context_size) @@ -268,8 +392,21 @@ def query_storage(self, query: str, doc_id, context_size=4) -> (List[Document], return context_as_text, coordinates def query_storage_and_embeddings(self, query: str, doc_id, context_size=4) -> List[Document]: - """ - Returns both the context and the embedding information from a given query + """Retrieve passages with their similarity scores and raw embeddings. + + Each returned ``Document`` has extra metadata keys: + + - ``__similarity`` β€” cosine distance to the query. + - ``__embeddings`` β€” the chunk's embedding vector. + + Args: + query: The natural-language question. + doc_id: Document identifier. + context_size: Number of chunks to retrieve (default 4). + + Returns: + list[Document]: Retrieved documents enriched with similarity and + embedding metadata. """ db = self.data_storage.embeddings_dict[doc_id] retriever = db.as_retriever( @@ -281,6 +418,20 @@ def query_storage_and_embeddings(self, query: str, doc_id, context_size=4) -> Li return relevant_documents def analyse_query(self, query, doc_id, context_size=4): + """Compute a relevance coefficient for *query* against *doc_id*. + + The coefficient is ``min_similarity - mean_similarity`` over the + top-k retrieved chunks. A value close to zero suggests the + question matches multiple passages equally well. + + Args: + query: The natural-language question. + doc_id: Document identifier. + context_size: Number of chunks to consider (default 4). + + Returns: + tuple: ``(summary_string, coordinates)`` + """ db = self.data_storage.embeddings_dict[doc_id] # retriever = db.as_retriever( # search_kwargs={"k": context_size, 'score_threshold': 0.0}, @@ -329,12 +480,12 @@ def _parse_json(self, response, output_parser): return parsed_output - def _run_query(self, doc_id, query, context_size=4) -> (List[Document], list): + def _run_query(self, doc_id, query, context_size=4) -> tuple[List[Document], list]: relevant_documents, relevant_document_coordinates = self._get_context(doc_id, query, context_size) response = self.chain.invoke({"context": relevant_documents, "question": query}) return response, relevant_document_coordinates - def _get_context(self, doc_id, query, context_size=4) -> (List[Document], list): + def _get_context(self, doc_id, query, context_size=4) -> tuple[List[Document], list]: db = self.data_storage.embeddings_dict[doc_id] retriever = db.as_retriever(search_kwargs={"k": context_size}) relevant_documents = retriever.invoke(query) @@ -366,10 +517,28 @@ def _get_context_multiquery(self, doc_id, query, context_size=4): return relevant_documents def get_text_from_document(self, pdf_file_path, chunk_size=-1, perc_overlap=0.1, verbose=False): - """ - Extract text from documents using Grobid. - - if chunk_size is < 0, keeps each paragraph separately - - if chunk_size > 0, aggregate all paragraphs and split them again using an approximate chunk size + """Extract and chunk text from a PDF via GROBID. + + Sends the PDF to the configured GROBID server, parses the returned + TEI-XML into passages with coordinate metadata, and optionally + merges passages into larger token-based chunks. + + Args: + pdf_file_path: Path to the PDF file on disk. + chunk_size: Target tokens per chunk. ``-1`` (default) keeps + GROBID paragraphs as-is; a positive value merges them. + perc_overlap: Reserved for future overlap support. + verbose: Print debug information. + + Returns: + tuple: ``(texts, metadatas, ids)`` + + - *texts* β€” list of passage strings. + - *metadatas* β€” list of metadata dicts (coordinates, section, …). + - *ids* β€” list of integer chunk IDs. + + Raises: + AttributeError: If ``grobid_url`` was not provided at init time. """ if verbose: print("File", pdf_file_path) @@ -416,6 +585,22 @@ def create_memory_embeddings( chunk_size=500, perc_overlap=0.1 ): + """Parse a PDF and create an in-memory vector collection. + + This is the main entry-point for ingesting a new document. It + calls GROBID, chunks the text, embeds it, and stores everything in `data_storage`. + + Args: + pdf_path: Path to the PDF file. + doc_id: Optional explicit document ID. When ``None``, the + MD5 hash extracted by GROBID is used. + chunk_size: Token count per chunk (default 500). Use ``-1`` + to keep GROBID paragraphs intact. + perc_overlap: Reserved for future overlap support. + + Returns: + str: The document ID. + """ texts, metadata, ids = self.get_text_from_document( pdf_path, chunk_size=chunk_size, @@ -436,6 +621,18 @@ def create_embeddings( perc_overlap=0.1, include_biblio=False ): + """Batch-process a directory of PDFs and persist their embeddings. + + Walks *pdfs_dir_path*, processes each ``.pdf`` file through GROBID, + creates embeddings, and persists the resulting ChromaDB collection + to a subdirectory named after the file's MD5. + + Args: + pdfs_dir_path: Directory containing PDF files. + chunk_size: Token count per chunk (default 500). + perc_overlap: Reserved for future overlap support. + include_biblio: Reserved flag (currently unused). + """ input_files = [] for root, dirs, files in os.walk(pdfs_dir_path, followlinks=False): for file_ in files: @@ -470,6 +667,8 @@ def create_embeddings( @staticmethod def calculate_md5(input_file: Union[Path, str]): + """Return the uppercase hex MD5 digest of *input_file*.""" + import hashlib md5_hash = hashlib.md5() with open(input_file, 'rb') as fi: diff --git a/document_qa/grobid_processors.py b/document_qa/grobid_processors.py index 0aae0ee..be63839 100644 --- a/document_qa/grobid_processors.py +++ b/document_qa/grobid_processors.py @@ -1,3 +1,18 @@ +"""GROBID-based processors for scientific text extraction. + +This module provides processors that interact with GROBID services to: + +- **Extract structured text** from scientific PDFs (:class:`GrobidProcessor`) + β€” parses TEI-XML into passages with section labels and PDF coordinates. +- **Annotate physical quantities** (:class:`GrobidQuantitiesProcessor`) + β€” identifies measurements via the grobid-quantities service. +- **Annotate materials** (:class:`GrobidMaterialsProcessor`) + β€” identifies material mentions via grobid-superconductors. +- **Aggregate NER results** (:class:`GrobidAggregationProcessor`) + β€” combines quantity and material annotations with overlap pruning. + +""" + import re from collections import OrderedDict from html import escape @@ -10,6 +25,7 @@ def get_span_start(type, title=None): + """Return an opening ```` tag for an annotation of the given *type*.""" title_ = ' title="' + title + '"' if title is not None else "" return '' @@ -31,10 +47,19 @@ def has_space_between_value_and_unit(quantity): def decorate_text_with_annotations(text, spans, tag="span"): - """ - Decorate a text using spans, using two style defined by the tag: - - "span" generated HTML like annotated text - - "rs" generate XML like annotated text (format SuperMat) + """Wrap recognised entity spans in markup tags. + + Produces either HTML (````) or TEI-XML + (````) depending on *tag*. + + Args: + text: The original plain-text string. + spans: List of span dicts with at least ``offset_start``, + ``offset_end``, and ``type`` keys. + tag: ``"span"`` (default) for HTML output, ``"rs"`` for XML. + + Returns: + str: The text with inline annotation markup. """ sorted_spans = list(sorted(spans, key=lambda item: item['offset_start'])) annotated_text = "" @@ -60,15 +85,26 @@ def get_parsed_value_type(quantity): class BaseProcessor(object): - # def __init__(self, grobid_superconductors_client=None, grobid_quantities_client=None): - # self.grobid_superconductors_client = grobid_superconductors_client - # self.grobid_quantities_client = grobid_quantities_client + """Shared post-processing logic for all GROBID-derived processors. + + Fixes common character-encoding artefacts produced by PDF extraction + (e.g. ``Γ€`` β†’ ``-``, ``ΒΌ`` β†’ ``=``). All processor subclasses + inherit :meth:`post_process` from here. + """ patterns = [ r'\d+e\d+' ] def post_process(self, text): + """Clean encoding artefacts and normalise special characters. + + Args: + text: Raw extracted text from GROBID. + + Returns: + str: Cleaned text. + """ output = text.replace('Γ€', '-') output = output.replace('ΒΌ', '=') output = output.replace('ΓΎ', '+') @@ -84,8 +120,25 @@ def post_process(self, text): class GrobidProcessor(BaseProcessor): + """Extract structured text and coordinates from PDFs via GROBID. + + Sends a PDF to a running GROBID server, parses the returned TEI-XML, + and produces a list of passage dicts with text content, section labels, + and bounding-box coordinates for each paragraph. + + Args: + grobid_url: Full URL of the GROBID server + (e.g. ``"https://grobid.example.com"``). + ping_server: If ``True`` (default), verify the server is alive + on init. + + Raises: + ServerUnavailableException: If *ping_server* is ``True`` and the + GROBID server does not respond. + + """ + def __init__(self, grobid_url, ping_server=True): - # super().__init__() grobid_client = GrobidClient( grobid_server=grobid_url, batch_size=5, @@ -97,6 +150,24 @@ def __init__(self, grobid_url, ping_server=True): self.grobid_client = grobid_client def process_structure(self, input_path, coordinates=False): + """Send a PDF to GROBID and return structured content. + + Args: + input_path: Path to the PDF file. + coordinates: If ``True``, include bounding-box coordinate + strings in each passage (needed for PDF highlighting). + + Returns: + dict or None: A dict with keys: + + - ``"biblio"`` β€” bibliographic metadata (title, authors, DOI, …). + - ``"passages"`` β€” list of passage dicts, each containing + ``text``, ``type``, ``section``, ``subSection``, + ``passage_id``, and ``coordinates``. + - ``"filename"`` β€” stem of the PDF filename. + + Returns ``None`` if GROBID returns a non-200 status. + """ pdf_file, status, text = self.grobid_client.process_pdf("processFulltextDocument", input_path, consolidate_header=True, @@ -125,6 +196,19 @@ def process_single(self, input_file): return doc def parse_grobid_xml(self, text, coordinates=False): + """Parse GROBID TEI-XML into a structured passage dict. + + Extracts title, abstract, body paragraphs, back-matter, and + figure descriptions from the XML, post-processes encoding + artefacts, and attaches coordinate metadata. + + Args: + text: Raw TEI-XML string returned by GROBID. + coordinates: Whether to extract ``coords`` attributes. + + Returns: + dict: ``{"biblio": {…}, "passages": […]}`` + """ output_data = OrderedDict() doc_biblio = grobid_tei_xml.parse_document_xml(text) @@ -256,10 +340,29 @@ def parse_grobid_xml(self, text, coordinates=False): class GrobidQuantitiesProcessor(BaseProcessor): + """NER processor for physical quantities (measurements, units). + + Wraps the `grobid-quantities `_ + service to identify and normalise measurements in text. + + Args: + grobid_quantities_client: A configured quantities API client + """ + def __init__(self, grobid_quantities_client): self.grobid_quantities_client = grobid_quantities_client def process(self, text) -> list: + """Extract quantity spans from *text*. + + Args: + text: Plain text to analyse. + + Returns: + list[dict]: Span dicts with ``offset_start``, ``offset_end``, + ``type`` (``"property"``), and optional ``unit_type`` / + ``quantified`` keys. + """ status, result = self.grobid_quantities_client.process_text(text.strip()) if status != 200: @@ -428,10 +531,29 @@ def get_raw(quantity): class GrobidMaterialsProcessor(BaseProcessor): + """NER processor for material mentions (chemical compounds, etc.). + + Wraps the `grobid-superconductors `_ + service. + + Args: + grobid_superconductors_client: A configured + :class:`~document_qa.ner_client_generic.NERClientGeneric` instance. + """ + def __init__(self, grobid_superconductors_client): self.grobid_superconductors_client = grobid_superconductors_client def process(self, text): + """Extract material-mention spans from *text*. + + Args: + text: Plain text to analyse. + + Returns: + list[dict]: Span dicts with ``offset_start``, ``offset_end``, + ``type`` (``"material"``), and optional ``formula`` keys. + """ preprocessed_text = text.strip() status, result = self.grobid_superconductors_client.process_text(preprocessed_text, "processText_disable_linking") @@ -528,6 +650,20 @@ def parse_superconductors_output(result, original_text): class GrobidAggregationProcessor(GrobidQuantitiesProcessor, GrobidMaterialsProcessor): + """Combined NER processor that merges quantity and material annotations. + + Runs both :class:`GrobidQuantitiesProcessor` and + :class:`GrobidMaterialsProcessor`, then prunes overlapping spans so + that the output is clean and non-overlapping. + + Args: + grobid_quantities_client: Optional quantities API client. + grobid_superconductors_client: Optional materials NER client. + + Either or both clients may be ``None``; only the provided services + will be called. + """ + def __init__(self, grobid_quantities_client=None, grobid_superconductors_client=None): if grobid_quantities_client: self.gqp = GrobidQuantitiesProcessor(grobid_quantities_client) @@ -535,6 +671,14 @@ def __init__(self, grobid_quantities_client=None, grobid_superconductors_client= self.gmp = GrobidMaterialsProcessor(grobid_superconductors_client) def process_single_text(self, text): + """Run both NER services on *text* and return merged, deduplicated spans. + + Args: + text: Plain text to process. + + Returns: + list[dict]: Non-overlapping span dicts sorted by offset. + """ extracted_quantities_spans = self.process_properties(text) extracted_materials_spans = self.process_materials(text) all_entities = extracted_quantities_spans + extracted_materials_spans @@ -555,6 +699,18 @@ def process_materials(self, text): @staticmethod def box_to_dict(box, color=None, type=None, border=None): + """Convert a GROBID coordinate list into an annotation dict. + + Args: + box: List or tuple of ``[page, x, y, width, height]``. + color: Optional hex colour string for the annotation. + type: Optional annotation type label. + border: Optional border style (e.g. ``"dotted"``). + + Returns: + dict: Annotation dict suitable for ``streamlit-pdf-viewer``, + or empty dict if *box* is invalid. + """ if box is None or box == "" or len(box) < 5: return {} @@ -573,6 +729,18 @@ def box_to_dict(box, color=None, type=None, border=None): @staticmethod def prune_overlapping_annotations(entities: list) -> list: + """Remove overlapping spans, keeping the most informative one. + + When two spans overlap, the longer span is preferred. Adjacent + spans of the same type may be merged (e.g. a split decimal number). + + Args: + entities: List of span dicts with ``offset_start``, + ``offset_end``, ``type``, and ``text`` keys. + + Returns: + list[dict]: Pruned, non-overlapping spans sorted by offset. + """ # Sorting by offsets sorted_entities = sorted(entities, key=lambda d: d['offset_start']) diff --git a/document_qa/langchain.py b/document_qa/langchain.py index 8bb19a5..4056dfd 100644 --- a/document_qa/langchain.py +++ b/document_qa/langchain.py @@ -1,3 +1,12 @@ +"""LangChain vector store extensions for document-qa. + +Extends ChromaDB with support for returning similarity scores **and** +raw embedding vectors alongside retrieved documents. This enables +the Streamlit frontend to compute relevance gradients and the +``question_coefficient`` analysis mode. + +""" + from typing import Any, Optional, List, Dict, Tuple, ClassVar, Collection from langchain.schema import Document @@ -8,6 +17,14 @@ class AdvancedVectorStoreRetriever(VectorStoreRetriever): + """Retriever that can enrich documents with similarity scores and embeddings. + + Extends LangChain's ``VectorStoreRetriever`` with a + ``"similarity_with_embeddings"`` search type. When used, each + returned document's ``metadata`` dict gains ``__similarity`` (float) + and ``__embeddings`` (list[float]) keys. + """ + allowed_search_types: ClassVar[Collection[str]] = ( "similarity", "similarity_score_threshold", @@ -18,6 +35,20 @@ class AdvancedVectorStoreRetriever(VectorStoreRetriever): def _get_relevant_documents( self, query: str, *, run_manager: CallbackManagerForRetrieverRun ) -> List[Document]: + """Fetch relevant documents for the configured search type. + + Supports all standard search types plus + ``"similarity_with_embeddings"`` which attaches score and + embedding vector metadata to each document. + + Args: + query: The search query string. + run_manager: LangChain callback manager. + + Returns: + list[Document]: Retrieved documents, optionally enriched + with similarity scores and embeddings. + """ if self.search_type == "similarity_with_embeddings": docs_scores_and_embeddings = ( @@ -51,13 +82,29 @@ def _get_relevant_documents( class AdvancedVectorStore(VectorStore): + """ + Extension of LangChain's VectorStore that returns a custom retriever + supporting advanced search features. + """ + def as_retriever(self, **kwargs: Any) -> AdvancedVectorStoreRetriever: + """Create a retriever supporting ``similarity_with_embeddings``. + + Accepts the same keyword arguments as the base ``as_retriever``. + """ tags = kwargs.pop("tags", None) or [] tags.extend(self._get_retriever_tags()) return AdvancedVectorStoreRetriever(vectorstore=self, **kwargs, tags=tags) class ChromaAdvancedRetrieval(Chroma, AdvancedVectorStore): + """Chroma vector store with support for embeddings + similarity scores. + + Extends the standard LangChain ``Chroma`` store with + `advanced_similarity_search` which returns ``(Document, score, + embedding)`` triples. + """ + def __init__(self, **kwargs): super().__init__(**kwargs) @@ -94,7 +141,18 @@ def advanced_similarity_search( k: int = DEFAULT_K, filter: Optional[Dict[str, str]] = None, **kwargs: Any, - ) -> [List[Document], float, List[float]]: + ) -> List[Tuple[Document, float, List[float]]]: + """Return documents, similarity scores, and embeddings for *query*. + + Args: + query: The search query. + k: Number of results to return. + filter: Optional Chroma metadata filter. + + Returns: + list[tuple[Document, float, list[float]]]: Triples of + (document, distance, embedding_vector). + """ docs_scores_and_embeddings = self.similarity_search_with_scores_and_embeddings(query, k, filter=filter) return docs_scores_and_embeddings @@ -106,6 +164,21 @@ def similarity_search_with_scores_and_embeddings( where_document: Optional[Dict[str, str]] = None, **kwargs: Any, ) -> List[Tuple[Document, float, List[float]]]: + """Low-level search returning docs with scores and embeddings. + + Queries the Chroma collection requesting ``distances`` and + ``embeddings`` in addition to the usual documents and metadata. + + Args: + query: The search query. + k: Number of results. + filter: Optional metadata filter. + where_document: Optional document-content filter. + + Returns: + list[tuple[Document, float, list[float]]]: Triples of + (document, distance, embedding_vector). + """ if self._embedding_function is None: results = self.__query_collection( @@ -129,6 +202,15 @@ def similarity_search_with_scores_and_embeddings( def _results_to_docs_scores_and_embeddings(results: Any) -> List[Tuple[Document, float, List[float]]]: + """Unpack raw Chroma query results into ``(Document, score, embedding)`` tuples. + + Args: + results: Dict returned by ``Collection.query()`` with + ``include=['documents', 'metadatas', 'distances', 'embeddings']``. + + Returns: + list[tuple[Document, float, list[float]]]: One tuple per result. + """ return [ (Document(page_content=result[0], metadata=result[1] or {}), result[2], result[3]) for result in zip( diff --git a/streamlit_app.py b/streamlit_app.py index c01ad2b..3976a83 100644 --- a/streamlit_app.py +++ b/streamlit_app.py @@ -1,3 +1,14 @@ +"""Streamlit frontend for the Document Q/A system. + +This module implements the web UI for uploading scientific PDFs, +asking questions via an LLM-powered RAG pipeline, and viewing +highlighted PDF passages. It is the main entry-point when running:: + + streamlit run streamlit_app.py + +Configuration is loaded from environment variables (see ``.env.example``). +""" + import os import re from hashlib import blake2b @@ -110,6 +121,11 @@ def new_file(): + """Reset session state when a new file is uploaded. + + Clears previous embeddings, annotations, and conversation memory + so the pipeline starts fresh for the new document. + """ st.session_state['loaded_embeddings'] = None st.session_state['doc_id'] = None st.session_state['uploaded'] = True @@ -119,11 +135,22 @@ def new_file(): def clear_memory(): + """Clear the conversation buffer memory (chat history).""" st.session_state['memory'].clear() # @st.cache_resource def init_qa(model_name, embeddings_name): + """Initialise the Q/A engine with the selected LLM and embedding models. + + Args: + model_name: Key from ``API_MODELS`` selecting the LLM. + embeddings_name: Key from ``API_EMBEDDINGS`` selecting the + embedding model. + + Returns: + DocumentQAEngine: Ready-to-use engine instance. + """ st.session_state['memory'] = ConversationBufferMemory( memory_key="chat_history", return_messages=True @@ -147,6 +174,14 @@ def init_qa(model_name, embeddings_name): @st.cache_resource def init_ner(): + """Initialise the NER aggregation processor (quantities + materials). + + Uses ``GROBID_QUANTITIES_URL`` and ``GROBID_MATERIALS_URL`` from + the environment. Results are cached across Streamlit reruns. + + Returns: + GrobidAggregationProcessor: Configured processor instance. + """ quantities_client = QuantitiesAPI(os.environ['GROBID_QUANTITIES_URL'], check_server=True) materials_client = NERClientGeneric(ping=True) @@ -173,6 +208,10 @@ def init_ner(): def get_file_hash(fname): + """Compute a BLAKE2b hex digest for the file at *fname*. + + Used to generate deterministic document IDs from file content. + """ hash_md5 = blake2b() with open(fname, "rb") as f: for chunk in iter(lambda: f.read(4096), b""): @@ -181,6 +220,11 @@ def get_file_hash(fname): def play_old_messages(container): + """Re-render previous chat messages into *container*. + + Called on Streamlit reruns to restore the visible conversation + history from ``st.session_state['messages']``. + """ if st.session_state['messages']: for message in st.session_state['messages']: if message['role'] == 'user': @@ -330,10 +374,22 @@ def play_old_messages(container): def rgb_to_hex(rgb): + """Convert an ``(R, G, B)`` tuple to a ``#rrggbb`` hex string.""" return "#{:02x}{:02x}{:02x}".format(*rgb) def generate_color_gradient(num_elements): + """Generate a warm-to-cold hex colour gradient for annotation ranking. + + The first colour (most relevant passage) is orange; the last (least + relevant) is blue. Intermediate colours are linearly interpolated. + + Args: + num_elements: Number of gradient stops to produce. + + Returns: + list[str]: Hex colour strings, e.g. ``['#ffa500', …, '#0000ff']``. + """ # Define warm and cold colors in RGB format warm_color = (255, 165, 0) # Orange cold_color = (0, 0, 255) # Blue