From 371d883f2d51a9797742923304f6b9ac999dc969 Mon Sep 17 00:00:00 2001
From: Jeremy Dyer <jdye64@gmail.com>
Date: Tue, 10 Mar 2026 22:37:47 -0400
Subject: [PATCH 01/55] Introduce release branch 26.03 with version 26.3.0-RC1

Update all hardcoded version references from 26.1.2 to 26.3.0-RC1
across helm charts, docker-compose, FastAPI, docs, and examples.

Made-with: Cursor
---
 docker-compose.yaml                             |  2 +-
 docs/docs/extraction/content-metadata.md        |  2 +-
 docs/docs/extraction/helm.md                    |  2 +-
 docs/docs/extraction/quickstart-guide.md        |  2 +-
 docs/docs/extraction/quickstart-library-mode.md |  2 +-
 docs/docs/extraction/releasenotes-nv-ingest.md  | 12 ++++++------
 docs/docs/extraction/user-defined-functions.md  |  2 +-
 examples/building_vdb_operator.ipynb            | 14 +++++++-------
 helm/Chart.yaml                                 |  2 +-
 helm/README.md                                  |  8 ++++----
 helm/README.md.gotmpl                           |  6 +++---
 helm/values.yaml                                |  2 +-
 src/nv_ingest/api/main.py                       |  2 +-
 tools/harness/test_configs.yaml                 |  2 +-
 14 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/docker-compose.yaml b/docker-compose.yaml
index 264df7d7f..4ea6d58ea 100644
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
@@ -262,7 +262,7 @@ services:
       - audio
 
   nv-ingest-ms-runtime:
-    image: nvcr.io/nvidia/nemo-microservices/nv-ingest:26.1.2
+    image: nvcr.io/nvidia/nemo-microservices/nv-ingest:26.3.0-RC1
     shm_size: 40gb # Should be at minimum 30% of assigned memory per Ray documentation
     build:
       context: ${NV_INGEST_ROOT:-.}
diff --git a/docs/docs/extraction/content-metadata.md b/docs/docs/extraction/content-metadata.md
index ae384b6fc..c02aa8c46 100644
--- a/docs/docs/extraction/content-metadata.md
+++ b/docs/docs/extraction/content-metadata.md
@@ -43,7 +43,7 @@ These fields apply to all content types including text, images, and tables.
 | Subtype | The type of the content for structured data types, such as table or chart. | — |
 | Content | Content extracted from the source.  | Extracted |
 | Description | A text description of the content object. | Generated |
-| Page \# | The page \# of the content in the source. Prior to 26.1.2, this field was 0-indexed. Beginning with 26.1.2, this field is 1-indexed. | Extracted |
+| Page \# | The page \# of the content in the source. Prior to 26.3.0-RC1, this field was 0-indexed. Beginning with 26.3.0-RC1, this field is 1-indexed. | Extracted |
 | Hierarchy | The location or order of the content within the source.  | Extracted |
 
 
diff --git a/docs/docs/extraction/helm.md b/docs/docs/extraction/helm.md
index 76cbd2a53..40fcd8ec9 100644
--- a/docs/docs/extraction/helm.md
+++ b/docs/docs/extraction/helm.md
@@ -3,4 +3,4 @@
 <!-- Use this documentation to deploy [NeMo Retriever Library](overview.md) by using Helm. -->
 
 To deploy [NeMo Retriever Library](overview.md) by using Helm, 
-refer to [NeMo Retriever Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.1.2/helm/README.md).
+refer to [NeMo Retriever Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0-RC1/helm/README.md).
diff --git a/docs/docs/extraction/quickstart-guide.md b/docs/docs/extraction/quickstart-guide.md
index 97bcfb578..74cc92824 100644
--- a/docs/docs/extraction/quickstart-guide.md
+++ b/docs/docs/extraction/quickstart-guide.md
@@ -84,7 +84,7 @@ h. Run the command `docker ps`. You should see output similar to the following.
     CONTAINER ID  IMAGE                                            COMMAND                 CREATED         STATUS                  PORTS            NAMES
 uv venv --python 3.12 nv-ingest-dev
 source nv-ingest-dev/bin/activate
-uv pip install nv-ingest==26.1.2 nv-ingest-api==26.1.2 nv-ingest-client==26.1.2
+uv pip install nv-ingest==26.3.0-RC1 nv-ingest-api==26.3.0-RC1 nv-ingest-client==26.3.0-RC1
 ```
 
 !!! tip
diff --git a/docs/docs/extraction/quickstart-library-mode.md b/docs/docs/extraction/quickstart-library-mode.md
index e65b4fac1..e4810b8e9 100644
--- a/docs/docs/extraction/quickstart-library-mode.md
+++ b/docs/docs/extraction/quickstart-library-mode.md
@@ -34,7 +34,7 @@ Use the following procedure to prepare your environment.
     ```
        uv venv --python 3.12 nvingest && \
          source nvingest/bin/activate && \
-         uv pip install nemo-retriever==26.1.2 milvus-lite==2.4.12
+         uv pip install nemo-retriever==26.3.0-RC1 milvus-lite==2.4.12
     ```
 
     !!! tip
diff --git a/docs/docs/extraction/releasenotes-nv-ingest.md b/docs/docs/extraction/releasenotes-nv-ingest.md
index 40c9db021..d824af15b 100644
--- a/docs/docs/extraction/releasenotes-nv-ingest.md
+++ b/docs/docs/extraction/releasenotes-nv-ingest.md
@@ -8,11 +8,11 @@ This documentation contains the release notes for [NeMo Retriever Library](overv
 
 
 
-## Release 26.01 (26.1.2)
+## Release 26.01 (26.3.0-RC1)
 
 The NeMo Retriever Library 26.01 release adds new hardware and software support, and other improvements.
 
-To upgrade the Helm Charts for this version, refer to [NeMo Retriever Library Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.1.2/helm/README.md).
+To upgrade the Helm Charts for this version, refer to [NeMo Retriever Library Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0-RC1/helm/README.md).
 
 
 ### Highlights 
@@ -20,7 +20,7 @@ To upgrade the Helm Charts for this version, refer to [NeMo Retriever Library He
 This release contains the following key changes:
 
 - Added functional support for [H200 NVL](https://www.nvidia.com/en-us/data-center/h200/). For details, refer to [Support Matrix](support-matrix.md).
-- All Helm deployments for Kubernetes now use [NVIDIA NIM Operator](https://docs.nvidia.com/nim-operator/latest/index.html). For details, refer to [NeMo Retriever Library Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.1.2/helm/README.md). 
+- All Helm deployments for Kubernetes now use [NVIDIA NIM Operator](https://docs.nvidia.com/nim-operator/latest/index.html). For details, refer to [NeMo Retriever Library Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0-RC1/helm/README.md). 
 - Updated RIVA NIM to version 1.4.0. For details, refer to [Extract Speech](audio.md).
 - Updated VLM NIM to [nemotron-nano-12b-v2-vl](https://build.nvidia.com/nvidia/nemotron-nano-12b-v2-vl/modelcard). For details, refer to [Extract Captions from Images](python-api-reference.md#extract-captions-from-images).
 - Added VLM caption prompt customization parameters, including reasoning control. For details, refer to [Caption Images and Control Reasoning](python-api-reference.md#caption-images-and-control-reasoning).
@@ -33,7 +33,7 @@ This release contains the following key changes:
 - Large PDFs are now automatically split into chunks and processed in parallel, delivering faster ingestion for long documents. For details, refer to [PDF Pre-Splitting](v2-api-guide.md).
 - Issues maintaining extraction quality while processing very large files are now resolved with the V2 API. For details, refer to [V2 API Guide](v2-api-guide.md).
 - Updated the embedding task to support embedding on custom content fields like the results of summarization functions. For details, refer to [Use the Python API](python-api-reference.md).
-- User-defined function summarization is now using `nemotron-mini-4b-instruct` which provides significant speed improvements. For details, refer to [User-defined Functions](user-defined-functions.md) and [NeMo Retriever Library UDF Examples](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.1.2/examples/udfs/README.md).
+- User-defined function summarization is now using `nemotron-mini-4b-instruct` which provides significant speed improvements. For details, refer to [User-defined Functions](user-defined-functions.md) and [NeMo Retriever Library UDF Examples](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0-RC1/examples/udfs/README.md).
 - In the `Ingestor.extract` method, the defaults for `extract_text` and `extract_images` are now set to `true` for consistency with `extract_tables` and `extract_charts`. For details, refer to [Use the Python API](python-api-reference.md).
 - The `table-structure` profile is no longer available. The table-structure profile is now part of the default profile. For details, refer to [Profile Information](quickstart-guide.md#profile-information).
 - New documentation [Why Throughput Is Dataset-Dependent](throughput-is-dataset-dependent.md).
@@ -49,8 +49,8 @@ This release contains the following key changes:
 
 The following are the known issues that are fixed in this version:
 
-- A10G support is restored. To use A10G hardware, use release 26.1.2 or later. For details, refer to [Support Matrix](support-matrix.md).
-- L40S support is restored. To use L40S hardware, use release 26.1.2 or later. For details, refer to [Support Matrix](support-matrix.md).
+- A10G support is restored. To use A10G hardware, use release 26.3.0-RC1 or later. For details, refer to [Support Matrix](support-matrix.md).
+- L40S support is restored. To use L40S hardware, use release 26.3.0-RC1 or later. For details, refer to [Support Matrix](support-matrix.md).
 - The page number field in the content metadata now starts at 1 instead of 0 so each page number is no longer off by one from what you would expect. For details, refer to [Content Metadata](content-metadata.md).
 - Support for batches that include individual files greater than approximately 400MB is restored. This includes audio files and pdfs.
 
diff --git a/docs/docs/extraction/user-defined-functions.md b/docs/docs/extraction/user-defined-functions.md
index d5f2b72c8..62013d1d8 100644
--- a/docs/docs/extraction/user-defined-functions.md
+++ b/docs/docs/extraction/user-defined-functions.md
@@ -941,6 +941,6 @@ def debug_udf(control_message: IngestControlMessage) -> IngestControlMessage:
 
 ## Related Topics
 
-- [NeMo Retriever Library UDF Examples](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.1.2/examples/udfs/README.md)
+- [NeMo Retriever Library UDF Examples](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0-RC1/examples/udfs/README.md)
 - [User-Defined Stages for NeMo Retriever Library](user-defined-stages.md)
 - [NimClient Usage](nimclient.md)
diff --git a/examples/building_vdb_operator.ipynb b/examples/building_vdb_operator.ipynb
index 11e7a6759..a00923a96 100644
--- a/examples/building_vdb_operator.ipynb
+++ b/examples/building_vdb_operator.ipynb
@@ -486,7 +486,7 @@
     "    self.write_to_index(records)\n",
     "```\n",
     "\n",
-    "This method is called by the NV-Ingest Ingestor class during the ingestion pipeline. For more information on how operators are integrated into NV-Ingest, refer to the [interface implementation](https://github.com/NVIDIA/nv-ingest/blob/release/26.1.2/client/src/nv_ingest_client/client/interface.py#L324).\n",
+    "This method is called by the NV-Ingest Ingestor class during the ingestion pipeline. For more information on how operators are integrated into NV-Ingest, refer to the [interface implementation](https://github.com/NVIDIA/nv-ingest/blob/release/26.3.0-RC1/client/src/nv_ingest_client/client/interface.py#L324).\n",
     "\n",
     "The simplicity of this method belies its importance - it ensures that indexes are properly configured before data ingestion begins."
    ]
@@ -728,12 +728,12 @@
     "\n",
     "This implementation includes all the features covered in this tutorial:\n",
     "\n",
-    "- ✅ Complete OpenSearch integration with k-NN vector search\n",
-    "- ✅ Configurable connection parameters and index settings\n",
-    "- ✅ Robust data validation and content filtering\n",
-    "- ✅ Efficient batch processing and error handling\n",
-    "- ✅ NVIDIA embedding model integration for query vectorization\n",
-    "- ✅ Optimized response formatting and payload management\n",
+    "- \u2705 Complete OpenSearch integration with k-NN vector search\n",
+    "- \u2705 Configurable connection parameters and index settings\n",
+    "- \u2705 Robust data validation and content filtering\n",
+    "- \u2705 Efficient batch processing and error handling\n",
+    "- \u2705 NVIDIA embedding model integration for query vectorization\n",
+    "- \u2705 Optimized response formatting and payload management\n",
     "\n",
     "### Getting Started with the OpenSearch Operator\n",
     "\n",
diff --git a/helm/Chart.yaml b/helm/Chart.yaml
index 12f044ec9..aa2201109 100644
--- a/helm/Chart.yaml
+++ b/helm/Chart.yaml
@@ -2,7 +2,7 @@ apiVersion: v2
 name: nv-ingest
 description: NV-Ingest Microservice
 type: application
-version: 26.1.2
+version: 26.3.0-RC1
 maintainers:
   - name: NVIDIA Corporation
     url: https://www.nvidia.com/
diff --git a/helm/README.md b/helm/README.md
index 6bc711a8b..f21832866 100644
--- a/helm/README.md
+++ b/helm/README.md
@@ -45,7 +45,7 @@ To install or upgrade the Helm chart, run the following code.
 helm upgrade \
     --install \
     nv-ingest \
-    https://helm.ngc.nvidia.com/nvidia/nemo-microservices/charts/nv-ingest-26.1.2.tgz \
+    https://helm.ngc.nvidia.com/nvidia/nemo-microservices/charts/nv-ingest-26.3.0-RC1.tgz \
     -n ${NAMESPACE} \
     --username '$oauthtoken' \
     --password "${NGC_API_KEY}" \
@@ -54,7 +54,7 @@ helm upgrade \
     --set ngcApiSecret.create=true \
     --set ngcApiSecret.password="${NGC_API_KEY}" \
     --set image.repository="nvcr.io/nvidia/nemo-microservices/nv-ingest" \
-    --set image.tag="26.1.2"
+    --set image.tag="26.3.0-RC1"
 ```
 
 Optionally you can create your own versions of the `Secrets` if you do not want to use the creation via the helm chart.
@@ -105,7 +105,7 @@ For more information, refer to [NV-Ingest-Client](https://github.com/NVIDIA/nv-i
 # Just to be cautious we remove any existing installation
 pip uninstall nv-ingest-client
 
-pip install nv-ingest-client==26.1.2
+pip install nv-ingest-client==26.3.0-RC1
 ```
 
 #### Rest Endpoint Ingress
@@ -347,7 +347,7 @@ You can also use NV-Ingest's Python client API to interact with the service runn
 | fullnameOverride | string | `""` |  |
 | image.pullPolicy | string | `"IfNotPresent"` |  |
 | image.repository | string | `"nvcr.io/nvidia/nemo-microservices/nv-ingest"` |  |
-| image.tag | string | `"26.1.2"` |  |
+| image.tag | string | `"26.3.0-RC1"` |  |
 | imagePullSecrets[0].name | string | `"ngc-api"` |  |
 | imagePullSecrets[1].name | string | `"ngc-secret"` |  |
 | ingress.annotations | object | `{}` |  |
diff --git a/helm/README.md.gotmpl b/helm/README.md.gotmpl
index 6080e6d71..37877e69a 100644
--- a/helm/README.md.gotmpl
+++ b/helm/README.md.gotmpl
@@ -46,7 +46,7 @@ To install or upgrade the Helm chart, run the following code.
 helm upgrade \
     --install \
     nv-ingest \
-    https://helm.ngc.nvidia.com/nvidia/nemo-microservices/charts/nv-ingest-26.1.2.tgz \
+    https://helm.ngc.nvidia.com/nvidia/nemo-microservices/charts/nv-ingest-26.3.0-RC1.tgz \
     -n ${NAMESPACE} \
     --username '$oauthtoken' \
     --password "${NGC_API_KEY}" \
@@ -55,7 +55,7 @@ helm upgrade \
     --set ngcApiSecret.create=true \
     --set ngcApiSecret.password="${NGC_API_KEY}" \
     --set image.repository="nvcr.io/nvidia/nemo-microservices/nv-ingest" \
-    --set image.tag="26.1.2"
+    --set image.tag="26.3.0-RC1"
 ```
 
 Optionally you can create your own versions of the `Secrets` if you do not want to use the creation via the helm chart.
@@ -107,7 +107,7 @@ For more information, refer to [NV-Ingest-Client](https://github.com/NVIDIA/nv-i
 # Just to be cautious we remove any existing installation
 pip uninstall nv-ingest-client
 
-pip install nv-ingest-client==26.1.2
+pip install nv-ingest-client==26.3.0-RC1
 ```
 
 #### Rest Endpoint Ingress
diff --git a/helm/values.yaml b/helm/values.yaml
index 20323f68a..fb6526ef3 100644
--- a/helm/values.yaml
+++ b/helm/values.yaml
@@ -28,7 +28,7 @@ nameOverride: ""
 image:
   pullPolicy: IfNotPresent
   repository: "nvcr.io/nvidia/nemo-microservices/nv-ingest"
-  tag: "26.1.2"
+  tag: "26.3.0-RC1"
 
 ## @section Pod Configuration
 ## @param podAnnotations [object] Sets additional annotations on the main deployment pods
diff --git a/src/nv_ingest/api/main.py b/src/nv_ingest/api/main.py
index 98524800f..40fcd67a3 100644
--- a/src/nv_ingest/api/main.py
+++ b/src/nv_ingest/api/main.py
@@ -23,7 +23,7 @@
 app = FastAPI(
     title="NV-Ingest Microservice",
     description="Service for ingesting heterogenous datatypes",
-    version="26.1.2",
+    version="26.3.0-RC1",
     contact={
         "name": "NVIDIA Corporation",
         "url": "https://nvidia.com",
diff --git a/tools/harness/test_configs.yaml b/tools/harness/test_configs.yaml
index 00bfb374d..8481f64a8 100644
--- a/tools/harness/test_configs.yaml
+++ b/tools/harness/test_configs.yaml
@@ -28,7 +28,7 @@ active:
     kubectl_bin: microk8s kubectl  # kubectl binary command (e.g., "kubectl", "microk8s kubectl")
     kubectl_sudo: null  # Prepend sudo to kubectl commands (null = same as helm_sudo)
     chart: nemo-microservices/nv-ingest  # Remote chart reference (set to null to use local chart from ./helm)
-    chart_version: 26.1.2  # Chart version (required for remote charts)
+    chart_version: 26.3.0-RC1  # Chart version (required for remote charts)
     release: nv-ingest
     namespace: nv-ingest
     values_file: .helm-env  # Optional: path to values file

From 72173fc28e7a43f4a1b9f076cdf8db4face74fd3 Mon Sep 17 00:00:00 2001
From: Jeremy Dyer <jdye64@gmail.com>
Date: Wed, 11 Mar 2026 14:10:38 -0400
Subject: [PATCH 02/55] Release prep: Update version to 26.03.0-RC1 (#1574)

---
 docker-compose.yaml                             | 2 +-
 docs/docs/extraction/helm.md                    | 2 +-
 docs/docs/extraction/quickstart-guide.md        | 2 +-
 docs/docs/extraction/quickstart-library-mode.md | 2 +-
 helm/Chart.yaml                                 | 2 +-
 helm/README.md                                  | 8 ++++----
 helm/README.md.gotmpl                           | 6 +++---
 helm/values.yaml                                | 2 +-
 nemo_retriever/pyproject.toml                   | 6 +++---
 src/nv_ingest/api/main.py                       | 2 +-
 tools/harness/pyproject.toml                    | 6 +++---
 tools/harness/test_configs.yaml                 | 2 +-
 12 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/docker-compose.yaml b/docker-compose.yaml
index 4ea6d58ea..94ddf7f70 100644
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
@@ -262,7 +262,7 @@ services:
       - audio
 
   nv-ingest-ms-runtime:
-    image: nvcr.io/nvidia/nemo-microservices/nv-ingest:26.3.0-RC1
+    image: nvcr.io/nvidia/nemo-microservices/nv-ingest:26.03.0-RC1
     shm_size: 40gb # Should be at minimum 30% of assigned memory per Ray documentation
     build:
       context: ${NV_INGEST_ROOT:-.}
diff --git a/docs/docs/extraction/helm.md b/docs/docs/extraction/helm.md
index 40fcd8ec9..952a44065 100644
--- a/docs/docs/extraction/helm.md
+++ b/docs/docs/extraction/helm.md
@@ -3,4 +3,4 @@
 <!-- Use this documentation to deploy [NeMo Retriever Library](overview.md) by using Helm. -->
 
 To deploy [NeMo Retriever Library](overview.md) by using Helm, 
-refer to [NeMo Retriever Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0-RC1/helm/README.md).
+refer to [NeMo Retriever Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.03.0-RC1/helm/README.md).
diff --git a/docs/docs/extraction/quickstart-guide.md b/docs/docs/extraction/quickstart-guide.md
index 74cc92824..43ddce8ed 100644
--- a/docs/docs/extraction/quickstart-guide.md
+++ b/docs/docs/extraction/quickstart-guide.md
@@ -84,7 +84,7 @@ h. Run the command `docker ps`. You should see output similar to the following.
     CONTAINER ID  IMAGE                                            COMMAND                 CREATED         STATUS                  PORTS            NAMES
 uv venv --python 3.12 nv-ingest-dev
 source nv-ingest-dev/bin/activate
-uv pip install nv-ingest==26.3.0-RC1 nv-ingest-api==26.3.0-RC1 nv-ingest-client==26.3.0-RC1
+uv pip install nv-ingest==26.03.0-RC1 nv-ingest-api==26.03.0-RC1 nv-ingest-client==26.03.0-RC1
 ```
 
 !!! tip
diff --git a/docs/docs/extraction/quickstart-library-mode.md b/docs/docs/extraction/quickstart-library-mode.md
index e4810b8e9..f193305b9 100644
--- a/docs/docs/extraction/quickstart-library-mode.md
+++ b/docs/docs/extraction/quickstart-library-mode.md
@@ -34,7 +34,7 @@ Use the following procedure to prepare your environment.
     ```
        uv venv --python 3.12 nvingest && \
          source nvingest/bin/activate && \
-         uv pip install nemo-retriever==26.3.0-RC1 milvus-lite==2.4.12
+         uv pip install nemo-retriever==26.03.0-RC1 milvus-lite==2.4.12
     ```
 
     !!! tip
diff --git a/helm/Chart.yaml b/helm/Chart.yaml
index aa2201109..1b0a3a7e8 100644
--- a/helm/Chart.yaml
+++ b/helm/Chart.yaml
@@ -2,7 +2,7 @@ apiVersion: v2
 name: nv-ingest
 description: NV-Ingest Microservice
 type: application
-version: 26.3.0-RC1
+version: 26.03.0-RC1
 maintainers:
   - name: NVIDIA Corporation
     url: https://www.nvidia.com/
diff --git a/helm/README.md b/helm/README.md
index f21832866..18cea4235 100644
--- a/helm/README.md
+++ b/helm/README.md
@@ -45,7 +45,7 @@ To install or upgrade the Helm chart, run the following code.
 helm upgrade \
     --install \
     nv-ingest \
-    https://helm.ngc.nvidia.com/nvidia/nemo-microservices/charts/nv-ingest-26.3.0-RC1.tgz \
+    https://helm.ngc.nvidia.com/nvidia/nemo-microservices/charts/nv-ingest-26.03.0-RC1.tgz \
     -n ${NAMESPACE} \
     --username '$oauthtoken' \
     --password "${NGC_API_KEY}" \
@@ -54,7 +54,7 @@ helm upgrade \
     --set ngcApiSecret.create=true \
     --set ngcApiSecret.password="${NGC_API_KEY}" \
     --set image.repository="nvcr.io/nvidia/nemo-microservices/nv-ingest" \
-    --set image.tag="26.3.0-RC1"
+    --set image.tag="26.03.0-RC1"
 ```
 
 Optionally you can create your own versions of the `Secrets` if you do not want to use the creation via the helm chart.
@@ -105,7 +105,7 @@ For more information, refer to [NV-Ingest-Client](https://github.com/NVIDIA/nv-i
 # Just to be cautious we remove any existing installation
 pip uninstall nv-ingest-client
 
-pip install nv-ingest-client==26.3.0-RC1
+pip install nv-ingest-client==26.03.0-RC1
 ```
 
 #### Rest Endpoint Ingress
@@ -347,7 +347,7 @@ You can also use NV-Ingest's Python client API to interact with the service runn
 | fullnameOverride | string | `""` |  |
 | image.pullPolicy | string | `"IfNotPresent"` |  |
 | image.repository | string | `"nvcr.io/nvidia/nemo-microservices/nv-ingest"` |  |
-| image.tag | string | `"26.3.0-RC1"` |  |
+| image.tag | string | `"26.03.0-RC1"` |  |
 | imagePullSecrets[0].name | string | `"ngc-api"` |  |
 | imagePullSecrets[1].name | string | `"ngc-secret"` |  |
 | ingress.annotations | object | `{}` |  |
diff --git a/helm/README.md.gotmpl b/helm/README.md.gotmpl
index 37877e69a..743ed3610 100644
--- a/helm/README.md.gotmpl
+++ b/helm/README.md.gotmpl
@@ -46,7 +46,7 @@ To install or upgrade the Helm chart, run the following code.
 helm upgrade \
     --install \
     nv-ingest \
-    https://helm.ngc.nvidia.com/nvidia/nemo-microservices/charts/nv-ingest-26.3.0-RC1.tgz \
+    https://helm.ngc.nvidia.com/nvidia/nemo-microservices/charts/nv-ingest-26.03.0-RC1.tgz \
     -n ${NAMESPACE} \
     --username '$oauthtoken' \
     --password "${NGC_API_KEY}" \
@@ -55,7 +55,7 @@ helm upgrade \
     --set ngcApiSecret.create=true \
     --set ngcApiSecret.password="${NGC_API_KEY}" \
     --set image.repository="nvcr.io/nvidia/nemo-microservices/nv-ingest" \
-    --set image.tag="26.3.0-RC1"
+    --set image.tag="26.03.0-RC1"
 ```
 
 Optionally you can create your own versions of the `Secrets` if you do not want to use the creation via the helm chart.
@@ -107,7 +107,7 @@ For more information, refer to [NV-Ingest-Client](https://github.com/NVIDIA/nv-i
 # Just to be cautious we remove any existing installation
 pip uninstall nv-ingest-client
 
-pip install nv-ingest-client==26.3.0-RC1
+pip install nv-ingest-client==26.03.0-RC1
 ```
 
 #### Rest Endpoint Ingress
diff --git a/helm/values.yaml b/helm/values.yaml
index fb6526ef3..a5fae1aef 100644
--- a/helm/values.yaml
+++ b/helm/values.yaml
@@ -28,7 +28,7 @@ nameOverride: ""
 image:
   pullPolicy: IfNotPresent
   repository: "nvcr.io/nvidia/nemo-microservices/nv-ingest"
-  tag: "26.3.0-RC1"
+  tag: "26.03.0-RC1"
 
 ## @section Pod Configuration
 ## @param podAnnotations [object] Sets additional annotations on the main deployment pods
diff --git a/nemo_retriever/pyproject.toml b/nemo_retriever/pyproject.toml
index ea22099fb..953a2f014 100644
--- a/nemo_retriever/pyproject.toml
+++ b/nemo_retriever/pyproject.toml
@@ -30,9 +30,9 @@ dependencies = [
   "typer>=0.12.0",
   "pyyaml>=6.0",
   "lancedb",
-  "nv-ingest",
-  "nv-ingest-api",
-  "nv-ingest-client",
+  "nv-ingest==26.03.0-RC1",
+  "nv-ingest-api==26.03.0-RC1",
+  "nv-ingest-client==26.03.0-RC1",
   "fastapi>=0.114.0",
   "uvicorn[standard]>=0.30.0",
   "httpx>=0.27.0",
diff --git a/src/nv_ingest/api/main.py b/src/nv_ingest/api/main.py
index 40fcd67a3..762865766 100644
--- a/src/nv_ingest/api/main.py
+++ b/src/nv_ingest/api/main.py
@@ -23,7 +23,7 @@
 app = FastAPI(
     title="NV-Ingest Microservice",
     description="Service for ingesting heterogenous datatypes",
-    version="26.3.0-RC1",
+    version="26.03.0-RC1",
     contact={
         "name": "NVIDIA Corporation",
         "url": "https://nvidia.com",
diff --git a/tools/harness/pyproject.toml b/tools/harness/pyproject.toml
index 9b2721ba3..07cb085a5 100644
--- a/tools/harness/pyproject.toml
+++ b/tools/harness/pyproject.toml
@@ -10,9 +10,9 @@ dependencies = [
     "pyyaml>=6.0",
     "requests>=2.32.5",
     "pynvml>=11.5.0",
-    "nv-ingest",
-    "nv-ingest-api",
-    "nv-ingest-client",
+    "nv-ingest==26.03.0-RC1",
+    "nv-ingest-api==26.03.0-RC1",
+    "nv-ingest-client==26.03.0-RC1",
     "milvus-lite==2.4.12",
     "pypdfium2>=4.30.0,<5.0.0",
     "nemotron-page-elements-v3>=0.dev0",
diff --git a/tools/harness/test_configs.yaml b/tools/harness/test_configs.yaml
index 8481f64a8..d49a25ffb 100644
--- a/tools/harness/test_configs.yaml
+++ b/tools/harness/test_configs.yaml
@@ -28,7 +28,7 @@ active:
     kubectl_bin: microk8s kubectl  # kubectl binary command (e.g., "kubectl", "microk8s kubectl")
     kubectl_sudo: null  # Prepend sudo to kubectl commands (null = same as helm_sudo)
     chart: nemo-microservices/nv-ingest  # Remote chart reference (set to null to use local chart from ./helm)
-    chart_version: 26.3.0-RC1  # Chart version (required for remote charts)
+    chart_version: 26.03.0-RC1  # Chart version (required for remote charts)
     release: nv-ingest
     namespace: nv-ingest
     values_file: .helm-env  # Optional: path to values file

From 852910c0f502aae0f917d5945c89061762908943 Mon Sep 17 00:00:00 2001
From: Edward Kim <109497216+edknv@users.noreply.github.com>
Date: Wed, 11 Mar 2026 11:47:51 -0700
Subject: [PATCH 03/55] (retriever) Add .split() for text chunking by token
 count (#1547) (#1576)

---
 .../nemo_retriever/examples/batch_pipeline.py |  56 +++--
 .../examples/inprocess_pipeline.py            | 219 ++++++++----------
 .../src/nemo_retriever/ingest_modes/batch.py  |  21 ++
 .../nemo_retriever/ingest_modes/inprocess.py  |  15 ++
 .../ingest_modes/lancedb_utils.py             |   7 +
 nemo_retriever/src/nemo_retriever/ingestor.py |   4 +-
 .../src/nemo_retriever/params/models.py       |   2 +-
 .../src/nemo_retriever/txt/ray_data.py        |  23 ++
 .../src/nemo_retriever/txt/split.py           |  80 ++++++-
 9 files changed, 277 insertions(+), 150 deletions(-)

diff --git a/nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py b/nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py
index 5aef13f6d..b7e96ac93 100644
--- a/nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py
+++ b/nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py
@@ -462,6 +462,24 @@ def main(
             "(used when --table-output-format=markdown)."
         ),
     ),
+    text_chunk: bool = typer.Option(
+        False,
+        "--text-chunk",
+        help=(
+            "Re-chunk extracted page text by token count before embedding. "
+            "Uses --text-chunk-max-tokens and --text-chunk-overlap-tokens (defaults: 1024, 150)."
+        ),
+    ),
+    text_chunk_max_tokens: Optional[int] = typer.Option(
+        None,
+        "--text-chunk-max-tokens",
+        help="Max tokens per text chunk (default: 1024). Implies --text-chunk.",
+    ),
+    text_chunk_overlap_tokens: Optional[int] = typer.Option(
+        None,
+        "--text-chunk-overlap-tokens",
+        help="Token overlap between consecutive text chunks (default: 150). Implies --text-chunk.",
+    ),
 ) -> None:
     log_handle, original_stdout, original_stderr = _configure_logging(log_file, debug=bool(debug))
     try:
@@ -643,33 +661,31 @@ def _extract_params(batch_tuning: dict, **overrides: Any) -> ExtractParams:
                 batch_tuning={**batch_tuning, **overrides},
             )
 
+        _text_chunk_params = TextChunkParams(
+            max_tokens=text_chunk_max_tokens or 1024,
+            overlap_tokens=text_chunk_overlap_tokens if text_chunk_overlap_tokens is not None else 150,
+        )
+
         if input_type == "txt":
-            ingestor = (
-                ingestor.files(file_patterns)
-                .extract_txt(TextChunkParams(max_tokens=512, overlap_tokens=0))
-                .embed(embed_params)
-            )
+            ingestor = ingestor.files(file_patterns).extract_txt(_text_chunk_params)
         elif input_type == "html":
-            ingestor = (
-                ingestor.files(file_patterns)
-                .extract_html(TextChunkParams(max_tokens=512, overlap_tokens=0))
-                .embed(embed_params)
-            )
+            ingestor = ingestor.files(file_patterns).extract_html(_text_chunk_params)
         elif input_type == "image":
-            ingestor = (
-                ingestor.files(file_patterns)
-                .extract_image_files(_extract_params(_detection_batch_tuning))
-                .embed(embed_params)
-            )
+            ingestor = ingestor.files(file_patterns).extract_image_files(_extract_params(_detection_batch_tuning))
         elif input_type == "doc":
-            ingestor = ingestor.files(file_patterns).extract(_extract_params(_pdf_batch_tuning)).embed(embed_params)
+            ingestor = ingestor.files(file_patterns).extract(_extract_params(_pdf_batch_tuning))
         else:
-            ingestor = (
-                ingestor.files(file_patterns)
-                .extract(_extract_params(_pdf_batch_tuning, inference_batch_size=page_elements_batch_size))
-                .embed(embed_params)
+            ingestor = ingestor.files(file_patterns).extract(
+                _extract_params(_pdf_batch_tuning, inference_batch_size=page_elements_batch_size)
             )
 
+        enable_text_chunk = text_chunk or text_chunk_max_tokens is not None or text_chunk_overlap_tokens is not None
+        if enable_text_chunk:
+            ingestor = ingestor.split(_text_chunk_params)
+
+        ingestor = ingestor.embed(embed_params)
+
+        logger.info("Running extraction...")
         ingest_start = time.perf_counter()
 
         ingest_results = (
diff --git a/nemo_retriever/src/nemo_retriever/examples/inprocess_pipeline.py b/nemo_retriever/src/nemo_retriever/examples/inprocess_pipeline.py
index d275c2068..6030e90d1 100644
--- a/nemo_retriever/src/nemo_retriever/examples/inprocess_pipeline.py
+++ b/nemo_retriever/src/nemo_retriever/examples/inprocess_pipeline.py
@@ -168,6 +168,24 @@ def main(
         "--graphic-elements-invoke-url",
         help="Optional remote endpoint URL for graphic-elements model inference.",
     ),
+    text_chunk: bool = typer.Option(
+        False,
+        "--text-chunk",
+        help=(
+            "Re-chunk extracted page text by token count before embedding. "
+            "Uses --text-chunk-max-tokens and --text-chunk-overlap-tokens (defaults: 1024, 150)."
+        ),
+    ),
+    text_chunk_max_tokens: Optional[int] = typer.Option(
+        None,
+        "--text-chunk-max-tokens",
+        help="Max tokens per text chunk (default: 1024). Implies --text-chunk.",
+    ),
+    text_chunk_overlap_tokens: Optional[int] = typer.Option(
+        None,
+        "--text-chunk-overlap-tokens",
+        help="Token overlap between consecutive text chunks (default: 150). Implies --text-chunk.",
+    ),
 ) -> None:
     if gpu_devices is not None and num_gpus is not None:
         raise typer.BadParameter("--gpu-devices and --num-gpus are mutually exclusive.")
@@ -194,146 +212,93 @@ def main(
 
     ingestor = create_ingestor(run_mode="inprocess")
     if input_type == "txt":
-        ingestor = (
-            ingestor.files(file_patterns)
-            .extract_txt(TextChunkParams(max_tokens=512, overlap_tokens=0))
-            .embed(
-                EmbedParams(
-                    model_name=str(embed_model_name),
-                    embed_invoke_url=embed_invoke_url,
-                    embed_modality=embed_modality,
-                    text_elements_modality=text_elements_modality,
-                    structured_elements_modality=structured_elements_modality,
-                    embed_granularity=embed_granularity,
-                )
-            )
-            .vdb_upload(
-                VdbUploadParams(
-                    lancedb={
-                        "lancedb_uri": LANCEDB_URI,
-                        "table_name": LANCEDB_TABLE,
-                        "overwrite": True,
-                        "create_index": True,
-                    }
-                )
+        ingestor = ingestor.files(file_patterns).extract_txt(
+            TextChunkParams(
+                max_tokens=text_chunk_max_tokens or 1024,
+                overlap_tokens=text_chunk_overlap_tokens if text_chunk_overlap_tokens is not None else 150,
             )
         )
     elif input_type == "html":
-        ingestor = (
-            ingestor.files(file_patterns)
-            .extract_html(TextChunkParams(max_tokens=512, overlap_tokens=0))
-            .embed(
-                EmbedParams(
-                    model_name=str(embed_model_name),
-                    embed_invoke_url=embed_invoke_url,
-                    embed_modality=embed_modality,
-                    text_elements_modality=text_elements_modality,
-                    structured_elements_modality=structured_elements_modality,
-                    embed_granularity=embed_granularity,
-                )
-            )
-            .vdb_upload(
-                VdbUploadParams(
-                    lancedb={
-                        "lancedb_uri": LANCEDB_URI,
-                        "table_name": LANCEDB_TABLE,
-                        "overwrite": True,
-                        "create_index": True,
-                    }
-                )
+        ingestor = ingestor.files(file_patterns).extract_html(
+            TextChunkParams(
+                max_tokens=text_chunk_max_tokens or 1024,
+                overlap_tokens=text_chunk_overlap_tokens if text_chunk_overlap_tokens is not None else 150,
             )
         )
     elif input_type == "doc":
-        ingestor = (
-            ingestor.files(file_patterns)
-            .extract(
-                ExtractParams(
-                    method=method,
-                    extract_text=True,
-                    extract_tables=True,
-                    extract_charts=True,
-                    extract_infographics=False,
-                    use_graphic_elements=use_graphic_elements,
-                    graphic_elements_invoke_url=graphic_elements_invoke_url,
-                    use_table_structure=use_table_structure,
-                    table_output_format=table_output_format,
-                    table_structure_invoke_url=table_structure_invoke_url,
-                    page_elements_invoke_url=page_elements_invoke_url,
-                    ocr_invoke_url=ocr_invoke_url,
-                    batch_tuning={
-                        "nemotron_parse_workers": float(nemotron_parse_actors),
-                        "gpu_nemotron_parse": float(nemotron_parse_gpus_per_actor),
-                        "nemotron_parse_batch_size": float(nemotron_parse_ray_batch_size),
-                    },
-                )
-            )
-            .embed(
-                EmbedParams(
-                    model_name=str(embed_model_name),
-                    embed_invoke_url=embed_invoke_url,
-                    embed_modality=embed_modality,
-                    text_elements_modality=text_elements_modality,
-                    structured_elements_modality=structured_elements_modality,
-                    embed_granularity=embed_granularity,
-                )
-            )
-            .vdb_upload(
-                VdbUploadParams(
-                    lancedb={
-                        "lancedb_uri": LANCEDB_URI,
-                        "table_name": LANCEDB_TABLE,
-                        "overwrite": True,
-                        "create_index": True,
-                    }
-                )
+        ingestor = ingestor.files(file_patterns).extract(
+            ExtractParams(
+                method=method,
+                extract_text=True,
+                extract_tables=True,
+                extract_charts=True,
+                extract_infographics=False,
+                use_graphic_elements=use_graphic_elements,
+                graphic_elements_invoke_url=graphic_elements_invoke_url,
+                use_table_structure=use_table_structure,
+                table_output_format=table_output_format,
+                table_structure_invoke_url=table_structure_invoke_url,
+                page_elements_invoke_url=page_elements_invoke_url,
+                ocr_invoke_url=ocr_invoke_url,
+                batch_tuning={
+                    "nemotron_parse_workers": float(nemotron_parse_actors),
+                    "gpu_nemotron_parse": float(nemotron_parse_gpus_per_actor),
+                    "nemotron_parse_batch_size": float(nemotron_parse_ray_batch_size),
+                },
             )
         )
     else:
-        ingestor = (
-            ingestor.files(file_patterns)
-            .extract(
-                ExtractParams(
-                    method=method,
-                    extract_text=True,
-                    extract_tables=True,
-                    extract_charts=True,
-                    extract_infographics=False,
-                    use_graphic_elements=use_graphic_elements,
-                    graphic_elements_invoke_url=graphic_elements_invoke_url,
-                    use_table_structure=use_table_structure,
-                    table_output_format=table_output_format,
-                    table_structure_invoke_url=table_structure_invoke_url,
-                    page_elements_invoke_url=page_elements_invoke_url,
-                    ocr_invoke_url=ocr_invoke_url,
-                    batch_tuning={
-                        "nemotron_parse_workers": float(nemotron_parse_actors),
-                        "gpu_nemotron_parse": float(nemotron_parse_gpus_per_actor),
-                        "nemotron_parse_batch_size": float(nemotron_parse_ray_batch_size),
-                    },
-                )
-            )
-            .embed(
-                EmbedParams(
-                    model_name=str(embed_model_name),
-                    embed_invoke_url=embed_invoke_url,
-                    embed_modality=embed_modality,
-                    text_elements_modality=text_elements_modality,
-                    structured_elements_modality=structured_elements_modality,
-                    embed_granularity=embed_granularity,
-                )
+        ingestor = ingestor.files(file_patterns).extract(
+            ExtractParams(
+                method=method,
+                extract_text=True,
+                extract_tables=True,
+                extract_charts=True,
+                extract_infographics=False,
+                use_graphic_elements=use_graphic_elements,
+                graphic_elements_invoke_url=graphic_elements_invoke_url,
+                use_table_structure=use_table_structure,
+                table_output_format=table_output_format,
+                table_structure_invoke_url=table_structure_invoke_url,
+                page_elements_invoke_url=page_elements_invoke_url,
+                ocr_invoke_url=ocr_invoke_url,
+                batch_tuning={
+                    "nemotron_parse_workers": float(nemotron_parse_actors),
+                    "gpu_nemotron_parse": float(nemotron_parse_gpus_per_actor),
+                    "nemotron_parse_batch_size": float(nemotron_parse_ray_batch_size),
+                },
             )
-            .vdb_upload(
-                VdbUploadParams(
-                    lancedb={
-                        "lancedb_uri": LANCEDB_URI,
-                        "table_name": LANCEDB_TABLE,
-                        "overwrite": True,
-                        "create_index": True,
-                    }
-                )
+        )
+
+    enable_text_chunk = text_chunk or text_chunk_max_tokens is not None or text_chunk_overlap_tokens is not None
+    if enable_text_chunk:
+        ingestor = ingestor.split(
+            TextChunkParams(
+                max_tokens=text_chunk_max_tokens or 1024,
+                overlap_tokens=text_chunk_overlap_tokens if text_chunk_overlap_tokens is not None else 150,
             )
         )
 
+    ingestor = ingestor.embed(
+        EmbedParams(
+            model_name=str(embed_model_name),
+            embed_invoke_url=embed_invoke_url,
+            embed_modality=embed_modality,
+            text_elements_modality=text_elements_modality,
+            structured_elements_modality=structured_elements_modality,
+            embed_granularity=embed_granularity,
+        )
+    ).vdb_upload(
+        VdbUploadParams(
+            lancedb={
+                "lancedb_uri": LANCEDB_URI,
+                "table_name": LANCEDB_TABLE,
+                "overwrite": True,
+                "create_index": True,
+            }
+        )
+    )
+
     print("Running extraction...")
     ingest_start = time.perf_counter()
     ingestor.ingest(
diff --git a/nemo_retriever/src/nemo_retriever/ingest_modes/batch.py b/nemo_retriever/src/nemo_retriever/ingest_modes/batch.py
index 49e568770..84c13fe5f 100644
--- a/nemo_retriever/src/nemo_retriever/ingest_modes/batch.py
+++ b/nemo_retriever/src/nemo_retriever/ingest_modes/batch.py
@@ -659,6 +659,27 @@ def extract_image_files(self, params: ExtractParams | None = None, **kwargs: Any
 
         return self
 
+    def split(self, params: TextChunkParams | None = None, **kwargs: Any) -> "BatchIngestor":
+        """
+        Re-chunk the ``text`` column by token count (post-extraction transform).
+
+        Adds a ``map_batches(TextChunkActor, ...)`` stage to the Ray Dataset so
+        already-extracted text is re-chunked before embedding.
+        """
+        from nemo_retriever.txt.ray_data import TextChunkActor
+
+        resolved = _coerce_params(params, TextChunkParams, kwargs)
+        self._tasks.append(("split", resolved.model_dump(mode="python")))
+
+        self._rd_dataset = self._rd_dataset.map_batches(
+            TextChunkActor,
+            batch_size=4,
+            batch_format="pandas",
+            num_cpus=1,
+            fn_constructor_kwargs={"params": resolved},
+        )
+        return self
+
     def extract_txt(self, params: TextChunkParams | None = None, **kwargs: Any) -> "BatchIngestor":
         """
         Configure txt-only pipeline: read_binary_files -> TxtSplitActor (bytes -> chunk rows).
diff --git a/nemo_retriever/src/nemo_retriever/ingest_modes/inprocess.py b/nemo_retriever/src/nemo_retriever/ingest_modes/inprocess.py
index 35f8e5185..34eaf7ed5 100644
--- a/nemo_retriever/src/nemo_retriever/ingest_modes/inprocess.py
+++ b/nemo_retriever/src/nemo_retriever/ingest_modes/inprocess.py
@@ -1267,6 +1267,21 @@ def extract_image_files(self, params: ExtractParams | None = None, **kwargs: Any
         self._append_detection_tasks(kwargs, use_nemotron_parse_only=use_nemotron_parse_only)
         return self
 
+    def split(self, params: TextChunkParams | None = None, **kwargs: Any) -> "InProcessIngestor":
+        """
+        Re-chunk the ``text`` column by token count (post-extraction transform).
+
+        Appends :func:`~nemo_retriever.txt.split.split_df` as a GPU-category
+        task so it runs in sequence after extraction and before embedding.
+        """
+        from nemo_retriever.txt.split import split_df
+
+        resolved = _coerce_params(params, TextChunkParams, kwargs)
+        split_kwargs = resolved.model_dump(mode="python")
+        split_kwargs.pop("encoding", None)
+        self._tasks.append((split_df, split_kwargs))
+        return self
+
     def extract_txt(self, params: TextChunkParams | None = None, **kwargs: Any) -> "InProcessIngestor":
         """
         Configure txt ingestion: tokenizer-based chunking only (no PDF extraction).
diff --git a/nemo_retriever/src/nemo_retriever/ingest_modes/lancedb_utils.py b/nemo_retriever/src/nemo_retriever/ingest_modes/lancedb_utils.py
index 1e3a98069..e82c45d17 100644
--- a/nemo_retriever/src/nemo_retriever/ingest_modes/lancedb_utils.py
+++ b/nemo_retriever/src/nemo_retriever/ingest_modes/lancedb_utils.py
@@ -128,6 +128,13 @@ def build_lancedb_row(
         metadata_obj["pdf_page"] = pdf_page
     metadata_obj.update(_build_detection_metadata(row))
 
+    # Preserve split metadata (chunk_index, chunk_count) from the original row.
+    orig_meta = getattr(row, "metadata", None)
+    if isinstance(orig_meta, dict):
+        for k in ("chunk_index", "chunk_count"):
+            if k in orig_meta:
+                metadata_obj[k] = orig_meta[k]
+
     source_obj: Dict[str, Any] = {"source_id": str(path)}
 
     row_out: Dict[str, Any] = {
diff --git a/nemo_retriever/src/nemo_retriever/ingestor.py b/nemo_retriever/src/nemo_retriever/ingestor.py
index 9e64f259f..7bbc19486 100644
--- a/nemo_retriever/src/nemo_retriever/ingestor.py
+++ b/nemo_retriever/src/nemo_retriever/ingestor.py
@@ -22,6 +22,7 @@
 from nemo_retriever.application.modes.factory import create_runmode_ingestor
 from nemo_retriever.params import EmbedParams
 from nemo_retriever.params import ExtractParams
+from nemo_retriever.params import TextChunkParams
 from nemo_retriever.params import IngestExecuteParams
 from nemo_retriever.params import IngestorCreateParams
 from nemo_retriever.params import RunMode
@@ -132,8 +133,9 @@ def filter(self) -> "ingestor":
         """Record a filter task configuration."""
         self._not_implemented("filter")
 
-    def split(self) -> "ingestor":
+    def split(self, params: TextChunkParams | None = None, **kwargs: Any) -> "ingestor":
         """Record a split task configuration."""
+        _ = _merge_params(params, kwargs)
         self._not_implemented("split")
 
     def store(self) -> "ingestor":
diff --git a/nemo_retriever/src/nemo_retriever/params/models.py b/nemo_retriever/src/nemo_retriever/params/models.py
index d6f407cd0..66e925162 100644
--- a/nemo_retriever/src/nemo_retriever/params/models.py
+++ b/nemo_retriever/src/nemo_retriever/params/models.py
@@ -65,7 +65,7 @@ class PdfSplitParams(_ParamsModel):
 
 
 class TextChunkParams(_ParamsModel):
-    max_tokens: int = 512
+    max_tokens: int = 1024
     overlap_tokens: int = 0
     tokenizer_model_id: Optional[str] = None
     encoding: str = "utf-8"
diff --git a/nemo_retriever/src/nemo_retriever/txt/ray_data.py b/nemo_retriever/src/nemo_retriever/txt/ray_data.py
index 1cb86970e..f01191814 100644
--- a/nemo_retriever/src/nemo_retriever/txt/ray_data.py
+++ b/nemo_retriever/src/nemo_retriever/txt/ray_data.py
@@ -17,6 +17,29 @@
 from .split import txt_bytes_to_chunks_df
 
 
+class TextChunkActor:
+    """
+    Ray Data map_batches callable: re-chunk existing ``text`` column by token count.
+
+    This is the batch-mode equivalent of :func:`~nemo_retriever.txt.split.split_df`.
+    Constructor takes :class:`TextChunkParams`; ``__call__`` receives a pandas batch
+    and returns the split result.
+    """
+
+    def __init__(self, params: TextChunkParams | None = None) -> None:
+        self._params = params or TextChunkParams()
+
+    def __call__(self, batch_df: pd.DataFrame) -> pd.DataFrame:
+        from .split import split_df
+
+        if not isinstance(batch_df, pd.DataFrame) or batch_df.empty:
+            return batch_df
+
+        kw = self._params.model_dump(mode="python")
+        kw.pop("encoding", None)
+        return split_df(batch_df, **kw)
+
+
 class TxtSplitActor:
     """
     Ray Data map_batches callable: DataFrame with bytes, path -> DataFrame of chunks.
diff --git a/nemo_retriever/src/nemo_retriever/txt/split.py b/nemo_retriever/src/nemo_retriever/txt/split.py
index d47b8dfd3..b94dba30d 100644
--- a/nemo_retriever/src/nemo_retriever/txt/split.py
+++ b/nemo_retriever/src/nemo_retriever/txt/split.py
@@ -18,7 +18,7 @@
 from nemo_retriever.params import TextChunkParams
 
 DEFAULT_TOKENIZER_MODEL_ID = "nvidia/llama-nemotron-embed-1b-v2"
-DEFAULT_MAX_TOKENS = 512
+DEFAULT_MAX_TOKENS = 1024
 DEFAULT_OVERLAP_TOKENS = 0
 
 
@@ -91,6 +91,84 @@ def split_text_by_tokens(
     return chunks if chunks else [text]
 
 
+def split_df(
+    df: pd.DataFrame,
+    *,
+    max_tokens: int = DEFAULT_MAX_TOKENS,
+    overlap_tokens: int = DEFAULT_OVERLAP_TOKENS,
+    tokenizer_model_id: Optional[str] = None,
+    tokenizer_cache_dir: Optional[str] = None,
+    encoding: str = "utf-8",
+) -> pd.DataFrame:
+    """
+    Re-chunk a DataFrame's ``text`` column by token count.
+
+    This is a **post-extraction** transform: it takes rows that already have a
+    ``text`` column (produced by ``extract`` / ``extract_txt`` / etc.) and
+    splits long texts into multiple rows using :func:`split_text_by_tokens`.
+    All other columns (``path``, ``page_number``, ``metadata``, …) are
+    preserved on every output row.  Each chunk row's ``metadata`` dict is
+    updated with ``chunk_index`` and ``chunk_count``.
+
+    Rows whose ``text`` is empty or missing are passed through unchanged.
+
+    Parameters
+    ----------
+    df : pd.DataFrame
+        Input DataFrame with at least a ``text`` column.
+    max_tokens, overlap_tokens, tokenizer_model_id, tokenizer_cache_dir, encoding
+        Forwarded to :func:`split_text_by_tokens` / :func:`_get_tokenizer`.
+
+    Returns
+    -------
+    pd.DataFrame
+        Expanded DataFrame (one row per chunk).
+    """
+    if df.empty:
+        return df.copy()
+
+    model_id = tokenizer_model_id or DEFAULT_TOKENIZER_MODEL_ID
+    tokenizer = _get_tokenizer(model_id, cache_dir=tokenizer_cache_dir)
+
+    out_rows: List[Dict[str, Any]] = []
+    for _, row in df.iterrows():
+        row_dict = row.to_dict()
+        text = row_dict.get("text")
+        if not isinstance(text, str) or not text.strip():
+            out_rows.append(row_dict)
+            continue
+
+        chunks = split_text_by_tokens(
+            text,
+            tokenizer=tokenizer,
+            max_tokens=max_tokens,
+            overlap_tokens=overlap_tokens,
+        )
+        if len(chunks) <= 1:
+            out_rows.append(row_dict)
+            continue
+
+        import copy
+
+        for i, chunk in enumerate(chunks):
+            new_row = {k: copy.deepcopy(v) if isinstance(v, (dict, list)) else v for k, v in row_dict.items()}
+            new_row["text"] = chunk
+            if "content" in new_row:
+                new_row["content"] = chunk
+            meta = new_row.get("metadata")
+            if isinstance(meta, dict):
+                meta["chunk_index"] = i
+                meta["chunk_count"] = len(chunks)
+                meta["content"] = chunk
+            new_row["page_number"] = i + 1
+            out_rows.append(new_row)
+
+    if not out_rows:
+        return df.iloc[:0].copy()
+
+    return pd.DataFrame(out_rows)
+
+
 def txt_file_to_chunks_df(
     path: str,
     params: TextChunkParams | None = None,

From 64c694b440709639c2ac5dcf3bed16e024b938fc Mon Sep 17 00:00:00 2001
From: Edward Kim <109497216+edknv@users.noreply.github.com>
Date: Wed, 11 Mar 2026 11:48:16 -0700
Subject: [PATCH 04/55] (retriever) add documentation for image file support
 (#1571) (#1577)

Co-authored-by: Kurt Heiss <kheiss@nvidia.com>
---
 nemo_retriever/README.md | 45 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/nemo_retriever/README.md b/nemo_retriever/README.md
index dd4d29eff..c3292ba69 100644
--- a/nemo_retriever/README.md
+++ b/nemo_retriever/README.md
@@ -165,6 +165,51 @@ uv run python -m nemo_retriever.examples.batch_pipeline /datasets/nemo-retriever
 ```
 This uses the module form of the NeMo Retriever Library batch pipeline example and points it at a sample dataset directory, verifying both ingestion and OCR under CUDA 13.
 
+7. Ingest image files
+
+NeMo Retriever Library can ingest standalone image files through the same detection, OCR, and embedding pipeline used for PDFs. Supported formats are PNG, JPEG, BMP, TIFF, and SVG. SVG support requires the optional `cairosvg` package. Each image is treated as a single page.
+
+To run the batch pipeline on a directory of images, use `--input-type image` to match all supported formats at once.
+
+```bash
+uv run python nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py /path/to/images \
+  --input-type image
+```
+
+You can also pass a single-format shortcut to restrict which files are picked up.
+
+```bash
+uv run python nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py /path/to/images \
+  --input-type png
+```
+
+Valid single-format values are `png`, `jpg`, `jpeg`, `bmp`, `tiff`, `tif`, and `svg`.
+
+For in-process mode, build the ingestor chain with `extract_image_files` instead of `extract`.
+
+```python
+from nemo_retriever import create_ingestor
+from nemo_retriever.params import ExtractParams, EmbedParams
+
+ingestor = (
+    create_ingestor(run_mode="inprocess")
+    .files("images/*.png")
+    .extract_image_files(
+        ExtractParams(
+            extract_text=True,
+            extract_tables=True,
+            extract_charts=True,
+            extract_infographics=True,
+        )
+    )
+    .embed()
+    .vdb_upload()
+    .ingest()
+)
+```
+
+All `ExtractParams` options (`extract_text`, `extract_tables`, `extract_charts`, `extract_infographics`) apply to image ingestion.
+
 ### Render one document as markdown
 
 If you want a readable page-by-page markdown view of a single in-process result, pass the

From d38abb2ebfaf40a811f16866fde71cdbb60a4925 Mon Sep 17 00:00:00 2001
From: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
Date: Wed, 11 Mar 2026 14:59:18 -0400
Subject: [PATCH 05/55] [26.03] Refactor get_*_model_name to avoid caching
 fallback model name (#1578)

---
 .../primitives/nim/model_interface/ocr.py     | 25 +++++---
 .../primitives/nim/model_interface/yolox.py   | 61 ++++++++++++++++---
 2 files changed, 71 insertions(+), 15 deletions(-)

diff --git a/api/src/nv_ingest_api/internal/primitives/nim/model_interface/ocr.py b/api/src/nv_ingest_api/internal/primitives/nim/model_interface/ocr.py
index f06f80d29..ef64c8f4c 100644
--- a/api/src/nv_ingest_api/internal/primitives/nim/model_interface/ocr.py
+++ b/api/src/nv_ingest_api/internal/primitives/nim/model_interface/ocr.py
@@ -16,7 +16,8 @@
 import tritonclient.grpc as grpcclient
 
 from nv_ingest_api.internal.primitives.nim import ModelInterface
-from nv_ingest_api.internal.primitives.nim.model_interface.decorators import multiprocessing_cache
+from nv_ingest_api.internal.primitives.nim.model_interface.decorators import global_cache
+from nv_ingest_api.internal.primitives.nim.model_interface.decorators import lock
 from nv_ingest_api.internal.primitives.nim.model_interface.helpers import preprocess_image_for_paddle
 from nv_ingest_api.util.image_processing.transforms import base64_to_numpy
 
@@ -752,12 +753,11 @@ def _format_single_batch(
             raise ValueError("Invalid protocol specified. Must be 'grpc' or 'http'.")
 
 
-@multiprocessing_cache(max_calls=100)  # Cache results first to avoid redundant retries from backoff
 @backoff.on_predicate(backoff.expo, max_time=30)
 def get_ocr_model_name(ocr_grpc_endpoint=None, default_model_name=DEFAULT_OCR_MODEL_NAME):
     """
     Determines the OCR model name by checking the environment, querying the gRPC endpoint,
-    or falling back to a default.
+    or falling back to a default. Only caches when the repository is successfully queried.
     """
     # 1. Check for an explicit override from the environment variable first.
     ocr_model_name = os.getenv("OCR_MODEL_NAME", None)
@@ -769,14 +769,25 @@ def get_ocr_model_name(ocr_grpc_endpoint=None, default_model_name=DEFAULT_OCR_MO
         logger.debug(f"No OCR gRPC endpoint provided. Falling back to default model name '{default_model_name}'.")
         return default_model_name
 
-    # 3. Attempt to query the gRPC endpoint to discover the model name.
+    # 3. Check cache (only populated on successful repository query).
+    key = (
+        "get_ocr_model_name",
+        (ocr_grpc_endpoint,),
+        frozenset({"default_model_name": default_model_name}.items()),
+    )
+    with lock:
+        if key in global_cache:
+            return global_cache[key]
+
+    # 4. Attempt to query the gRPC endpoint to discover the model name.
     try:
         client = grpcclient.InferenceServerClient(ocr_grpc_endpoint)
         model_index = client.get_model_repository_index(as_json=True)
         model_names = [x["name"] for x in model_index.get("models", [])]
         ocr_model_name = model_names[0]
+        with lock:
+            global_cache[key] = ocr_model_name
+        return ocr_model_name
     except Exception:
         logger.warning(f"Failed to get ocr model name after 30 seconds. Falling back to '{default_model_name}'.")
-        ocr_model_name = default_model_name
-
-    return ocr_model_name
+        return default_model_name
diff --git a/api/src/nv_ingest_api/internal/primitives/nim/model_interface/yolox.py b/api/src/nv_ingest_api/internal/primitives/nim/model_interface/yolox.py
index 0b1084905..ff93cb953 100644
--- a/api/src/nv_ingest_api/internal/primitives/nim/model_interface/yolox.py
+++ b/api/src/nv_ingest_api/internal/primitives/nim/model_interface/yolox.py
@@ -20,6 +20,8 @@
 
 from nv_ingest_api.internal.primitives.nim import ModelInterface
 import tritonclient.grpc as grpcclient
+from nv_ingest_api.internal.primitives.nim.model_interface.decorators import global_cache
+from nv_ingest_api.internal.primitives.nim.model_interface.decorators import lock
 from nv_ingest_api.internal.primitives.nim.model_interface.decorators import multiprocessing_cache
 from nv_ingest_api.internal.primitives.nim.model_interface.helpers import get_model_name
 from nv_ingest_api.util.image_processing import scale_image_to_encoding_size
@@ -135,10 +137,36 @@ def __init__(
         self.class_labels = class_labels
 
         if endpoints:
-            self.model_name = get_yolox_model_name(endpoints[0], default_model_name="yolox_ensemble")
-            self._grpc_uses_bls = self.model_name == "pipeline"
+            self._yolox_grpc_endpoint = endpoints[0]
+            self._model_name = None
+            self._grpc_uses_bls_value = None  # Resolved on first use
         else:
-            self._grpc_uses_bls = False
+            self._yolox_grpc_endpoint = None
+            self._model_name = None
+            self._grpc_uses_bls_value = False
+
+    def _resolve_yolox_model_name_if_needed(self) -> None:
+        """Resolve model name and BLS flag from the gRPC endpoint on first use. Cached on the instance."""
+        if self._yolox_grpc_endpoint is None:
+            return
+        if self._model_name is not None:
+            return
+        self._model_name = get_yolox_model_name(self._yolox_grpc_endpoint, default_model_name="yolox_ensemble")
+        self._grpc_uses_bls_value = self._model_name == "pipeline"
+
+    @property
+    def model_name(self) -> Optional[str]:
+        self._resolve_yolox_model_name_if_needed()
+        return self._model_name
+
+    @model_name.setter
+    def model_name(self, value: Optional[str]) -> None:
+        self._model_name = value
+
+    @property
+    def _grpc_uses_bls(self) -> bool:
+        self._resolve_yolox_model_name_if_needed()
+        return bool(self._grpc_uses_bls_value)
 
     def prepare_data_for_inference(self, data: Dict[str, Any]) -> Dict[str, Any]:
         """
@@ -2117,7 +2145,6 @@ def postprocess_included_texts(boxes, confs, labels, classes):
     return boxes, labels, confs
 
 
-@multiprocessing_cache(max_calls=100)  # Cache results first to avoid redundant retries from backoff
 @backoff.on_predicate(backoff.expo, max_time=30)
 def get_yolox_model_name(yolox_grpc_endpoint, default_model_name="yolox"):
     # If a gRPC endpoint isn't provided (common when using HTTP-only NIM endpoints),
@@ -2131,6 +2158,15 @@ def get_yolox_model_name(yolox_grpc_endpoint, default_model_name="yolox"):
     ):
         return default_model_name
 
+    key = (
+        "get_yolox_model_name",
+        (yolox_grpc_endpoint,),
+        frozenset({"default_model_name": default_model_name}.items()),
+    )
+    with lock:
+        if key in global_cache:
+            return global_cache[key]
+
     try:
         client = grpcclient.InferenceServerClient(yolox_grpc_endpoint)
         model_index = client.get_model_repository_index(as_json=True)
@@ -2148,14 +2184,23 @@ def get_yolox_model_name(yolox_grpc_endpoint, default_model_name="yolox"):
             "nemoretriever-page-elements-v2",
         ):
             if preferred in model_names:
-                return preferred
+                result = preferred
+                with lock:
+                    global_cache[key] = result
+                return result
 
         # Otherwise pick a best-effort match for newer model names.
         candidates = [m for m in model_names if isinstance(m, str) and ("yolox" in m or "page-elements" in m)]
         if candidates:
-            return sorted(candidates)[0]
-
-        return default_model_name
+            result = sorted(candidates)[0]
+            with lock:
+                global_cache[key] = result
+            return result
+
+        result = default_model_name
+        with lock:
+            global_cache[key] = result
+        return result
     except Exception as e:
         logger.warning(
             "Failed to inspect YOLOX model repository at '%s' (%s). Falling back to '%s'.",

From fbd2e28139a64a934ee2fbc393cb8a6cbfee64da Mon Sep 17 00:00:00 2001
From: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
Date: Wed, 11 Mar 2026 16:06:51 -0400
Subject: [PATCH 06/55] [26.03] (helm) More nemotron rebranding (#1581)

---
 ci/scripts/validate_deployment_configs.py     |  4 +-
 helm/README.md                                | 82 +++++++++----------
 helm/mig/nv-ingest-mig-values-25x.yaml        |  6 +-
 helm/mig/nv-ingest-mig-values.yaml            |  6 +-
 helm/overrides/values-a100-40gb.yaml          |  4 +-
 helm/overrides/values-a10g.yaml               |  4 +-
 helm/overrides/values-l40s.yaml               |  4 +-
 .../llama-3.2-nv-rerankqa-1b-v2.yaml          | 47 -----------
 ...2.yaml => llama-nemotron-embed-1b-v2.yaml} |  2 +-
 .../llama-nemotron-rerank-1b-v2.yaml          | 47 +++++++++++
 helm/templates/nemoretriever-ocr-v1.yaml      | 41 ----------
 ...yaml => nemotron-graphic-elements-v1.yaml} |  0
 helm/templates/nemotron-ocr-v1.yaml           | 41 ++++++++++
 ...v3.yaml => nemotron-page-elements-v3.yaml} |  0
 ....yaml => nemotron-table-structure-v1.yaml} |  0
 helm/values.yaml                              | 42 +++++-----
 tools/harness/test_configs.yaml               |  2 +-
 17 files changed, 166 insertions(+), 166 deletions(-)
 delete mode 100644 helm/templates/llama-3.2-nv-rerankqa-1b-v2.yaml
 rename helm/templates/{llama-3.2-nv-embedqa-1b-v2.yaml => llama-nemotron-embed-1b-v2.yaml} (98%)
 create mode 100644 helm/templates/llama-nemotron-rerank-1b-v2.yaml
 delete mode 100644 helm/templates/nemoretriever-ocr-v1.yaml
 rename helm/templates/{nemoretriever-graphic-elements-v1.yaml => nemotron-graphic-elements-v1.yaml} (100%)
 create mode 100644 helm/templates/nemotron-ocr-v1.yaml
 rename helm/templates/{nemoretriever-page-elements-v3.yaml => nemotron-page-elements-v3.yaml} (100%)
 rename helm/templates/{nemoretriever-table-structure-v1.yaml => nemotron-table-structure-v1.yaml} (100%)

diff --git a/ci/scripts/validate_deployment_configs.py b/ci/scripts/validate_deployment_configs.py
index 1cb528949..14c8b3d43 100755
--- a/ci/scripts/validate_deployment_configs.py
+++ b/ci/scripts/validate_deployment_configs.py
@@ -49,9 +49,9 @@ def __str__(self) -> str:
     "page-elements": "page_elements",
     "graphic-elements": "graphic_elements",
     "table-structure": "table_structure",
-    "ocr": "nemoretriever_ocr_v1",
+    "ocr": "ocr",
     "embedding": "embedqa",
-    "reranker": "llama_3_2_nv_rerankqa_1b_v2",
+    "reranker": "rerankqa",
     "nemotron-parse": "nemotron_parse",
     "vlm": "nemotron_nano_12b_v2_vl",
     "audio": "audio",
diff --git a/helm/README.md b/helm/README.md
index 18cea4235..c446f3e3c 100644
--- a/helm/README.md
+++ b/helm/README.md
@@ -298,7 +298,7 @@ You can also use NV-Ingest's Python client API to interact with the service runn
 | envVars.AUDIO_GRPC_ENDPOINT | string | `"audio:50051"` |  |
 | envVars.AUDIO_INFER_PROTOCOL | string | `"grpc"` |  |
 | envVars.COMPONENTS_TO_READY_CHECK | string | `"ALL"` |  |
-| envVars.EMBEDDING_NIM_ENDPOINT | string | `"http://llama-32-nv-embedqa-1b-v2:8000/v1"` |  |
+| envVars.EMBEDDING_NIM_ENDPOINT | string | `"http://llama-nemotron-embed-1b-v2:8000/v1"` |  |
 | envVars.EMBEDDING_NIM_MODEL_NAME | string | `"nvidia/llama-nemotron-embed-1b-v2"` |  |
 | envVars.IMAGE_STORAGE_PUBLIC_BASE_URL | string | `""` |  |
 | envVars.IMAGE_STORAGE_URI | string | `"s3://nv-ingest/artifacts/store/images"` |  |
@@ -465,46 +465,46 @@ You can also use NV-Ingest's Python client API to interact with the service runn
 | nimOperator.graphic_elements.storage.pvc.create | bool | `true` |  |
 | nimOperator.graphic_elements.storage.pvc.size | string | `"25Gi"` |  |
 | nimOperator.graphic_elements.storage.pvc.volumeAccessMode | string | `"ReadWriteOnce"` |  |
-| nimOperator.llama_3_2_nv_rerankqa_1b_v2.authSecret | string | `"ngc-api"` |  |
-| nimOperator.llama_3_2_nv_rerankqa_1b_v2.enabled | bool | `false` |  |
-| nimOperator.llama_3_2_nv_rerankqa_1b_v2.env[0].name | string | `"NIM_HTTP_API_PORT"` |  |
-| nimOperator.llama_3_2_nv_rerankqa_1b_v2.env[0].value | string | `"8000"` |  |
-| nimOperator.llama_3_2_nv_rerankqa_1b_v2.env[1].name | string | `"NIM_TRITON_LOG_VERBOSE"` |  |
-| nimOperator.llama_3_2_nv_rerankqa_1b_v2.env[1].value | string | `"1"` |  |
-| nimOperator.llama_3_2_nv_rerankqa_1b_v2.expose.service.grpcPort | int | `8001` |  |
-| nimOperator.llama_3_2_nv_rerankqa_1b_v2.expose.service.port | int | `8000` |  |
-| nimOperator.llama_3_2_nv_rerankqa_1b_v2.expose.service.type | string | `"ClusterIP"` |  |
-| nimOperator.llama_3_2_nv_rerankqa_1b_v2.image.pullPolicy | string | `"IfNotPresent"` |  |
-| nimOperator.llama_3_2_nv_rerankqa_1b_v2.image.pullSecrets[0] | string | `"ngc-secret"` |  |
-| nimOperator.llama_3_2_nv_rerankqa_1b_v2.image.repository | string | `"nvcr.io/nim/nvidia/llama-nemotron-rerank-1b-v2"` |  |
-| nimOperator.llama_3_2_nv_rerankqa_1b_v2.image.tag | string | `"1.10.0"` |  |
-| nimOperator.llama_3_2_nv_rerankqa_1b_v2.replicas | int | `1` |  |
-| nimOperator.llama_3_2_nv_rerankqa_1b_v2.resources.limits."nvidia.com/gpu" | int | `1` |  |
-| nimOperator.llama_3_2_nv_rerankqa_1b_v2.storage.pvc.create | bool | `true` |  |
-| nimOperator.llama_3_2_nv_rerankqa_1b_v2.storage.pvc.size | string | `"50Gi"` |  |
-| nimOperator.llama_3_2_nv_rerankqa_1b_v2.storage.pvc.volumeAccessMode | string | `"ReadWriteOnce"` |  |
-| nimOperator.nemoretriever_ocr_v1.authSecret | string | `"ngc-api"` |  |
-| nimOperator.nemoretriever_ocr_v1.enabled | bool | `true` |  |
-| nimOperator.nemoretriever_ocr_v1.env[0].name | string | `"OMP_NUM_THREADS"` |  |
-| nimOperator.nemoretriever_ocr_v1.env[0].value | string | `"8"` |  |
-| nimOperator.nemoretriever_ocr_v1.env[1].name | string | `"NIM_HTTP_API_PORT"` |  |
-| nimOperator.nemoretriever_ocr_v1.env[1].value | string | `"8000"` |  |
-| nimOperator.nemoretriever_ocr_v1.env[2].name | string | `"NIM_TRITON_LOG_VERBOSE"` |  |
-| nimOperator.nemoretriever_ocr_v1.env[2].value | string | `"1"` |  |
-| nimOperator.nemoretriever_ocr_v1.env[3].name | string | `"NIM_TRITON_MAX_BATCH_SIZE"` |  |
-| nimOperator.nemoretriever_ocr_v1.env[3].value | string | `"32"` |  |
-| nimOperator.nemoretriever_ocr_v1.expose.service.grpcPort | int | `8001` |  |
-| nimOperator.nemoretriever_ocr_v1.expose.service.port | int | `8000` |  |
-| nimOperator.nemoretriever_ocr_v1.expose.service.type | string | `"ClusterIP"` |  |
-| nimOperator.nemoretriever_ocr_v1.image.pullPolicy | string | `"IfNotPresent"` |  |
-| nimOperator.nemoretriever_ocr_v1.image.pullSecrets[0] | string | `"ngc-secret"` |  |
-| nimOperator.nemoretriever_ocr_v1.image.repository | string | `"nvcr.io/nim/nvidia/nemotron-ocr-v1"` |  |
-| nimOperator.nemoretriever_ocr_v1.image.tag | string | `"1.3.0"` |  |
-| nimOperator.nemoretriever_ocr_v1.replicas | int | `1` |  |
-| nimOperator.nemoretriever_ocr_v1.resources.limits."nvidia.com/gpu" | int | `1` |  |
-| nimOperator.nemoretriever_ocr_v1.storage.pvc.create | bool | `true` |  |
-| nimOperator.nemoretriever_ocr_v1.storage.pvc.size | string | `"25Gi"` |  |
-| nimOperator.nemoretriever_ocr_v1.storage.pvc.volumeAccessMode | string | `"ReadWriteOnce"` |  |
+| nimOperator.rerankqa.authSecret | string | `"ngc-api"` |  |
+| nimOperator.rerankqa.enabled | bool | `false` |  |
+| nimOperator.rerankqa.env[0].name | string | `"NIM_HTTP_API_PORT"` |  |
+| nimOperator.rerankqa.env[0].value | string | `"8000"` |  |
+| nimOperator.rerankqa.env[1].name | string | `"NIM_TRITON_LOG_VERBOSE"` |  |
+| nimOperator.rerankqa.env[1].value | string | `"1"` |  |
+| nimOperator.rerankqa.expose.service.grpcPort | int | `8001` |  |
+| nimOperator.rerankqa.expose.service.port | int | `8000` |  |
+| nimOperator.rerankqa.expose.service.type | string | `"ClusterIP"` |  |
+| nimOperator.rerankqa.image.pullPolicy | string | `"IfNotPresent"` |  |
+| nimOperator.rerankqa.image.pullSecrets[0] | string | `"ngc-secret"` |  |
+| nimOperator.rerankqa.image.repository | string | `"nvcr.io/nim/nvidia/llama-nemotron-rerank-1b-v2"` |  |
+| nimOperator.rerankqa.image.tag | string | `"1.10.0"` |  |
+| nimOperator.rerankqa.replicas | int | `1` |  |
+| nimOperator.rerankqa.resources.limits."nvidia.com/gpu" | int | `1` |  |
+| nimOperator.rerankqa.storage.pvc.create | bool | `true` |  |
+| nimOperator.rerankqa.storage.pvc.size | string | `"50Gi"` |  |
+| nimOperator.rerankqa.storage.pvc.volumeAccessMode | string | `"ReadWriteOnce"` |  |
+| nimOperator.ocr.authSecret | string | `"ngc-api"` |  |
+| nimOperator.ocr.enabled | bool | `true` |  |
+| nimOperator.ocr.env[0].name | string | `"OMP_NUM_THREADS"` |  |
+| nimOperator.ocr.env[0].value | string | `"8"` |  |
+| nimOperator.ocr.env[1].name | string | `"NIM_HTTP_API_PORT"` |  |
+| nimOperator.ocr.env[1].value | string | `"8000"` |  |
+| nimOperator.ocr.env[2].name | string | `"NIM_TRITON_LOG_VERBOSE"` |  |
+| nimOperator.ocr.env[2].value | string | `"1"` |  |
+| nimOperator.ocr.env[3].name | string | `"NIM_TRITON_MAX_BATCH_SIZE"` |  |
+| nimOperator.ocr.env[3].value | string | `"32"` |  |
+| nimOperator.ocr.expose.service.grpcPort | int | `8001` |  |
+| nimOperator.ocr.expose.service.port | int | `8000` |  |
+| nimOperator.ocr.expose.service.type | string | `"ClusterIP"` |  |
+| nimOperator.ocr.image.pullPolicy | string | `"IfNotPresent"` |  |
+| nimOperator.ocr.image.pullSecrets[0] | string | `"ngc-secret"` |  |
+| nimOperator.ocr.image.repository | string | `"nvcr.io/nim/nvidia/nemotron-ocr-v1"` |  |
+| nimOperator.ocr.image.tag | string | `"1.3.0"` |  |
+| nimOperator.ocr.replicas | int | `1` |  |
+| nimOperator.ocr.resources.limits."nvidia.com/gpu" | int | `1` |  |
+| nimOperator.ocr.storage.pvc.create | bool | `true` |  |
+| nimOperator.ocr.storage.pvc.size | string | `"25Gi"` |  |
+| nimOperator.ocr.storage.pvc.volumeAccessMode | string | `"ReadWriteOnce"` |  |
 | nimOperator.nemotron_nano_12b_v2_vl.authSecret | string | `"ngc-api"` |  |
 | nimOperator.nemotron_nano_12b_v2_vl.enabled | bool | `false` |  |
 | nimOperator.nemotron_nano_12b_v2_vl.env[0].name | string | `"NIM_HTTP_API_PORT"` |  |
diff --git a/helm/mig/nv-ingest-mig-values-25x.yaml b/helm/mig/nv-ingest-mig-values-25x.yaml
index d1b108e2e..5f9757518 100644
--- a/helm/mig/nv-ingest-mig-values-25x.yaml
+++ b/helm/mig/nv-ingest-mig-values-25x.yaml
@@ -38,7 +38,7 @@ nemotron-table-structure-v1:
       nvidia.com/gpu: 0
       nvidia.com/mig-1g.10gb: 1
 
-nvidia-nim-llama-32-nv-embedqa-1b-v2:
+nvidia-nim-llama-nemotron-embed-1b-v2:
   resources:
     limits:
       nvidia.com/gpu: 0
@@ -75,8 +75,8 @@ text-embedding-nim:
       nvidia.com/gpu: 0
       nvidia.com/mig-1g.10gb: 1
 
-# If you want to deploy llama-32-nv-rerankqa-1b-v2
-llama-32-nv-rerankqa-1b-v2:
+# If you want to deploy llama-nemotron-rerank-1b-v2
+llama-nemotron-rerank-1b-v2:
   resources:
     limits:
       nvidia.com/gpu: 0
diff --git a/helm/mig/nv-ingest-mig-values.yaml b/helm/mig/nv-ingest-mig-values.yaml
index 8ae0e8c83..97707a5da 100644
--- a/helm/mig/nv-ingest-mig-values.yaml
+++ b/helm/mig/nv-ingest-mig-values.yaml
@@ -39,7 +39,7 @@ nimOperator:
         nvidia.com/gpu: "0"
         nvidia.com/mig-1g.10gb: 1
 
-  nemoretriever_ocr_v1:
+  ocr:
     resources:
       limits:
         nvidia.com/gpu: "0"
@@ -48,8 +48,8 @@ nimOperator:
         nvidia.com/gpu: "0"
         nvidia.com/mig-1g.20gb: 1
 
-  # If you want to deploy llama-32-nv-rerankqa-1b-v2
-  llama_3_2_nv_rerankqa_1b_v2:
+  # If you want to deploy llama-nemotron-rerank-1b-v2
+  rerankqa:
     enabled: true
     resources:
       limits:
diff --git a/helm/overrides/values-a100-40gb.yaml b/helm/overrides/values-a100-40gb.yaml
index 003c234ba..7fe15de12 100644
--- a/helm/overrides/values-a100-40gb.yaml
+++ b/helm/overrides/values-a100-40gb.yaml
@@ -64,7 +64,7 @@ nimOperator:
       - name: OMP_NUM_THREADS
         value: "1"
 
-  nemoretriever_ocr_v1:
+  ocr:
     env:
       - name: OMP_NUM_THREADS
         value: "8"
@@ -75,7 +75,7 @@ nimOperator:
       - name: NIM_TRITON_MAX_BATCH_SIZE
         value: "1"
 
-  llama_3_2_nv_rerankqa_1b_v2:
+  rerankqa:
     env:
       - name: NIM_HTTP_API_PORT
         value: "8000"
diff --git a/helm/overrides/values-a10g.yaml b/helm/overrides/values-a10g.yaml
index 0ad99584c..36a9bed4d 100644
--- a/helm/overrides/values-a10g.yaml
+++ b/helm/overrides/values-a10g.yaml
@@ -70,7 +70,7 @@ nimOperator:
       - name: OMP_NUM_THREADS
         value: "1"
 
-  nemoretriever_ocr_v1:
+  ocr:
     env:
       - name: OMP_NUM_THREADS
         value: "8"
@@ -81,7 +81,7 @@ nimOperator:
       - name: NIM_TRITON_MAX_BATCH_SIZE
         value: "1"
 
-  llama_3_2_nv_rerankqa_1b_v2:
+  rerankqa:
     env:
       - name: NIM_HTTP_API_PORT
         value: "8000"
diff --git a/helm/overrides/values-l40s.yaml b/helm/overrides/values-l40s.yaml
index 7f4e3a680..85e941485 100644
--- a/helm/overrides/values-l40s.yaml
+++ b/helm/overrides/values-l40s.yaml
@@ -64,7 +64,7 @@ nimOperator:
       - name: OMP_NUM_THREADS
         value: "1"
 
-  nemoretriever_ocr_v1:
+  ocr:
     env:
       - name: OMP_NUM_THREADS
         value: "8"
@@ -75,7 +75,7 @@ nimOperator:
       - name: NIM_TRITON_MAX_BATCH_SIZE
         value: "1"
 
-  llama_3_2_nv_rerankqa_1b_v2:
+  rerankqa:
     env:
       - name: NIM_HTTP_API_PORT
         value: "8000"
diff --git a/helm/templates/llama-3.2-nv-rerankqa-1b-v2.yaml b/helm/templates/llama-3.2-nv-rerankqa-1b-v2.yaml
deleted file mode 100644
index 12e69da27..000000000
--- a/helm/templates/llama-3.2-nv-rerankqa-1b-v2.yaml
+++ /dev/null
@@ -1,47 +0,0 @@
-{{ if and (.Capabilities.APIVersions.Has "apps.nvidia.com/v1alpha1") (eq .Values.nimOperator.llama_3_2_nv_rerankqa_1b_v2.enabled true) -}}
-apiVersion: apps.nvidia.com/v1alpha1
-kind: NIMCache
-metadata:
-  name: llama-nemotron-rerank-1b-v2
-  annotations:
-    helm.sh/resource-policy: keep
-spec:
-  source:
-    ngc:
-      modelPuller: "{{ .Values.nimOperator.llama_3_2_nv_rerankqa_1b_v2.image.repository }}:{{ .Values.nimOperator.llama_3_2_nv_rerankqa_1b_v2.image.tag }}"
-      pullSecret: "{{ index .Values.nimOperator.llama_3_2_nv_rerankqa_1b_v2.image.pullSecrets 0 }}"
-      authSecret: {{ .Values.nimOperator.llama_3_2_nv_rerankqa_1b_v2.authSecret }}
-  storage:
-    pvc:
-      create: {{ .Values.nimOperator.llama_3_2_nv_rerankqa_1b_v2.storage.pvc.create }}
-      storageClass: {{ .Values.nimOperator.llama_3_2_nv_rerankqa_1b_v2.storage.pvc.storageClass }}
-      size: {{ .Values.nimOperator.llama_3_2_nv_rerankqa_1b_v2.storage.pvc.size }}
-      volumeAccessMode: {{ .Values.nimOperator.llama_3_2_nv_rerankqa_1b_v2.storage.pvc.volumeAccessMode }}
----
-apiVersion: apps.nvidia.com/v1alpha1
-kind: NIMService
-metadata:
-  name: llama-32-nv-rerankqa-1b-v2
-spec:
-  image:
-    repository: {{ .Values.nimOperator.llama_3_2_nv_rerankqa_1b_v2.image.repository }}
-    tag: {{ .Values.nimOperator.llama_3_2_nv_rerankqa_1b_v2.image.tag }}
-    pullPolicy: {{ .Values.nimOperator.llama_3_2_nv_rerankqa_1b_v2.image.pullPolicy }}
-    pullSecrets:
-{{ toYaml .Values.nimOperator.llama_3_2_nv_rerankqa_1b_v2.image.pullSecrets | nindent 6 }}
-  authSecret: {{ .Values.nimOperator.llama_3_2_nv_rerankqa_1b_v2.authSecret }}
-  storage:
-    nimCache:
-      name: llama-nemotron-rerank-1b-v2
-  replicas: {{ .Values.nimOperator.llama_3_2_nv_rerankqa_1b_v2.replicas }}
-  nodeSelector:
-{{ toYaml .Values.nimOperator.llama_3_2_nv_rerankqa_1b_v2.nodeSelector | nindent 4 }}
-  resources:
-{{ toYaml .Values.nimOperator.llama_3_2_nv_rerankqa_1b_v2.resources | nindent 4 }}
-  tolerations:
-{{ toYaml .Values.nimOperator.llama_3_2_nv_rerankqa_1b_v2.tolerations | nindent 4 }}
-  expose:
-{{ toYaml .Values.nimOperator.llama_3_2_nv_rerankqa_1b_v2.expose | nindent 4 }}
-  env:
-{{ toYaml .Values.nimOperator.llama_3_2_nv_rerankqa_1b_v2.env | nindent 4 }}
-{{- end }}
diff --git a/helm/templates/llama-3.2-nv-embedqa-1b-v2.yaml b/helm/templates/llama-nemotron-embed-1b-v2.yaml
similarity index 98%
rename from helm/templates/llama-3.2-nv-embedqa-1b-v2.yaml
rename to helm/templates/llama-nemotron-embed-1b-v2.yaml
index e9376ced7..199bcdc9c 100644
--- a/helm/templates/llama-3.2-nv-embedqa-1b-v2.yaml
+++ b/helm/templates/llama-nemotron-embed-1b-v2.yaml
@@ -21,7 +21,7 @@ spec:
 apiVersion: apps.nvidia.com/v1alpha1
 kind: NIMService
 metadata:
-  name: llama-32-nv-embedqa-1b-v2
+  name: llama-nemotron-embed-1b-v2
 spec:
   image:
     repository: {{ .Values.nimOperator.embedqa.image.repository }}
diff --git a/helm/templates/llama-nemotron-rerank-1b-v2.yaml b/helm/templates/llama-nemotron-rerank-1b-v2.yaml
new file mode 100644
index 000000000..6cfc2fcfc
--- /dev/null
+++ b/helm/templates/llama-nemotron-rerank-1b-v2.yaml
@@ -0,0 +1,47 @@
+{{ if and (.Capabilities.APIVersions.Has "apps.nvidia.com/v1alpha1") (eq .Values.nimOperator.rerankqa.enabled true) -}}
+apiVersion: apps.nvidia.com/v1alpha1
+kind: NIMCache
+metadata:
+  name: llama-nemotron-rerank-1b-v2
+  annotations:
+    helm.sh/resource-policy: keep
+spec:
+  source:
+    ngc:
+      modelPuller: "{{ .Values.nimOperator.rerankqa.image.repository }}:{{ .Values.nimOperator.rerankqa.image.tag }}"
+      pullSecret: "{{ index .Values.nimOperator.rerankqa.image.pullSecrets 0 }}"
+      authSecret: {{ .Values.nimOperator.rerankqa.authSecret }}
+  storage:
+    pvc:
+      create: {{ .Values.nimOperator.rerankqa.storage.pvc.create }}
+      storageClass: {{ .Values.nimOperator.rerankqa.storage.pvc.storageClass }}
+      size: {{ .Values.nimOperator.rerankqa.storage.pvc.size }}
+      volumeAccessMode: {{ .Values.nimOperator.rerankqa.storage.pvc.volumeAccessMode }}
+---
+apiVersion: apps.nvidia.com/v1alpha1
+kind: NIMService
+metadata:
+  name: llama-nemotron-rerank-1b-v2
+spec:
+  image:
+    repository: {{ .Values.nimOperator.rerankqa.image.repository }}
+    tag: {{ .Values.nimOperator.rerankqa.image.tag }}
+    pullPolicy: {{ .Values.nimOperator.rerankqa.image.pullPolicy }}
+    pullSecrets:
+{{ toYaml .Values.nimOperator.rerankqa.image.pullSecrets | nindent 6 }}
+  authSecret: {{ .Values.nimOperator.rerankqa.authSecret }}
+  storage:
+    nimCache:
+      name: llama-nemotron-rerank-1b-v2
+  replicas: {{ .Values.nimOperator.rerankqa.replicas }}
+  nodeSelector:
+{{ toYaml .Values.nimOperator.rerankqa.nodeSelector | nindent 4 }}
+  resources:
+{{ toYaml .Values.nimOperator.rerankqa.resources | nindent 4 }}
+  tolerations:
+{{ toYaml .Values.nimOperator.rerankqa.tolerations | nindent 4 }}
+  expose:
+{{ toYaml .Values.nimOperator.rerankqa.expose | nindent 4 }}
+  env:
+{{ toYaml .Values.nimOperator.rerankqa.env | nindent 4 }}
+{{- end }}
diff --git a/helm/templates/nemoretriever-ocr-v1.yaml b/helm/templates/nemoretriever-ocr-v1.yaml
deleted file mode 100644
index 6606d12f5..000000000
--- a/helm/templates/nemoretriever-ocr-v1.yaml
+++ /dev/null
@@ -1,41 +0,0 @@
-{{ if and (.Capabilities.APIVersions.Has "apps.nvidia.com/v1alpha1") (eq .Values.nimOperator.nemoretriever_ocr_v1.enabled true) -}}
-apiVersion: apps.nvidia.com/v1alpha1
-kind: NIMCache
-metadata:
-  name: nemotron-ocr-v1
-  annotations:
-    helm.sh/resource-policy: keep
-spec:
-  source:
-    ngc:
-      modelPuller: "{{ .Values.nimOperator.nemoretriever_ocr_v1.image.repository }}:{{ .Values.nimOperator.nemoretriever_ocr_v1.image.tag }}"
-      pullSecret: "{{ index .Values.nimOperator.nemoretriever_ocr_v1.image.pullSecrets 0 }}"
-      authSecret: {{ .Values.nimOperator.nemoretriever_ocr_v1.authSecret }}
-  storage:
-    pvc:
-      create: {{ .Values.nimOperator.nemoretriever_ocr_v1.storage.pvc.create }}
-      storageClass: {{ .Values.nimOperator.nemoretriever_ocr_v1.storage.pvc.storageClass }}
-      size: {{ .Values.nimOperator.nemoretriever_ocr_v1.storage.pvc.size }}
-      volumeAccessMode: {{ .Values.nimOperator.nemoretriever_ocr_v1.storage.pvc.volumeAccessMode }}
----
-apiVersion: apps.nvidia.com/v1alpha1
-kind: NIMService
-metadata:
-  name: nemotron-ocr-v1
-spec:
-  image:
-    repository: {{ .Values.nimOperator.nemoretriever_ocr_v1.image.repository }}
-    tag: {{ .Values.nimOperator.nemoretriever_ocr_v1.image.tag }}
-    pullPolicy: {{ .Values.nimOperator.nemoretriever_ocr_v1.image.pullPolicy }}
-    pullSecrets: {{ toYaml .Values.nimOperator.nemoretriever_ocr_v1.image.pullSecrets | nindent 6 }}
-  authSecret: {{ .Values.nimOperator.nemoretriever_ocr_v1.authSecret }}
-  storage:
-    nimCache:
-      name: nemotron-ocr-v1
-  replicas: {{ .Values.nimOperator.nemoretriever_ocr_v1.replicas }}
-  nodeSelector: {{ toYaml .Values.nimOperator.nemoretriever_ocr_v1.nodeSelector | nindent 4 }}
-  resources: {{ toYaml .Values.nimOperator.nemoretriever_ocr_v1.resources | nindent 4 }}
-  tolerations: {{ toYaml .Values.nimOperator.nemoretriever_ocr_v1.tolerations | nindent 4 }}
-  expose: {{ toYaml .Values.nimOperator.nemoretriever_ocr_v1.expose | nindent 4 }}
-  env: {{ toYaml .Values.nimOperator.nemoretriever_ocr_v1.env | nindent 4 }}
-{{- end }}
diff --git a/helm/templates/nemoretriever-graphic-elements-v1.yaml b/helm/templates/nemotron-graphic-elements-v1.yaml
similarity index 100%
rename from helm/templates/nemoretriever-graphic-elements-v1.yaml
rename to helm/templates/nemotron-graphic-elements-v1.yaml
diff --git a/helm/templates/nemotron-ocr-v1.yaml b/helm/templates/nemotron-ocr-v1.yaml
new file mode 100644
index 000000000..7ae0f2dea
--- /dev/null
+++ b/helm/templates/nemotron-ocr-v1.yaml
@@ -0,0 +1,41 @@
+{{ if and (.Capabilities.APIVersions.Has "apps.nvidia.com/v1alpha1") (eq .Values.nimOperator.ocr.enabled true) -}}
+apiVersion: apps.nvidia.com/v1alpha1
+kind: NIMCache
+metadata:
+  name: nemotron-ocr-v1
+  annotations:
+    helm.sh/resource-policy: keep
+spec:
+  source:
+    ngc:
+      modelPuller: "{{ .Values.nimOperator.ocr.image.repository }}:{{ .Values.nimOperator.ocr.image.tag }}"
+      pullSecret: "{{ index .Values.nimOperator.ocr.image.pullSecrets 0 }}"
+      authSecret: {{ .Values.nimOperator.ocr.authSecret }}
+  storage:
+    pvc:
+      create: {{ .Values.nimOperator.ocr.storage.pvc.create }}
+      storageClass: {{ .Values.nimOperator.ocr.storage.pvc.storageClass }}
+      size: {{ .Values.nimOperator.ocr.storage.pvc.size }}
+      volumeAccessMode: {{ .Values.nimOperator.ocr.storage.pvc.volumeAccessMode }}
+---
+apiVersion: apps.nvidia.com/v1alpha1
+kind: NIMService
+metadata:
+  name: nemotron-ocr-v1
+spec:
+  image:
+    repository: {{ .Values.nimOperator.ocr.image.repository }}
+    tag: {{ .Values.nimOperator.ocr.image.tag }}
+    pullPolicy: {{ .Values.nimOperator.ocr.image.pullPolicy }}
+    pullSecrets: {{ toYaml .Values.nimOperator.ocr.image.pullSecrets | nindent 6 }}
+  authSecret: {{ .Values.nimOperator.ocr.authSecret }}
+  storage:
+    nimCache:
+      name: nemotron-ocr-v1
+  replicas: {{ .Values.nimOperator.ocr.replicas }}
+  nodeSelector: {{ toYaml .Values.nimOperator.ocr.nodeSelector | nindent 4 }}
+  resources: {{ toYaml .Values.nimOperator.ocr.resources | nindent 4 }}
+  tolerations: {{ toYaml .Values.nimOperator.ocr.tolerations | nindent 4 }}
+  expose: {{ toYaml .Values.nimOperator.ocr.expose | nindent 4 }}
+  env: {{ toYaml .Values.nimOperator.ocr.env | nindent 4 }}
+{{- end }}
diff --git a/helm/templates/nemoretriever-page-elements-v3.yaml b/helm/templates/nemotron-page-elements-v3.yaml
similarity index 100%
rename from helm/templates/nemoretriever-page-elements-v3.yaml
rename to helm/templates/nemotron-page-elements-v3.yaml
diff --git a/helm/templates/nemoretriever-table-structure-v1.yaml b/helm/templates/nemotron-table-structure-v1.yaml
similarity index 100%
rename from helm/templates/nemoretriever-table-structure-v1.yaml
rename to helm/templates/nemotron-table-structure-v1.yaml
diff --git a/helm/values.yaml b/helm/values.yaml
index a5fae1aef..2bbbea67c 100644
--- a/helm/values.yaml
+++ b/helm/values.yaml
@@ -170,7 +170,7 @@ envVars:
   AUDIO_GRPC_ENDPOINT: "audio:50051"
   AUDIO_INFER_PROTOCOL: "grpc"
 
-  EMBEDDING_NIM_ENDPOINT: "http://llama-32-nv-embedqa-1b-v2:8000/v1"
+  EMBEDDING_NIM_ENDPOINT: "http://llama-nemotron-embed-1b-v2:8000/v1"
   EMBEDDING_NIM_MODEL_NAME: "nvidia/llama-nemotron-embed-1b-v2"
 
   NEMOTRON_PARSE_HTTP_ENDPOINT: http://nemotron-parse:8000/v1/chat/completions
@@ -828,16 +828,16 @@ nimOperator:
       - name: NIM_TRITON_PERFORMANCE_MODE
         value: "throughput"
 
-  ## @param nemoretriever_ocr_v1 [object] Configuration for NemoRetriever OCR v1 NIM
-  ## @param nemoretriever_ocr_v1.enabled [bool] Enable the NEMORetriever OCR v1 service
-  ## @param nemoretriever_ocr_v1.image.* [various] Image settings for NEMORetriever OCR v1
-  ## @param nemoretriever_ocr_v1.authSecret [string] Secret for authentication
-  ## @param nemoretriever_ocr_v1.storage.* [object] Storage/PVC configuration
-  ## @param nemoretriever_ocr_v1.replicas [int] Number of service replicas
-  ## @param nemoretriever_ocr_v1.resources [object] Limits/requests for compute resources
-  ## @param nemoretriever_ocr_v1.expose.* [object] Ports and service config
-  ## @param nemoretriever_ocr_v1.env [array] Additional environment variables
-  nemoretriever_ocr_v1:
+  ## @param ocr [object] Configuration for Nemotron OCR v1 NIM
+  ## @param ocr.enabled [bool] Enable the Nemotron OCR v1 service
+  ## @param ocr.image.* [various] Image settings for Nemotron OCR v1
+  ## @param ocr.authSecret [string] Secret for authentication
+  ## @param ocr.storage.* [object] Storage/PVC configuration
+  ## @param ocr.replicas [int] Number of service replicas
+  ## @param ocr.resources [object] Limits/requests for compute resources
+  ## @param ocr.expose.* [object] Ports and service config
+  ## @param ocr.env [array] Additional environment variables
+  ocr:
     enabled: true
     image:
       repository: nvcr.io/nim/nvidia/nemotron-ocr-v1
@@ -870,16 +870,16 @@ nimOperator:
       - name: NIM_TRITON_MAX_BATCH_SIZE
         value: "32"
 
-  ## @param llama_3_2_nv_rerankqa_1b_v2 [object] Configuration for LLaMA-3.2 NV RerankQA 1B v2 NIM
-  ## @param llama_3_2_nv_rerankqa_1b_v2.enabled [bool] Enable this NIM
-  ## @param llama_3_2_nv_rerankqa_1b_v2.image.* [various] Image repository/tag for this NIM
-  ## @param llama_3_2_nv_rerankqa_1b_v2.authSecret [string] Authentication secret for the NIM
-  ## @param llama_3_2_nv_rerankqa_1b_v2.storage.* [various] Storage/PVC configuration
-  ## @param llama_3_2_nv_rerankqa_1b_v2.replicas [int] Number of replicas
-  ## @param llama_3_2_nv_rerankqa_1b_v2.resources [object] Limits/requests for resources
-  ## @param llama_3_2_nv_rerankqa_1b_v2.expose.* [object] Port/service configuration
-  ## @param llama_3_2_nv_rerankqa_1b_v2.env [array] Additional environment variables
-  llama_3_2_nv_rerankqa_1b_v2:
+  ## @param rerankqa [object] Configuration for LLaMA-3.2 NV RerankQA 1B v2 NIM
+  ## @param rerankqa.enabled [bool] Enable this NIM
+  ## @param rerankqa.image.* [various] Image repository/tag for this NIM
+  ## @param rerankqa.authSecret [string] Authentication secret for the NIM
+  ## @param rerankqa.storage.* [various] Storage/PVC configuration
+  ## @param rerankqa.replicas [int] Number of replicas
+  ## @param rerankqa.resources [object] Limits/requests for resources
+  ## @param rerankqa.expose.* [object] Port/service configuration
+  ## @param rerankqa.env [array] Additional environment variables
+  rerankqa:
     enabled: false
     image:
       repository: nvcr.io/nim/nvidia/llama-nemotron-rerank-1b-v2
diff --git a/tools/harness/test_configs.yaml b/tools/harness/test_configs.yaml
index d49a25ffb..1db4646ea 100644
--- a/tools/harness/test_configs.yaml
+++ b/tools/harness/test_configs.yaml
@@ -53,7 +53,7 @@ active:
         local_port: 8020
         remote_port: 8000
     values:  # inline Helm values
-      nimOperator.llama_3_2_nv_rerankqa_1b_v2.enabled: true
+      nimOperator.rerankqa.enabled: true
 
   # Runtime configuration
   sparse: false  # Use sparse embeddings (Milvus only)

From 1835ba778697ff0365ee8ba29712520e13f3d33a Mon Sep 17 00:00:00 2001
From: Jeremy Dyer <jdye64@gmail.com>
Date: Wed, 11 Mar 2026 20:10:16 -0400
Subject: [PATCH 07/55] Add source_id column back to lancedb

---
 .../src/nemo_retriever/ingest_modes/batch.py           | 10 ++++++++++
 .../src/nemo_retriever/ingest_modes/lancedb_utils.py   |  4 +++-
 .../src/nemo_retriever/utils/hf_model_registry.py      |  1 +
 nemo_retriever/tests/test_lancedb_utils.py             |  1 +
 4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/nemo_retriever/src/nemo_retriever/ingest_modes/batch.py b/nemo_retriever/src/nemo_retriever/ingest_modes/batch.py
index 84c13fe5f..b2886991a 100644
--- a/nemo_retriever/src/nemo_retriever/ingest_modes/batch.py
+++ b/nemo_retriever/src/nemo_retriever/ingest_modes/batch.py
@@ -295,6 +295,9 @@ def extract(self, params: ExtractParams | None = None, **kwargs: Any) -> "BatchI
         This does not run extraction yet; it records configuration so the batch
         executor can build a concrete pipeline later.
 
+        If all input files have a ``.txt`` extension, the pipeline automatically
+        delegates to :meth:`extract_txt` with default :class:`TextChunkParams`.
+
         Resource-tuning kwargs (auto-detected from available resources if omitted):
 
         - ``pdf_split_batch_size``: Batch size for PDF split stage (default 1).
@@ -308,6 +311,13 @@ def extract(self, params: ExtractParams | None = None, **kwargs: Any) -> "BatchI
         - ``ocr_cpus_per_actor``: CPUs reserved per OCR actor (default 1).
         """
 
+        if self._input_documents and all(f.lower().endswith(".txt") for f in self._input_documents):
+            txt_params = TextChunkParams(
+                max_tokens=kwargs.pop("max_tokens", 1024),
+                overlap_tokens=kwargs.pop("overlap_tokens", 0),
+            )
+            return self.extract_txt(params=txt_params)
+
         resolved = _coerce_params(params, ExtractParams, kwargs)
         if (
             any(
diff --git a/nemo_retriever/src/nemo_retriever/ingest_modes/lancedb_utils.py b/nemo_retriever/src/nemo_retriever/ingest_modes/lancedb_utils.py
index e82c45d17..41fd24378 100644
--- a/nemo_retriever/src/nemo_retriever/ingest_modes/lancedb_utils.py
+++ b/nemo_retriever/src/nemo_retriever/ingest_modes/lancedb_utils.py
@@ -197,7 +197,9 @@ def lancedb_schema(vector_dim: int = 2048) -> Any:
             pa.field("pdf_basename", pa.string()),
             pa.field("page_number", pa.int32()),
             pa.field("source", pa.string()),
-            pa.field("source_id", pa.string()),
+            pa.field(
+                "source_id", pa.string()
+            ),  # Different than the source. Field contains path+page_number for aggregation tasks
             pa.field("path", pa.string()),
             pa.field("text", pa.string()),
             pa.field("metadata", pa.string()),
diff --git a/nemo_retriever/src/nemo_retriever/utils/hf_model_registry.py b/nemo_retriever/src/nemo_retriever/utils/hf_model_registry.py
index 46022b03f..2589e198a 100644
--- a/nemo_retriever/src/nemo_retriever/utils/hf_model_registry.py
+++ b/nemo_retriever/src/nemo_retriever/utils/hf_model_registry.py
@@ -28,6 +28,7 @@
     "nvidia/llama-nemotron-embed-vl-1b-v2": "859e1f2dac29c56c37a5279cf55f53f3e74efc6b",
     "meta-llama/Llama-3.2-1B": "4e20de362430cd3b72f300e6b0f18e50e7166e08",
     "intfloat/e5-large-unsupervised": "15af9288f69a6291f37bfb89b47e71abc747b206",
+    "nvidia/llama-nemotron-rerank-1b-v2": "aee9a1be0bbd89489f8bd0ec5763614c8bb85878",
 }
 
 
diff --git a/nemo_retriever/tests/test_lancedb_utils.py b/nemo_retriever/tests/test_lancedb_utils.py
index cc0541195..9fd6734f3 100644
--- a/nemo_retriever/tests/test_lancedb_utils.py
+++ b/nemo_retriever/tests/test_lancedb_utils.py
@@ -198,6 +198,7 @@ def test_returns_schema_with_correct_fields(self):
         assert "text" in names
         assert "metadata" in names
         assert "source" in names
+        assert "source_id" in names
         assert len(names) == 10
 
 

From db03ed7c0946dbca82e74739330973bbe46e4927 Mon Sep 17 00:00:00 2001
From: Julio Perez <37191411+jperez999@users.noreply.github.com>
Date: Wed, 11 Mar 2026 17:43:46 -0400
Subject: [PATCH 08/55] upmerge

---
 nemo_retriever/pyproject.toml                 |   1 +
 .../nemo_retriever/examples/batch_pipeline.py |   9 +
 .../nemo_retriever/model/local/__init__.py    |   5 +
 .../model/local/nemotron_rerank_v2.py         | 210 ++++++
 .../src/nemo_retriever/recall/core.py         | 126 +---
 .../src/nemo_retriever/rerank/__init__.py     |  24 +
 .../src/nemo_retriever/rerank/rerank.py       | 377 +++++++++++
 .../src/nemo_retriever/retriever.py           | 123 +++-
 .../utils/benchmark/audio_extract_actor.py    | 126 ++--
 .../vector_store/lancedb_store.py             |  21 -
 nemo_retriever/tests/test_audio_benchmark.py  |  29 +-
 .../tests/test_audio_pipeline_batch.py        |   2 +
 nemo_retriever/tests/test_html_convert.py     |   6 +-
 .../tests/test_nemotron_rerank_v2.py          | 608 ++++++++++++++++++
 .../tests/test_retriever_queries.py           | 372 +++++++++++
 15 files changed, 1846 insertions(+), 193 deletions(-)
 create mode 100644 nemo_retriever/src/nemo_retriever/model/local/nemotron_rerank_v2.py
 create mode 100644 nemo_retriever/src/nemo_retriever/rerank/__init__.py
 create mode 100644 nemo_retriever/src/nemo_retriever/rerank/rerank.py
 create mode 100644 nemo_retriever/tests/test_nemotron_rerank_v2.py
 create mode 100644 nemo_retriever/tests/test_retriever_queries.py

diff --git a/nemo_retriever/pyproject.toml b/nemo_retriever/pyproject.toml
index 953a2f014..b9f84b79d 100644
--- a/nemo_retriever/pyproject.toml
+++ b/nemo_retriever/pyproject.toml
@@ -63,6 +63,7 @@ dependencies = [
   "nemotron-ocr>=0.dev0",
   "markitdown",
   "timm==1.0.22",
+  "tqdm",
   "accelerate==1.12.0",
   "albumentations==2.0.8",
   "scikit-learn>=1.6.0",
diff --git a/nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py b/nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py
index b7e96ac93..a66137660 100644
--- a/nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py
+++ b/nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py
@@ -425,6 +425,14 @@ def main(
         "--runtime-metrics-prefix",
         help="Optional filename prefix for per-run metrics artifacts.",
     ),
+    reranker: Optional[bool] = typer.Option(
+        False, "--reranker/--no-reranker", help="Enable a re-ranking stage with a cross-encoder model."
+    ),
+    reranker_model_name: str = typer.Option(
+        "nvidia/llama-nemotron-rerank-1b-v2",
+        "--reranker-model-name",
+        help="Cross-encoder model name for re-ranking stage (passed to .embed()).",
+    ),
     structured_elements_modality: Optional[str] = typer.Option(
         None,
         "--structured-elements-modality",
@@ -782,6 +790,7 @@ def _extract_params(batch_tuning: dict, **overrides: Any) -> ExtractParams:
             ks=(1, 5, 10),
             hybrid=hybrid,
             match_mode=recall_match_mode,
+            reranker=reranker_model_name if reranker else None,
         )
 
         # Capture recall only times.
diff --git a/nemo_retriever/src/nemo_retriever/model/local/__init__.py b/nemo_retriever/src/nemo_retriever/model/local/__init__.py
index 7fa66d3f7..791df4daa 100644
--- a/nemo_retriever/src/nemo_retriever/model/local/__init__.py
+++ b/nemo_retriever/src/nemo_retriever/model/local/__init__.py
@@ -17,6 +17,7 @@
     "NemotronTableStructureV1",
     "NemotronGraphicElementsV1",
     "NemotronParseV12",
+    "NemotronRerankV2",
     "ParakeetCTC1B1ASR",
 ]
 
@@ -42,6 +43,10 @@ def __getattr__(name: str):
         from .nemotron_parse_v1_2 import NemotronParseV12
 
         return NemotronParseV12
+    if name == "NemotronRerankV2":
+        from .nemotron_rerank_v2 import NemotronRerankV2
+
+        return NemotronRerankV2
     if name == "ParakeetCTC1B1ASR":
         from .parakeet_ctc_1_1b_asr import ParakeetCTC1B1ASR
 
diff --git a/nemo_retriever/src/nemo_retriever/model/local/nemotron_rerank_v2.py b/nemo_retriever/src/nemo_retriever/model/local/nemotron_rerank_v2.py
new file mode 100644
index 000000000..eca0ee674
--- /dev/null
+++ b/nemo_retriever/src/nemo_retriever/model/local/nemotron_rerank_v2.py
@@ -0,0 +1,210 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES.
+# All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""Local wrapper for nvidia/llama-nemotron-rerank-1b-v2 cross-encoder reranker."""
+
+from __future__ import annotations
+
+from typing import List, Optional
+
+from nemo_retriever.utils.hf_cache import configure_global_hf_cache_base
+from ..model import BaseModel, RunMode
+
+
+_DEFAULT_MODEL = "nvidia/llama-nemotron-rerank-1b-v2"
+_DEFAULT_MAX_LENGTH = 512
+_DEFAULT_BATCH_SIZE = 32
+
+
+def _prompt_template(query: str, passage: str) -> str:
+    """Format a (query, passage) pair as the model expects."""
+    return f"question:{query} \n \n passage:{passage}"
+
+
+class NemotronRerankV2(BaseModel):
+    """
+    Local cross-encoder reranker wrapping nvidia/llama-nemotron-rerank-1b-v2.
+
+    The model scores (query, document) pairs and returns raw logits; higher
+    values indicate greater relevance.  It is fine-tuned from
+    meta-llama/Llama-3.2-1B with bi-directional attention and supports 26
+    languages with sequences up to 8 192 tokens.
+
+    Example::
+
+        reranker = NemotronRerankV2()
+        scores = reranker.score("What is ML?", ["Machine learning is…", "Paris is…"])
+        # scores -> [20.6, -23.1]  (higher = more relevant)
+    """
+
+    def __init__(
+        self,
+        model_name: str = _DEFAULT_MODEL,
+        device: Optional[str] = None,
+        hf_cache_dir: Optional[str] = None,
+    ) -> None:
+        super().__init__()
+        import torch
+        from transformers import AutoModelForSequenceClassification, AutoTokenizer
+
+        configure_global_hf_cache_base()
+
+        self._model_name = model_name
+        self._device = device or ("cuda" if torch.cuda.is_available() else "cpu")
+
+        kwargs: dict = {"trust_remote_code": True}
+        if hf_cache_dir:
+            kwargs["cache_dir"] = hf_cache_dir
+
+        self._tokenizer = AutoTokenizer.from_pretrained(
+            model_name,
+            padding_side="left",
+            **kwargs,
+        )
+        if self._tokenizer.pad_token is None:
+            self._tokenizer.pad_token = self._tokenizer.eos_token
+
+        self._model = (
+            AutoModelForSequenceClassification.from_pretrained(
+                model_name,
+                torch_dtype=torch.bfloat16,
+                **kwargs,
+            )
+            .eval()
+            .to(self._device)
+        )
+
+        if self._model.config.pad_token_id is None:
+            self._model.config.pad_token_id = self._tokenizer.eos_token_id
+
+    # ------------------------------------------------------------------
+    # BaseModel abstract properties
+    # ------------------------------------------------------------------
+
+    @property
+    def model_name(self) -> str:
+        return self._model_name
+
+    @property
+    def model_type(self) -> str:
+        return "reranker"
+
+    @property
+    def model_runmode(self) -> RunMode:
+        return "local"
+
+    @property
+    def input(self):
+        return "List[Tuple[str, str]]"
+
+    @property
+    def output(self):
+        return "List[float]"
+
+    @property
+    def input_batch_size(self) -> int:
+        return _DEFAULT_BATCH_SIZE
+
+    # ------------------------------------------------------------------
+    # Public API
+    # ------------------------------------------------------------------
+
+    def score(
+        self,
+        query: str,
+        documents: List[str],
+        *,
+        max_length: int = _DEFAULT_MAX_LENGTH,
+        batch_size: int = _DEFAULT_BATCH_SIZE,
+    ) -> List[float]:
+        """
+        Score relevance of *documents* to *query*.
+
+        Parameters
+        ----------
+        query:
+            The search query.
+        documents:
+            Candidate passages/documents to score.
+        max_length:
+            Tokenizer truncation length (default 512; max supported 8 192).
+        batch_size:
+            Number of (query, doc) pairs to process per GPU forward pass.
+
+        Returns
+        -------
+        List[float]
+            Raw logit scores aligned with *documents* (higher = more relevant).
+        """
+        import torch
+
+        if not documents:
+            return []
+
+        texts = [_prompt_template(query, d) for d in documents]
+        all_scores: List[float] = []
+
+        with torch.inference_mode():
+            for start in range(0, len(texts), batch_size):
+                chunk = texts[start : start + batch_size]
+                batch = self._tokenizer(
+                    chunk,
+                    padding=True,
+                    truncation=True,
+                    return_tensors="pt",
+                    max_length=max_length,
+                )
+                batch = {k: v.to(self._device) for k, v in batch.items()}
+                logits = self._model(**batch).logits
+                all_scores.extend(logits.view(-1).cpu().tolist())
+
+        return all_scores
+
+    def score_pairs(
+        self,
+        pairs: List[tuple],
+        *,
+        max_length: int = _DEFAULT_MAX_LENGTH,
+        batch_size: int = _DEFAULT_BATCH_SIZE,
+    ) -> List[float]:
+        """
+        Score a list of (query, document) pairs.
+
+        Parameters
+        ----------
+        pairs:
+            Sequence of ``(query, document)`` tuples.
+        max_length:
+            Tokenizer truncation length.
+        batch_size:
+            GPU forward-pass batch size.
+
+        Returns
+        -------
+        List[float]
+            Raw logit scores (higher = more relevant).
+        """
+        import torch
+
+        if not pairs:
+            return []
+
+        texts = [_prompt_template(q, d) for q, d in pairs]
+        all_scores: List[float] = []
+
+        with torch.inference_mode():
+            for start in range(0, len(texts), batch_size):
+                chunk = texts[start : start + batch_size]
+                batch = self._tokenizer(
+                    chunk,
+                    padding=True,
+                    truncation=True,
+                    return_tensors="pt",
+                    max_length=max_length,
+                )
+                batch = {k: v.to(self._device) for k, v in batch.items()}
+                logits = self._model(**batch).logits
+                all_scores.extend(logits.view(-1).cpu().tolist())
+
+        return all_scores
diff --git a/nemo_retriever/src/nemo_retriever/recall/core.py b/nemo_retriever/src/nemo_retriever/recall/core.py
index d5174b968..882e3722b 100644
--- a/nemo_retriever/src/nemo_retriever/recall/core.py
+++ b/nemo_retriever/src/nemo_retriever/recall/core.py
@@ -9,6 +9,7 @@
 from dataclasses import dataclass
 from pathlib import Path
 from typing import Any, Dict, List, Optional, Sequence, Tuple
+from nemo_retriever.retriever import Retriever
 import json
 
 logger = logging.getLogger(__name__)
@@ -48,6 +49,10 @@ class RecallConfig:
     # - pdf_page: compare on "{pdf}_{page}" keys
     # - pdf_only: compare on "{pdf}" document keys
     match_mode: str = "pdf_page"
+    reranker: Optional[str] = None
+    reranker_endpoint: Optional[str] = None
+    reranker_api_key: str = ""
+    reranker_batch_size: int = 32
 
 
 def _normalize_pdf_name(value: str) -> str:
@@ -179,81 +184,21 @@ def _embed_queries_local_hf(
     return vecs.detach().to("cpu").tolist()
 
 
-def _search_lancedb(
-    *,
-    lancedb_uri: str,
-    table_name: str,
-    query_vectors: List[List[float]],
-    top_k: int,
-    vector_column_name: str = "vector",
-    nprobes: int = 0,
-    refine_factor: int = 10,
-    query_texts: Optional[List[str]] = None,
-    hybrid: bool = False,
-) -> List[List[Dict[str, Any]]]:
-    import lancedb  # type: ignore
-
-    db = lancedb.connect(lancedb_uri)
-    table = db.open_table(table_name)
-
-    # Determine nprobes: 0 means "search all partitions" for exhaustive ANN search.
-    # Read the actual partition count from the index so we don't hard-code it.
-    effective_nprobes = nprobes
-    if effective_nprobes <= 0:
-        try:
-            indices = table.list_indices()
-            for idx in indices:
-                np_ = getattr(idx, "num_partitions", None)
-                if np_ and int(np_) > 0:
-                    effective_nprobes = int(np_)
-                    break
-        except Exception:
-            pass
-        if effective_nprobes <= 0:
-            effective_nprobes = 16  # safe fallback matching default index config
-
-    results: List[List[Dict[str, Any]]] = []
-    for i, v in enumerate(query_vectors):
-        q = np.asarray(v, dtype="float32")
-
-        if hybrid and query_texts is not None:
-            from lancedb.rerankers import RRFReranker  # type: ignore
-
-            text = query_texts[i]
-            hits = (
-                table.search(query_type="hybrid")
-                .vector(q)
-                .text(text)
-                .nprobes(effective_nprobes)
-                .refine_factor(refine_factor)
-                .select(["text", "metadata", "source", "page_number"])
-                .limit(top_k)
-                .rerank(RRFReranker())
-                .to_list()
-            )
-        else:
-            hits = (
-                table.search(q, vector_column_name=vector_column_name)
-                .nprobes(effective_nprobes)
-                .refine_factor(refine_factor)
-                .select(["text", "metadata", "source", "page_number", "_distance"])
-                .limit(top_k)
-                .to_list()
-            )
-
-        results.append(hits)
-    return results
-
-
 def _hits_to_keys(raw_hits: List[List[Dict[str, Any]]]) -> List[List[str]]:
     retrieved_keys: List[List[str]] = []
     for hits in raw_hits:
         keys: List[str] = []
         for h in hits:
+            page_number = h["page_number"]
+            source = h["source"]
             page_number = h["page_number"]
             source = h["source"]
             # Prefer explicit `pdf_page` column; fall back to derived form.
             # if res.get("page_number") is not None and source.get("source_id"):
+            if page_number is not None and source:
+                filename = Path(source).stem
+                keys.append(f"{filename}_{str(page_number)}")
+            # if res.get("page_number") is not None and source.get("source_id"):
             if page_number is not None and source:
                 filename = Path(source).stem
                 keys.append(f"{filename}_{str(page_number)}")
@@ -359,35 +304,34 @@ def retrieve_and_score(
 
     queries = df_query["query"].astype(str).tolist()
     gold = df_query["golden_answer"].astype(str).tolist()
-
     endpoint, use_grpc = _resolve_embedding_endpoint(cfg)
-    if endpoint is not None and use_grpc is not None:
-        vectors = _embed_queries_nim(
-            queries,
-            endpoint=endpoint,
-            model=cfg.embedding_model,
-            api_key=cfg.embedding_api_key,
-            grpc=bool(use_grpc),
-        )
-    else:
-        vectors = _embed_queries_local_hf(
-            queries,
-            device=cfg.local_hf_device,
-            cache_dir=cfg.local_hf_cache_dir,
-            batch_size=int(cfg.local_hf_batch_size),
-            model_name=cfg.embedding_model,
-        )
-    raw_hits = _search_lancedb(
+    retriever = Retriever(
         lancedb_uri=cfg.lancedb_uri,
-        table_name=cfg.lancedb_table,
-        query_vectors=vectors,
-        top_k=int(cfg.top_k),
-        vector_column_name=vector_column_name,
-        nprobes=int(cfg.nprobes),
-        refine_factor=int(cfg.refine_factor),
-        query_texts=queries,
+        lancedb_table=cfg.lancedb_table,
+        embedder=cfg.embedding_model or "nvidia/llama-nemotron-embed-1b-v2",
+        embedding_http_endpoint=cfg.embedding_http_endpoint,
+        embedding_api_key=cfg.embedding_api_key,
+        top_k=cfg.top_k,
+        nprobes=cfg.nprobes,
+        refine_factor=cfg.refine_factor,
         hybrid=bool(cfg.hybrid),
+        local_hf_device=cfg.local_hf_device,
+        local_hf_cache_dir=cfg.local_hf_cache_dir,
+        local_hf_batch_size=cfg.local_hf_batch_size,
+        reranker=cfg.reranker,
+        reranker_endpoint=cfg.reranker_endpoint,
+        reranker_api_key=cfg.reranker_api_key,
+        reranker_batch_size=cfg.reranker_batch_size,
     )
+    start = time.time()
+    raw_hits = retriever.queries(queries)
+    end_queries = time.time() - start
+    print(
+        f"Retrieval time for {len(queries)} ",
+        f"queries: {end_queries:.2f} seconds ",
+        f"(average {len(queries)/end_queries:.2f} queries/second)",
+    )
+
     retrieved_keys = _hits_to_keys(raw_hits)
     metrics = {
         f"recall@{k}": _recall_at_k(gold, retrieved_keys, int(k), match_mode=str(cfg.match_mode)) for k in cfg.ks
diff --git a/nemo_retriever/src/nemo_retriever/rerank/__init__.py b/nemo_retriever/src/nemo_retriever/rerank/__init__.py
new file mode 100644
index 000000000..988355cdd
--- /dev/null
+++ b/nemo_retriever/src/nemo_retriever/rerank/__init__.py
@@ -0,0 +1,24 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES.
+# All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+Reranking stage using nvidia/llama-nemotron-rerank-1b-v2.
+
+Exports
+-------
+NemotronRerankActor
+    Ray Data-compatible stateful actor that initialises the cross-encoder once
+    per worker and scores (query, document) pairs in batch DataFrames.
+rerank_hits
+    Convenience function to rerank a list of LanceDB hit dicts for a single
+    query string, using either a local ``NemotronRerankV2`` model or a remote
+    vLLM / NIM ``/rerank`` endpoint.
+"""
+
+from .rerank import NemotronRerankActor, rerank_hits
+
+__all__ = [
+    "NemotronRerankActor",
+    "rerank_hits",
+]
diff --git a/nemo_retriever/src/nemo_retriever/rerank/rerank.py b/nemo_retriever/src/nemo_retriever/rerank/rerank.py
new file mode 100644
index 000000000..189b56a89
--- /dev/null
+++ b/nemo_retriever/src/nemo_retriever/rerank/rerank.py
@@ -0,0 +1,377 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES.
+# All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+Reranking stage using nvidia/llama-nemotron-rerank-1b-v2.
+
+Provides:
+  - ``rerank_hits``         – rerank a list of LanceDB hits for a single query
+  - ``NemotronRerankActor`` – Ray Data-compatible stateful actor for batch DataFrames
+
+Remote endpoint
+---------------
+When ``invoke_url`` is set the actor/function calls a vLLM (>=0.14) or NIM
+server that exposes the OpenAI-compatible ``/rerank`` REST API::
+
+    POST /rerank
+    {
+      "model": "nvidia/llama-nemotron-rerank-1b-v2",
+      "query": "...",
+      "documents": ["...", "..."],
+      "top_n": N
+    }
+
+Local model
+-----------
+When no endpoint is configured the model is loaded directly from HuggingFace
+(or ``hf_cache_dir``) using ``NemotronRerankV2``.
+
+Ray Data actor usage::
+
+    import ray
+    ds = ds.map_batches(
+        NemotronRerankActor,
+        batch_size=64,
+        batch_format="pandas",
+        num_gpus=1,
+        compute=ray.data.ActorPoolStrategy(size=4),
+        fn_constructor_kwargs={
+            "model_name": "nvidia/llama-nemotron-rerank-1b-v2",
+            "query_column": "query",
+            "text_column": "text",
+            "score_column": "rerank_score",
+            "max_length": 512,
+            "batch_size": 32,
+        },
+    )
+"""
+
+from __future__ import annotations
+
+import traceback
+from typing import Any, Dict, List, Optional
+
+import pandas as pd
+
+
+_DEFAULT_MODEL = "nvidia/llama-nemotron-rerank-1b-v2"
+_DEFAULT_MAX_LENGTH = 512
+_DEFAULT_BATCH_SIZE = 32
+_SCORE_COLUMN = "rerank_score"
+
+
+# ---------------------------------------------------------------------------
+# Remote endpoint helper
+# ---------------------------------------------------------------------------
+
+
+def _rerank_via_endpoint(
+    query: str,
+    documents: List[str],
+    *,
+    endpoint: str,
+    model_name: str = _DEFAULT_MODEL,
+    api_key: str = "",
+    top_n: Optional[int] = None,
+) -> List[float]:
+    """
+    Call a vLLM / NIM ``/rerank`` REST endpoint and return per-document scores.
+
+    The server must expose the OpenAI-compatible rerank API introduced in
+    vLLM >= 0.14.0::
+
+        POST {endpoint}/rerank
+        {"model": ..., "query": ..., "documents": [...], "top_n": N}
+
+    Returns
+    -------
+    List[float]
+        Scores aligned with *documents* (higher = more relevant).
+        Documents not returned by ``top_n`` truncation receive ``-inf``.
+    """
+    import requests
+
+    url = endpoint.rstrip("/") + "/rerank"
+    headers: Dict[str, str] = {"Content-Type": "application/json"}
+    if api_key:
+        headers["Authorization"] = f"Bearer {api_key}"
+
+    payload: Dict[str, Any] = {
+        "model": model_name,
+        "query": query,
+        "documents": documents,
+    }
+    if top_n is not None:
+        payload["top_n"] = top_n
+
+    response = requests.post(url, json=payload, headers=headers, timeout=120)
+    response.raise_for_status()
+    data = response.json()
+
+    # Build score list aligned with input document order.
+    scores = [float("-inf")] * len(documents)
+    for item in data.get("results", []):
+        idx = item.get("index")
+        score = item.get("relevance_score")
+        if idx is not None and score is not None:
+            scores[idx] = float(score)
+    return scores
+
+
+# ---------------------------------------------------------------------------
+# Public helper: rerank LanceDB hits for a single query
+# ---------------------------------------------------------------------------
+
+
+def rerank_hits(
+    query: str,
+    hits: List[Dict[str, Any]],
+    *,
+    model: Optional[Any] = None,
+    invoke_url: Optional[str] = None,
+    model_name: str = _DEFAULT_MODEL,
+    api_key: str = "",
+    max_length: int = _DEFAULT_MAX_LENGTH,
+    batch_size: int = _DEFAULT_BATCH_SIZE,
+    top_n: Optional[int] = None,
+    text_key: str = "text",
+) -> List[Dict[str, Any]]:
+    """
+    Rerank *hits* (list of LanceDB result dicts) by relevance to *query*.
+
+    Each hit that has a ``text_key`` field is scored; hits without text are
+    placed at the end.  The returned list is sorted highest-score first and
+    each dict gains a ``"_rerank_score"`` field.
+
+    Parameters
+    ----------
+    query:
+        The search query.
+    hits:
+        LanceDB result dicts (as returned by ``Retriever.queries()``).
+    model:
+        A ``NemotronRerankV2`` instance (local GPU inference).  Ignored when
+        *invoke_url* is set.
+    invoke_url:
+        Base URL of a vLLM / NIM ``/rerank`` endpoint.  Takes priority over
+        *model*.
+    model_name:
+        Model identifier sent to the remote endpoint (default
+        ``"nvidia/llama-nemotron-rerank-1b-v2"``).
+    api_key:
+        Bearer token for the remote endpoint.
+    max_length:
+        Tokenizer truncation length for local inference (max 8 192).
+    batch_size:
+        GPU forward-pass batch size for local inference.
+    top_n:
+        If set, only the top-N results (after reranking) are returned.
+    text_key:
+        Dict key used to extract document text from each hit (default
+        ``"text"``).
+
+    Returns
+    -------
+    List[dict]
+        Hits sorted by ``"_rerank_score"`` descending.  Each dict has a new
+        ``"_rerank_score"`` key with the raw logit (local) or relevance score
+        (remote).
+    """
+    if not hits:
+        return hits
+
+    documents = [str(h.get(text_key) or "") for h in hits]
+
+    if invoke_url:
+        scores = _rerank_via_endpoint(
+            query,
+            documents,
+            endpoint=invoke_url,
+            model_name=model_name,
+            api_key=api_key,
+        )
+    elif model is not None:
+        scores = model.score(query, documents, max_length=max_length, batch_size=batch_size)
+    else:
+        raise ValueError("Either 'model' (NemotronRerankV2 instance) or 'invoke_url' must be provided.")
+
+    ranked = sorted(
+        [{"_rerank_score": s, **h} for s, h in zip(scores, hits)],
+        key=lambda x: x["_rerank_score"],
+        reverse=True,
+    )
+
+    if top_n is not None:
+        ranked = ranked[:top_n]
+
+    return ranked
+
+
+# ---------------------------------------------------------------------------
+# Error payload helper (mirrors other actors in this project)
+# ---------------------------------------------------------------------------
+
+
+def _error_payload(*, stage: str, exc: BaseException) -> Dict[str, Any]:
+    return {
+        "status": "error",
+        "stage": stage,
+        "error_message": str(exc),
+        "traceback": traceback.format_exc(),
+    }
+
+
+# ---------------------------------------------------------------------------
+# Ray Data actor
+# ---------------------------------------------------------------------------
+
+
+class NemotronRerankActor:
+    """
+    Ray Data-compatible stateful actor for cross-encoder reranking.
+
+    Initialises ``nvidia/llama-nemotron-rerank-1b-v2`` **once** per actor
+    instance and reuses it across batches, avoiding repeated model loads.
+
+    Each row in the input DataFrame is expected to have a *query* column and a
+    *text* (document) column.  The actor appends a ``rerank_score`` column
+    (name configurable) with the raw logit score.
+
+    Usage with Ray Data::
+
+        import ray
+        ds = ds.map_batches(
+            NemotronRerankActor,
+            batch_size=64,
+            batch_format="pandas",
+            num_gpus=1,
+            compute=ray.data.ActorPoolStrategy(size=4),
+            fn_constructor_kwargs={
+                "model_name": "nvidia/llama-nemotron-rerank-1b-v2",
+                "query_column": "query",
+                "text_column": "text",
+                "score_column": "rerank_score",
+                "max_length": 512,
+                "batch_size": 32,
+            },
+        )
+
+    Parameters
+    ----------
+    model_name:
+        HuggingFace model ID (default ``"nvidia/llama-nemotron-rerank-1b-v2"``).
+    invoke_url:
+        Base URL of a vLLM / NIM ``/rerank`` endpoint.  When set the actor
+        skips local model creation and delegates all scoring to the endpoint.
+        Also accepted as ``rerank_invoke_url``.
+    api_key:
+        Bearer token for the remote endpoint.
+    device:
+        Torch device string (default: ``"cuda"`` if available, else ``"cpu"``).
+    hf_cache_dir:
+        Directory for HuggingFace model cache.
+    query_column:
+        DataFrame column containing query strings (default ``"query"``).
+    text_column:
+        DataFrame column containing document/passage text (default ``"text"``).
+    score_column:
+        Output column name for rerank scores (default ``"rerank_score"``).
+    max_length:
+        Tokenizer truncation length (default 512).
+    batch_size:
+        GPU forward-pass micro-batch size (default 32).
+    sort_results:
+        If ``True`` (default) rows in each batch are sorted by score descending.
+    """
+
+    __slots__ = ("_kwargs", "_model")
+
+    def __init__(self, **kwargs: Any) -> None:
+        self._kwargs = dict(kwargs)
+
+        invoke_url = str(self._kwargs.get("rerank_invoke_url") or self._kwargs.get("invoke_url") or "").strip()
+        if invoke_url and "invoke_url" not in self._kwargs:
+            self._kwargs["invoke_url"] = invoke_url
+
+        if invoke_url:
+            self._model = None
+        else:
+            from nemo_retriever.model.local import NemotronRerankV2
+
+            self._model = NemotronRerankV2(
+                model_name=str(self._kwargs.get("model_name", _DEFAULT_MODEL)),
+                device=self._kwargs.get("device") or None,
+                hf_cache_dir=str(self._kwargs["hf_cache_dir"]) if self._kwargs.get("hf_cache_dir") else None,
+            )
+
+    def __call__(self, batch_df: Any, **override_kwargs: Any) -> Any:
+        try:
+            return _rerank_batch(batch_df, model=self._model, **self._kwargs, **override_kwargs)
+        except BaseException as exc:
+            if isinstance(batch_df, pd.DataFrame):
+                out = batch_df.copy()
+                payload = _error_payload(stage="actor_call", exc=exc)
+                score_col = str(self._kwargs.get("score_column", _SCORE_COLUMN))
+                out[score_col] = [payload for _ in range(len(out.index))]
+                return out
+            return [{"rerank_score": _error_payload(stage="actor_call", exc=exc)}]
+
+
+# ---------------------------------------------------------------------------
+# Batch processing function (called by actor and usable standalone)
+# ---------------------------------------------------------------------------
+
+
+def _rerank_batch(
+    batch_df: pd.DataFrame,
+    *,
+    model: Optional[Any] = None,
+    invoke_url: Optional[str] = None,
+    model_name: str = _DEFAULT_MODEL,
+    api_key: str = "",
+    query_column: str = "query",
+    text_column: str = "text",
+    score_column: str = _SCORE_COLUMN,
+    max_length: int = _DEFAULT_MAX_LENGTH,
+    batch_size: int = _DEFAULT_BATCH_SIZE,
+    sort_results: bool = True,
+    **_ignored: Any,
+) -> pd.DataFrame:
+    """
+    Score each (query, document) row in *batch_df* and append *score_column*.
+
+    When *sort_results* is ``True`` the returned DataFrame is sorted by score
+    descending within the batch.
+    """
+    if not isinstance(batch_df, pd.DataFrame):
+        raise TypeError(f"Expected a pandas DataFrame, got {type(batch_df)}")
+
+    queries = batch_df[query_column].tolist()
+    texts = batch_df[text_column].tolist()
+    pairs = list(zip(queries, texts))
+
+    if invoke_url:
+        # Remote endpoint: score pair-by-pair (each row may have a different query).
+        scores: List[float] = []
+        for q, d in pairs:
+            row_scores = _rerank_via_endpoint(
+                q,
+                [d],
+                endpoint=invoke_url,
+                model_name=model_name,
+                api_key=api_key,
+            )
+            scores.append(row_scores[0])
+    elif model is not None:
+        scores = model.score_pairs(pairs, max_length=max_length, batch_size=batch_size)
+    else:
+        raise ValueError("Either 'model' or 'invoke_url' must be provided to NemotronRerankActor.")
+
+    out = batch_df.copy()
+    out[score_column] = scores
+
+    if sort_results:
+        out = out.sort_values(score_column, ascending=False).reset_index(drop=True)
+
+    return out
diff --git a/nemo_retriever/src/nemo_retriever/retriever.py b/nemo_retriever/src/nemo_retriever/retriever.py
index bffd35cf0..aa9203783 100644
--- a/nemo_retriever/src/nemo_retriever/retriever.py
+++ b/nemo_retriever/src/nemo_retriever/retriever.py
@@ -4,14 +4,39 @@
 
 from __future__ import annotations
 
-from dataclasses import dataclass
+from dataclasses import dataclass, field
 from pathlib import Path
 from typing import Any, Optional, Sequence
+from tqdm import tqdm
 
 
 @dataclass
 class Retriever:
-    """Simple query helper over LanceDB with configurable embedders."""
+    """Simple query helper over LanceDB with configurable embedders.
+
+    Retrieval pipeline
+    ------------------
+    1. Embed query strings (NIM endpoint or local HuggingFace model).
+    2. Search LanceDB (vector or hybrid vector+BM25).
+    3. Optionally rerank the results with ``nvidia/llama-nemotron-rerank-1b-v2``
+       (NIM/vLLM endpoint or local HuggingFace model).
+
+    Reranking
+    ---------
+    Set ``reranker`` to a model name (e.g.
+    ``"nvidia/llama-nemotron-rerank-1b-v2"``) to enable post-retrieval
+    reranking.  Results are re-sorted by the cross-encoder score and a
+    ``"_rerank_score"`` key is added to each hit dict.
+
+    Use ``reranker_endpoint`` to delegate to a running vLLM (>=0.14) or NIM
+    server instead of loading the model locally::
+
+        retriever = Retriever(
+            reranker="nvidia/llama-nemotron-rerank-1b-v2",
+            reranker_endpoint="http://localhost:8000",
+        )
+        results = retriever.query("What is machine learning?")
+    """
 
     lancedb_uri: str = "lancedb"
     lancedb_table: str = "nv-ingest"
@@ -27,6 +52,23 @@ class Retriever:
     local_hf_device: Optional[str] = None
     local_hf_cache_dir: Optional[Path] = None
     local_hf_batch_size: int = 64
+    # Reranking -----------------------------------------------------------
+    reranker: Optional[str] = "nvidia/llama-nemotron-rerank-1b-v2"
+    """HuggingFace model ID for local reranking (e.g. 'nvidia/llama-nemotron-rerank-1b-v2').
+    Set to None to skip reranking (default)."""
+    reranker_endpoint: Optional[str] = None
+    """Base URL of a vLLM / NIM /rerank endpoint.  Takes priority over local model."""
+    reranker_api_key: str = ""
+    """Bearer token for the remote rerank endpoint."""
+    reranker_max_length: int = 512
+    """Tokenizer truncation length for local reranking (max 8 192)."""
+    reranker_batch_size: int = 32
+    """GPU micro-batch size for local reranking."""
+    reranker_refine_factor: int = 4
+    """Number of candidates to rerank = top_k * reranker_refine_factor.
+    Set to 1 to rerank only the top_k results."""
+    # Internal cache for the local rerank model (not part of the public API).
+    _reranker_model: Any = field(default=None, init=False, repr=False, compare=False)
 
     def _resolve_embedding_endpoint(self) -> Optional[str]:
         http_ep = self.embedding_http_endpoint.strip() if isinstance(self.embedding_http_endpoint, str) else None
@@ -107,6 +149,8 @@ def _search_lancedb(
         results: list[list[dict[str, Any]]] = []
         for i, vector in enumerate(query_vectors):
             q = np.asarray(vector, dtype="float32")
+            # doubling top_k for both hybrid and dense search in order to have more to rerank
+            top_k = self.top_k if not self.reranker else self.top_k * self.reranker_refine_factor
             if self.hybrid:
                 from lancedb.rerankers import RRFReranker  # type: ignore
 
@@ -116,8 +160,8 @@ def _search_lancedb(
                     .text(query_texts[i])
                     .nprobes(effective_nprobes)
                     .refine_factor(int(self.refine_factor))
-                    .select(["text", "metadata", "source"])
-                    .limit(int(self.top_k))
+                    .select(["text", "metadata", "source", "page_number"])
+                    .limit(int(top_k))
                     .rerank(RRFReranker())
                     .to_list()
                 )
@@ -126,13 +170,62 @@ def _search_lancedb(
                     table.search(q, vector_column_name=self.vector_column_name)
                     .nprobes(effective_nprobes)
                     .refine_factor(int(self.refine_factor))
-                    .select(["text", "metadata", "source", "_distance"])
-                    .limit(int(self.top_k))
+                    .select(["text", "metadata", "source", "page_number", "_distance"])
+                    .limit(int(top_k))
                     .to_list()
                 )
             results.append(hits)
         return results
 
+    # ------------------------------------------------------------------
+    # Reranking helpers
+    # ------------------------------------------------------------------
+
+    def _get_reranker_model(self) -> Any:
+        """Lazily load and cache the local NemotronRerankV2 model."""
+        if self._reranker_model is None:
+            from nemo_retriever.model.local import NemotronRerankV2
+
+            cache_dir = str(self.local_hf_cache_dir) if self.local_hf_cache_dir else None
+            self._reranker_model = NemotronRerankV2(
+                model_name=str(self.reranker),
+                device=self.local_hf_device,
+                hf_cache_dir=cache_dir,
+            )
+        return self._reranker_model
+
+    def _rerank_results(
+        self,
+        query_texts: list[str],
+        results: list[list[dict[str, Any]]],
+    ) -> list[list[dict[str, Any]]]:
+        """Rerank each per-query result list using the configured reranker."""
+        from nemo_retriever.rerank import rerank_hits
+
+        reranker_endpoint = (self.reranker_endpoint or "").strip() or None
+        model = None if reranker_endpoint else self._get_reranker_model()
+
+        reranked: list[list[dict[str, Any]]] = []
+        for query, hits in tqdm(zip(query_texts, results), desc="Reranking", unit="query", total=len(query_texts)):
+            reranked.append(
+                rerank_hits(
+                    query,
+                    hits,
+                    model=model,
+                    invoke_url=reranker_endpoint,
+                    model_name=str(self.reranker),
+                    api_key=(self.reranker_api_key or "").strip(),
+                    max_length=int(self.reranker_max_length),
+                    batch_size=int(self.reranker_batch_size),
+                    top_n=int(self.top_k),
+                )
+            )
+        return reranked
+
+    # ------------------------------------------------------------------
+    # Public query API
+    # ------------------------------------------------------------------
+
     def query(
         self,
         query: str,
@@ -157,7 +250,13 @@ def queries(
         lancedb_uri: Optional[str] = None,
         lancedb_table: Optional[str] = None,
     ) -> list[list[dict[str, Any]]]:
-        """Run retrieval for multiple query strings."""
+        """Run retrieval for multiple query strings.
+
+        If ``reranker`` is set on this instance the initial vector-search
+        results are re-scored with ``nvidia/llama-nemotron-rerank-1b-v2``
+        (or the configured endpoint) and returned sorted by cross-encoder
+        score.  Each hit gains a ``"_rerank_score"`` key.
+        """
         query_texts = [str(q) for q in queries]
         if not query_texts:
             return []
@@ -179,13 +278,21 @@ def queries(
                 model_name=resolved_embedder,
             )
 
-        return self._search_lancedb(
+        results = self._search_lancedb(
             lancedb_uri=resolved_lancedb_uri,
             lancedb_table=resolved_lancedb_table,
             query_vectors=vectors,
             query_texts=query_texts,
         )
 
+        if self.reranker:
+            assert self.top_k * self.reranker_refine_factor == len(
+                results[0]
+            ), "top_k must be at least 1/4 of the number of retrieved hits for reranking to work properly."
+            results = self._rerank_results(query_texts, results)
+
+        return results
+
 
 # Backward compatibility alias.
 retriever = Retriever
diff --git a/nemo_retriever/src/nemo_retriever/utils/benchmark/audio_extract_actor.py b/nemo_retriever/src/nemo_retriever/utils/benchmark/audio_extract_actor.py
index 748a00975..8ade02b36 100644
--- a/nemo_retriever/src/nemo_retriever/utils/benchmark/audio_extract_actor.py
+++ b/nemo_retriever/src/nemo_retriever/utils/benchmark/audio_extract_actor.py
@@ -55,6 +55,73 @@ def __call__(self, batch_df: pd.DataFrame) -> pd.DataFrame:
 app = typer.Typer(help="Benchmark audio extraction (MediaChunkActor + ASRActor) throughput (chunk rows/sec).")
 
 
+def run_benchmark(
+    audio_path: Path,
+    rows: int = 16,
+    workers: str = "1,2",
+    batch_sizes: str = "2,4,8",
+    mock_asr: bool = True,
+    split_type: str = "size",
+    split_interval: int = 450,
+    ray_address: Optional[str] = None,
+    output_json: Optional[Path] = None,
+) -> None:
+    if not is_media_available():
+        raise typer.BadParameter("Audio benchmark requires ffmpeg on PATH.")
+
+    if split_type not in ("size", "time", "frame"):
+        raise typer.BadParameter("--split-type must be one of: size, time, frame")
+
+    maybe_init_ray(ray_address)
+    worker_grid = parse_csv_ints(workers, name="workers")
+    batch_grid = parse_csv_ints(batch_sizes, name="batch_sizes")
+    seed_row = make_seed_audio_row(audio_path)
+
+    chunk_params = AudioChunkParams(
+        split_type=split_type,
+        split_interval=split_interval,
+    )
+
+    def _map(ds: rd.Dataset, worker_count: int, batch_size: int) -> rd.Dataset:
+        chunk_actor = MediaChunkActor(params=chunk_params)
+        if mock_asr:
+            asr_actor = MockASRActor()
+        else:
+            asr_actor = ASRActor(params=asr_params_from_env())
+
+        ds = ds.map_batches(
+            chunk_actor,
+            batch_size=int(batch_size),
+            batch_format="pandas",
+            num_cpus=1,
+            num_gpus=0,
+            compute=rd.TaskPoolStrategy(size=int(worker_count)),
+        )
+        ds = ds.map_batches(
+            asr_actor,
+            batch_size=int(batch_size),
+            batch_format="pandas",
+            num_cpus=1,
+            num_gpus=0.25 if not mock_asr else 0,
+            compute=rd.TaskPoolStrategy(size=int(worker_count)),
+        )
+        return ds
+
+    best, results = benchmark_sweep(
+        stage_name="audio_extract",
+        seed_row=seed_row,
+        rows=int(rows),
+        workers=worker_grid,
+        batch_sizes=batch_grid,
+        map_builder=_map,
+    )
+    typer.echo(
+        f"BEST audio_extract: workers={best.workers} batch_size={best.batch_size} "
+        f"chunk_rows={best.rows} elapsed={best.elapsed_seconds:.3f}s rows_per_second={best.rows_per_second:.2f}"
+    )
+    maybe_write_results_json(output_json, best=best, results=results)
+
+
 @app.command("run")
 def run(
     audio_path: Path = typer.Option(
@@ -108,57 +175,14 @@ def run(
         help="Optional output JSON summary path.",
     ),
 ) -> None:
-    if not is_media_available():
-        raise typer.BadParameter("Audio benchmark requires ffmpeg on PATH.")
-
-    if split_type not in ("size", "time", "frame"):
-        raise typer.BadParameter("--split-type must be one of: size, time, frame")
-
-    maybe_init_ray(ray_address)
-    worker_grid = parse_csv_ints(workers, name="workers")
-    batch_grid = parse_csv_ints(batch_sizes, name="batch_sizes")
-    seed_row = make_seed_audio_row(audio_path)
-
-    chunk_params = AudioChunkParams(
+    run_benchmark(
+        audio_path=audio_path,
+        rows=rows,
+        workers=workers,
+        batch_sizes=batch_sizes,
+        mock_asr=mock_asr,
         split_type=split_type,
         split_interval=split_interval,
+        ray_address=ray_address,
+        output_json=output_json,
     )
-
-    def _map(ds: rd.Dataset, worker_count: int, batch_size: int) -> rd.Dataset:
-        chunk_actor = MediaChunkActor(params=chunk_params)
-        if mock_asr:
-            asr_actor = MockASRActor()
-        else:
-            asr_actor = ASRActor(params=asr_params_from_env())
-
-        ds = ds.map_batches(
-            chunk_actor,
-            batch_size=int(batch_size),
-            batch_format="pandas",
-            num_cpus=1,
-            num_gpus=0,
-            compute=rd.TaskPoolStrategy(size=int(worker_count)),
-        )
-        ds = ds.map_batches(
-            asr_actor,
-            batch_size=int(batch_size),
-            batch_format="pandas",
-            num_cpus=1,
-            num_gpus=0.25 if not mock_asr else 0,
-            compute=rd.TaskPoolStrategy(size=int(worker_count)),
-        )
-        return ds
-
-    best, results = benchmark_sweep(
-        stage_name="audio_extract",
-        seed_row=seed_row,
-        rows=int(rows),
-        workers=worker_grid,
-        batch_sizes=batch_grid,
-        map_builder=_map,
-    )
-    typer.echo(
-        f"BEST audio_extract: workers={best.workers} batch_size={best.batch_size} "
-        f"chunk_rows={best.rows} elapsed={best.elapsed_seconds:.3f}s rows_per_second={best.rows_per_second:.2f}"
-    )
-    maybe_write_results_json(output_json, best=best, results=results)
diff --git a/nemo_retriever/src/nemo_retriever/vector_store/lancedb_store.py b/nemo_retriever/src/nemo_retriever/vector_store/lancedb_store.py
index ebe460204..c3acb2d8f 100644
--- a/nemo_retriever/src/nemo_retriever/vector_store/lancedb_store.py
+++ b/nemo_retriever/src/nemo_retriever/vector_store/lancedb_store.py
@@ -171,7 +171,6 @@ def _build_lancedb_rows_from_df(rows: List[Dict[str, Any]]) -> List[Dict[str, An
                 "pdf_basename": pdf_basename,
                 "page_number": int(page_number),
                 "source": source_id,
-                "source_id": source_id,
                 "path": path,
                 "text": row.get("text", ""),
                 "metadata": str(meta),
@@ -295,26 +294,6 @@ def write_text_embeddings_dir_to_lancedb(
 
     lancedb.run(results)
 
-    # all_rows: List[Dict[str, Any]] = []
-    # for p in files:
-    #     try:
-    #         df = _read_text_embeddings_json_df(p)
-    #         if df.empty:
-    #             skipped += 1
-    #             continue
-    #         rows = _build_lancedb_rows_from_df(df)
-    #         if not rows:
-    #             skipped += 1
-    #             continue
-    #         all_rows.extend(rows)
-    #         processed += 1
-    #     except Exception:
-    #         failed += 1
-    #         logger.exception("Failed reading embeddings from %s", p)
-
-    # # Write once so --overwrite behaves as expected.
-    # _write_rows_to_lancedb(all_rows, cfg=cfg)
-
     return {
         "input_dir": str(input_dir),
         "n_files": len(files),
diff --git a/nemo_retriever/tests/test_audio_benchmark.py b/nemo_retriever/tests/test_audio_benchmark.py
index 3862d3a83..a6b67f092 100644
--- a/nemo_retriever/tests/test_audio_benchmark.py
+++ b/nemo_retriever/tests/test_audio_benchmark.py
@@ -31,25 +31,12 @@ def test_audio_benchmark_run_mock_asr(tmp_path: Path):
     wav = tmp_path / "tiny.wav"
     _make_small_wav(wav, duration_sec=0.3)
 
-    from typer.testing import CliRunner
-
-    from nemo_retriever.utils.benchmark.audio_extract_actor import app
-
-    runner = CliRunner()
-    result = runner.invoke(
-        app,
-        [
-            "run",
-            "--audio-path",
-            str(wav),
-            "--rows",
-            "2",
-            "--workers",
-            "1",
-            "--batch-sizes",
-            "2",
-            "--mock-asr",
-        ],
+    from nemo_retriever.utils.benchmark.audio_extract_actor import run_benchmark
+
+    run_benchmark(
+        audio_path=wav,
+        rows=2,
+        workers="1",
+        batch_sizes="2",
+        mock_asr=True,
     )
-    assert result.exit_code == 0, (result.stdout, result.stderr)
-    assert "audio_extract" in result.stdout or "BEST" in result.stdout
diff --git a/nemo_retriever/tests/test_audio_pipeline_batch.py b/nemo_retriever/tests/test_audio_pipeline_batch.py
index 09ce1a7ad..ae0d3a136 100644
--- a/nemo_retriever/tests/test_audio_pipeline_batch.py
+++ b/nemo_retriever/tests/test_audio_pipeline_batch.py
@@ -96,6 +96,7 @@ def test_batch_audio_pipeline_with_mocked_asr(tmp_path: Path):
                 runtime_env={"working_dir": str(_nv_ingest_root)},
             )
             results = ingestor.ingest()
+            results = results._rd_dataset.take_all() if results is not None else None
         finally:
             try:
                 ray.shutdown()
@@ -219,6 +220,7 @@ def test_fused_audio_pipeline_with_mocked_asr(tmp_path: Path):
                 runtime_env={"working_dir": str(_nv_ingest_root)},
             )
             results = ingestor.ingest()
+            results = results._rd_dataset.take_all() if results is not None else None
         finally:
             try:
                 ray.shutdown()
diff --git a/nemo_retriever/tests/test_html_convert.py b/nemo_retriever/tests/test_html_convert.py
index e558a4b29..399ae9091 100644
--- a/nemo_retriever/tests/test_html_convert.py
+++ b/nemo_retriever/tests/test_html_convert.py
@@ -11,7 +11,11 @@
 import pandas as pd
 import pytest
 
-from nemo_retriever.html.convert import html_bytes_to_chunks_df, html_file_to_chunks_df, html_to_markdown
+from nemo_retriever.html.convert import (
+    html_bytes_to_chunks_df,
+    html_file_to_chunks_df,
+    html_to_markdown,
+)
 
 
 def test_html_to_markdown_str():
diff --git a/nemo_retriever/tests/test_nemotron_rerank_v2.py b/nemo_retriever/tests/test_nemotron_rerank_v2.py
new file mode 100644
index 000000000..4c6761a5b
--- /dev/null
+++ b/nemo_retriever/tests/test_nemotron_rerank_v2.py
@@ -0,0 +1,608 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES.
+# All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+Unit tests for NemotronRerankV2 and the rerank module helpers.
+
+All heavy dependencies (torch, transformers, nemo_retriever.utils.hf_cache)
+are stubbed via sys.modules injection so no GPU or model download is required.
+"""
+
+from __future__ import annotations
+
+import sys
+from types import ModuleType
+from unittest.mock import MagicMock, patch
+
+import pytest
+
+
+# ---------------------------------------------------------------------------
+# Helpers to build lightweight torch / transformers stubs
+# ---------------------------------------------------------------------------
+
+
+def _make_tensor_stub(values: list[float]) -> MagicMock:
+    """Return a mock that mimics a 1-D torch.Tensor view(-1).cpu().tolist()."""
+    t = MagicMock()
+    t.view.return_value = t
+    t.cpu.return_value = t
+    t.tolist.return_value = values
+    return t
+
+
+def _make_model_output_stub(logits_values: list[float]) -> MagicMock:
+    out = MagicMock()
+    out.logits = _make_tensor_stub(logits_values)
+    return out
+
+
+def _build_torch_stub() -> MagicMock:
+    torch_mod = MagicMock()
+    torch_mod.cuda.is_available.return_value = False
+    torch_mod.bfloat16 = "bfloat16"
+    torch_mod.inference_mode.return_value.__enter__ = lambda s: None
+    torch_mod.inference_mode.return_value.__exit__ = MagicMock(return_value=False)
+    return torch_mod
+
+
+def _build_transformers_stub(model_output_values: list[float]) -> tuple[MagicMock, MagicMock, MagicMock]:
+    """Return (transformers_mod, tokenizer_instance, model_instance)."""
+    tokenizer_inst = MagicMock()
+    tokenizer_inst.pad_token = "pad"
+    tokenizer_inst.eos_token_id = 0
+    # __call__ on the tokenizer returns a dict of tensors
+    tokenizer_inst.return_value = {"input_ids": MagicMock(), "attention_mask": MagicMock()}
+
+    model_inst = MagicMock()
+    model_inst.eval.return_value = model_inst
+    model_inst.to.return_value = model_inst
+    model_inst.config.pad_token_id = 1
+    model_inst.return_value = _make_model_output_stub(model_output_values)
+
+    AutoTokenizer = MagicMock()
+    AutoTokenizer.from_pretrained.return_value = tokenizer_inst
+
+    AutoModelForSequenceClassification = MagicMock()
+    AutoModelForSequenceClassification.from_pretrained.return_value = model_inst
+
+    transformers_mod = MagicMock()
+    transformers_mod.AutoTokenizer = AutoTokenizer
+    transformers_mod.AutoModelForSequenceClassification = AutoModelForSequenceClassification
+
+    return transformers_mod, tokenizer_inst, model_inst
+
+
+@pytest.fixture()
+def _patch_heavy_deps(monkeypatch):
+    """Inject torch + transformers stubs and disable hf_cache setup."""
+    torch_stub = _build_torch_stub()
+    transformers_stub, tok, mdl = _build_transformers_stub([1.5, -0.3])
+
+    monkeypatch.setitem(sys.modules, "torch", torch_stub)
+    monkeypatch.setitem(sys.modules, "transformers", transformers_stub)
+
+    # Stub hf_cache so configure_global_hf_cache_base() is a no-op.
+    hf_cache_mod = ModuleType("nemo_retriever.utils.hf_cache")
+    hf_cache_mod.configure_global_hf_cache_base = MagicMock()
+    monkeypatch.setitem(sys.modules, "nemo_retriever.utils.hf_cache", hf_cache_mod)
+
+    # Also stub the parent model module so BaseModel import works.
+    # We bypass by importing NemotronRerankV2 after patching.
+    yield torch_stub, transformers_stub, tok, mdl
+
+
+# ---------------------------------------------------------------------------
+# _prompt_template
+# ---------------------------------------------------------------------------
+
+
+def test_prompt_template_format():
+    from nemo_retriever.rerank.rerank import _rerank_via_endpoint  # noqa: F401 — just ensure importable
+    from nemo_retriever.model.local.nemotron_rerank_v2 import _prompt_template
+
+    result = _prompt_template("What is ML?", "Machine learning is a branch of AI.")
+    assert "question:What is ML?" in result
+    assert "passage:Machine learning is a branch of AI." in result
+
+
+# ---------------------------------------------------------------------------
+# NemotronRerankV2 — properties & initialisation
+# ---------------------------------------------------------------------------
+
+
+class TestNemotronRerankV2Properties:
+    """Test BaseModel properties without loading real weights."""
+
+    def _make_instance(self, model_name: str = "nvidia/llama-nemotron-rerank-1b-v2") -> object:
+        """Instantiate NemotronRerankV2 with all heavy ops mocked out."""
+        from nemo_retriever.model.local import nemotron_rerank_v2 as mod
+
+        with (
+            patch.object(mod, "configure_global_hf_cache_base"),
+            patch("torch.cuda.is_available", return_value=False),
+            patch("transformers.AutoTokenizer") as MockTok,
+            patch("transformers.AutoModelForSequenceClassification") as MockModel,
+        ):
+            tok = MockTok.from_pretrained.return_value
+            tok.pad_token = "pad"
+            tok.eos_token_id = 0
+            mdl = MockModel.from_pretrained.return_value
+            mdl.eval.return_value = mdl
+            mdl.to.return_value = mdl
+            mdl.config.pad_token_id = 1
+            obj = mod.NemotronRerankV2(model_name=model_name)
+        return obj
+
+    def test_model_name(self):
+        obj = self._make_instance()
+        assert obj.model_name == "nvidia/llama-nemotron-rerank-1b-v2"
+
+    def test_model_type(self):
+        obj = self._make_instance()
+        assert obj.model_type == "reranker"
+
+    def test_model_runmode(self):
+        obj = self._make_instance()
+        assert obj.model_runmode == "local"
+
+    def test_input_batch_size(self):
+        obj = self._make_instance()
+        assert obj.input_batch_size == 32
+
+    def test_custom_model_name_stored(self):
+        obj = self._make_instance("my-org/my-reranker")
+        assert obj.model_name == "my-org/my-reranker"
+
+    def test_device_defaults_to_cpu_when_no_cuda(self):
+        obj = self._make_instance()
+        assert obj._device == "cpu"
+
+
+# ---------------------------------------------------------------------------
+# NemotronRerankV2 — score() logic (batch chunking, empty input)
+# ---------------------------------------------------------------------------
+
+
+class TestNemotronRerankV2Score:
+    """Test score() and score_pairs() without real model weights."""
+
+    @pytest.fixture()
+    def reranker(self):
+        from nemo_retriever.model.local import nemotron_rerank_v2 as mod
+
+        with (
+            patch.object(mod, "configure_global_hf_cache_base"),
+            patch("torch.cuda.is_available", return_value=False),
+            patch("transformers.AutoTokenizer") as MockTok,
+            patch("transformers.AutoModelForSequenceClassification") as MockModel,
+        ):
+            tok_inst = MockTok.from_pretrained.return_value
+            tok_inst.pad_token = "pad"
+            tok_inst.eos_token_id = 0
+            mdl_inst = MockModel.from_pretrained.return_value
+            mdl_inst.eval.return_value = mdl_inst
+            mdl_inst.to.return_value = mdl_inst
+            mdl_inst.config.pad_token_id = 1
+            obj = mod.NemotronRerankV2()
+
+        return obj
+
+    def test_score_empty_documents_returns_empty(self, reranker):
+        assert reranker.score("q", []) == []
+
+    def test_score_pairs_empty_returns_empty(self, reranker):
+        assert reranker.score_pairs([]) == []
+
+    def test_score_calls_model_and_returns_flat_list(self, reranker):
+        """score() should return one float per document."""
+        logit_tensor = MagicMock()
+        logit_tensor.view.return_value = logit_tensor
+        logit_tensor.cpu.return_value = logit_tensor
+        logit_tensor.tolist.return_value = [3.5, -1.2]
+
+        model_out = MagicMock()
+        model_out.logits = logit_tensor
+
+        reranker._tokenizer.return_value = {"input_ids": MagicMock(), "attention_mask": MagicMock()}
+        reranker._model.return_value = model_out
+
+        with patch("torch.inference_mode") as inf_mode:
+            inf_mode.return_value.__enter__ = lambda s: None
+            inf_mode.return_value.__exit__ = MagicMock(return_value=False)
+            scores = reranker.score("What is ML?", ["Machine learning is…", "Paris is…"])
+
+        assert len(scores) == 2
+        assert scores == [3.5, -1.2]
+
+    def test_score_prompts_are_formatted_correctly(self, reranker):
+        """The tokenizer must receive the templated text, not the raw document."""
+        captured_texts = []
+
+        def fake_tokenizer(texts, **kwargs):
+            captured_texts.extend(texts)
+            m = MagicMock()
+            m.items.return_value = []
+            return m
+
+        reranker._tokenizer.side_effect = fake_tokenizer
+
+        logit_tensor = MagicMock()
+        logit_tensor.view.return_value = logit_tensor
+        logit_tensor.cpu.return_value = logit_tensor
+        logit_tensor.tolist.return_value = [0.0]
+
+        model_out = MagicMock()
+        model_out.logits = logit_tensor
+        reranker._model.return_value = model_out
+
+        with patch("torch.inference_mode") as inf_mode:
+            inf_mode.return_value.__enter__ = lambda s: None
+            inf_mode.return_value.__exit__ = MagicMock(return_value=False)
+            reranker.score("my query", ["my document"])
+
+        assert len(captured_texts) == 1
+        assert "question:my query" in captured_texts[0]
+        assert "passage:my document" in captured_texts[0]
+
+    def test_score_splits_into_batches(self, reranker):
+        """With batch_size=2 and 5 documents, model should be called 3 times."""
+        call_count = [0]
+
+        def fake_tokenizer(texts, **kwargs):
+            m = MagicMock()
+            m.items.return_value = [("input_ids", MagicMock())]
+            return m
+
+        reranker._tokenizer.side_effect = fake_tokenizer
+
+        def fake_model(**kwargs):
+            # Count items in the batch by inspecting how many texts were tokenized
+            call_count[0] += 1
+            logit_tensor = MagicMock()
+            logit_tensor.view.return_value = logit_tensor
+            logit_tensor.cpu.return_value = logit_tensor
+            logit_tensor.tolist.return_value = [1.0] * 2  # Return 2 scores per call
+            out = MagicMock()
+            out.logits = logit_tensor
+            return out
+
+        reranker._model.side_effect = fake_model
+
+        with patch("torch.inference_mode") as inf_mode:
+            inf_mode.return_value.__enter__ = lambda s: None
+            inf_mode.return_value.__exit__ = MagicMock(return_value=False)
+            # 5 documents, batch_size=2 → ceil(5/2) = 3 forward passes
+            reranker.score("q", ["d1", "d2", "d3", "d4", "d5"], batch_size=2)
+
+        assert call_count[0] == 3
+
+    def test_score_pairs_uses_query_per_pair(self, reranker):
+        """score_pairs() must use each pair's own query, not a shared one."""
+        captured = []
+
+        def fake_tokenizer(texts, **kwargs):
+            captured.extend(texts)
+            m = MagicMock()
+            m.items.return_value = []
+            return m
+
+        reranker._tokenizer.side_effect = fake_tokenizer
+
+        logit_tensor = MagicMock()
+        logit_tensor.view.return_value = logit_tensor
+        logit_tensor.cpu.return_value = logit_tensor
+        logit_tensor.tolist.return_value = [0.0, 0.0]
+
+        model_out = MagicMock()
+        model_out.logits = logit_tensor
+        reranker._model.return_value = model_out
+
+        with patch("torch.inference_mode") as inf_mode:
+            inf_mode.return_value.__enter__ = lambda s: None
+            inf_mode.return_value.__exit__ = MagicMock(return_value=False)
+            reranker.score_pairs([("q1", "doc A"), ("q2", "doc B")])
+
+        assert any("question:q1" in t for t in captured)
+        assert any("question:q2" in t for t in captured)
+
+
+# ---------------------------------------------------------------------------
+# rerank_hits() — standalone helper
+# ---------------------------------------------------------------------------
+
+
+class TestRerankHits:
+    """Test the public rerank_hits() convenience function."""
+
+    def _make_hits(self, n: int, prefix: str = "doc") -> list[dict]:
+        return [{"text": f"{prefix}{i}", "_distance": float(i)} for i in range(n)]
+
+    def test_empty_hits_returns_empty(self):
+        from nemo_retriever.rerank import rerank_hits
+
+        model = MagicMock()
+        assert rerank_hits("q", [], model=model) == []
+
+    def test_results_sorted_by_score_descending(self):
+        from nemo_retriever.rerank import rerank_hits
+
+        hits = self._make_hits(3)
+        model = MagicMock()
+        model.score.return_value = [0.1, 5.0, -1.0]
+
+        out = rerank_hits("q", hits, model=model)
+
+        scores = [h["_rerank_score"] for h in out]
+        assert scores == sorted(scores, reverse=True)
+
+    def test_rerank_score_added_to_each_hit(self):
+        from nemo_retriever.rerank import rerank_hits
+
+        hits = [{"text": "hello"}, {"text": "world"}]
+        model = MagicMock()
+        model.score.return_value = [2.0, 3.0]
+
+        out = rerank_hits("q", hits, model=model)
+        assert all("_rerank_score" in h for h in out)
+
+    def test_top_n_truncates_output(self):
+        from nemo_retriever.rerank import rerank_hits
+
+        hits = self._make_hits(5)
+        model = MagicMock()
+        model.score.return_value = [5.0, 4.0, 3.0, 2.0, 1.0]
+
+        out = rerank_hits("q", hits, model=model, top_n=3)
+        assert len(out) == 3
+
+    def test_model_score_called_with_query_and_texts(self):
+        from nemo_retriever.rerank import rerank_hits
+
+        hits = [{"text": "first"}, {"text": "second"}]
+        model = MagicMock()
+        model.score.return_value = [1.0, 2.0]
+
+        rerank_hits("my query", hits, model=model)
+
+        model.score.assert_called_once_with("my query", ["first", "second"], max_length=512, batch_size=32)
+
+    def test_raises_without_model_or_endpoint(self):
+        from nemo_retriever.rerank import rerank_hits
+
+        with pytest.raises(ValueError, match="model.*invoke_url"):
+            rerank_hits("q", [{"text": "doc"}])
+
+    def test_custom_text_key(self):
+        from nemo_retriever.rerank import rerank_hits
+
+        hits = [{"content": "alpha"}, {"content": "beta"}]
+        model = MagicMock()
+        model.score.return_value = [1.0, 2.0]
+
+        out = rerank_hits("q", hits, model=model, text_key="content")
+        assert len(out) == 2
+
+    def test_original_hit_keys_preserved(self):
+        from nemo_retriever.rerank import rerank_hits
+
+        hits = [{"text": "t", "metadata": "m", "_distance": 0.5}]
+        model = MagicMock()
+        model.score.return_value = [7.0]
+
+        out = rerank_hits("q", hits, model=model)
+        assert out[0]["metadata"] == "m"
+        assert out[0]["_distance"] == 0.5
+
+
+# ---------------------------------------------------------------------------
+# _rerank_via_endpoint()
+# ---------------------------------------------------------------------------
+
+
+class TestRerankViaEndpoint:
+    def test_posts_to_rerank_url(self):
+        from nemo_retriever.rerank.rerank import _rerank_via_endpoint
+
+        mock_resp = MagicMock()
+        mock_resp.json.return_value = {
+            "results": [
+                {"index": 0, "relevance_score": 0.9},
+                {"index": 1, "relevance_score": 0.3},
+            ]
+        }
+        mock_resp.raise_for_status = MagicMock()
+
+        with patch("requests.post", return_value=mock_resp) as mock_post:
+            scores = _rerank_via_endpoint(
+                "What is ML?",
+                ["Machine learning is…", "Paris is…"],
+                endpoint="http://localhost:8000",
+                model_name="nvidia/llama-nemotron-rerank-1b-v2",
+            )
+
+        mock_post.assert_called_once()
+        call_kwargs = mock_post.call_args
+        assert call_kwargs[0][0] == "http://localhost:8000/rerank"
+        assert call_kwargs[1]["json"]["query"] == "What is ML?"
+        assert len(call_kwargs[1]["json"]["documents"]) == 2
+
+        assert scores == [0.9, 0.3]
+
+    def test_scores_aligned_with_input_order(self):
+        from nemo_retriever.rerank.rerank import _rerank_via_endpoint
+
+        # Server returns results in reversed order
+        mock_resp = MagicMock()
+        mock_resp.json.return_value = {
+            "results": [
+                {"index": 2, "relevance_score": 0.1},
+                {"index": 0, "relevance_score": 0.8},
+                {"index": 1, "relevance_score": 0.5},
+            ]
+        }
+        mock_resp.raise_for_status = MagicMock()
+
+        with patch("requests.post", return_value=mock_resp):
+            scores = _rerank_via_endpoint(
+                "q",
+                ["d0", "d1", "d2"],
+                endpoint="http://localhost:8000",
+            )
+
+        assert scores[0] == 0.8  # index 0
+        assert scores[1] == 0.5  # index 1
+        assert scores[2] == 0.1  # index 2
+
+    def test_authorization_header_sent_when_api_key_provided(self):
+        from nemo_retriever.rerank.rerank import _rerank_via_endpoint
+
+        mock_resp = MagicMock()
+        mock_resp.json.return_value = {"results": [{"index": 0, "relevance_score": 1.0}]}
+        mock_resp.raise_for_status = MagicMock()
+
+        with patch("requests.post", return_value=mock_resp) as mock_post:
+            _rerank_via_endpoint(
+                "q",
+                ["d"],
+                endpoint="http://localhost:8000",
+                api_key="my-secret-key",
+            )
+
+        headers = mock_post.call_args[1]["headers"]
+        assert headers["Authorization"] == "Bearer my-secret-key"
+
+    def test_trailing_slash_on_endpoint_normalized(self):
+        from nemo_retriever.rerank.rerank import _rerank_via_endpoint
+
+        mock_resp = MagicMock()
+        mock_resp.json.return_value = {"results": [{"index": 0, "relevance_score": 0.5}]}
+        mock_resp.raise_for_status = MagicMock()
+
+        with patch("requests.post", return_value=mock_resp) as mock_post:
+            _rerank_via_endpoint("q", ["d"], endpoint="http://localhost:8000/")
+
+        url = mock_post.call_args[0][0]
+        assert url == "http://localhost:8000/rerank"
+
+    def test_top_n_sent_in_payload_when_specified(self):
+        from nemo_retriever.rerank.rerank import _rerank_via_endpoint
+
+        mock_resp = MagicMock()
+        mock_resp.json.return_value = {"results": [{"index": 0, "relevance_score": 0.5}]}
+        mock_resp.raise_for_status = MagicMock()
+
+        with patch("requests.post", return_value=mock_resp) as mock_post:
+            _rerank_via_endpoint("q", ["d"], endpoint="http://localhost:8000", top_n=5)
+
+        payload = mock_post.call_args[1]["json"]
+        assert payload["top_n"] == 5
+
+    def test_top_n_not_in_payload_when_not_specified(self):
+        from nemo_retriever.rerank.rerank import _rerank_via_endpoint
+
+        mock_resp = MagicMock()
+        mock_resp.json.return_value = {"results": [{"index": 0, "relevance_score": 0.5}]}
+        mock_resp.raise_for_status = MagicMock()
+
+        with patch("requests.post", return_value=mock_resp) as mock_post:
+            _rerank_via_endpoint("q", ["d"], endpoint="http://localhost:8000")
+
+        payload = mock_post.call_args[1]["json"]
+        assert "top_n" not in payload
+
+
+# ---------------------------------------------------------------------------
+# NemotronRerankActor
+# ---------------------------------------------------------------------------
+
+
+class TestNemotronRerankActor:
+    """Test the Ray Data-compatible actor."""
+
+    def test_actor_with_invoke_url_skips_local_model(self):
+        from nemo_retriever.rerank.rerank import NemotronRerankActor
+
+        actor = NemotronRerankActor(invoke_url="http://localhost:8000")
+        assert actor._model is None
+
+    def test_actor_with_rerank_invoke_url_alias(self):
+        from nemo_retriever.rerank.rerank import NemotronRerankActor
+
+        actor = NemotronRerankActor(rerank_invoke_url="http://localhost:8000")
+        assert actor._model is None
+        assert actor._kwargs.get("invoke_url") == "http://localhost:8000"
+
+    def test_actor_call_scores_dataframe(self):
+        import pandas as pd
+        from nemo_retriever.rerank.rerank import NemotronRerankActor
+
+        actor = NemotronRerankActor(invoke_url="http://localhost:8000")
+
+        df = pd.DataFrame({"query": ["q1", "q2"], "text": ["doc A", "doc B"]})
+
+        mock_resp = MagicMock()
+        mock_resp.raise_for_status = MagicMock()
+        mock_resp.json.side_effect = [
+            {"results": [{"index": 0, "relevance_score": 0.9}]},
+            {"results": [{"index": 0, "relevance_score": 0.4}]},
+        ]
+
+        with patch("requests.post", return_value=mock_resp):
+            out = actor(df)
+
+        assert "rerank_score" in out.columns
+        assert len(out) == 2
+
+    def test_actor_call_sorts_descending_by_default(self):
+        import pandas as pd
+        from nemo_retriever.rerank.rerank import NemotronRerankActor
+
+        actor = NemotronRerankActor(invoke_url="http://localhost:8000")
+        df = pd.DataFrame({"query": ["q", "q"], "text": ["low relevance", "high relevance"]})
+
+        mock_resp = MagicMock()
+        mock_resp.raise_for_status = MagicMock()
+        mock_resp.json.side_effect = [
+            {"results": [{"index": 0, "relevance_score": 0.1}]},
+            {"results": [{"index": 0, "relevance_score": 0.9}]},
+        ]
+
+        with patch("requests.post", return_value=mock_resp):
+            out = actor(df)
+
+        scores = out["rerank_score"].tolist()
+        assert scores == sorted(scores, reverse=True)
+
+    def test_actor_call_returns_error_payload_on_exception(self):
+        import pandas as pd
+        from nemo_retriever.rerank.rerank import NemotronRerankActor
+
+        actor = NemotronRerankActor(invoke_url="http://localhost:8000")
+        df = pd.DataFrame({"query": ["q"], "text": ["doc"]})
+
+        with patch("requests.post", side_effect=RuntimeError("connection failed")):
+            out = actor(df)
+
+        # Should not raise; should return a DataFrame with error payload
+        assert isinstance(out, pd.DataFrame)
+        assert "rerank_score" in out.columns
+        payload = out["rerank_score"].iloc[0]
+        assert payload["status"] == "error"
+
+    def test_actor_custom_score_column_name(self):
+        import pandas as pd
+        from nemo_retriever.rerank.rerank import NemotronRerankActor
+
+        actor = NemotronRerankActor(invoke_url="http://localhost:8000", score_column="my_score")
+        df = pd.DataFrame({"query": ["q"], "text": ["doc"]})
+
+        mock_resp = MagicMock()
+        mock_resp.raise_for_status = MagicMock()
+        mock_resp.json.return_value = {"results": [{"index": 0, "relevance_score": 0.7}]}
+
+        with patch("requests.post", return_value=mock_resp):
+            out = actor(df)
+
+        assert "my_score" in out.columns
diff --git a/nemo_retriever/tests/test_retriever_queries.py b/nemo_retriever/tests/test_retriever_queries.py
new file mode 100644
index 000000000..fa3b7e8c8
--- /dev/null
+++ b/nemo_retriever/tests/test_retriever_queries.py
@@ -0,0 +1,372 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES.
+# All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+Unit tests for Retriever.queries() and Retriever.query().
+
+All external I/O (LanceDB, embedders, requests) is mocked so the tests run
+without any GPU, network, or database dependency.
+"""
+
+from __future__ import annotations
+
+from unittest.mock import MagicMock, patch
+
+import pytest
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+_EMBED_DIM = 4
+_DUMMY_VECTOR = [0.1, 0.2, 0.3, 0.4]
+
+
+def _make_hits(n: int, base_score: float = 0.5) -> list[dict]:
+    return [
+        {
+            "text": f"passage {i}",
+            "metadata": "{}",
+            "source": "{}",
+            "page_number": i,
+            "_distance": base_score + i * 0.01,
+        }
+        for i in range(n)
+    ]
+
+
+def _make_retriever(**overrides):
+    """Return a Retriever with reranker disabled by default and sane test values."""
+    from nemo_retriever.retriever import Retriever
+
+    defaults = dict(
+        reranker=None,
+        top_k=5,
+        nprobes=16,
+    )
+    defaults.update(overrides)
+    return Retriever(**defaults)
+
+
+# ---------------------------------------------------------------------------
+# Retriever._resolve_embedding_endpoint
+# ---------------------------------------------------------------------------
+
+
+class TestResolveEmbeddingEndpoint:
+    def test_returns_none_when_no_endpoints_set(self):
+        r = _make_retriever()
+        assert r._resolve_embedding_endpoint() is None
+
+    def test_http_endpoint_takes_priority(self):
+        r = _make_retriever(
+            embedding_http_endpoint="http://embed.example.com",
+            embedding_endpoint="http://other.example.com",
+        )
+        assert r._resolve_embedding_endpoint() == "http://embed.example.com"
+
+    def test_single_endpoint_returned_when_http(self):
+        r = _make_retriever(embedding_endpoint="http://embed.example.com")
+        assert r._resolve_embedding_endpoint() == "http://embed.example.com"
+
+    def test_grpc_endpoint_raises(self):
+        r = _make_retriever(embedding_endpoint="grpc://embed.example.com")
+        with pytest.raises(ValueError, match="gRPC"):
+            r._resolve_embedding_endpoint()
+
+    def test_whitespace_only_endpoint_treated_as_none(self):
+        r = _make_retriever(embedding_http_endpoint="   ")
+        assert r._resolve_embedding_endpoint() is None
+
+
+# ---------------------------------------------------------------------------
+# Retriever.queries() — basic (no reranking)
+# ---------------------------------------------------------------------------
+
+
+class TestQueriesNoReranking:
+    def _run_queries(self, retriever, query_texts, fake_vectors, fake_hits):
+        """Patch embed + search helpers and call queries()."""
+        with (
+            patch.object(retriever, "_embed_queries_local_hf", return_value=fake_vectors),
+            patch.object(retriever, "_search_lancedb", return_value=fake_hits),
+        ):
+            return retriever.queries(query_texts)
+
+    def test_empty_queries_returns_empty(self):
+        r = _make_retriever()
+        assert r.queries([]) == []
+
+    def test_single_query_returns_one_result_list(self):
+        r = _make_retriever()
+        hits = [_make_hits(5)]
+        result = self._run_queries(r, ["What is ML?"], [_DUMMY_VECTOR], hits)
+        assert len(result) == 1
+        assert result[0] is hits[0]
+
+    def test_multiple_queries_return_matching_result_count(self):
+        r = _make_retriever()
+        n_queries = 3
+        fake_hits = [_make_hits(5)] * n_queries
+        result = self._run_queries(
+            r,
+            [f"query {i}" for i in range(n_queries)],
+            [_DUMMY_VECTOR] * n_queries,
+            fake_hits,
+        )
+        assert len(result) == n_queries
+
+    def test_embed_local_hf_called_with_query_texts(self):
+        r = _make_retriever()
+        with (
+            patch.object(r, "_embed_queries_local_hf", return_value=[_DUMMY_VECTOR]) as mock_embed,
+            patch.object(r, "_search_lancedb", return_value=[_make_hits(5)]),
+        ):
+            r.queries(["hello world"])
+
+        mock_embed.assert_called_once_with(["hello world"], model_name=r.embedder)
+
+    def test_embed_nim_called_when_endpoint_set(self):
+        r = _make_retriever(embedding_http_endpoint="http://nim.example.com")
+        with (
+            patch.object(r, "_embed_queries_nim", return_value=[_DUMMY_VECTOR]) as mock_nim,
+            patch.object(r, "_search_lancedb", return_value=[_make_hits(5)]),
+        ):
+            r.queries(["hello"])
+
+        mock_nim.assert_called_once()
+        call_kwargs = mock_nim.call_args[1]
+        assert call_kwargs["endpoint"] == "http://nim.example.com"
+
+    def test_search_lancedb_receives_vectors_and_texts(self):
+        r = _make_retriever()
+        vecs = [[0.1, 0.2, 0.3, 0.4]]
+        with (
+            patch.object(r, "_embed_queries_local_hf", return_value=vecs),
+            patch.object(r, "_search_lancedb", return_value=[_make_hits(5)]) as mock_search,
+        ):
+            r.queries(["my query"])
+
+        kwargs = mock_search.call_args[1]
+        assert kwargs["query_vectors"] == vecs
+        assert kwargs["query_texts"] == ["my query"]
+
+    def test_embedder_override_forwarded(self):
+        r = _make_retriever()
+        with (
+            patch.object(r, "_embed_queries_local_hf", return_value=[_DUMMY_VECTOR]) as mock_embed,
+            patch.object(r, "_search_lancedb", return_value=[_make_hits(5)]),
+        ):
+            r.queries(["q"], embedder="custom/embedder")
+
+        assert mock_embed.call_args[1]["model_name"] == "custom/embedder"
+
+    def test_lancedb_uri_and_table_overrides_forwarded(self):
+        r = _make_retriever()
+        with (
+            patch.object(r, "_embed_queries_local_hf", return_value=[_DUMMY_VECTOR]),
+            patch.object(r, "_search_lancedb", return_value=[_make_hits(5)]) as mock_search,
+        ):
+            r.queries(["q"], lancedb_uri="/tmp/db", lancedb_table="my-table")
+
+        kwargs = mock_search.call_args[1]
+        assert kwargs["lancedb_uri"] == "/tmp/db"
+        assert kwargs["lancedb_table"] == "my-table"
+
+
+# ---------------------------------------------------------------------------
+# Retriever.query() — single-query convenience wrapper
+# ---------------------------------------------------------------------------
+
+
+class TestQuerySingleConvenience:
+    def test_query_delegates_to_queries_and_returns_first_element(self):
+        r = _make_retriever()
+        expected = _make_hits(5)
+        with patch.object(r, "queries", return_value=[expected]) as mock_queries:
+            result = r.query("find something")
+
+        mock_queries.assert_called_once_with(
+            ["find something"],
+            embedder=None,
+            lancedb_uri=None,
+            lancedb_table=None,
+        )
+        assert result is expected
+
+    def test_query_passes_through_overrides(self):
+        r = _make_retriever()
+        with patch.object(r, "queries", return_value=[[]]) as mock_queries:
+            r.query("q", embedder="e", lancedb_uri="u", lancedb_table="t")
+
+        mock_queries.assert_called_once_with(["q"], embedder="e", lancedb_uri="u", lancedb_table="t")
+
+
+# ---------------------------------------------------------------------------
+# Retriever.queries() — with reranking via remote endpoint
+# ---------------------------------------------------------------------------
+
+
+class TestQueriesWithEndpointReranking:
+    def _retriever_with_endpoint(self, top_k: int = 3, refine: int = 2) -> object:
+        return _make_retriever(
+            reranker="nvidia/llama-nemotron-rerank-1b-v2",
+            reranker_endpoint="http://rerank.example.com",
+            top_k=top_k,
+            reranker_refine_factor=refine,
+        )
+
+    def _fake_search_results(self, retriever) -> list[list[dict]]:
+        """Return the number of hits that satisfies the assertion check."""
+        n = retriever.top_k * retriever.reranker_refine_factor
+        return [_make_hits(n)]
+
+    def test_rerank_results_called_when_reranker_set(self):
+        r = self._retriever_with_endpoint()
+        fake_results = self._fake_search_results(r)
+
+        with (
+            patch.object(r, "_embed_queries_local_hf", return_value=[_DUMMY_VECTOR]),
+            patch.object(r, "_search_lancedb", return_value=fake_results),
+            patch.object(r, "_rerank_results", return_value=[_make_hits(3)]) as mock_rerank,
+        ):
+            r.queries(["q"])
+
+        mock_rerank.assert_called_once_with(["q"], fake_results)
+
+    def test_rerank_not_called_when_reranker_is_none(self):
+        r = _make_retriever(reranker=None)
+        fake_results = [_make_hits(5)]
+
+        with (
+            patch.object(r, "_embed_queries_local_hf", return_value=[_DUMMY_VECTOR]),
+            patch.object(r, "_search_lancedb", return_value=fake_results),
+            patch.object(r, "_rerank_results") as mock_rerank,
+        ):
+            r.queries(["q"])
+
+        mock_rerank.assert_not_called()
+
+    def test_reranked_results_are_returned(self):
+        r = self._retriever_with_endpoint()
+        fake_results = self._fake_search_results(r)
+        reranked = [_make_hits(3)]
+
+        with (
+            patch.object(r, "_embed_queries_local_hf", return_value=[_DUMMY_VECTOR]),
+            patch.object(r, "_search_lancedb", return_value=fake_results),
+            patch.object(r, "_rerank_results", return_value=reranked),
+        ):
+            out = r.queries(["q"])
+
+        assert out is reranked
+
+    def test_rerank_results_uses_endpoint_not_local_model(self):
+        r = self._retriever_with_endpoint()
+        fake_hits = self._fake_search_results(r)[0]
+
+        mock_resp = MagicMock()
+        mock_resp.raise_for_status = MagicMock()
+        # Return relevance scores in reverse original order
+        mock_resp.json.return_value = {
+            "results": [{"index": i, "relevance_score": float(len(fake_hits) - i)} for i in range(len(fake_hits))]
+        }
+
+        with patch("requests.post", return_value=mock_resp) as mock_post:
+            out = r._rerank_results(["q"], [fake_hits])
+
+        mock_post.assert_called()
+        # Results should be sorted descending
+        scores = [h["_rerank_score"] for h in out[0]]
+        assert scores == sorted(scores, reverse=True)
+
+
+# ---------------------------------------------------------------------------
+# Retriever.queries() — with local reranking model
+# ---------------------------------------------------------------------------
+
+
+class TestQueriesWithLocalReranking:
+
+    def test_rerank_results_with_local_model(self):
+        r = _make_retriever(reranker="nvidia/llama-nemotron-rerank-1b-v2")
+        hits = _make_hits(4)
+        fake_model = MagicMock()
+        fake_model.score.return_value = [0.1, 0.9, 0.5, 0.3]
+
+        with patch.object(r, "_get_reranker_model", return_value=fake_model):
+            out = r._rerank_results(["q"], [hits])
+
+        scores = [h["_rerank_score"] for h in out[0]]
+        assert scores == sorted(scores, reverse=True)
+        assert max(scores) == 0.9
+
+    def test_rerank_results_respects_top_k(self):
+        r = _make_retriever(reranker="nvidia/llama-nemotron-rerank-1b-v2", top_k=2)
+        hits = _make_hits(4)
+        fake_model = MagicMock()
+        fake_model.score.return_value = [0.1, 0.9, 0.5, 0.3]
+
+        with patch.object(r, "_get_reranker_model", return_value=fake_model):
+            out = r._rerank_results(["q"], [hits])
+
+        assert len(out[0]) == 2
+
+    def test_rerank_results_multiple_queries(self):
+        r = _make_retriever(reranker="nvidia/llama-nemotron-rerank-1b-v2", top_k=2)
+        hits_a = _make_hits(2)
+        hits_b = _make_hits(2)
+        fake_model = MagicMock()
+        fake_model.score.side_effect = [[0.2, 0.8], [0.6, 0.4]]
+
+        with patch.object(r, "_get_reranker_model", return_value=fake_model):
+            out = r._rerank_results(["q1", "q2"], [hits_a, hits_b])
+
+        assert len(out) == 2
+        # Each per-query list should be sorted descending
+        for per_query in out:
+            scores = [h["_rerank_score"] for h in per_query]
+            assert scores == sorted(scores, reverse=True)
+
+
+# ---------------------------------------------------------------------------
+# Retriever defaults: reranker field behaviour
+# ---------------------------------------------------------------------------
+
+
+class TestRetrieverDefaults:
+    def test_default_reranker_is_nemotron_model(self):
+        from nemo_retriever.retriever import Retriever
+
+        r = Retriever()
+        assert r.reranker == "nvidia/llama-nemotron-rerank-1b-v2"
+
+    def test_reranker_can_be_disabled(self):
+        r = _make_retriever(reranker=None)
+        assert r.reranker is None
+
+    def test_reranker_refine_factor_default(self):
+        from nemo_retriever.retriever import Retriever
+
+        r = Retriever()
+        assert r.reranker_refine_factor == 4
+
+    def test_reranker_max_length_default(self):
+        from nemo_retriever.retriever import Retriever
+
+        r = Retriever()
+        assert r.reranker_max_length == 512
+
+    def test_reranker_model_not_initialized_at_construction(self):
+        from nemo_retriever.retriever import Retriever
+
+        r = Retriever()
+        # Should be None until first use
+        assert r._reranker_model is None
+
+    def test_retriever_alias_is_retriever_class(self):
+        from nemo_retriever.retriever import retriever, Retriever
+
+        assert retriever is Retriever

From 5cbf38e280dbcf06bf583bbab819a713044a533e Mon Sep 17 00:00:00 2001
From: Julio Perez <37191411+jperez999@users.noreply.github.com>
Date: Wed, 11 Mar 2026 21:14:10 -0400
Subject: [PATCH 09/55] fix reranker in inproc (#1588)

---
 nemo_retriever/src/nemo_retriever/retriever.py | 8 +++++---
 nemo_retriever/tests/test_retriever_queries.py | 2 +-
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/nemo_retriever/src/nemo_retriever/retriever.py b/nemo_retriever/src/nemo_retriever/retriever.py
index aa9203783..aab11b519 100644
--- a/nemo_retriever/src/nemo_retriever/retriever.py
+++ b/nemo_retriever/src/nemo_retriever/retriever.py
@@ -53,7 +53,9 @@ class Retriever:
     local_hf_cache_dir: Optional[Path] = None
     local_hf_batch_size: int = 64
     # Reranking -----------------------------------------------------------
-    reranker: Optional[str] = "nvidia/llama-nemotron-rerank-1b-v2"
+    reranker: Optional[bool] = False
+    """True to enable reranking with the default model, will use the reranker_model_name as hf model"""
+    reranker_model_name: Optional[str] = "nvidia/llama-nemotron-rerank-1b-v2"
     """HuggingFace model ID for local reranking (e.g. 'nvidia/llama-nemotron-rerank-1b-v2').
     Set to None to skip reranking (default)."""
     reranker_endpoint: Optional[str] = None
@@ -183,12 +185,12 @@ def _search_lancedb(
 
     def _get_reranker_model(self) -> Any:
         """Lazily load and cache the local NemotronRerankV2 model."""
-        if self._reranker_model is None:
+        if self._reranker_model is None and self.reranker:
             from nemo_retriever.model.local import NemotronRerankV2
 
             cache_dir = str(self.local_hf_cache_dir) if self.local_hf_cache_dir else None
             self._reranker_model = NemotronRerankV2(
-                model_name=str(self.reranker),
+                model_name=self.reranker_model_name if self.reranker else None,
                 device=self.local_hf_device,
                 hf_cache_dir=cache_dir,
             )
diff --git a/nemo_retriever/tests/test_retriever_queries.py b/nemo_retriever/tests/test_retriever_queries.py
index fa3b7e8c8..b398c48ac 100644
--- a/nemo_retriever/tests/test_retriever_queries.py
+++ b/nemo_retriever/tests/test_retriever_queries.py
@@ -341,7 +341,7 @@ def test_default_reranker_is_nemotron_model(self):
         from nemo_retriever.retriever import Retriever
 
         r = Retriever()
-        assert r.reranker == "nvidia/llama-nemotron-rerank-1b-v2"
+        assert r.reranker_model_name == "nvidia/llama-nemotron-rerank-1b-v2"
 
     def test_reranker_can_be_disabled(self):
         r = _make_retriever(reranker=None)

From 6459e60e00f1e613ca9fcb44bbb54f3aaaa10bd7 Mon Sep 17 00:00:00 2001
From: Jeremy Dyer <jdye64@gmail.com>
Date: Wed, 11 Mar 2026 21:34:31 -0400
Subject: [PATCH 10/55] Add source_id to output columns

---
 nemo_retriever/src/nemo_retriever/vector_store/lancedb_store.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/nemo_retriever/src/nemo_retriever/vector_store/lancedb_store.py b/nemo_retriever/src/nemo_retriever/vector_store/lancedb_store.py
index c3acb2d8f..2b46ecbb5 100644
--- a/nemo_retriever/src/nemo_retriever/vector_store/lancedb_store.py
+++ b/nemo_retriever/src/nemo_retriever/vector_store/lancedb_store.py
@@ -171,6 +171,7 @@ def _build_lancedb_rows_from_df(rows: List[Dict[str, Any]]) -> List[Dict[str, An
                 "pdf_basename": pdf_basename,
                 "page_number": int(page_number),
                 "source": source_id,
+                "source_id": source_id,
                 "path": path,
                 "text": row.get("text", ""),
                 "metadata": str(meta),

From ed95c440b4165e104a6b63b4566d591e1e35f581 Mon Sep 17 00:00:00 2001
From: Julio Perez <37191411+jperez999@users.noreply.github.com>
Date: Wed, 11 Mar 2026 21:35:21 -0400
Subject: [PATCH 11/55] fix in process extract to handle txt (#1589)

Co-authored-by: Jeremy Dyer <jdye64@gmail.com>
---
 .../src/nemo_retriever/html/ray_data.py         |  6 ++++--
 .../nemo_retriever/ingest_modes/inprocess.py    | 17 ++++++++++++++++-
 .../src/nemo_retriever/txt/ray_data.py          |  6 ++++--
 3 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/nemo_retriever/src/nemo_retriever/html/ray_data.py b/nemo_retriever/src/nemo_retriever/html/ray_data.py
index f2dafbd80..1c87a18b7 100644
--- a/nemo_retriever/src/nemo_retriever/html/ray_data.py
+++ b/nemo_retriever/src/nemo_retriever/html/ray_data.py
@@ -35,12 +35,14 @@ def __call__(self, batch_df: pd.DataFrame) -> pd.DataFrame:
         out_dfs: List[pd.DataFrame] = []
         for _, row in batch_df.iterrows():
             raw = row.get("bytes")
+            text = row.get("text")
             path = row.get("path")
-            if raw is None or path is None:
+            if (raw is None and text is None) or path is None:
                 continue
             path_str = str(path) if path is not None else ""
             try:
-                chunk_df = html_bytes_to_chunks_df(raw, path_str, params=params)
+                payload = raw or text.encode("utf-8")
+                chunk_df = html_bytes_to_chunks_df(payload, path_str, params=params)
                 if not chunk_df.empty:
                     out_dfs.append(chunk_df)
             except Exception:
diff --git a/nemo_retriever/src/nemo_retriever/ingest_modes/inprocess.py b/nemo_retriever/src/nemo_retriever/ingest_modes/inprocess.py
index 34eaf7ed5..90e230cba 100644
--- a/nemo_retriever/src/nemo_retriever/ingest_modes/inprocess.py
+++ b/nemo_retriever/src/nemo_retriever/ingest_modes/inprocess.py
@@ -1001,7 +1001,12 @@ def extract(self, params: ExtractParams | None = None, **kwargs: Any) -> "InProc
         # NOTE: `kwargs` passed to `.extract()` are intended primarily for PDF extraction
         # (e.g. `extract_text`, `dpi`, etc). Downstream model stages do NOT necessarily
         # accept the same keyword arguments. Keep per-stage kwargs isolated.
-
+        if self._input_documents and all(f.lower().endswith(".txt") for f in self._input_documents):
+            txt_params = TextChunkParams()
+            return self.extract_txt(params=txt_params)
+        if self._input_documents and all(f.lower().endswith(".html") for f in self._input_documents):
+            html_params = HtmlChunkParams()
+            return self.extract_html(params=html_params)
         resolved = _coerce_params(params, ExtractParams, kwargs)
         if (
             any(
@@ -1289,9 +1294,13 @@ def extract_txt(self, params: TextChunkParams | None = None, **kwargs: Any) -> "
         Use with .files("*.txt").extract_txt(...).embed().vdb_upload().ingest().
         Do not call .extract() when using .extract_txt().
         """
+        from nemo_retriever.txt.ray_data import TxtSplitActor
+
         self._pipeline_type = "txt"
         resolved = _coerce_params(params, TextChunkParams, kwargs)
         self._extract_txt_kwargs = resolved.model_dump(mode="python")
+        text_split = TxtSplitActor(params=TextChunkParams(**self._extract_txt_kwargs))
+        self._tasks.append((text_split, {}))
         return self
 
     def extract_html(self, params: HtmlChunkParams | None = None, **kwargs: Any) -> "InProcessIngestor":
@@ -1301,9 +1310,15 @@ def extract_html(self, params: HtmlChunkParams | None = None, **kwargs: Any) ->
         Use with .files("*.html").extract_html(...).embed().vdb_upload().ingest().
         Do not call .extract() when using .extract_html().
         """
+        from nemo_retriever.html.ray_data import HtmlSplitActor
+
         self._pipeline_type = "html"
         resolved = _coerce_params(params, HtmlChunkParams, kwargs)
         self._extract_html_kwargs = resolved.model_dump(mode="python")
+        html_split = HtmlSplitActor(
+            params=HtmlChunkParams(**self._extract_html_kwargs),
+        )
+        self._tasks.append((html_split, {}))
         return self
 
     def extract_audio(
diff --git a/nemo_retriever/src/nemo_retriever/txt/ray_data.py b/nemo_retriever/src/nemo_retriever/txt/ray_data.py
index f01191814..b74f482cd 100644
--- a/nemo_retriever/src/nemo_retriever/txt/ray_data.py
+++ b/nemo_retriever/src/nemo_retriever/txt/ray_data.py
@@ -58,12 +58,14 @@ def __call__(self, batch_df: pd.DataFrame) -> pd.DataFrame:
         out_dfs: List[pd.DataFrame] = []
         for _, row in batch_df.iterrows():
             raw = row.get("bytes")
+            text = row.get("text")
             path = row.get("path")
-            if raw is None or path is None:
+            if (raw is None and text is None) or path is None:
                 continue
             path_str = str(path) if path is not None else ""
             try:
-                chunk_df = txt_bytes_to_chunks_df(raw, path_str, params=params)
+                payload = raw or text.encode("utf-8")
+                chunk_df = txt_bytes_to_chunks_df(payload, path_str, params=params)
                 if not chunk_df.empty:
                     out_dfs.append(chunk_df)
             except Exception:

From 9568b50baaf208c4d5d2a868827c27d6fe293ca0 Mon Sep 17 00:00:00 2001
From: Jeremy Dyer <jdye64@gmail.com>
Date: Wed, 11 Mar 2026 22:01:30 -0400
Subject: [PATCH 12/55] Release prep: 26.03.0-RC2 (#1591)

---
 docker-compose.yaml                           |  2 +-
 docs/docs/extraction/helm.md                  |  2 +-
 docs/docs/extraction/quickstart-guide.md      |  2 +-
 .../extraction/quickstart-library-mode.md     |  2 +-
 helm/Chart.yaml                               |  2 +-
 helm/README.md                                | 88 +++++++++----------
 helm/README.md.gotmpl                         |  6 +-
 helm/values.yaml                              |  2 +-
 nemo_retriever/pyproject.toml                 | 14 +--
 src/nv_ingest/api/main.py                     |  2 +-
 tools/harness/nemotron-nightly.txt            |  8 +-
 tools/harness/pyproject.toml                  | 14 +--
 tools/harness/test_configs.yaml               |  2 +-
 13 files changed, 73 insertions(+), 73 deletions(-)

diff --git a/docker-compose.yaml b/docker-compose.yaml
index 94ddf7f70..6ad589efc 100644
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
@@ -262,7 +262,7 @@ services:
       - audio
 
   nv-ingest-ms-runtime:
-    image: nvcr.io/nvidia/nemo-microservices/nv-ingest:26.03.0-RC1
+    image: nvcr.io/nvidia/nemo-microservices/nv-ingest:26.03.0-RC2
     shm_size: 40gb # Should be at minimum 30% of assigned memory per Ray documentation
     build:
       context: ${NV_INGEST_ROOT:-.}
diff --git a/docs/docs/extraction/helm.md b/docs/docs/extraction/helm.md
index 952a44065..f5891a772 100644
--- a/docs/docs/extraction/helm.md
+++ b/docs/docs/extraction/helm.md
@@ -3,4 +3,4 @@
 <!-- Use this documentation to deploy [NeMo Retriever Library](overview.md) by using Helm. -->
 
 To deploy [NeMo Retriever Library](overview.md) by using Helm, 
-refer to [NeMo Retriever Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.03.0-RC1/helm/README.md).
+refer to [NeMo Retriever Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.03.0-RC2/helm/README.md).
diff --git a/docs/docs/extraction/quickstart-guide.md b/docs/docs/extraction/quickstart-guide.md
index 43ddce8ed..a996d4f21 100644
--- a/docs/docs/extraction/quickstart-guide.md
+++ b/docs/docs/extraction/quickstart-guide.md
@@ -84,7 +84,7 @@ h. Run the command `docker ps`. You should see output similar to the following.
     CONTAINER ID  IMAGE                                            COMMAND                 CREATED         STATUS                  PORTS            NAMES
 uv venv --python 3.12 nv-ingest-dev
 source nv-ingest-dev/bin/activate
-uv pip install nv-ingest==26.03.0-RC1 nv-ingest-api==26.03.0-RC1 nv-ingest-client==26.03.0-RC1
+uv pip install nv-ingest==26.03.0-RC2 nv-ingest-api==26.03.0-RC2 nv-ingest-client==26.03.0-RC2
 ```
 
 !!! tip
diff --git a/docs/docs/extraction/quickstart-library-mode.md b/docs/docs/extraction/quickstart-library-mode.md
index f193305b9..b9e6ca371 100644
--- a/docs/docs/extraction/quickstart-library-mode.md
+++ b/docs/docs/extraction/quickstart-library-mode.md
@@ -34,7 +34,7 @@ Use the following procedure to prepare your environment.
     ```
        uv venv --python 3.12 nvingest && \
          source nvingest/bin/activate && \
-         uv pip install nemo-retriever==26.03.0-RC1 milvus-lite==2.4.12
+         uv pip install nemo-retriever==26.03.0-RC2 milvus-lite==2.4.12
     ```
 
     !!! tip
diff --git a/helm/Chart.yaml b/helm/Chart.yaml
index 1b0a3a7e8..9891b9555 100644
--- a/helm/Chart.yaml
+++ b/helm/Chart.yaml
@@ -2,7 +2,7 @@ apiVersion: v2
 name: nv-ingest
 description: NV-Ingest Microservice
 type: application
-version: 26.03.0-RC1
+version: 26.03.0-RC2
 maintainers:
   - name: NVIDIA Corporation
     url: https://www.nvidia.com/
diff --git a/helm/README.md b/helm/README.md
index c446f3e3c..3860dfce9 100644
--- a/helm/README.md
+++ b/helm/README.md
@@ -45,7 +45,7 @@ To install or upgrade the Helm chart, run the following code.
 helm upgrade \
     --install \
     nv-ingest \
-    https://helm.ngc.nvidia.com/nvidia/nemo-microservices/charts/nv-ingest-26.03.0-RC1.tgz \
+    https://helm.ngc.nvidia.com/nvidia/nemo-microservices/charts/nv-ingest-26.03.0-RC2.tgz \
     -n ${NAMESPACE} \
     --username '$oauthtoken' \
     --password "${NGC_API_KEY}" \
@@ -54,7 +54,7 @@ helm upgrade \
     --set ngcApiSecret.create=true \
     --set ngcApiSecret.password="${NGC_API_KEY}" \
     --set image.repository="nvcr.io/nvidia/nemo-microservices/nv-ingest" \
-    --set image.tag="26.03.0-RC1"
+    --set image.tag="26.03.0-RC2"
 ```
 
 Optionally you can create your own versions of the `Secrets` if you do not want to use the creation via the helm chart.
@@ -105,7 +105,7 @@ For more information, refer to [NV-Ingest-Client](https://github.com/NVIDIA/nv-i
 # Just to be cautious we remove any existing installation
 pip uninstall nv-ingest-client
 
-pip install nv-ingest-client==26.03.0-RC1
+pip install nv-ingest-client==26.03.0-RC2
 ```
 
 #### Rest Endpoint Ingress
@@ -347,7 +347,7 @@ You can also use NV-Ingest's Python client API to interact with the service runn
 | fullnameOverride | string | `""` |  |
 | image.pullPolicy | string | `"IfNotPresent"` |  |
 | image.repository | string | `"nvcr.io/nvidia/nemo-microservices/nv-ingest"` |  |
-| image.tag | string | `"26.03.0-RC1"` |  |
+| image.tag | string | `"26.03.0-RC2"` |  |
 | imagePullSecrets[0].name | string | `"ngc-api"` |  |
 | imagePullSecrets[1].name | string | `"ngc-secret"` |  |
 | ingress.annotations | object | `{}` |  |
@@ -465,46 +465,6 @@ You can also use NV-Ingest's Python client API to interact with the service runn
 | nimOperator.graphic_elements.storage.pvc.create | bool | `true` |  |
 | nimOperator.graphic_elements.storage.pvc.size | string | `"25Gi"` |  |
 | nimOperator.graphic_elements.storage.pvc.volumeAccessMode | string | `"ReadWriteOnce"` |  |
-| nimOperator.rerankqa.authSecret | string | `"ngc-api"` |  |
-| nimOperator.rerankqa.enabled | bool | `false` |  |
-| nimOperator.rerankqa.env[0].name | string | `"NIM_HTTP_API_PORT"` |  |
-| nimOperator.rerankqa.env[0].value | string | `"8000"` |  |
-| nimOperator.rerankqa.env[1].name | string | `"NIM_TRITON_LOG_VERBOSE"` |  |
-| nimOperator.rerankqa.env[1].value | string | `"1"` |  |
-| nimOperator.rerankqa.expose.service.grpcPort | int | `8001` |  |
-| nimOperator.rerankqa.expose.service.port | int | `8000` |  |
-| nimOperator.rerankqa.expose.service.type | string | `"ClusterIP"` |  |
-| nimOperator.rerankqa.image.pullPolicy | string | `"IfNotPresent"` |  |
-| nimOperator.rerankqa.image.pullSecrets[0] | string | `"ngc-secret"` |  |
-| nimOperator.rerankqa.image.repository | string | `"nvcr.io/nim/nvidia/llama-nemotron-rerank-1b-v2"` |  |
-| nimOperator.rerankqa.image.tag | string | `"1.10.0"` |  |
-| nimOperator.rerankqa.replicas | int | `1` |  |
-| nimOperator.rerankqa.resources.limits."nvidia.com/gpu" | int | `1` |  |
-| nimOperator.rerankqa.storage.pvc.create | bool | `true` |  |
-| nimOperator.rerankqa.storage.pvc.size | string | `"50Gi"` |  |
-| nimOperator.rerankqa.storage.pvc.volumeAccessMode | string | `"ReadWriteOnce"` |  |
-| nimOperator.ocr.authSecret | string | `"ngc-api"` |  |
-| nimOperator.ocr.enabled | bool | `true` |  |
-| nimOperator.ocr.env[0].name | string | `"OMP_NUM_THREADS"` |  |
-| nimOperator.ocr.env[0].value | string | `"8"` |  |
-| nimOperator.ocr.env[1].name | string | `"NIM_HTTP_API_PORT"` |  |
-| nimOperator.ocr.env[1].value | string | `"8000"` |  |
-| nimOperator.ocr.env[2].name | string | `"NIM_TRITON_LOG_VERBOSE"` |  |
-| nimOperator.ocr.env[2].value | string | `"1"` |  |
-| nimOperator.ocr.env[3].name | string | `"NIM_TRITON_MAX_BATCH_SIZE"` |  |
-| nimOperator.ocr.env[3].value | string | `"32"` |  |
-| nimOperator.ocr.expose.service.grpcPort | int | `8001` |  |
-| nimOperator.ocr.expose.service.port | int | `8000` |  |
-| nimOperator.ocr.expose.service.type | string | `"ClusterIP"` |  |
-| nimOperator.ocr.image.pullPolicy | string | `"IfNotPresent"` |  |
-| nimOperator.ocr.image.pullSecrets[0] | string | `"ngc-secret"` |  |
-| nimOperator.ocr.image.repository | string | `"nvcr.io/nim/nvidia/nemotron-ocr-v1"` |  |
-| nimOperator.ocr.image.tag | string | `"1.3.0"` |  |
-| nimOperator.ocr.replicas | int | `1` |  |
-| nimOperator.ocr.resources.limits."nvidia.com/gpu" | int | `1` |  |
-| nimOperator.ocr.storage.pvc.create | bool | `true` |  |
-| nimOperator.ocr.storage.pvc.size | string | `"25Gi"` |  |
-| nimOperator.ocr.storage.pvc.volumeAccessMode | string | `"ReadWriteOnce"` |  |
 | nimOperator.nemotron_nano_12b_v2_vl.authSecret | string | `"ngc-api"` |  |
 | nimOperator.nemotron_nano_12b_v2_vl.enabled | bool | `false` |  |
 | nimOperator.nemotron_nano_12b_v2_vl.env[0].name | string | `"NIM_HTTP_API_PORT"` |  |
@@ -547,6 +507,28 @@ You can also use NV-Ingest's Python client API to interact with the service runn
 | nimOperator.nimCache.pvc.volumeAccessMode | string | `"ReadWriteOnce"` |  |
 | nimOperator.nimService.namespaces | list | `[]` |  |
 | nimOperator.nimService.resources | object | `{}` |  |
+| nimOperator.ocr.authSecret | string | `"ngc-api"` |  |
+| nimOperator.ocr.enabled | bool | `true` |  |
+| nimOperator.ocr.env[0].name | string | `"OMP_NUM_THREADS"` |  |
+| nimOperator.ocr.env[0].value | string | `"8"` |  |
+| nimOperator.ocr.env[1].name | string | `"NIM_HTTP_API_PORT"` |  |
+| nimOperator.ocr.env[1].value | string | `"8000"` |  |
+| nimOperator.ocr.env[2].name | string | `"NIM_TRITON_LOG_VERBOSE"` |  |
+| nimOperator.ocr.env[2].value | string | `"1"` |  |
+| nimOperator.ocr.env[3].name | string | `"NIM_TRITON_MAX_BATCH_SIZE"` |  |
+| nimOperator.ocr.env[3].value | string | `"32"` |  |
+| nimOperator.ocr.expose.service.grpcPort | int | `8001` |  |
+| nimOperator.ocr.expose.service.port | int | `8000` |  |
+| nimOperator.ocr.expose.service.type | string | `"ClusterIP"` |  |
+| nimOperator.ocr.image.pullPolicy | string | `"IfNotPresent"` |  |
+| nimOperator.ocr.image.pullSecrets[0] | string | `"ngc-secret"` |  |
+| nimOperator.ocr.image.repository | string | `"nvcr.io/nim/nvidia/nemotron-ocr-v1"` |  |
+| nimOperator.ocr.image.tag | string | `"1.3.0"` |  |
+| nimOperator.ocr.replicas | int | `1` |  |
+| nimOperator.ocr.resources.limits."nvidia.com/gpu" | int | `1` |  |
+| nimOperator.ocr.storage.pvc.create | bool | `true` |  |
+| nimOperator.ocr.storage.pvc.size | string | `"25Gi"` |  |
+| nimOperator.ocr.storage.pvc.volumeAccessMode | string | `"ReadWriteOnce"` |  |
 | nimOperator.page_elements.authSecret | string | `"ngc-api"` |  |
 | nimOperator.page_elements.enabled | bool | `true` |  |
 | nimOperator.page_elements.env[0].name | string | `"NIM_HTTP_API_PORT"` |  |
@@ -589,6 +571,24 @@ You can also use NV-Ingest's Python client API to interact with the service runn
 | nimOperator.page_elements.storage.pvc.create | bool | `true` |  |
 | nimOperator.page_elements.storage.pvc.size | string | `"25Gi"` |  |
 | nimOperator.page_elements.storage.pvc.volumeAccessMode | string | `"ReadWriteOnce"` |  |
+| nimOperator.rerankqa.authSecret | string | `"ngc-api"` |  |
+| nimOperator.rerankqa.enabled | bool | `false` |  |
+| nimOperator.rerankqa.env[0].name | string | `"NIM_HTTP_API_PORT"` |  |
+| nimOperator.rerankqa.env[0].value | string | `"8000"` |  |
+| nimOperator.rerankqa.env[1].name | string | `"NIM_TRITON_LOG_VERBOSE"` |  |
+| nimOperator.rerankqa.env[1].value | string | `"1"` |  |
+| nimOperator.rerankqa.expose.service.grpcPort | int | `8001` |  |
+| nimOperator.rerankqa.expose.service.port | int | `8000` |  |
+| nimOperator.rerankqa.expose.service.type | string | `"ClusterIP"` |  |
+| nimOperator.rerankqa.image.pullPolicy | string | `"IfNotPresent"` |  |
+| nimOperator.rerankqa.image.pullSecrets[0] | string | `"ngc-secret"` |  |
+| nimOperator.rerankqa.image.repository | string | `"nvcr.io/nim/nvidia/llama-nemotron-rerank-1b-v2"` |  |
+| nimOperator.rerankqa.image.tag | string | `"1.10.0"` |  |
+| nimOperator.rerankqa.replicas | int | `1` |  |
+| nimOperator.rerankqa.resources.limits."nvidia.com/gpu" | int | `1` |  |
+| nimOperator.rerankqa.storage.pvc.create | bool | `true` |  |
+| nimOperator.rerankqa.storage.pvc.size | string | `"50Gi"` |  |
+| nimOperator.rerankqa.storage.pvc.volumeAccessMode | string | `"ReadWriteOnce"` |  |
 | nimOperator.table_structure.authSecret | string | `"ngc-api"` |  |
 | nimOperator.table_structure.enabled | bool | `true` |  |
 | nimOperator.table_structure.env[0].name | string | `"NIM_HTTP_API_PORT"` |  |
diff --git a/helm/README.md.gotmpl b/helm/README.md.gotmpl
index 743ed3610..b16dab04c 100644
--- a/helm/README.md.gotmpl
+++ b/helm/README.md.gotmpl
@@ -46,7 +46,7 @@ To install or upgrade the Helm chart, run the following code.
 helm upgrade \
     --install \
     nv-ingest \
-    https://helm.ngc.nvidia.com/nvidia/nemo-microservices/charts/nv-ingest-26.03.0-RC1.tgz \
+    https://helm.ngc.nvidia.com/nvidia/nemo-microservices/charts/nv-ingest-26.03.0-RC2.tgz \
     -n ${NAMESPACE} \
     --username '$oauthtoken' \
     --password "${NGC_API_KEY}" \
@@ -55,7 +55,7 @@ helm upgrade \
     --set ngcApiSecret.create=true \
     --set ngcApiSecret.password="${NGC_API_KEY}" \
     --set image.repository="nvcr.io/nvidia/nemo-microservices/nv-ingest" \
-    --set image.tag="26.03.0-RC1"
+    --set image.tag="26.03.0-RC2"
 ```
 
 Optionally you can create your own versions of the `Secrets` if you do not want to use the creation via the helm chart.
@@ -107,7 +107,7 @@ For more information, refer to [NV-Ingest-Client](https://github.com/NVIDIA/nv-i
 # Just to be cautious we remove any existing installation
 pip uninstall nv-ingest-client
 
-pip install nv-ingest-client==26.03.0-RC1
+pip install nv-ingest-client==26.03.0-RC2
 ```
 
 #### Rest Endpoint Ingress
diff --git a/helm/values.yaml b/helm/values.yaml
index 2bbbea67c..243c1e740 100644
--- a/helm/values.yaml
+++ b/helm/values.yaml
@@ -28,7 +28,7 @@ nameOverride: ""
 image:
   pullPolicy: IfNotPresent
   repository: "nvcr.io/nvidia/nemo-microservices/nv-ingest"
-  tag: "26.03.0-RC1"
+  tag: "26.03.0-RC2"
 
 ## @section Pod Configuration
 ## @param podAnnotations [object] Sets additional annotations on the main deployment pods
diff --git a/nemo_retriever/pyproject.toml b/nemo_retriever/pyproject.toml
index b9f84b79d..01de8f640 100644
--- a/nemo_retriever/pyproject.toml
+++ b/nemo_retriever/pyproject.toml
@@ -30,9 +30,9 @@ dependencies = [
   "typer>=0.12.0",
   "pyyaml>=6.0",
   "lancedb",
-  "nv-ingest==26.03.0-RC1",
-  "nv-ingest-api==26.03.0-RC1",
-  "nv-ingest-client==26.03.0-RC1",
+  "nv-ingest==26.03.0rc2",
+  "nv-ingest-api==26.03.0rc2",
+  "nv-ingest-client==26.03.0rc2",
   "fastapi>=0.114.0",
   "uvicorn[standard]>=0.30.0",
   "httpx>=0.27.0",
@@ -57,10 +57,10 @@ dependencies = [
   "einops",
   "easydict",
   "addict",
-  "nemotron-page-elements-v3>=0.dev0",
-  "nemotron-graphic-elements-v1>=0.dev0",
-  "nemotron-table-structure-v1>=0.dev0",
-  "nemotron-ocr>=0.dev0",
+  "nemotron-page-elements-v3==3.0.1",
+  "nemotron-graphic-elements-v1==1.0.0",
+  "nemotron-table-structure-v1==1.0.0",
+  "nemotron-ocr==1.0.1",
   "markitdown",
   "timm==1.0.22",
   "tqdm",
diff --git a/src/nv_ingest/api/main.py b/src/nv_ingest/api/main.py
index 762865766..ae72b3fdf 100644
--- a/src/nv_ingest/api/main.py
+++ b/src/nv_ingest/api/main.py
@@ -23,7 +23,7 @@
 app = FastAPI(
     title="NV-Ingest Microservice",
     description="Service for ingesting heterogenous datatypes",
-    version="26.03.0-RC1",
+    version="26.03.0-RC2",
     contact={
         "name": "NVIDIA Corporation",
         "url": "https://nvidia.com",
diff --git a/tools/harness/nemotron-nightly.txt b/tools/harness/nemotron-nightly.txt
index bddd02119..0ce660471 100644
--- a/tools/harness/nemotron-nightly.txt
+++ b/tools/harness/nemotron-nightly.txt
@@ -3,7 +3,7 @@
 # Usage: pip install -r tools/harness/nemotron-nightly.txt --force-reinstall --no-deps
 --index-url https://test.pypi.org/simple/
 
-nemotron-page-elements-v3>=0.dev0
-nemotron-graphic-elements-v1>=0.dev0
-nemotron-table-structure-v1>=0.dev0
-nemotron-ocr>=0.dev0
+nemotron-page-elements-v3==3.0.1
+nemotron-graphic-elements-v1==1.0.0
+nemotron-table-structure-v1==1.0.0
+nemotron-ocr==1.0.1
diff --git a/tools/harness/pyproject.toml b/tools/harness/pyproject.toml
index 07cb085a5..c04a4638c 100644
--- a/tools/harness/pyproject.toml
+++ b/tools/harness/pyproject.toml
@@ -10,15 +10,15 @@ dependencies = [
     "pyyaml>=6.0",
     "requests>=2.32.5",
     "pynvml>=11.5.0",
-    "nv-ingest==26.03.0-RC1",
-    "nv-ingest-api==26.03.0-RC1",
-    "nv-ingest-client==26.03.0-RC1",
+    "nv-ingest==26.03.0rc2",
+    "nv-ingest-api==26.03.0rc2",
+    "nv-ingest-client==26.03.0rc2",
     "milvus-lite==2.4.12",
     "pypdfium2>=4.30.0,<5.0.0",
-    "nemotron-page-elements-v3>=0.dev0",
-    "nemotron-graphic-elements-v1>=0.dev0",
-    "nemotron-table-structure-v1>=0.dev0",
-    "nemotron-ocr>=0.dev0",
+    "nemotron-page-elements-v3==3.0.1",
+    "nemotron-graphic-elements-v1==1.0.0",
+    "nemotron-table-structure-v1==1.0.0",
+    "nemotron-ocr==1.0.1",
 ]
 
 [project.scripts]
diff --git a/tools/harness/test_configs.yaml b/tools/harness/test_configs.yaml
index 1db4646ea..f2a214681 100644
--- a/tools/harness/test_configs.yaml
+++ b/tools/harness/test_configs.yaml
@@ -28,7 +28,7 @@ active:
     kubectl_bin: microk8s kubectl  # kubectl binary command (e.g., "kubectl", "microk8s kubectl")
     kubectl_sudo: null  # Prepend sudo to kubectl commands (null = same as helm_sudo)
     chart: nemo-microservices/nv-ingest  # Remote chart reference (set to null to use local chart from ./helm)
-    chart_version: 26.03.0-RC1  # Chart version (required for remote charts)
+    chart_version: 26.03.0-RC2  # Chart version (required for remote charts)
     release: nv-ingest
     namespace: nv-ingest
     values_file: .helm-env  # Optional: path to values file

From 4a8301ebd8745ea15bab6b1581738b003d68a0c6 Mon Sep 17 00:00:00 2001
From: Jacob Ioffe <jioffe@nvidia.com>
Date: Wed, 11 Mar 2026 14:40:31 -0400
Subject: [PATCH 13/55] Increase default Redis TTL from 1-2h to 48h to prevent
 job expiry during long VLM captioning

Large PDFs with VLM captioning enabled can take 2-22+ hours depending on hardware.
The previous defaults (STATE_TTL=7200s, RESULT_DATA_TTL=3600s) caused job state to
expire mid-processing, resulting in 404 "Job ID not found or state has expired" errors
even though the pipeline completed successfully.

Raises both defaults to 172800s (48 hours), providing sufficient headroom for all
observed workloads. Users can still override via RESULT_DATA_TTL_SECONDS and
STATE_TTL_SECONDS environment variables.

Fixes: Customer bug 5914605

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 .../util/service/impl/ingest/redis_ingest_service.py          | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/nv_ingest/framework/util/service/impl/ingest/redis_ingest_service.py b/src/nv_ingest/framework/util/service/impl/ingest/redis_ingest_service.py
index c96977ec2..f40dbeca6 100644
--- a/src/nv_ingest/framework/util/service/impl/ingest/redis_ingest_service.py
+++ b/src/nv_ingest/framework/util/service/impl/ingest/redis_ingest_service.py
@@ -64,8 +64,8 @@ def get_instance() -> "RedisIngestService":
             redis_task_queue: str = os.getenv("REDIS_INGEST_TASK_QUEUE", "ingest_task_queue")
 
             fetch_mode: "FetchMode" = get_fetch_mode_from_env()
-            result_data_ttl: int = int(os.getenv("RESULT_DATA_TTL_SECONDS", "3600"))
-            state_ttl: int = int(os.getenv("STATE_TTL_SECONDS", "7200"))
+            result_data_ttl: int = int(os.getenv("RESULT_DATA_TTL_SECONDS", "172800"))
+            state_ttl: int = int(os.getenv("STATE_TTL_SECONDS", "172800"))
 
             cache_config: Dict[str, Any] = {
                 "directory": os.getenv("FETCH_CACHE_DIR", "./.fetch_cache"),

From 4f4e5125e604b1d6d5cc13131009117ef9a435c1 Mon Sep 17 00:00:00 2001
From: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
Date: Thu, 12 Mar 2026 13:38:16 -0400
Subject: [PATCH 14/55] Add Helm RTX PRO 4500 override, extend obj-det warmup
 batch size override (#1592) (#1597)

---
 docker-compose.a100-40gb.yaml           |  3 +
 docker-compose.l40s.yaml                |  3 +
 docker-compose.rtx-pro-4500.yaml        |  3 +
 helm/overrides/values-a100-40gb.yaml    |  6 ++
 helm/overrides/values-l40s.yaml         |  6 ++
 helm/overrides/values-rtx-pro-4500.yaml | 91 +++++++++++++++++++++++++
 6 files changed, 112 insertions(+)
 create mode 100644 helm/overrides/values-rtx-pro-4500.yaml

diff --git a/docker-compose.a100-40gb.yaml b/docker-compose.a100-40gb.yaml
index a717d7a3a..cbe16ebff 100644
--- a/docker-compose.a100-40gb.yaml
+++ b/docker-compose.a100-40gb.yaml
@@ -6,14 +6,17 @@ services:
   page-elements:
     environment:
       - NIM_TRITON_MAX_BATCH_SIZE=1
+      - NIM_TRITON_DATA_MAX_BATCH_SIZE=1
 
   graphic-elements:
     environment:
       - NIM_TRITON_MAX_BATCH_SIZE=1
+      - NIM_TRITON_DATA_MAX_BATCH_SIZE=1
 
   table-structure:
     environment:
       - NIM_TRITON_MAX_BATCH_SIZE=1
+      - NIM_TRITON_DATA_MAX_BATCH_SIZE=1
 
   ocr:
     environment:
diff --git a/docker-compose.l40s.yaml b/docker-compose.l40s.yaml
index 8f8414e5a..55da32ca1 100644
--- a/docker-compose.l40s.yaml
+++ b/docker-compose.l40s.yaml
@@ -6,14 +6,17 @@ services:
   page-elements:
     environment:
       - NIM_TRITON_MAX_BATCH_SIZE=1
+      - NIM_TRITON_DATA_MAX_BATCH_SIZE=1
 
   graphic-elements:
     environment:
       - NIM_TRITON_MAX_BATCH_SIZE=1
+      - NIM_TRITON_DATA_MAX_BATCH_SIZE=1
 
   table-structure:
     environment:
       - NIM_TRITON_MAX_BATCH_SIZE=1
+      - NIM_TRITON_DATA_MAX_BATCH_SIZE=1
 
   ocr:
     environment:
diff --git a/docker-compose.rtx-pro-4500.yaml b/docker-compose.rtx-pro-4500.yaml
index a717d7a3a..cbe16ebff 100644
--- a/docker-compose.rtx-pro-4500.yaml
+++ b/docker-compose.rtx-pro-4500.yaml
@@ -6,14 +6,17 @@ services:
   page-elements:
     environment:
       - NIM_TRITON_MAX_BATCH_SIZE=1
+      - NIM_TRITON_DATA_MAX_BATCH_SIZE=1
 
   graphic-elements:
     environment:
       - NIM_TRITON_MAX_BATCH_SIZE=1
+      - NIM_TRITON_DATA_MAX_BATCH_SIZE=1
 
   table-structure:
     environment:
       - NIM_TRITON_MAX_BATCH_SIZE=1
+      - NIM_TRITON_DATA_MAX_BATCH_SIZE=1
 
   ocr:
     environment:
diff --git a/helm/overrides/values-a100-40gb.yaml b/helm/overrides/values-a100-40gb.yaml
index 7fe15de12..828bfdd33 100644
--- a/helm/overrides/values-a100-40gb.yaml
+++ b/helm/overrides/values-a100-40gb.yaml
@@ -13,6 +13,8 @@ nimOperator:
         value: "1"
       - name: NIM_TRITON_MAX_BATCH_SIZE
         value: "1"
+      - name: NIM_TRITON_DATA_MAX_BATCH_SIZE
+        value: "1"
       - name: NIM_TRITON_CPU_THREADS_PRE_PROCESSOR
         value: "2"
       - name: OMP_NUM_THREADS
@@ -44,6 +46,8 @@ nimOperator:
         value: "3"
       - name: NIM_TRITON_MAX_BATCH_SIZE
         value: "1"
+      - name: NIM_TRITON_DATA_MAX_BATCH_SIZE
+        value: "1"
       - name: NIM_TRITON_CUDA_MEMORY_POOL_MB
         value: "2048"
       - name: OMP_NUM_THREADS
@@ -59,6 +63,8 @@ nimOperator:
         value: "3"
       - name: NIM_TRITON_MAX_BATCH_SIZE
         value: "1"
+      - name: NIM_TRITON_DATA_MAX_BATCH_SIZE
+        value: "1"
       - name: NIM_TRITON_CUDA_MEMORY_POOL_MB
         value: "2048"
       - name: OMP_NUM_THREADS
diff --git a/helm/overrides/values-l40s.yaml b/helm/overrides/values-l40s.yaml
index 85e941485..d430e39f1 100644
--- a/helm/overrides/values-l40s.yaml
+++ b/helm/overrides/values-l40s.yaml
@@ -13,6 +13,8 @@ nimOperator:
         value: "1"
       - name: NIM_TRITON_MAX_BATCH_SIZE
         value: "1"
+      - name: NIM_TRITON_DATA_MAX_BATCH_SIZE
+        value: "1"
       - name: NIM_TRITON_CPU_THREADS_PRE_PROCESSOR
         value: "2"
       - name: OMP_NUM_THREADS
@@ -44,6 +46,8 @@ nimOperator:
         value: "3"
       - name: NIM_TRITON_MAX_BATCH_SIZE
         value: "1"
+      - name: NIM_TRITON_DATA_MAX_BATCH_SIZE
+        value: "1"
       - name: NIM_TRITON_CUDA_MEMORY_POOL_MB
         value: "2048"
       - name: OMP_NUM_THREADS
@@ -59,6 +63,8 @@ nimOperator:
         value: "3"
       - name: NIM_TRITON_MAX_BATCH_SIZE
         value: "1"
+      - name: NIM_TRITON_DATA_MAX_BATCH_SIZE
+        value: "1"
       - name: NIM_TRITON_CUDA_MEMORY_POOL_MB
         value: "2048"
       - name: OMP_NUM_THREADS
diff --git a/helm/overrides/values-rtx-pro-4500.yaml b/helm/overrides/values-rtx-pro-4500.yaml
new file mode 100644
index 000000000..55a38482e
--- /dev/null
+++ b/helm/overrides/values-rtx-pro-4500.yaml
@@ -0,0 +1,91 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.
+# All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+# GPU-specific overrides for RTX Pro 4500 (loaded by harness when --deployment-type helm --sku rtx-pro-4500).
+# Sets NIM_TRITON_MAX_BATCH_SIZE=1 per NIM to match docker-compose.rtx-pro-4500.yaml.
+
+nimOperator:
+  page_elements:
+    env:
+      - name: NIM_HTTP_API_PORT
+        value: "8000"
+      - name: NIM_TRITON_LOG_VERBOSE
+        value: "1"
+      - name: NIM_TRITON_MAX_BATCH_SIZE
+        value: "1"
+      - name: NIM_TRITON_DATA_MAX_BATCH_SIZE
+        value: "1"
+      - name: NIM_TRITON_CPU_THREADS_PRE_PROCESSOR
+        value: "2"
+      - name: OMP_NUM_THREADS
+        value: "2"
+      - name: NIM_TRITON_CPU_THREADS_POST_PROCESSOR
+        value: "1"
+      - name: NIM_ENABLE_OTEL
+        value: "true"
+      - name: NIM_OTEL_SERVICE_NAME
+        value: "page-elements"
+      - name: NIM_OTEL_TRACES_EXPORTER
+        value: "otlp"
+      - name: NIM_OTEL_METRICS_EXPORTER
+        value: "console"
+      - name: NIM_OTEL_EXPORTER_OTLP_ENDPOINT
+        value: "http://otel-collector:4318"
+      - name: TRITON_OTEL_URL
+        value: "http://otel-collector:4318/v1/traces"
+      - name: TRITON_OTEL_RATE
+        value: "1"
+
+  graphic_elements:
+    env:
+      - name: NIM_HTTP_API_PORT
+        value: "8000"
+      - name: NIM_TRITON_LOG_VERBOSE
+        value: "1"
+      - name: NIM_TRITON_RATE_LIMIT
+        value: "3"
+      - name: NIM_TRITON_MAX_BATCH_SIZE
+        value: "1"
+      - name: NIM_TRITON_DATA_MAX_BATCH_SIZE
+        value: "1"
+      - name: NIM_TRITON_CUDA_MEMORY_POOL_MB
+        value: "2048"
+      - name: OMP_NUM_THREADS
+        value: "1"
+
+  table_structure:
+    env:
+      - name: NIM_HTTP_API_PORT
+        value: "8000"
+      - name: NIM_TRITON_LOG_VERBOSE
+        value: "1"
+      - name: NIM_TRITON_RATE_LIMIT
+        value: "3"
+      - name: NIM_TRITON_MAX_BATCH_SIZE
+        value: "1"
+      - name: NIM_TRITON_DATA_MAX_BATCH_SIZE
+        value: "1"
+      - name: NIM_TRITON_CUDA_MEMORY_POOL_MB
+        value: "2048"
+      - name: OMP_NUM_THREADS
+        value: "1"
+
+  ocr:
+    env:
+      - name: OMP_NUM_THREADS
+        value: "8"
+      - name: NIM_HTTP_API_PORT
+        value: "8000"
+      - name: NIM_TRITON_LOG_VERBOSE
+        value: "1"
+      - name: NIM_TRITON_MAX_BATCH_SIZE
+        value: "1"
+
+  rerankqa:
+    env:
+      - name: NIM_HTTP_API_PORT
+        value: "8000"
+      - name: NIM_TRITON_LOG_VERBOSE
+        value: "1"
+      - name: NIM_TRITON_MAX_BATCH_SIZE
+        value: "1"

From be533061fd071863b340e36ad837c9d560c3c282 Mon Sep 17 00:00:00 2001
From: Edward Kim <109497216+edknv@users.noreply.github.com>
Date: Thu, 12 Mar 2026 13:44:03 -0700
Subject: [PATCH 15/55] (retriever) update nemotron_parse extraction method
 (#1599) (#1604)

---
 .../nemo_retriever/examples/batch_pipeline.py |  2 +-
 .../examples/inprocess_pipeline.py            | 30 +------------------
 .../src/nemo_retriever/ingest_modes/batch.py  | 28 +++++++++++++++++
 .../nemo_retriever/ingest_modes/inprocess.py  | 20 ++-----------
 4 files changed, 32 insertions(+), 48 deletions(-)

diff --git a/nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py b/nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py
index a66137660..556bf38fc 100644
--- a/nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py
+++ b/nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py
@@ -284,7 +284,7 @@ def main(
     method: str = typer.Option(
         "pdfium",
         "--method",
-        help="PDF text extraction method: 'pdfium' (native only), 'pdfium_hybrid' (native + OCR for scanned), or 'ocr' (OCR all pages).",  # noqa: E501
+        help="PDF text extraction method: 'pdfium' (native only), 'pdfium_hybrid' (native + OCR for scanned), 'ocr' (OCR all pages), or 'nemotron_parse' (Nemotron Parse only, auto-configured).",  # noqa: E501
     ),
     log_file: Optional[Path] = typer.Option(
         None,
diff --git a/nemo_retriever/src/nemo_retriever/examples/inprocess_pipeline.py b/nemo_retriever/src/nemo_retriever/examples/inprocess_pipeline.py
index 6030e90d1..6c4f27ba9 100644
--- a/nemo_retriever/src/nemo_retriever/examples/inprocess_pipeline.py
+++ b/nemo_retriever/src/nemo_retriever/examples/inprocess_pipeline.py
@@ -95,25 +95,7 @@ def main(
     method: str = typer.Option(
         "pdfium",
         "--method",
-        help="PDF text extraction method: 'pdfium' (native only), 'pdfium_hybrid' (native + OCR for scanned), or 'ocr' (OCR all pages).",  # noqa: E501
-    ),
-    nemotron_parse_actors: float = typer.Option(
-        0.0,
-        "--nemotron-parse-actors",
-        min=0.0,
-        help="Enable Parse-only extraction path when > 0.0 with parse GPU/batch-size.",
-    ),
-    nemotron_parse_gpus_per_actor: float = typer.Option(
-        0.0,
-        "--nemotron-parse-gpus-per-actor",
-        min=0.0,
-        help="GPU allocation hint for Parse-only extraction path.",
-    ),
-    nemotron_parse_ray_batch_size: float = typer.Option(
-        0.0,
-        "--nemotron-parse-ray-batch-size",
-        min=0.0,
-        help="Parse stage batch size (enables Parse-only path when > 0.0 with parse workers/GPU).",
+        help="PDF text extraction method: 'pdfium' (native only), 'pdfium_hybrid' (native + OCR for scanned), 'ocr' (OCR all pages), or 'nemotron_parse' (Nemotron Parse only).",  # noqa: E501
     ),
     embed_modality: str = typer.Option(
         "text",
@@ -240,11 +222,6 @@ def main(
                 table_structure_invoke_url=table_structure_invoke_url,
                 page_elements_invoke_url=page_elements_invoke_url,
                 ocr_invoke_url=ocr_invoke_url,
-                batch_tuning={
-                    "nemotron_parse_workers": float(nemotron_parse_actors),
-                    "gpu_nemotron_parse": float(nemotron_parse_gpus_per_actor),
-                    "nemotron_parse_batch_size": float(nemotron_parse_ray_batch_size),
-                },
             )
         )
     else:
@@ -262,11 +239,6 @@ def main(
                 table_structure_invoke_url=table_structure_invoke_url,
                 page_elements_invoke_url=page_elements_invoke_url,
                 ocr_invoke_url=ocr_invoke_url,
-                batch_tuning={
-                    "nemotron_parse_workers": float(nemotron_parse_actors),
-                    "gpu_nemotron_parse": float(nemotron_parse_gpus_per_actor),
-                    "nemotron_parse_batch_size": float(nemotron_parse_ray_batch_size),
-                },
             )
         )
 
diff --git a/nemo_retriever/src/nemo_retriever/ingest_modes/batch.py b/nemo_retriever/src/nemo_retriever/ingest_modes/batch.py
index b2886991a..87e3c546c 100644
--- a/nemo_retriever/src/nemo_retriever/ingest_modes/batch.py
+++ b/nemo_retriever/src/nemo_retriever/ingest_modes/batch.py
@@ -394,10 +394,37 @@ def _endpoint_count(raw: Any) -> int:
             compute=rd.TaskPoolStrategy(size=self._requested_plan.get_pdf_extract_tasks()),
         )
 
+        self._apply_nemotron_parse_overrides(kwargs)
+
         self._append_detection_stages(kwargs)
 
         return self
 
+    def _apply_nemotron_parse_overrides(self, kwargs: dict[str, Any]) -> None:
+        """Update ``_requested_plan`` with user-provided Nemotron Parse resource overrides
+        and set ``_use_nemotron_parse_only``."""
+        nemotron_parse_workers = float(kwargs.get("nemotron_parse_workers", 0.0) or 0.0)
+        gpu_nemotron_parse = float(kwargs.get("gpu_nemotron_parse", 0.0) or 0.0)
+        nemotron_parse_batch_size = float(kwargs.get("nemotron_parse_batch_size", 0.0) or 0.0)
+        self._use_nemotron_parse_only = kwargs.get("method") == "nemotron_parse" or (
+            nemotron_parse_workers > 0.0 and gpu_nemotron_parse > 0.0 and nemotron_parse_batch_size > 0.0
+        )
+
+        # Forward CLI overrides into the RequestedPlan so that downstream Ray
+        # actor pools (batch size, GPU fraction, pool size) honour them.
+        overrides: dict[str, Any] = {}
+        if nemotron_parse_workers > 0.0:
+            workers = int(nemotron_parse_workers)
+            overrides["nemotron_parse_initial_actors"] = workers
+            overrides["nemotron_parse_min_actors"] = workers
+            overrides["nemotron_parse_max_actors"] = workers
+        if gpu_nemotron_parse > 0.0:
+            overrides["nemotron_parse_gpus_per_actor"] = gpu_nemotron_parse
+        if nemotron_parse_batch_size > 0.0:
+            overrides["nemotron_parse_batch_size"] = int(nemotron_parse_batch_size)
+        if overrides:
+            self._requested_plan = self._requested_plan.model_copy(update=overrides)
+
     def _append_detection_stages(self, kwargs: dict[str, Any]) -> None:
         """Append downstream GPU detection stages (page elements, OCR, table/chart/infographic).
 
@@ -665,6 +692,7 @@ def extract_image_files(self, params: ExtractParams | None = None, **kwargs: Any
         )
 
         # Downstream detection stages (page elements, OCR, table/chart/infographic).
+        self._apply_nemotron_parse_overrides(kwargs)
         self._append_detection_stages(kwargs)
 
         return self
diff --git a/nemo_retriever/src/nemo_retriever/ingest_modes/inprocess.py b/nemo_retriever/src/nemo_retriever/ingest_modes/inprocess.py
index 90e230cba..529814853 100644
--- a/nemo_retriever/src/nemo_retriever/ingest_modes/inprocess.py
+++ b/nemo_retriever/src/nemo_retriever/ingest_modes/inprocess.py
@@ -1022,13 +1022,7 @@ def extract(self, params: ExtractParams | None = None, **kwargs: Any) -> "InProc
         ):
             resolved = resolved.model_copy(update={"api_key": resolve_remote_api_key()})
         kwargs = resolved.model_dump(mode="python")
-        batch_tuning = kwargs.get("batch_tuning") if isinstance(kwargs.get("batch_tuning"), dict) else {}
-        nemotron_parse_workers = float(batch_tuning.get("nemotron_parse_workers", 0.0) or 0.0)
-        gpu_nemotron_parse = float(batch_tuning.get("gpu_nemotron_parse", 0.0) or 0.0)
-        nemotron_parse_batch_size = float(batch_tuning.get("nemotron_parse_batch_size", 0.0) or 0.0)
-        use_nemotron_parse_only = (
-            nemotron_parse_workers > 0.0 and gpu_nemotron_parse > 0.0 and nemotron_parse_batch_size > 0.0
-        )
+        use_nemotron_parse_only = kwargs.get("method") == "nemotron_parse"
         extract_kwargs = dict(kwargs)
         # Downstream in-process stages (page elements / table / chart / infographic) assume
         # `page_image.image_b64` exists. Ensure PDF extraction emits a page image unless
@@ -1056,9 +1050,6 @@ def _append_detection_tasks(
 
         Shared by ``extract()`` (PDF) and ``extract_image_files()`` (standalone images).
         """
-        batch_tuning = kwargs.get("batch_tuning") if isinstance(kwargs.get("batch_tuning"), dict) else {}
-        nemotron_parse_batch_size = float(batch_tuning.get("nemotron_parse_batch_size", 0.0) or 0.0)
-
         # Common, optional knobs shared by our detect_* helpers.
         detect_passthrough_keys = {
             "inference_batch_size",
@@ -1104,7 +1095,6 @@ def _detect_kwargs_with_model(model_obj: Any, *, stage_name: str, allow_remote:
                 parse_flags["extract_charts"] = True
             if kwargs.get("extract_infographics") is True:
                 parse_flags["extract_infographics"] = True
-            parse_flags["inference_batch_size"] = int(nemotron_parse_batch_size)
             parse_flags.update(_stage_remote_kwargs("nemotron_parse"))
             parse_invoke_url = kwargs.get(
                 "nemotron_parse_invoke_url", kwargs.get("ocr_invoke_url", kwargs.get("invoke_url", ""))
@@ -1261,13 +1251,7 @@ def extract_image_files(self, params: ExtractParams | None = None, **kwargs: Any
         ):
             resolved = resolved.model_copy(update={"api_key": resolve_remote_api_key()})
         kwargs = resolved.model_dump(mode="python")
-        batch_tuning = kwargs.get("batch_tuning") if isinstance(kwargs.get("batch_tuning"), dict) else {}
-        nemotron_parse_workers = float(batch_tuning.get("nemotron_parse_workers", 0.0) or 0.0)
-        gpu_nemotron_parse = float(batch_tuning.get("gpu_nemotron_parse", 0.0) or 0.0)
-        nemotron_parse_batch_size = float(batch_tuning.get("nemotron_parse_batch_size", 0.0) or 0.0)
-        use_nemotron_parse_only = (
-            nemotron_parse_workers > 0.0 and gpu_nemotron_parse > 0.0 and nemotron_parse_batch_size > 0.0
-        )
+        use_nemotron_parse_only = kwargs.get("method") == "nemotron_parse"
         self._pipeline_type = "image"
         self._append_detection_tasks(kwargs, use_nemotron_parse_only=use_nemotron_parse_only)
         return self

From 491aed0b32bf6cf290706adb4f5737d7f91818a9 Mon Sep 17 00:00:00 2001
From: Edward Kim <109497216+edknv@users.noreply.github.com>
Date: Thu, 12 Mar 2026 13:47:35 -0700
Subject: [PATCH 16/55] =?UTF-8?q?(retriever)=20auto-route=20image=20files?=
 =?UTF-8?q?=20in=20.extract()=20for=20both=20inprocess=20a=E2=80=A6=20(#16?=
 =?UTF-8?q?05)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 .../examples/inprocess_pipeline.py            | 20 ++++++++++++++++++-
 .../src/nemo_retriever/ingest_modes/batch.py  |  6 ++++++
 .../nemo_retriever/ingest_modes/inprocess.py  |  5 +++++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/nemo_retriever/src/nemo_retriever/examples/inprocess_pipeline.py b/nemo_retriever/src/nemo_retriever/examples/inprocess_pipeline.py
index 6c4f27ba9..e5b9ad117 100644
--- a/nemo_retriever/src/nemo_retriever/examples/inprocess_pipeline.py
+++ b/nemo_retriever/src/nemo_retriever/examples/inprocess_pipeline.py
@@ -44,7 +44,7 @@ def main(
     input_type: str = typer.Option(
         "pdf",
         "--input-type",
-        help="Input format: 'pdf', 'txt', 'html', or 'doc'. Use 'txt' for .txt, 'html' for .html (markitdown -> chunks), 'doc' for .docx/.pptx (converted to PDF via LibreOffice).",  # noqa: E501
+        help="Input format: 'pdf', 'txt', 'html', 'doc', or 'image'. Use 'txt' for .txt, 'html' for .html (markitdown -> chunks), 'doc' for .docx/.pptx (converted to PDF via LibreOffice), 'image' for standalone image files (PNG, JPEG, BMP, TIFF, SVG).",  # noqa: E501
     ),
     query_csv: Path = typer.Option(
         "bo767_query_gt.csv",
@@ -186,6 +186,7 @@ def main(
             "txt": ["*.txt"],
             "html": ["*.html"],
             "doc": ["*.docx", "*.pptx"],
+            "image": ["*.png", "*.jpg", "*.jpeg", "*.bmp", "*.tiff", "*.tif", "*.svg"],
         }
         exts = ext_map.get(input_type, ["*.pdf"])
         file_patterns = [str(input_path / e) for e in exts]
@@ -207,6 +208,23 @@ def main(
                 overlap_tokens=text_chunk_overlap_tokens if text_chunk_overlap_tokens is not None else 150,
             )
         )
+    elif input_type == "image":
+        ingestor = ingestor.files(file_patterns).extract_image_files(
+            ExtractParams(
+                method=method,
+                extract_text=True,
+                extract_tables=True,
+                extract_charts=True,
+                extract_infographics=False,
+                use_graphic_elements=use_graphic_elements,
+                graphic_elements_invoke_url=graphic_elements_invoke_url,
+                use_table_structure=use_table_structure,
+                table_output_format=table_output_format,
+                table_structure_invoke_url=table_structure_invoke_url,
+                page_elements_invoke_url=page_elements_invoke_url,
+                ocr_invoke_url=ocr_invoke_url,
+            )
+        )
     elif input_type == "doc":
         ingestor = ingestor.files(file_patterns).extract(
             ExtractParams(
diff --git a/nemo_retriever/src/nemo_retriever/ingest_modes/batch.py b/nemo_retriever/src/nemo_retriever/ingest_modes/batch.py
index 87e3c546c..f7a909f29 100644
--- a/nemo_retriever/src/nemo_retriever/ingest_modes/batch.py
+++ b/nemo_retriever/src/nemo_retriever/ingest_modes/batch.py
@@ -38,6 +38,7 @@
 )
 from nemo_retriever.ingest_modes.inprocess import collapse_content_to_page_rows, explode_content_to_rows
 
+from ..image.load import SUPPORTED_IMAGE_EXTENSIONS
 from ..ingestor import Ingestor
 from ..params import ASRParams
 from ..params import AudioChunkParams
@@ -318,6 +319,11 @@ def extract(self, params: ExtractParams | None = None, **kwargs: Any) -> "BatchI
             )
             return self.extract_txt(params=txt_params)
 
+        if self._input_documents and all(
+            os.path.splitext(f)[1].lower() in SUPPORTED_IMAGE_EXTENSIONS for f in self._input_documents
+        ):
+            return self.extract_image_files(params=params, **kwargs)
+
         resolved = _coerce_params(params, ExtractParams, kwargs)
         if (
             any(
diff --git a/nemo_retriever/src/nemo_retriever/ingest_modes/inprocess.py b/nemo_retriever/src/nemo_retriever/ingest_modes/inprocess.py
index 529814853..1f1d229a2 100644
--- a/nemo_retriever/src/nemo_retriever/ingest_modes/inprocess.py
+++ b/nemo_retriever/src/nemo_retriever/ingest_modes/inprocess.py
@@ -45,6 +45,7 @@
     pdfium = None  # type: ignore[assignment]
     _PDFIUM_IMPORT_ERROR = e
 
+from ..image.load import SUPPORTED_IMAGE_EXTENSIONS
 from ..utils.convert import SUPPORTED_EXTENSIONS, convert_to_pdf_bytes
 from ..ingestor import Ingestor
 from ..params import ASRParams
@@ -1007,6 +1008,10 @@ def extract(self, params: ExtractParams | None = None, **kwargs: Any) -> "InProc
         if self._input_documents and all(f.lower().endswith(".html") for f in self._input_documents):
             html_params = HtmlChunkParams()
             return self.extract_html(params=html_params)
+        if self._input_documents and all(
+            os.path.splitext(f)[1].lower() in SUPPORTED_IMAGE_EXTENSIONS for f in self._input_documents
+        ):
+            return self.extract_image_files(params=params, **kwargs)
         resolved = _coerce_params(params, ExtractParams, kwargs)
         if (
             any(

From 82088d72a744be9ba8bafb6ea794fb15970ad678 Mon Sep 17 00:00:00 2001
From: Charles Blackmon-Luca <20627856+charlesbluca@users.noreply.github.com>
Date: Thu, 12 Mar 2026 16:50:16 -0400
Subject: [PATCH 17/55] Dump libfreetype source in release container (#1600)
 (#1606)

---
 Dockerfile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Dockerfile b/Dockerfile
index 87f1e9304..f89038926 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -39,13 +39,13 @@ RUN chmod +x scripts/install_ffmpeg.sh \
 # For GPL-licensed components, we provide their source code in the container
 # via `apt-get source` below to satisfy GPL requirements.
 ARG GPL_LIBS="\
+    libfreetype6 \
     libltdl7 \
     libhunspell-1.7-0 \
     libhyphen0 \
     libdbus-1-3 \
 "
 ARG FORCE_REMOVE_PKGS="\
-    libfreetype6 \
     ucf \
     liblangtag-common \
     libjbig0 \

From 10c7435daf27e91f7b71fa07bb7fae2fca4dc114 Mon Sep 17 00:00:00 2001
From: Jeremy Dyer <jdye64@gmail.com>
Date: Thu, 12 Mar 2026 18:20:11 -0400
Subject: [PATCH 18/55] Unit test failure fixes (#1607)

---
 .github/workflows/retriever-unit-tests.yml  | 10 +++++++---
 nemo_retriever/pyproject.toml               |  8 ++++----
 nemo_retriever/tests/test_batch_pipeline.py | 17 -----------------
 nemo_retriever/tests/test_html_convert.py   |  8 ++++----
 nemo_retriever/tests/test_txt_split.py      |  8 ++++----
 5 files changed, 19 insertions(+), 32 deletions(-)

diff --git a/.github/workflows/retriever-unit-tests.yml b/.github/workflows/retriever-unit-tests.yml
index e26d93328..87fb7ee25 100644
--- a/.github/workflows/retriever-unit-tests.yml
+++ b/.github/workflows/retriever-unit-tests.yml
@@ -19,11 +19,15 @@ jobs:
         with:
           python-version: "3.12"
 
+      - name: Install uv
+        run: |
+          curl -LsSf https://astral.sh/uv/install.sh | sh
+          echo "$HOME/.local/bin" >> "$GITHUB_PATH"
+
       - name: Install unit test dependencies
         run: |
-          python -m pip install --upgrade pip
-          python -m pip install pytest pandas pydantic pyyaml typer scikit-learn
-          python -m pip install api/
+          uv pip install --system -e src/ -e api/ -e client/
+          uv pip install --system -e nemo_retriever
 
       - name: Run retriever unit tests
         env:
diff --git a/nemo_retriever/pyproject.toml b/nemo_retriever/pyproject.toml
index 01de8f640..17ccb9459 100644
--- a/nemo_retriever/pyproject.toml
+++ b/nemo_retriever/pyproject.toml
@@ -93,10 +93,10 @@ version = {attr = "nemo_retriever.version.get_build_version"}
 nv-ingest = { path = "../src/", editable = true }
 nv-ingest-api = { path = "../api/", editable = true }
 nv-ingest-client = { path = "../client/", editable = true }
-nemotron-page-elements-v3 = { index = "test-pypi" }
-nemotron-graphic-elements-v1 = { index = "test-pypi" }
-nemotron-table-structure-v1 = { index = "test-pypi" }
-nemotron-ocr = { index = "test-pypi" }
+#nemotron-page-elements-v3 = { index = "test-pypi" }
+#nemotron-graphic-elements-v1 = { index = "test-pypi" }
+#nemotron-table-structure-v1 = { index = "test-pypi" }
+#nemotron-ocr = { index = "test-pypi" }
 torch = { index = "torch-cuda"}
 torchvision = { index ="torch-cuda"}
 
diff --git a/nemo_retriever/tests/test_batch_pipeline.py b/nemo_retriever/tests/test_batch_pipeline.py
index 2d18d92bb..6dfc913a6 100644
--- a/nemo_retriever/tests/test_batch_pipeline.py
+++ b/nemo_retriever/tests/test_batch_pipeline.py
@@ -1,23 +1,6 @@
-import pytest
-
-pytest.importorskip("ray")
-
-from nemo_retriever.examples.batch_pipeline import _count_materialized_rows
 from nemo_retriever.utils.input_files import resolve_input_patterns
 
 
-class _DatasetWithoutLen:
-    def count(self) -> int:
-        return 42
-
-    def __len__(self) -> int:
-        raise AssertionError("__len__ should not be used")
-
-
-def test_count_materialized_rows_prefers_dataset_count() -> None:
-    assert _count_materialized_rows(_DatasetWithoutLen()) == 42
-
-
 def test_resolve_input_file_patterns_recurses_for_directory_inputs(tmp_path) -> None:
     dataset_dir = tmp_path / "earnings_consulting"
     dataset_dir.mkdir()
diff --git a/nemo_retriever/tests/test_html_convert.py b/nemo_retriever/tests/test_html_convert.py
index 399ae9091..646127830 100644
--- a/nemo_retriever/tests/test_html_convert.py
+++ b/nemo_retriever/tests/test_html_convert.py
@@ -16,6 +16,7 @@
     html_file_to_chunks_df,
     html_to_markdown,
 )
+from nemo_retriever.params.models import HtmlChunkParams
 
 
 def test_html_to_markdown_str():
@@ -53,8 +54,7 @@ def test_html_file_to_chunks_df(tmp_path: Path):
     )
     df = html_file_to_chunks_df(
         str(f),
-        max_tokens=512,
-        overlap_tokens=0,
+        params=HtmlChunkParams(max_tokens=512, overlap_tokens=0),
     )
     assert isinstance(df, pd.DataFrame)
     assert "text" in df.columns and "path" in df.columns and "page_number" in df.columns and "metadata" in df.columns
@@ -71,7 +71,7 @@ def test_html_file_to_chunks_df_empty_content(tmp_path: Path):
     pytest.importorskip("transformers")
     f = tmp_path / "empty.html"
     f.write_text("<html><body></body></html>", encoding="utf-8")
-    df = html_file_to_chunks_df(str(f), max_tokens=512)
+    df = html_file_to_chunks_df(str(f), params=HtmlChunkParams(max_tokens=512))
     assert isinstance(df, pd.DataFrame)
     assert list(df.columns) == ["text", "path", "page_number", "metadata"]
     assert len(df) == 0
@@ -82,7 +82,7 @@ def test_html_bytes_to_chunks_df(tmp_path: Path):
     pytest.importorskip("transformers")
     html_bytes = b"<html><body><p>Chunk content from bytes.</p></body></html>"
     path = str(tmp_path / "virtual.html")
-    df = html_bytes_to_chunks_df(html_bytes, path, max_tokens=512, overlap_tokens=0)
+    df = html_bytes_to_chunks_df(html_bytes, path, params=HtmlChunkParams(max_tokens=512, overlap_tokens=0))
     assert isinstance(df, pd.DataFrame)
     assert "text" in df.columns and "path" in df.columns and "page_number" in df.columns and "metadata" in df.columns
     assert len(df) >= 1
diff --git a/nemo_retriever/tests/test_txt_split.py b/nemo_retriever/tests/test_txt_split.py
index 212c94813..cc71bfc45 100644
--- a/nemo_retriever/tests/test_txt_split.py
+++ b/nemo_retriever/tests/test_txt_split.py
@@ -13,6 +13,7 @@
 import pytest
 
 from nemo_retriever.txt.split import split_text_by_tokens, txt_file_to_chunks_df
+from nemo_retriever.params.models import TextChunkParams
 
 
 class _MockTokenizer:
@@ -63,11 +64,10 @@ def test_txt_file_to_chunks_df(tmp_path: Path):
     f.write_text("First paragraph here. Second paragraph there.", encoding="utf-8")
     df = txt_file_to_chunks_df(
         str(f),
-        max_tokens=512,
-        overlap_tokens=0,
+        params=TextChunkParams(max_tokens=512, overlap_tokens=0),
     )
     assert isinstance(df, pd.DataFrame)
-    assert list(df.columns) == ["text", "path", "page_number", "metadata"]
+    assert list(df.columns) == ["text", "content", "path", "page_number", "metadata"]
     assert len(df) >= 1
     assert df["path"].iloc[0] == str(f.resolve())
     assert df["page_number"].iloc[0] >= 1
@@ -79,7 +79,7 @@ def test_txt_file_to_chunks_df_empty_file(tmp_path: Path):
     pytest.importorskip("transformers")
     f = tmp_path / "empty.txt"
     f.write_text("", encoding="utf-8")
-    df = txt_file_to_chunks_df(str(f), max_tokens=512)
+    df = txt_file_to_chunks_df(str(f), params=TextChunkParams(max_tokens=512))
     assert isinstance(df, pd.DataFrame)
     assert list(df.columns) == ["text", "path", "page_number", "metadata"]
     assert len(df) == 0

From 11662db9856e907ebd9859b07847d5ba09aa2c94 Mon Sep 17 00:00:00 2001
From: Jacob Ioffe <70251274+jioffe502@users.noreply.github.com>
Date: Thu, 12 Mar 2026 18:27:47 -0400
Subject: [PATCH 19/55] Fix markdown outputs for batch and inprocess. (#1601)

Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
---
 nemo_retriever/README.md                      |  15 +-
 .../src/nemo_retriever/io/markdown.py         | 216 ++++++++++++++----
 nemo_retriever/tests/test_io_markdown.py      | 125 +++++++---
 3 files changed, 274 insertions(+), 82 deletions(-)

diff --git a/nemo_retriever/README.md b/nemo_retriever/README.md
index c3292ba69..f0af47d61 100644
--- a/nemo_retriever/README.md
+++ b/nemo_retriever/README.md
@@ -210,10 +210,12 @@ ingestor = (
 
 All `ExtractParams` options (`extract_text`, `extract_tables`, `extract_charts`, `extract_infographics`) apply to image ingestion.
 
-### Render one document as markdown
+### Render results as markdown
 
-If you want a readable page-by-page markdown view of a single in-process result, pass the
-single-document result from `results[0]` to `nemo_retriever.io.to_markdown`.
+If you want a readable markdown view of extracted results, pass the full in-process result list
+to `nemo_retriever.io.to_markdown`. The helper now returns a `dict[str, str]` keyed by input
+filename, where each value is the document collapsed into one markdown string without per-page
+headers, so both single-document and multi-document runs follow the same contract.
 
 ```python
 from nemo_retriever import create_ingestor
@@ -230,11 +232,12 @@ ingestor = (
     )
 )
 results = ingestor.ingest()
-print(to_markdown(results[0]))
+markdown_docs = to_markdown(results)
+print(markdown_docs["multimodal_test.pdf"])
 ```
 
-Use `to_markdown_by_page(results[0])` when you want a `dict[int, str]` instead of one concatenated
-markdown document.
+Use `to_markdown_by_page(results)` when you want a nested
+`dict[str, dict[int, str]]` instead, where each filename maps to its per-page markdown strings.
 
 ## Benchmark harness
 
diff --git a/nemo_retriever/src/nemo_retriever/io/markdown.py b/nemo_retriever/src/nemo_retriever/io/markdown.py
index 366677e40..2a03e7e94 100644
--- a/nemo_retriever/src/nemo_retriever/io/markdown.py
+++ b/nemo_retriever/src/nemo_retriever/io/markdown.py
@@ -4,6 +4,7 @@
 
 from __future__ import annotations
 
+import json
 from collections import defaultdict
 from collections.abc import Iterable, Mapping
 from dataclasses import dataclass, field
@@ -14,7 +15,6 @@
 
 from .dataframe import read_dataframe
 
-_DOCUMENT_TITLE = "Extracted Content"
 _UNKNOWN_PAGE = -1
 _RECORD_LIST_KEYS = ("records", "df_records", "extracted_df_records", "primitives")
 _PAGE_CONTENT_COLUMNS = (
@@ -39,76 +39,202 @@ def next_index(self, label: str) -> int:
         return next_value
 
 
-def to_markdown_by_page(results: object) -> dict[int, str]:
-    """Render a single document result as markdown grouped by page."""
-    records = _coerce_records(results)
-    by_page: dict[int, _PageContent] = defaultdict(_PageContent)
+def to_markdown_by_page(results: object) -> dict[str, dict[int, str]]:
+    """Render results as markdown grouped by document, then by page."""
+    grouped_records = _coerce_documents(results)
+    rendered: dict[str, dict[int, str]] = {}
 
-    for record in records:
-        if "document_type" in record:
-            _collect_primitive_record(by_page, record)
-        else:
-            _collect_page_record(by_page, record)
+    for document_name, records in grouped_records.items():
+        by_page = _pages_for_records(records)
+        rendered[document_name] = {
+            page_number: _render_page_content(page_content)
+            for page_number, page_content in sorted(by_page.items(), key=_page_sort_key)
+        }
+
+    return rendered
 
-    rendered: dict[int, str] = {}
-    for page_number, page_content in sorted(by_page.items(), key=_page_sort_key):
-        blocks = _dedupe_blocks(page_content.text_blocks + page_content.sections)
-        header = f"## Page {page_number}" if page_number != _UNKNOWN_PAGE else "## Page Unknown"
-        rendered[page_number] = header + ("\n\n" + "\n\n".join(blocks) if blocks else "\n")
+
+def to_markdown(results: object) -> dict[str, str]:
+    """Render results as one collapsed markdown string per document."""
+    rendered: dict[str, str] = {}
+
+    for document_name, pages in to_markdown_by_page(results).items():
+        rendered[document_name] = "\n\n".join(page_markdown for page_markdown in pages.values() if page_markdown)
 
     return rendered
 
 
-def to_markdown(results: object) -> str:
-    """Render a single document result as one markdown document."""
-    pages = to_markdown_by_page(results)
-    if not pages:
-        return f"# {_DOCUMENT_TITLE}\n\n_No content found._"
-    return f"# {_DOCUMENT_TITLE}\n\n" + "\n\n".join(pages.values())
+def _coerce_documents(results: object) -> dict[str, list[dict[str, Any]]]:
+    grouped_records: dict[str, list[dict[str, Any]]] = {}
+    _extend_documents(grouped_records, results)
+    return grouped_records
 
 
-def _coerce_records(results: object) -> list[dict[str, Any]]:
+def _extend_documents(
+    grouped_records: dict[str, list[dict[str, Any]]],
+    results: object,
+    explicit_document_name: str | None = None,
+) -> None:
     if results is None:
-        return []
+        return
+
+    dataset = getattr(results, "_rd_dataset", None)
+    if dataset is not None:
+        _extend_documents(grouped_records, dataset, explicit_document_name)
+        return
+
+    take_all = getattr(results, "take_all", None)
+    if callable(take_all):
+        _extend_documents(grouped_records, take_all(), explicit_document_name)
+        return
+
     if isinstance(results, pd.DataFrame):
-        return results.to_dict(orient="records")
+        _add_records(grouped_records, results.to_dict(orient="records"), explicit_document_name)
+        return
+
     if isinstance(results, Path):
-        return read_dataframe(results).to_dict(orient="records")
+        payload, payload_document_name = _load_results_path(results)
+        _extend_documents(grouped_records, payload, explicit_document_name or payload_document_name)
+        return
+
     if isinstance(results, str):
         path = Path(results).expanduser()
         if path.exists():
-            return read_dataframe(path).to_dict(orient="records")
+            payload, payload_document_name = _load_results_path(path)
+            _extend_documents(grouped_records, payload, explicit_document_name or payload_document_name)
+            return
         raise TypeError("String inputs must point to a saved results file.")
+
     if isinstance(results, Mapping):
-        return _records_from_mapping(results)
+        extracted = _extract_records_from_mapping(results)
+        if extracted is not None:
+            records, mapping_document_name = extracted
+            _add_records(grouped_records, records, explicit_document_name or mapping_document_name)
+            return
+
+        for key, value in results.items():
+            _extend_documents(grouped_records, value, str(key))
+        return
+
     if isinstance(results, Iterable) and not isinstance(results, (bytes, bytearray)):
-        return _records_from_iterable(results)
+        items = list(results)
+        if not items:
+            return
+        if all(isinstance(item, Mapping) and _looks_like_record(item) for item in items):
+            _add_records(grouped_records, [dict(item) for item in items], explicit_document_name)
+            return
+        for item in items:
+            _extend_documents(grouped_records, item, explicit_document_name=None)
+        return
+
     raise TypeError(f"Unsupported results type for markdown rendering: {type(results)!r}")
 
 
-def _records_from_iterable(results: Iterable[Any]) -> list[dict[str, Any]]:
-    items = list(results)
-    if not items:
-        return []
-    if len(items) == 1:
-        first = items[0]
-        if not isinstance(first, Mapping):
-            return _coerce_records(first)
-        if not _looks_like_record(first):
-            return _records_from_mapping(first)
-    if all(isinstance(item, Mapping) for item in items):
-        return [dict(item) for item in items]
-    raise ValueError("Markdown rendering expects a single document result. Pass one document, such as results[0].")
+def _load_results_path(path: Path) -> tuple[object, str | None]:
+    suffix = path.suffix.lower()
+    if suffix == ".json":
+        payload = json.loads(path.read_text(encoding="utf-8"))
+        return payload, _document_name_from_mapping(payload) if isinstance(payload, Mapping) else None
+    if suffix == ".jsonl":
+        lines = [json.loads(line) for line in path.read_text(encoding="utf-8").splitlines() if line.strip()]
+        if len(lines) == 1 and isinstance(lines[0], Mapping):
+            payload = lines[0]
+            return payload, _document_name_from_mapping(payload)
+        return lines, None
+    return read_dataframe(path), None
 
 
-def _records_from_mapping(results: Mapping[str, Any]) -> list[dict[str, Any]]:
+def _extract_records_from_mapping(results: Mapping[str, Any]) -> tuple[list[dict[str, Any]], str | None] | None:
     for key in _RECORD_LIST_KEYS:
         value = results.get(key)
         if isinstance(value, list) and all(isinstance(item, Mapping) for item in value):
-            return [dict(item) for item in value]
+            return [dict(item) for item in value], _document_name_from_mapping(results)
     if _looks_like_record(results):
-        return [dict(results)]
-    raise ValueError("Markdown rendering expects a document row, row list, or saved results payload.")
+        return [dict(results)], None
+    return None
+
+
+def _add_records(
+    grouped_records: dict[str, list[dict[str, Any]]],
+    records: list[dict[str, Any]],
+    explicit_document_name: str | None = None,
+) -> None:
+    fallback_document_name = explicit_document_name or _next_unknown_document_name(grouped_records)
+    for record in records:
+        document_name = explicit_document_name or _document_name_for_record(record) or fallback_document_name
+        grouped_records.setdefault(document_name, []).append(record)
+
+
+def _next_unknown_document_name(grouped_records: Mapping[str, list[dict[str, Any]]]) -> str:
+    index = 1
+    while f"document_{index}" in grouped_records:
+        index += 1
+    return f"document_{index}"
+
+
+def _pages_for_records(records: Iterable[Mapping[str, Any]]) -> dict[int, _PageContent]:
+    by_page: dict[int, _PageContent] = defaultdict(_PageContent)
+
+    for record in records:
+        if "document_type" in record:
+            _collect_primitive_record(by_page, record)
+        else:
+            _collect_page_record(by_page, record)
+
+    return by_page
+
+
+def _render_page_content(page_content: _PageContent) -> str:
+    return "\n\n".join(_dedupe_blocks(page_content.text_blocks + page_content.sections))
+
+
+def _document_name_from_mapping(results: Mapping[str, Any]) -> str | None:
+    metadata = results.get("metadata")
+    source_metadata = _nested_mapping(metadata, "source_metadata") if isinstance(metadata, Mapping) else {}
+    custom_content = _nested_mapping(metadata, "custom_content") if isinstance(metadata, Mapping) else {}
+
+    return _normalize_document_name(
+        results.get("filename"),
+        results.get("source_path"),
+        results.get("path"),
+        results.get("source_id"),
+        source_metadata.get("source_name"),
+        source_metadata.get("source_id"),
+        custom_content.get("path"),
+        custom_content.get("input_pdf"),
+        custom_content.get("pdf_path"),
+    )
+
+
+def _document_name_for_record(record: Mapping[str, Any]) -> str | None:
+    metadata = _metadata(record)
+    source_metadata = _nested_mapping(metadata, "source_metadata")
+    custom_content = _nested_mapping(metadata, "custom_content")
+
+    return _normalize_document_name(
+        record.get("filename"),
+        record.get("source_path"),
+        metadata.get("source_path"),
+        source_metadata.get("source_name"),
+        custom_content.get("path"),
+        custom_content.get("input_pdf"),
+        custom_content.get("pdf_path"),
+        record.get("path"),
+        source_metadata.get("source_id"),
+        record.get("source_id"),
+    )
+
+
+def _normalize_document_name(*candidates: Any) -> str | None:
+    for candidate in candidates:
+        if not isinstance(candidate, str):
+            continue
+        normalized = candidate.strip()
+        if not normalized:
+            continue
+        name = Path(normalized).name
+        return name or normalized
+    return None
 
 
 def _looks_like_record(record: Mapping[str, Any]) -> bool:
diff --git a/nemo_retriever/tests/test_io_markdown.py b/nemo_retriever/tests/test_io_markdown.py
index e2ce7ed52..f0d98edbd 100644
--- a/nemo_retriever/tests/test_io_markdown.py
+++ b/nemo_retriever/tests/test_io_markdown.py
@@ -2,7 +2,6 @@
 from pathlib import Path
 
 import pandas as pd
-import pytest
 
 from nemo_retriever.io import to_markdown, to_markdown_by_page
 
@@ -18,10 +17,24 @@ def __len__(self):
         return len(self._rows)
 
 
-def test_to_markdown_renders_page_dataframe() -> None:
+class _DatasetLike:
+    def __init__(self, rows):
+        self._rows = rows
+
+    def take_all(self):
+        return list(self._rows)
+
+
+class _BatchResults:
+    def __init__(self, rows):
+        self._rd_dataset = _DatasetLike(rows)
+
+
+def test_to_markdown_groups_page_dataframe_by_filename() -> None:
     df = pd.DataFrame(
         [
             {
+                "path": "/tmp/alpha.pdf",
                 "page_number": 1,
                 "text": "Executive summary",
                 "table": [{"text": "| Animal | Count |\n| --- | --- |\n| Cat | 2 |"}],
@@ -29,10 +42,19 @@ def test_to_markdown_renders_page_dataframe() -> None:
                 "infographic": [],
             },
             {
+                "path": "/tmp/alpha.pdf",
                 "page_number": 2,
                 "text": "Appendix",
                 "table": [],
                 "chart": [],
+                "infographic": [],
+            },
+            {
+                "path": "/tmp/beta.pdf",
+                "page_number": 1,
+                "text": "Appendix",
+                "table": [],
+                "chart": [],
                 "infographic": [{"text": "Icon legend and callouts."}],
             },
         ]
@@ -40,29 +62,33 @@ def test_to_markdown_renders_page_dataframe() -> None:
 
     markdown = to_markdown(df)
 
-    assert markdown.startswith("# Extracted Content")
-    assert "## Page 1" in markdown
-    assert "Executive summary" in markdown
-    assert "### Table 1" in markdown
-    assert "### Chart 1" in markdown
-    assert "## Page 2" in markdown
-    assert "### Infographic 1" in markdown
+    assert list(markdown) == ["alpha.pdf", "beta.pdf"]
+    assert markdown["alpha.pdf"].startswith("Executive summary")
+    assert "Executive summary" in markdown["alpha.pdf"]
+    assert "### Table 1" in markdown["alpha.pdf"]
+    assert "### Chart 1" in markdown["alpha.pdf"]
+    assert "Appendix" in markdown["alpha.pdf"]
+    assert "## Page 1" not in markdown["alpha.pdf"]
+    assert "## Page 2" not in markdown["alpha.pdf"]
+    assert "### Infographic 1" in markdown["beta.pdf"]
 
 
-def test_to_markdown_by_page_sorts_pages_and_groups_unknown() -> None:
+def test_to_markdown_by_page_sorts_pages_and_groups_unknown_per_document() -> None:
     pages = to_markdown_by_page(
         [
-            {"page_number": "2", "text": "Second page"},
-            {"page_number": None, "text": "Unknown page"},
-            {"page_number": 1, "text": "First page"},
-            {"page_number": 2, "text": "Second page"},
+            {"source_path": "/tmp/alpha.pdf", "page_number": "2", "text": "Second page"},
+            {"source_path": "/tmp/alpha.pdf", "page_number": None, "text": "Unknown page"},
+            {"source_path": "/tmp/alpha.pdf", "page_number": 1, "text": "First page"},
+            {"source_path": "/tmp/alpha.pdf", "page_number": 2, "text": "Second page"},
+            {"source_path": "/tmp/beta.pdf", "page_number": 1, "text": "Only page"},
         ]
     )
 
-    assert list(pages.keys()) == [1, 2, -1]
-    assert pages[1].startswith("## Page 1")
-    assert pages[2].count("Second page") == 1
-    assert pages[-1].startswith("## Page Unknown")
+    assert list(pages["alpha.pdf"].keys()) == [1, 2, -1]
+    assert pages["alpha.pdf"][1] == "First page"
+    assert pages["alpha.pdf"][2].count("Second page") == 1
+    assert pages["alpha.pdf"][-1] == "Unknown page"
+    assert pages["beta.pdf"][1] == "Only page"
 
 
 def test_to_markdown_supports_primitive_rows_from_lazy_iterable() -> None:
@@ -71,6 +97,7 @@ def test_to_markdown_supports_primitive_rows_from_lazy_iterable() -> None:
             {
                 "document_type": "text",
                 "metadata": {
+                    "source_path": "/tmp/alpha.pdf",
                     "content": "Page text",
                     "content_metadata": {"page_number": 1},
                 },
@@ -78,6 +105,7 @@ def test_to_markdown_supports_primitive_rows_from_lazy_iterable() -> None:
             {
                 "document_type": "structured",
                 "metadata": {
+                    "source_path": "/tmp/alpha.pdf",
                     "content_metadata": {"page_number": 1, "subtype": "table"},
                     "table_metadata": {"table_content": "| A |\n| --- |\n| 1 |"},
                 },
@@ -85,6 +113,7 @@ def test_to_markdown_supports_primitive_rows_from_lazy_iterable() -> None:
             {
                 "document_type": "image",
                 "metadata": {
+                    "source_path": "/tmp/beta.pdf",
                     "content_metadata": {"page_number": 2, "subtype": "page_image"},
                     "image_metadata": {"text": "OCR fallback"},
                 },
@@ -94,39 +123,73 @@ def test_to_markdown_supports_primitive_rows_from_lazy_iterable() -> None:
 
     pages = to_markdown_by_page(rows)
 
-    assert "Page text" in pages[1]
-    assert "### Table 1" in pages[1]
-    assert "### Page Image 1" in pages[2]
-    assert "OCR fallback" in pages[2]
+    assert "Page text" in pages["alpha.pdf"][1]
+    assert "### Table 1" in pages["alpha.pdf"][1]
+    assert "### Page Image 1" in pages["beta.pdf"][2]
+    assert "OCR fallback" in pages["beta.pdf"][2]
 
 
 def test_to_markdown_reads_saved_records_wrapper(tmp_path: Path) -> None:
     path = tmp_path / "results.json"
     payload = {
+        "source_path": "/tmp/example.pdf",
         "records": [
             {
                 "page_number": 1,
                 "text": "Saved result text",
                 "table": [{"text": "| H |\n| --- |\n| V |"}],
-                "metadata": {"source_path": "/tmp/example.pdf"},
             }
-        ]
+        ],
     }
     path.write_text(json.dumps(payload), encoding="utf-8")
 
     markdown = to_markdown(path)
 
-    assert "Saved result text" in markdown
-    assert "### Table 1" in markdown
+    assert list(markdown) == ["example.pdf"]
+    assert "Saved result text" in markdown["example.pdf"]
+    assert "### Table 1" in markdown["example.pdf"]
 
 
-def test_to_markdown_empty_results_returns_placeholder() -> None:
-    assert to_markdown([]) == "# Extracted Content\n\n_No content found._"
+def test_to_markdown_empty_results_returns_empty_dict() -> None:
+    assert to_markdown([]) == {}
 
 
-def test_to_markdown_rejects_multi_document_results() -> None:
+def test_to_markdown_groups_inprocess_multi_document_results() -> None:
     doc_a = pd.DataFrame([{"page_number": 1, "text": "A"}])
+    doc_a["path"] = "/tmp/a.pdf"
     doc_b = pd.DataFrame([{"page_number": 1, "text": "B"}])
+    doc_b["path"] = "/tmp/b.pdf"
+
+    markdown = to_markdown([doc_a, doc_b])
+
+    assert set(markdown) == {"a.pdf", "b.pdf"}
+    assert "A" in markdown["a.pdf"]
+    assert "B" in markdown["b.pdf"]
+
+
+def test_to_markdown_supports_batch_dataset_like_results() -> None:
+    results = _BatchResults(
+        [
+            {
+                "document_type": "text",
+                "metadata": {
+                    "source_path": "/tmp/batch-a.pdf",
+                    "content": "Batch A page 1",
+                    "content_metadata": {"page_number": 1},
+                },
+            },
+            {
+                "document_type": "text",
+                "metadata": {
+                    "source_path": "/tmp/batch-b.pdf",
+                    "content": "Batch B page 2",
+                    "content_metadata": {"page_number": 2},
+                },
+            },
+        ]
+    )
+
+    pages = to_markdown_by_page(results)
 
-    with pytest.raises(ValueError, match="single document result"):
-        to_markdown([doc_a, doc_b])
+    assert pages["batch-a.pdf"][1] == "Batch A page 1"
+    assert pages["batch-b.pdf"][2] == "Batch B page 2"

From 02c2dcd88975e464b638167f5c7d7078b7e99e28 Mon Sep 17 00:00:00 2001
From: Edward Kim <109497216+edknv@users.noreply.github.com>
Date: Thu, 12 Mar 2026 15:32:27 -0700
Subject: [PATCH 20/55] (retriever) update pre/post-processing for improved
 recall (#1596) (#1608)

---
 .../nemo_retriever/chart/chart_detection.py   | 17 +++++-
 .../nemo_retriever/examples/batch_pipeline.py |  8 +++
 .../examples/inprocess_pipeline.py            |  7 +++
 .../src/nemo_retriever/ingest-config.yaml     |  7 +++
 .../src/nemo_retriever/ingest_modes/batch.py  | 10 ++--
 .../src/nemo_retriever/ingest_modes/fused.py  |  2 +-
 .../model/local/nemotron_page_elements_v3.py  | 15 ++++-
 nemo_retriever/src/nemo_retriever/ocr/ocr.py  | 57 ++++++++++++++-----
 .../page_elements/page_elements.py            | 36 +++++++++++-
 .../src/nemo_retriever/params/models.py       |  1 +
 .../src/nemo_retriever/pdf/extract.py         | 54 ++++++++++++++++--
 .../src/nemo_retriever/pdf/stage.py           | 12 ++++
 .../nemo_retriever/table/table_detection.py   | 11 +++-
 nemo_retriever/tests/test_pdf_render_scale.py |  2 +-
 14 files changed, 206 insertions(+), 33 deletions(-)

diff --git a/nemo_retriever/src/nemo_retriever/chart/chart_detection.py b/nemo_retriever/src/nemo_retriever/chart/chart_detection.py
index 1e5a9bb18..23e1d8798 100644
--- a/nemo_retriever/src/nemo_retriever/chart/chart_detection.py
+++ b/nemo_retriever/src/nemo_retriever/chart/chart_detection.py
@@ -30,6 +30,13 @@
 except Exception:  # pragma: no cover
     Image = None  # type: ignore[assignment]
 
+try:
+    from nv_ingest_api.internal.primitives.nim.model_interface.yolox import (
+        YOLOX_GRAPHIC_MIN_SCORE,
+    )
+except ImportError:
+    YOLOX_GRAPHIC_MIN_SCORE = 0.1  # type: ignore[assignment]
+
 
 def _error_payload(*, stage: str, exc: BaseException) -> Dict[str, Any]:
     return {
@@ -443,7 +450,13 @@ def graphic_elements_ocr_page_elements(
                 if len(response_items) != len(crops):
                     raise RuntimeError(f"Expected {len(crops)} GE responses, got {len(response_items)}")
                 for resp in response_items:
-                    ge_results.append(_remote_response_to_ge_detections(resp))
+                    ge_results.append(
+                        [
+                            d
+                            for d in _remote_response_to_ge_detections(resp)
+                            if (d.get("score") or 0.0) >= YOLOX_GRAPHIC_MIN_SCORE
+                        ]
+                    )
             else:
                 # Local batched inference.
                 for _, _, crop_array in crops:
@@ -458,7 +471,7 @@ def graphic_elements_ocr_page_elements(
                         pre = pre.unsqueeze(0)
                     pred = graphic_elements_model.invoke(pre, (h, w))
                     ge_dets = _prediction_to_detections(pred, label_names=label_names)
-                    ge_results.append(ge_dets)
+                    ge_results.append([d for d in ge_dets if (d.get("score") or 0.0) >= YOLOX_GRAPHIC_MIN_SCORE])
 
             # --- Run OCR on all crops ---
             ocr_results: List[Any] = []
diff --git a/nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py b/nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py
index 556bf38fc..b36958b22 100644
--- a/nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py
+++ b/nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py
@@ -189,6 +189,12 @@ def main(
         "--debug/--no-debug",
         help="Enable debug-level logging for this full pipeline run.",
     ),
+    dpi: int = typer.Option(
+        300,
+        "--dpi",
+        min=72,
+        help="Render DPI for PDF page images (default: 300).",
+    ),
     input_path: Path = typer.Argument(
         ...,
         help="File or directory containing PDFs, .txt, .html, or .doc/.pptx files to ingest.",
@@ -617,6 +623,7 @@ def main(
                 "embed_workers": embed_actors,
                 "embed_batch_size": int(embed_batch_size),
                 "embed_cpus_per_actor": float(embed_cpus_per_actor),
+                "gpu_embed": float(embed_gpus_per_actor),
             },
         )
         # txt/html don't use embed_granularity from batch_tuning the same way,
@@ -653,6 +660,7 @@ def main(
         def _extract_params(batch_tuning: dict, **overrides: Any) -> ExtractParams:
             return ExtractParams(
                 method=method,
+                dpi=int(dpi),
                 extract_text=True,
                 extract_tables=True,
                 extract_charts=True,
diff --git a/nemo_retriever/src/nemo_retriever/examples/inprocess_pipeline.py b/nemo_retriever/src/nemo_retriever/examples/inprocess_pipeline.py
index e5b9ad117..b4bdb34ef 100644
--- a/nemo_retriever/src/nemo_retriever/examples/inprocess_pipeline.py
+++ b/nemo_retriever/src/nemo_retriever/examples/inprocess_pipeline.py
@@ -150,6 +150,11 @@ def main(
         "--graphic-elements-invoke-url",
         help="Optional remote endpoint URL for graphic-elements model inference.",
     ),
+    hybrid: bool = typer.Option(
+        False,
+        "--hybrid/--no-hybrid",
+        help="Enable LanceDB hybrid mode (dense + FTS text).",
+    ),
     text_chunk: bool = typer.Option(
         False,
         "--text-chunk",
@@ -285,6 +290,7 @@ def main(
                 "table_name": LANCEDB_TABLE,
                 "overwrite": True,
                 "create_index": True,
+                "hybrid": hybrid,
             }
         )
     )
@@ -330,6 +336,7 @@ def main(
         embedding_http_endpoint=embed_invoke_url,
         top_k=10,
         ks=(1, 5, 10),
+        hybrid=hybrid,
     )
 
     _df_query, _gold, _raw_hits, _retrieved_keys, metrics = retrieve_and_score(query_csv=query_csv, cfg=cfg)
diff --git a/nemo_retriever/src/nemo_retriever/ingest-config.yaml b/nemo_retriever/src/nemo_retriever/ingest-config.yaml
index 1d8acd6c1..25f10997c 100644
--- a/nemo_retriever/src/nemo_retriever/ingest-config.yaml
+++ b/nemo_retriever/src/nemo_retriever/ingest-config.yaml
@@ -42,6 +42,13 @@ pdf:
       http: null
       model_name: null
 
+  # PDF rendering mode for page-element detection images:
+  #   full_dpi     – render at `dpi` (default 300), then resize_pad down to 1024×1024.
+  #                  Higher source resolution, but bilinear downscale may differ from NIM.
+  #   fit_to_model – render directly at the scale that fits within 1024×1024 (~93 DPI
+  #                  for US Letter), matching the nv-ingest/NIM container rasterization.
+  render_mode: fit_to_model
+
   extract:
     text: true
     # Text depth: page | document
diff --git a/nemo_retriever/src/nemo_retriever/ingest_modes/batch.py b/nemo_retriever/src/nemo_retriever/ingest_modes/batch.py
index f7a909f29..dc2e14ea7 100644
--- a/nemo_retriever/src/nemo_retriever/ingest_modes/batch.py
+++ b/nemo_retriever/src/nemo_retriever/ingest_modes/batch.py
@@ -338,6 +338,7 @@ def extract(self, params: ExtractParams | None = None, **kwargs: Any) -> "BatchI
             and not resolved.api_key
         ):
             resolved = resolved.model_copy(update={"api_key": resolve_remote_api_key()})
+
         kwargs = {
             **resolved.model_dump(mode="python", exclude={"remote_retry", "batch_tuning"}, exclude_none=True),
             **resolved.remote_retry.model_dump(mode="python", exclude_none=True),
@@ -357,10 +358,9 @@ def _endpoint_count(raw: Any) -> int:
 
         # 200 DPI is sufficient for both detection and OCR.  YOLOX resizes to
         # 1024x1024 internally, and NemotronOCR also resizes crops to 1024x1024,
-        # so resolution above ~1200px per side is wasted.  200 DPI (Letter =
-        # 1700x2200) gives enough detail while reducing extraction time and
-        # memory usage by ~30-40% vs 300 DPI.
-        kwargs.setdefault("dpi", 200)
+        # nv-ingest NIM uses 300 DPI for page-element detection; match that
+        # default here so local-model recall matches the container path.
+        kwargs.setdefault("dpi", 300)
         kwargs.setdefault("image_format", "jpeg")
         kwargs.setdefault("jpeg_quality", 100)
         self._pipeline_type = "pdf"
@@ -680,6 +680,7 @@ def extract_image_files(self, params: ExtractParams | None = None, **kwargs: Any
             and not resolved.api_key
         ):
             resolved = resolved.model_copy(update={"api_key": resolve_remote_api_key()})
+
         kwargs = {
             **resolved.model_dump(mode="python", exclude={"remote_retry", "batch_tuning"}, exclude_none=True),
             **resolved.remote_retry.model_dump(mode="python", exclude_none=True),
@@ -850,6 +851,7 @@ def embed(
         resolved = _coerce_params(params, EmbedParams, kwargs)
         if any((resolved.embedding_endpoint, resolved.embed_invoke_url)) and not resolved.api_key:
             resolved = resolved.model_copy(update={"api_key": resolve_remote_api_key()})
+
         kwargs = build_embed_kwargs(resolved, include_batch_tuning=True)
 
         # Remaining kwargs are forwarded to the actor constructor.
diff --git a/nemo_retriever/src/nemo_retriever/ingest_modes/fused.py b/nemo_retriever/src/nemo_retriever/ingest_modes/fused.py
index 6df62b656..7fd35373a 100644
--- a/nemo_retriever/src/nemo_retriever/ingest_modes/fused.py
+++ b/nemo_retriever/src/nemo_retriever/ingest_modes/fused.py
@@ -200,7 +200,7 @@ def extract(self, params: ExtractParams | None = None, **kwargs: Any) -> "FusedI
         pdf_extract_workers = int(kwargs.pop("pdf_extract_workers", max(1, self._num_cpus // 2)))
 
         kwargs.setdefault("extract_page_as_image", True)
-        kwargs.setdefault("dpi", 200)
+        kwargs.setdefault("dpi", 300)
 
         self._tasks.append(("extract", dict(kwargs)))
         self._fused_extract_flags = {
diff --git a/nemo_retriever/src/nemo_retriever/model/local/nemotron_page_elements_v3.py b/nemo_retriever/src/nemo_retriever/model/local/nemotron_page_elements_v3.py
index 9e2cd1074..21c9077da 100644
--- a/nemo_retriever/src/nemo_retriever/model/local/nemotron_page_elements_v3.py
+++ b/nemo_retriever/src/nemo_retriever/model/local/nemotron_page_elements_v3.py
@@ -67,6 +67,8 @@ def preprocess(self, tensor: Union[torch.Tensor, np.ndarray]) -> torch.Tensor:
                     raise TypeError(f"resize_pad returned non-tensor: {type(y)!r}")
                 if y.ndim != 3:
                     raise ValueError(f"Expected CHW from resize_pad, got {tuple(y.shape)}")
+                # Match NIM preprocessing: quantize to uint8 after interpolation
+                y = torch.clamp(y, 0, 255).to(torch.uint8).float()
                 return y.unsqueeze(0)
 
             outs: List[torch.Tensor] = []
@@ -74,6 +76,8 @@ def preprocess(self, tensor: Union[torch.Tensor, np.ndarray]) -> torch.Tensor:
                 y = resize_pad_page_elements(x[i], self.input_shape)
                 if not isinstance(y, torch.Tensor) or y.ndim != 3:
                     raise ValueError(f"resize_pad produced unexpected output for batch item {i}: {type(y)!r}")
+                # Match NIM preprocessing: quantize to uint8 after interpolation
+                y = torch.clamp(y, 0, 255).to(torch.uint8).float()
                 outs.append(y)
             return torch.stack(outs, dim=0)
 
@@ -83,6 +87,8 @@ def preprocess(self, tensor: Union[torch.Tensor, np.ndarray]) -> torch.Tensor:
                 raise TypeError(f"resize_pad returned non-tensor: {type(y)!r}")
             if y.ndim != 3:
                 raise ValueError(f"Expected CHW from resize_pad, got {tuple(y.shape)}")
+            # Match NIM preprocessing: quantize to uint8 after interpolation
+            y = torch.clamp(y, 0, 255).to(torch.uint8).float()
             return y.unsqueeze(0)
 
         raise ValueError(f"Expected CHW or BCHW tensor, got shape {tuple(x.shape)}")
@@ -133,8 +139,13 @@ def postprocess(self, preds: Union[Dict[str, torch.Tensor], Sequence[Dict[str, t
         # may pass a *list* of per-image preds for batched inference, so handle both cases
         # and always return torch tensors (or lists of torch tensors).
 
+        # Use a zero threshold so all NMS survivors reach WBF before per-class
+        # filtering.  The real per-class gate is _apply_final_score_filter (after WBF),
+        # matching the NIM pipeline ordering.
+        passthrough_thresholds = {k: 0.0 for k in self._model.thresholds_per_class}
+
         def _one(p: Dict[str, torch.Tensor]) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
-            b_np, l_np, s_np = postprocess_preds_page_element(p, self._model.thresholds_per_class, self._model.labels)
+            b_np, l_np, s_np = postprocess_preds_page_element(p, passthrough_thresholds, self._model.labels)
             b = torch.as_tensor(b_np, dtype=torch.float32)
             l = torch.as_tensor(l_np, dtype=torch.int64)  # noqa: E741
             s = torch.as_tensor(s_np, dtype=torch.float32)
@@ -212,7 +223,7 @@ def output(self) -> Any:
                 "labels": "List[str] - class names",
                 "scores": "np.ndarray[N] - confidence scores",
             },
-            "classes": ["table", "chart", "infographic", "title", "text", "header_footer"],
+            "classes": ["table", "chart", "title", "infographic", "text", "header_footer"],
             "post_processing": {"conf_thresh": 0.01, "iou_thresh": 0.5},
         }
 
diff --git a/nemo_retriever/src/nemo_retriever/ocr/ocr.py b/nemo_retriever/src/nemo_retriever/ocr/ocr.py
index 8ef24ebfe..34ae7258b 100644
--- a/nemo_retriever/src/nemo_retriever/ocr/ocr.py
+++ b/nemo_retriever/src/nemo_retriever/ocr/ocr.py
@@ -337,16 +337,29 @@ def _parse_ocr_result(preds: Any) -> List[Dict[str, Any]]:
 
 
 def _blocks_to_text(blocks: List[Dict[str, Any]]) -> str:
-    """Sort text blocks by reading order (y then x) and join with newlines."""
+    """Sort text blocks by reading order (y then x) and join with whitespace."""
     blocks.sort(key=lambda b: (b.get("sort_y", 0.0), b.get("sort_x", 0.0)))
-    return "\n".join(b["text"] for b in blocks if b.get("text"))
+    return " ".join(b["text"] for b in blocks if b.get("text"))
 
 
-def _blocks_to_pseudo_markdown(blocks: List[Dict[str, Any]]) -> str:
+def _blocks_to_pseudo_markdown(
+    blocks: List[Dict[str, Any]],
+    crop_hw: Tuple[int, int] = (0, 0),
+) -> str:
     """Convert OCR text blocks into pseudo-markdown table format.
 
-    Uses DBSCAN clustering on y-coordinates to identify rows, then
+    Uses DBSCAN clustering on pixel y-coordinates to identify rows, then
     sorts within each row by x-coordinate and joins with pipe separators.
+
+    Parameters
+    ----------
+    blocks : list of dict
+        OCR text blocks with ``sort_y`` (normalised [0,1]) and ``sort_x``.
+    crop_hw : (height, width)
+        Pixel dimensions of the crop image.  When provided the normalised
+        ``sort_y`` values are scaled to pixels and clustered with
+        ``eps=10`` (matching nv-ingest behaviour).  Falls back to the old
+        normalised-space heuristic when the height is unavailable.
     """
     if not blocks:
         return ""
@@ -358,19 +371,27 @@ def _blocks_to_pseudo_markdown(blocks: List[Dict[str, Any]]) -> str:
     from sklearn.cluster import DBSCAN
 
     df = pd.DataFrame(valid)
+    df = df.sort_values("sort_y")
 
-    # Normalize y-coordinates to [0,1] for scale-invariant clustering.
     y_vals = df["sort_y"].values
-    y_range = y_vals.max() - y_vals.min()
-    if y_range > 0:
-        y_norm = (y_vals - y_vals.min()) / y_range
-        eps = 0.03  # ~3% of bbox height ≈ one text line
+    crop_h = crop_hw[0] if crop_hw else 0
+
+    if crop_h > 0:
+        # Pixel-space clustering (matches nv-ingest eps=10).
+        y_pixels = (y_vals * crop_h).astype(int)
+        eps = 10
     else:
-        y_norm = y_vals
-        eps = 0.1
+        # Fallback: normalise to [0,1] when pixel dims are unknown.
+        y_range = y_vals.max() - y_vals.min()
+        if y_range > 0:
+            y_pixels = (y_vals - y_vals.min()) / y_range
+            eps = 0.03
+        else:
+            y_pixels = y_vals
+            eps = 0.1
 
     dbscan = DBSCAN(eps=eps, min_samples=1)
-    dbscan.fit(y_norm.reshape(-1, 1))
+    dbscan.fit(y_pixels.reshape(-1, 1))
     df["cluster"] = dbscan.labels_
 
     df = df.sort_values(["cluster", "sort_x"])
@@ -574,7 +595,15 @@ def ocr_page_elements(
 
                         blocks = _parse_ocr_result(preds)
                         if label_name == "table":
-                            text = _blocks_to_pseudo_markdown(blocks) or _blocks_to_text(blocks)
+                            crop_hw_table: Tuple[int, int] = (0, 0)
+                            try:
+                                _raw = base64.b64decode(crop_b64s[i])
+                                with Image.open(io.BytesIO(_raw)) as _cim:
+                                    _cw, _ch = _cim.size
+                                    crop_hw_table = (_ch, _cw)
+                            except Exception:
+                                pass
+                            text = _blocks_to_pseudo_markdown(blocks, crop_hw=crop_hw_table) or _blocks_to_text(blocks)
                         else:
                             text = _blocks_to_text(blocks)
                         entry = {"bbox_xyxy_norm": bbox, "text": text}
@@ -615,7 +644,7 @@ def _append_local_result(
                                 return
                     blocks = _parse_ocr_result(preds)
                     if label_name == "table":
-                        text = _blocks_to_pseudo_markdown(blocks)
+                        text = _blocks_to_pseudo_markdown(blocks, crop_hw=crop_hw)
                         if not text:
                             text = _blocks_to_text(blocks)
                     else:
diff --git a/nemo_retriever/src/nemo_retriever/page_elements/page_elements.py b/nemo_retriever/src/nemo_retriever/page_elements/page_elements.py
index fc9052ddf..09a3179f9 100644
--- a/nemo_retriever/src/nemo_retriever/page_elements/page_elements.py
+++ b/nemo_retriever/src/nemo_retriever/page_elements/page_elements.py
@@ -34,10 +34,12 @@
     from nv_ingest_api.internal.primitives.nim.model_interface.yolox import (
         postprocess_page_elements_v3,
         YOLOX_PAGE_V3_CLASS_LABELS,
+        YOLOX_PAGE_V3_FINAL_SCORE,
     )
 except ImportError:
     postprocess_page_elements_v3 = None  # type: ignore[assignment,misc]
     YOLOX_PAGE_V3_CLASS_LABELS = None  # type: ignore[assignment]
+    YOLOX_PAGE_V3_FINAL_SCORE = {}  # type: ignore[assignment]
 
 from nemo_retriever.nim.nim import invoke_page_elements_batches
 
@@ -123,6 +125,10 @@ def _decode_b64_image_to_np_array(image_b64: str) -> Tuple["np.array", Tuple[int
         im = im0.convert("RGB")
         w, h = im.size
         arr = np.array(im)
+        # The NIM container receives BGR images (PNG encoded from BGR numpy
+        # arrays) and decodes the raw channels as-is, so the model effectively
+        # runs on BGR input.  Match that here by reversing the channel order.
+        arr = arr[:, :, ::-1].copy()
 
     return arr, (int(h), int(w))
 
@@ -339,6 +345,25 @@ def _bounding_boxes_to_detections(
     return dets
 
 
+def _apply_final_score_filter(
+    dets: List[Dict[str, Any]],
+) -> List[Dict[str, Any]]:
+    """Filter detections by per-class final score thresholds (YOLOX_PAGE_V3_FINAL_SCORE).
+
+    This should be applied **after** WBF post-processing to match the NIM pipeline ordering.
+    Maps retriever label "text" to API label "paragraph" for threshold lookup.
+    """
+    if not YOLOX_PAGE_V3_FINAL_SCORE or not dets:
+        return dets
+    filtered: List[Dict[str, Any]] = []
+    for d in dets:
+        api_name = _RETRIEVER_TO_API.get(d["label_name"], d["label_name"])
+        threshold = YOLOX_PAGE_V3_FINAL_SCORE.get(api_name, 0.0)
+        if d.get("score") is not None and d["score"] >= threshold:
+            filtered.append(d)
+    return filtered
+
+
 def _apply_page_elements_v3_postprocess(
     dets: List[Dict[str, Any]],
 ) -> List[Dict[str, Any]]:
@@ -495,7 +520,11 @@ def detect_page_elements_v3(
     if model is not None and hasattr(model, "thresholds_per_class"):
         thresholds_per_class = getattr(model, "thresholds_per_class")
     else:
-        thresholds_per_class = [0.0 for _ in label_names]
+        # Use the same per-class thresholds as the yolox pipeline.
+        # label_names uses "text" where yolox uses "paragraph"; _RETRIEVER_TO_API maps between them.
+        thresholds_per_class = [
+            YOLOX_PAGE_V3_FINAL_SCORE.get(_RETRIEVER_TO_API.get(name, name), 0.0) for name in label_names
+        ]
 
     for _, row in pages_df.iterrows():
         try:
@@ -671,6 +700,7 @@ def detect_page_elements_v3(
                     labels_list.append(torch.as_tensor(l_np, dtype=torch.int64))
                     scores_list.append(torch.as_tensor(s_np, dtype=torch.float32))
                 boxes, labels, scores = boxes_list, labels_list, scores_list
+
             per_image_dets = _postprocess_to_per_image_detections(
                 boxes=boxes,
                 labels=labels,
@@ -678,8 +708,10 @@ def detect_page_elements_v3(
                 batch_size=len(pre_list),
                 label_names=label_names,
             )
-            # Apply v3 postprocessing (box fusion, title matching, expansion, overlap removal)
+            # Apply v3 postprocessing (box fusion via WBF at iou=0.01, title matching, expansion, overlap removal)
             per_image_dets = [_apply_page_elements_v3_postprocess(dets) for dets in per_image_dets]
+            # Apply per-class final score filtering AFTER WBF (matches NIM pipeline ordering)
+            per_image_dets = [_apply_final_score_filter(dets) for dets in per_image_dets]
             for local_i, row_i in enumerate(chunk_idx):
                 dets = per_image_dets[local_i] if local_i < len(per_image_dets) else []
                 row_payloads[row_i] = {
diff --git a/nemo_retriever/src/nemo_retriever/params/models.py b/nemo_retriever/src/nemo_retriever/params/models.py
index 66e925162..1f81e38e0 100644
--- a/nemo_retriever/src/nemo_retriever/params/models.py
+++ b/nemo_retriever/src/nemo_retriever/params/models.py
@@ -167,6 +167,7 @@ class ExtractParams(_ParamsModel):
     dpi: int = 200
     image_format: str = "jpeg"
     jpeg_quality: int = 100
+    render_mode: Literal["full_dpi", "fit_to_model"] = "fit_to_model"
     inference_batch_size: int = 8
     ocr_model_dir: Optional[str] = None
 
diff --git a/nemo_retriever/src/nemo_retriever/pdf/extract.py b/nemo_retriever/src/nemo_retriever/pdf/extract.py
index af2a92a8c..992c18ebe 100644
--- a/nemo_retriever/src/nemo_retriever/pdf/extract.py
+++ b/nemo_retriever/src/nemo_retriever/pdf/extract.py
@@ -6,7 +6,7 @@
 
 from io import BytesIO
 from dataclasses import dataclass
-from typing import Any, Dict, List, Optional
+from typing import Any, Dict, List, Literal, Optional, Tuple
 
 import base64
 import traceback
@@ -33,6 +33,37 @@
 except Exception:  # pragma: no cover
     np = None  # type: ignore[assignment]
 
+# Default model input size used by nv-ingest for page-element detection.
+_MODEL_INPUT_SIZE: Tuple[int, int] = (1024, 1024)
+
+# Allowed render-mode values.
+RenderMode = Literal["full_dpi", "fit_to_model"]
+
+
+def _compute_fit_to_model_scale(
+    page: Any,
+    target_wh: Tuple[int, int] = _MODEL_INPUT_SIZE,
+    max_dpi: int = 300,
+) -> float:
+    """Compute a pdfium render scale that fits the page within *target_wh* pixels.
+
+    This mirrors the logic in ``nv_ingest_api.util.pdf.pdfium._compute_render_scale_to_fit``
+    combined with the ``min(base_scale, fit_scale)`` cap applied in
+    ``pdfium_pages_to_numpy`` when ``scale_tuple`` is provided.
+
+    For a US-Letter page (612×792 pt) fitting into 1024×1024 the result is
+    ``min(300/72, min(1024/612, 1024/792)) ≈ 1.293`` → ~93 effective DPI.
+    """
+    target_w, target_h = target_wh
+    page_w = float(page.get_width())
+    page_h = float(page.get_height())
+    if page_w <= 0 or page_h <= 0 or target_w <= 0 or target_h <= 0:
+        return max(float(max_dpi) / 72.0, 0.01)
+
+    fit_scale = max(min(target_w / page_w, target_h / page_h), 1e-3)
+    base_scale = max(float(max_dpi) / 72.0, 0.01)
+    return min(base_scale, fit_scale)
+
 
 def _render_page_to_base64(
     page: Any,
@@ -40,17 +71,28 @@ def _render_page_to_base64(
     dpi: int = 200,
     image_format: str = "jpeg",
     jpeg_quality: int = 100,
+    render_mode: RenderMode = "fit_to_model",
 ) -> Dict[str, Any]:
-    """
-    Render a page at full DPI and encode as JPEG or PNG.
+    """Render a page and encode as JPEG or PNG.
+
+    Parameters
+    ----------
+    render_mode:
+        ``"full_dpi"`` – render at *dpi* (default 300 → 2550×3300 for US Letter).
+        ``"fit_to_model"`` – render at the nv-ingest fit-to-1024 scale (~93 DPI
+        for US Letter) so the raster is already close to the model's input size,
+        avoiding a large bilinear down-scale in ``resize_pad``.
 
     Returns dict with:
     - image_b64: str
     - encoding: str ("jpeg" or "png")
     - orig_shape_hw: tuple[int,int] (H,W) of the rendered raster
     """
-    base_scale = max(float(dpi) / 72.0, 0.01)
-    bitmap = page.render(scale=base_scale)
+    if render_mode == "fit_to_model":
+        render_scale = _compute_fit_to_model_scale(page, _MODEL_INPUT_SIZE, max_dpi=dpi)
+    else:
+        render_scale = max(float(dpi) / 72.0, 0.01)
+    bitmap = page.render(scale=render_scale)
 
     arr = convert_bitmap_to_corrected_numpy(bitmap)
 
@@ -144,6 +186,7 @@ def pdf_extraction(
     jpeg_quality: int = 100,
     text_extraction_method: str = "pdfium_hybrid",
     text_depth: str = "page",
+    render_mode: RenderMode = "fit_to_model",
     **kwargs: Any,
 ) -> Any:
     """
@@ -250,6 +293,7 @@ def pdf_extraction(
                             dpi=dpi,
                             image_format=image_format,
                             jpeg_quality=jpeg_quality,
+                            render_mode=render_mode,
                         )
 
                     page_record: Dict[str, Any] = {
diff --git a/nemo_retriever/src/nemo_retriever/pdf/stage.py b/nemo_retriever/src/nemo_retriever/pdf/stage.py
index d34313697..be39cb332 100644
--- a/nemo_retriever/src/nemo_retriever/pdf/stage.py
+++ b/nemo_retriever/src/nemo_retriever/pdf/stage.py
@@ -145,6 +145,7 @@ def _normalize_page_elements_config(raw: Dict[str, Any]) -> Dict[str, Any]:
         outputs, "json_output_dir", "json-output-dir"
     )
     out["limit"] = _cfg_get(raw, "limit")
+    out["render_mode"] = _cfg_get(raw, "render_mode")
 
     # Drop Nones so "not specified" stays not specified.
     return {k: v for k, v in out.items() if v is not None}
@@ -522,6 +523,14 @@ def render_page_elements(
         "--text-depth",
         help="Text depth for extracted text primitives: 'page' or 'document'.",
     ),
+    render_mode: str = typer.Option(
+        "fit_to_model",
+        "--render-mode",
+        help=(
+            "Page rendering mode: 'full_dpi' (render at DPI then resize_pad) or "
+            "'fit_to_model' (render at nv-ingest fit-to-1024 scale, ~93 DPI for US Letter)."
+        ),
+    ),
     write_json_outputs: bool = typer.Option(
         True,
         "--write-json-outputs/--no-write-json-outputs",
@@ -583,6 +592,9 @@ def render_page_elements(
     if not _argv_has_any(["--text-depth"]):
         text_depth = str(cfg_raw.get("text_depth", text_depth))
 
+    if not _argv_has_any(["--render-mode"]):
+        render_mode = str(cfg_raw.get("render_mode", render_mode))
+
     if not _argv_has_any(["--write-json-outputs", "--no-write-json-outputs"]):
         write_json_outputs = bool(cfg_raw.get("write_json_outputs", write_json_outputs))
     if not _argv_has_any(["--json-output-dir"]):
diff --git a/nemo_retriever/src/nemo_retriever/table/table_detection.py b/nemo_retriever/src/nemo_retriever/table/table_detection.py
index bfb82a187..841526431 100644
--- a/nemo_retriever/src/nemo_retriever/table/table_detection.py
+++ b/nemo_retriever/src/nemo_retriever/table/table_detection.py
@@ -17,6 +17,13 @@
 except Exception:  # pragma: no cover
     torch = None  # type: ignore[assignment]
 
+try:
+    from nv_ingest_api.internal.primitives.nim.model_interface.yolox import (
+        YOLOX_TABLE_MIN_SCORE,
+    )
+except ImportError:
+    YOLOX_TABLE_MIN_SCORE = 0.1  # type: ignore[assignment]
+
 _DEFAULT_TABLE_STRUCTURE_LABELS: List[str] = ["cell", "row", "column"]
 
 
@@ -351,7 +358,7 @@ def table_structure_ocr_page_elements(
                     if not parsed:
                         pred_item = _extract_remote_pred_item(resp)
                         parsed = _prediction_to_detections(pred_item, label_names=label_names)
-                    structure_results.append(parsed)
+                    structure_results.append([d for d in parsed if (d.get("score") or 0.0) >= YOLOX_TABLE_MIN_SCORE])
             else:
                 # Local batched inference.
                 for _, _, crop_array in crops:
@@ -366,7 +373,7 @@ def table_structure_ocr_page_elements(
                         pre = pre.unsqueeze(0)
                     pred = table_structure_model.invoke(pre, (h, w))
                     dets = _prediction_to_detections(pred, label_names=label_names)
-                    structure_results.append(dets)
+                    structure_results.append([d for d in dets if (d.get("score") or 0.0) >= YOLOX_TABLE_MIN_SCORE])
 
             # --- Pass 3: Run OCR on all crops ---
             ocr_results: List[Any] = []
diff --git a/nemo_retriever/tests/test_pdf_render_scale.py b/nemo_retriever/tests/test_pdf_render_scale.py
index 5c2bba5ec..6344c5d3b 100644
--- a/nemo_retriever/tests/test_pdf_render_scale.py
+++ b/nemo_retriever/tests/test_pdf_render_scale.py
@@ -54,7 +54,7 @@ def test_renders_at_full_dpi(self):
         dpi = 200
         base_scale = dpi / 72.0
 
-        _extract._render_page_to_base64(page, dpi=dpi)
+        _extract._render_page_to_base64(page, dpi=dpi, render_mode="full_dpi")
 
         render_call = page.render.call_args
         actual_scale = render_call.kwargs.get("scale", render_call.args[0] if render_call.args else None)

From f55a733082b1f49b4c45ad66c178da0a8a3bedc8 Mon Sep 17 00:00:00 2001
From: Jeremy Dyer <jdye64@gmail.com>
Date: Fri, 13 Mar 2026 10:51:56 -0400
Subject: [PATCH 21/55] Remove get_hf_revision logic from code not inside the
 nemo_retriever directory

---
 .../internal/transform/split_text.py          |  8 +-----
 docker/scripts/post_build_triggers.py         | 26 +------------------
 2 files changed, 2 insertions(+), 32 deletions(-)

diff --git a/api/src/nv_ingest_api/internal/transform/split_text.py b/api/src/nv_ingest_api/internal/transform/split_text.py
index 9b88ec5ce..9d099ec7f 100644
--- a/api/src/nv_ingest_api/internal/transform/split_text.py
+++ b/api/src/nv_ingest_api/internal/transform/split_text.py
@@ -56,14 +56,8 @@ def _get_tokenizer(
         if cache_key in _tokenizer_cache:
             return _tokenizer_cache[cache_key]
 
-        from nemo_retriever.utils.hf_model_registry import get_hf_revision
-
         logger.info("Loading and caching tokenizer: %s", tokenizer_identifier)
-        tokenizer = AutoTokenizer.from_pretrained(
-            tokenizer_identifier,
-            revision=get_hf_revision(tokenizer_identifier),
-            token=token,
-        )
+        tokenizer = AutoTokenizer.from_pretrained(tokenizer_identifier, token=token)
         _tokenizer_cache[cache_key] = tokenizer
         return tokenizer
 
diff --git a/docker/scripts/post_build_triggers.py b/docker/scripts/post_build_triggers.py
index 1488e6339..8eb26f301 100644
--- a/docker/scripts/post_build_triggers.py
+++ b/docker/scripts/post_build_triggers.py
@@ -4,30 +4,6 @@
 
 from transformers import AutoTokenizer
 
-try:
-    from nemo_retriever.utils.hf_model_registry import get_hf_revision
-except ModuleNotFoundError:
-    # Fallback for Docker build stages where nemo_retriever isn't installed yet.
-    _REVISIONS = {
-        "meta-llama/Llama-3.2-1B": "4e20de362430cd3b72f300e6b0f18e50e7166e08",
-        "intfloat/e5-large-unsupervised": "15af9288f69a6291f37bfb89b47e71abc747b206",
-    }
-
-    def get_hf_revision(model_id, *, strict=True):  # type: ignore[misc]
-        revision = _REVISIONS.get(model_id)
-        if revision is not None:
-            return revision
-        msg = (
-            f"No pinned HuggingFace revision for model '{model_id}'. "
-            "Add an entry to _REVISIONS in post_build_triggers.py (and "
-            "HF_MODEL_REVISIONS in hf_model_registry.py) to pin it."
-        )
-        if strict:
-            raise ValueError(msg)
-        print(f"WARNING: {msg} Falling back to the default (main) branch.")
-        return None
-
-
 MAX_RETRIES = 5
 
 
@@ -36,7 +12,7 @@ def download_tokenizer(model_name, save_path, token=None):
 
     for attempt in range(MAX_RETRIES):
         try:
-            tokenizer = AutoTokenizer.from_pretrained(model_name, revision=get_hf_revision(model_name), token=token)
+            tokenizer = AutoTokenizer.from_pretrained(model_name, token=token)
             tokenizer.save_pretrained(save_path)
             return
         except Exception as e:

From 83a936c29d0c92e291a9ec453beba49d57204866 Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Fri, 13 Mar 2026 11:08:59 -0700
Subject: [PATCH 22/55] Added air gap instructions to helm file (#1616)

---
 docs/docs/extraction/helm.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/docs/docs/extraction/helm.md b/docs/docs/extraction/helm.md
index f5891a772..ae6af0066 100644
--- a/docs/docs/extraction/helm.md
+++ b/docs/docs/extraction/helm.md
@@ -4,3 +4,7 @@
 
 To deploy [NeMo Retriever Library](overview.md) by using Helm, 
 refer to [NeMo Retriever Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.03.0-RC2/helm/README.md).
+
+!!! note "Air-gapped environments"
+   
+    For deploying in an air-gapped environment, refer to the [NVIDIA NIM Operator documentation on Air-Gapped Environments](https://docs.nvidia.com/nim-operator/latest/air-gap.html), which explains how to deploy NIMs when your cluster has no internet or NGC registry access.

From 4d9ce5fb69a2e8867f8eccf0cbccccffb589323b Mon Sep 17 00:00:00 2001
From: Julio Perez <37191411+jperez999@users.noreply.github.com>
Date: Fri, 13 Mar 2026 15:49:37 -0400
Subject: [PATCH 23/55] fix for network call reranking (#1619)

---
 .../src/nemo_retriever/rerank/rerank.py       | 22 ++++++------
 .../src/nemo_retriever/retriever.py           |  2 +-
 .../tests/test_nemotron_rerank_v2.py          | 35 ++++++-------------
 3 files changed, 22 insertions(+), 37 deletions(-)

diff --git a/nemo_retriever/src/nemo_retriever/rerank/rerank.py b/nemo_retriever/src/nemo_retriever/rerank/rerank.py
index 189b56a89..6d7bd7a6f 100644
--- a/nemo_retriever/src/nemo_retriever/rerank/rerank.py
+++ b/nemo_retriever/src/nemo_retriever/rerank/rerank.py
@@ -73,7 +73,7 @@ def _rerank_via_endpoint(
     endpoint: str,
     model_name: str = _DEFAULT_MODEL,
     api_key: str = "",
-    top_n: Optional[int] = None,
+    truncate: str = "END",
 ) -> List[float]:
     """
     Call a vLLM / NIM ``/rerank`` REST endpoint and return per-document scores.
@@ -92,18 +92,16 @@ def _rerank_via_endpoint(
     """
     import requests
 
-    url = endpoint.rstrip("/") + "/rerank"
+    cleaned_endpoint = endpoint.rstrip("/")
     headers: Dict[str, str] = {"Content-Type": "application/json"}
+    if not cleaned_endpoint.endswith("/reranking"):
+        cleaned_endpoint = endpoint.rstrip("/") + "/v1/ranking"
+    url = cleaned_endpoint
+    headers = {"accept": "application/json", "Content-Type": "application/json"}
     if api_key:
         headers["Authorization"] = f"Bearer {api_key}"
-
-    payload: Dict[str, Any] = {
-        "model": model_name,
-        "query": query,
-        "documents": documents,
-    }
-    if top_n is not None:
-        payload["top_n"] = top_n
+    texts = [{"text": d} for d in documents]
+    payload = {"model": model_name, "query": {"text": query}, "passages": texts, "truncate": truncate}
 
     response = requests.post(url, json=payload, headers=headers, timeout=120)
     response.raise_for_status()
@@ -111,9 +109,9 @@ def _rerank_via_endpoint(
 
     # Build score list aligned with input document order.
     scores = [float("-inf")] * len(documents)
-    for item in data.get("results", []):
+    for item in data.get("rankings", []):
         idx = item.get("index")
-        score = item.get("relevance_score")
+        score = item.get("logit")
         if idx is not None and score is not None:
             scores[idx] = float(score)
     return scores
diff --git a/nemo_retriever/src/nemo_retriever/retriever.py b/nemo_retriever/src/nemo_retriever/retriever.py
index aab11b519..5d3458ce8 100644
--- a/nemo_retriever/src/nemo_retriever/retriever.py
+++ b/nemo_retriever/src/nemo_retriever/retriever.py
@@ -215,7 +215,7 @@ def _rerank_results(
                     hits,
                     model=model,
                     invoke_url=reranker_endpoint,
-                    model_name=str(self.reranker),
+                    model_name=str(self.reranker_model_name),
                     api_key=(self.reranker_api_key or "").strip(),
                     max_length=int(self.reranker_max_length),
                     batch_size=int(self.reranker_batch_size),
diff --git a/nemo_retriever/tests/test_nemotron_rerank_v2.py b/nemo_retriever/tests/test_nemotron_rerank_v2.py
index 4c6761a5b..6ba3c6fa9 100644
--- a/nemo_retriever/tests/test_nemotron_rerank_v2.py
+++ b/nemo_retriever/tests/test_nemotron_rerank_v2.py
@@ -407,9 +407,9 @@ def test_posts_to_rerank_url(self):
 
         mock_resp = MagicMock()
         mock_resp.json.return_value = {
-            "results": [
-                {"index": 0, "relevance_score": 0.9},
-                {"index": 1, "relevance_score": 0.3},
+            "rankings": [
+                {"index": 0, "logit": 0.9},
+                {"index": 1, "logit": 0.3},
             ]
         }
         mock_resp.raise_for_status = MagicMock()
@@ -424,9 +424,9 @@ def test_posts_to_rerank_url(self):
 
         mock_post.assert_called_once()
         call_kwargs = mock_post.call_args
-        assert call_kwargs[0][0] == "http://localhost:8000/rerank"
-        assert call_kwargs[1]["json"]["query"] == "What is ML?"
-        assert len(call_kwargs[1]["json"]["documents"]) == 2
+        assert call_kwargs[0][0] == "http://localhost:8000/v1/ranking"
+        assert call_kwargs[1]["json"]["query"] == {"text": "What is ML?"}
+        assert len(call_kwargs[1]["json"]["passages"]) == 2
 
         assert scores == [0.9, 0.3]
 
@@ -436,10 +436,10 @@ def test_scores_aligned_with_input_order(self):
         # Server returns results in reversed order
         mock_resp = MagicMock()
         mock_resp.json.return_value = {
-            "results": [
-                {"index": 2, "relevance_score": 0.1},
-                {"index": 0, "relevance_score": 0.8},
-                {"index": 1, "relevance_score": 0.5},
+            "rankings": [
+                {"index": 2, "logit": 0.1},
+                {"index": 0, "logit": 0.8},
+                {"index": 1, "logit": 0.5},
             ]
         }
         mock_resp.raise_for_status = MagicMock()
@@ -484,20 +484,7 @@ def test_trailing_slash_on_endpoint_normalized(self):
             _rerank_via_endpoint("q", ["d"], endpoint="http://localhost:8000/")
 
         url = mock_post.call_args[0][0]
-        assert url == "http://localhost:8000/rerank"
-
-    def test_top_n_sent_in_payload_when_specified(self):
-        from nemo_retriever.rerank.rerank import _rerank_via_endpoint
-
-        mock_resp = MagicMock()
-        mock_resp.json.return_value = {"results": [{"index": 0, "relevance_score": 0.5}]}
-        mock_resp.raise_for_status = MagicMock()
-
-        with patch("requests.post", return_value=mock_resp) as mock_post:
-            _rerank_via_endpoint("q", ["d"], endpoint="http://localhost:8000", top_n=5)
-
-        payload = mock_post.call_args[1]["json"]
-        assert payload["top_n"] == 5
+        assert url == "http://localhost:8000/v1/ranking"
 
     def test_top_n_not_in_payload_when_not_specified(self):
         from nemo_retriever.rerank.rerank import _rerank_via_endpoint

From 0a60c1aa0f6e480fea62eda9e8e3a52b55a6b832 Mon Sep 17 00:00:00 2001
From: Jeremy Dyer <jdye64@gmail.com>
Date: Fri, 13 Mar 2026 16:02:05 -0400
Subject: [PATCH 24/55] Release prep: Update versions to 26.3.0-RC4 (#1620)

---
 docker-compose.yaml                             | 2 +-
 docs/docs/extraction/helm.md                    | 2 +-
 docs/docs/extraction/quickstart-guide.md        | 2 +-
 docs/docs/extraction/quickstart-library-mode.md | 2 +-
 helm/Chart.yaml                                 | 2 +-
 helm/README.md                                  | 8 ++++----
 helm/README.md.gotmpl                           | 6 +++---
 helm/values.yaml                                | 2 +-
 nemo_retriever/pyproject.toml                   | 6 +++---
 src/nv_ingest/api/main.py                       | 2 +-
 tools/harness/pyproject.toml                    | 6 +++---
 tools/harness/test_configs.yaml                 | 2 +-
 12 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/docker-compose.yaml b/docker-compose.yaml
index 6ad589efc..dc7dea85a 100644
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
@@ -262,7 +262,7 @@ services:
       - audio
 
   nv-ingest-ms-runtime:
-    image: nvcr.io/nvidia/nemo-microservices/nv-ingest:26.03.0-RC2
+    image: nvcr.io/nvidia/nemo-microservices/nv-ingest:26.3.0-RC4
     shm_size: 40gb # Should be at minimum 30% of assigned memory per Ray documentation
     build:
       context: ${NV_INGEST_ROOT:-.}
diff --git a/docs/docs/extraction/helm.md b/docs/docs/extraction/helm.md
index ae6af0066..5e5c787cd 100644
--- a/docs/docs/extraction/helm.md
+++ b/docs/docs/extraction/helm.md
@@ -3,7 +3,7 @@
 <!-- Use this documentation to deploy [NeMo Retriever Library](overview.md) by using Helm. -->
 
 To deploy [NeMo Retriever Library](overview.md) by using Helm, 
-refer to [NeMo Retriever Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.03.0-RC2/helm/README.md).
+refer to [NeMo Retriever Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0-RC4/helm/README.md).
 
 !!! note "Air-gapped environments"
    
diff --git a/docs/docs/extraction/quickstart-guide.md b/docs/docs/extraction/quickstart-guide.md
index a996d4f21..111638f1e 100644
--- a/docs/docs/extraction/quickstart-guide.md
+++ b/docs/docs/extraction/quickstart-guide.md
@@ -84,7 +84,7 @@ h. Run the command `docker ps`. You should see output similar to the following.
     CONTAINER ID  IMAGE                                            COMMAND                 CREATED         STATUS                  PORTS            NAMES
 uv venv --python 3.12 nv-ingest-dev
 source nv-ingest-dev/bin/activate
-uv pip install nv-ingest==26.03.0-RC2 nv-ingest-api==26.03.0-RC2 nv-ingest-client==26.03.0-RC2
+uv pip install nv-ingest==26.3.0-RC4 nv-ingest-api==26.3.0-RC4 nv-ingest-client==26.3.0-RC4
 ```
 
 !!! tip
diff --git a/docs/docs/extraction/quickstart-library-mode.md b/docs/docs/extraction/quickstart-library-mode.md
index b9e6ca371..2e5042f10 100644
--- a/docs/docs/extraction/quickstart-library-mode.md
+++ b/docs/docs/extraction/quickstart-library-mode.md
@@ -34,7 +34,7 @@ Use the following procedure to prepare your environment.
     ```
        uv venv --python 3.12 nvingest && \
          source nvingest/bin/activate && \
-         uv pip install nemo-retriever==26.03.0-RC2 milvus-lite==2.4.12
+         uv pip install nemo-retriever==26.3.0-RC4 milvus-lite==2.4.12
     ```
 
     !!! tip
diff --git a/helm/Chart.yaml b/helm/Chart.yaml
index 9891b9555..5b018b724 100644
--- a/helm/Chart.yaml
+++ b/helm/Chart.yaml
@@ -2,7 +2,7 @@ apiVersion: v2
 name: nv-ingest
 description: NV-Ingest Microservice
 type: application
-version: 26.03.0-RC2
+version: 26.3.0-RC4
 maintainers:
   - name: NVIDIA Corporation
     url: https://www.nvidia.com/
diff --git a/helm/README.md b/helm/README.md
index 3860dfce9..e1e43362e 100644
--- a/helm/README.md
+++ b/helm/README.md
@@ -45,7 +45,7 @@ To install or upgrade the Helm chart, run the following code.
 helm upgrade \
     --install \
     nv-ingest \
-    https://helm.ngc.nvidia.com/nvidia/nemo-microservices/charts/nv-ingest-26.03.0-RC2.tgz \
+    https://helm.ngc.nvidia.com/nvidia/nemo-microservices/charts/nv-ingest-26.3.0-RC4.tgz \
     -n ${NAMESPACE} \
     --username '$oauthtoken' \
     --password "${NGC_API_KEY}" \
@@ -54,7 +54,7 @@ helm upgrade \
     --set ngcApiSecret.create=true \
     --set ngcApiSecret.password="${NGC_API_KEY}" \
     --set image.repository="nvcr.io/nvidia/nemo-microservices/nv-ingest" \
-    --set image.tag="26.03.0-RC2"
+    --set image.tag="26.3.0-RC4"
 ```
 
 Optionally you can create your own versions of the `Secrets` if you do not want to use the creation via the helm chart.
@@ -105,7 +105,7 @@ For more information, refer to [NV-Ingest-Client](https://github.com/NVIDIA/nv-i
 # Just to be cautious we remove any existing installation
 pip uninstall nv-ingest-client
 
-pip install nv-ingest-client==26.03.0-RC2
+pip install nv-ingest-client==26.3.0-RC4
 ```
 
 #### Rest Endpoint Ingress
@@ -347,7 +347,7 @@ You can also use NV-Ingest's Python client API to interact with the service runn
 | fullnameOverride | string | `""` |  |
 | image.pullPolicy | string | `"IfNotPresent"` |  |
 | image.repository | string | `"nvcr.io/nvidia/nemo-microservices/nv-ingest"` |  |
-| image.tag | string | `"26.03.0-RC2"` |  |
+| image.tag | string | `"26.3.0-RC4"` |  |
 | imagePullSecrets[0].name | string | `"ngc-api"` |  |
 | imagePullSecrets[1].name | string | `"ngc-secret"` |  |
 | ingress.annotations | object | `{}` |  |
diff --git a/helm/README.md.gotmpl b/helm/README.md.gotmpl
index b16dab04c..3686cd1fc 100644
--- a/helm/README.md.gotmpl
+++ b/helm/README.md.gotmpl
@@ -46,7 +46,7 @@ To install or upgrade the Helm chart, run the following code.
 helm upgrade \
     --install \
     nv-ingest \
-    https://helm.ngc.nvidia.com/nvidia/nemo-microservices/charts/nv-ingest-26.03.0-RC2.tgz \
+    https://helm.ngc.nvidia.com/nvidia/nemo-microservices/charts/nv-ingest-26.3.0-RC4.tgz \
     -n ${NAMESPACE} \
     --username '$oauthtoken' \
     --password "${NGC_API_KEY}" \
@@ -55,7 +55,7 @@ helm upgrade \
     --set ngcApiSecret.create=true \
     --set ngcApiSecret.password="${NGC_API_KEY}" \
     --set image.repository="nvcr.io/nvidia/nemo-microservices/nv-ingest" \
-    --set image.tag="26.03.0-RC2"
+    --set image.tag="26.3.0-RC4"
 ```
 
 Optionally you can create your own versions of the `Secrets` if you do not want to use the creation via the helm chart.
@@ -107,7 +107,7 @@ For more information, refer to [NV-Ingest-Client](https://github.com/NVIDIA/nv-i
 # Just to be cautious we remove any existing installation
 pip uninstall nv-ingest-client
 
-pip install nv-ingest-client==26.03.0-RC2
+pip install nv-ingest-client==26.3.0-RC4
 ```
 
 #### Rest Endpoint Ingress
diff --git a/helm/values.yaml b/helm/values.yaml
index 243c1e740..d0cdd9364 100644
--- a/helm/values.yaml
+++ b/helm/values.yaml
@@ -28,7 +28,7 @@ nameOverride: ""
 image:
   pullPolicy: IfNotPresent
   repository: "nvcr.io/nvidia/nemo-microservices/nv-ingest"
-  tag: "26.03.0-RC2"
+  tag: "26.3.0-RC4"
 
 ## @section Pod Configuration
 ## @param podAnnotations [object] Sets additional annotations on the main deployment pods
diff --git a/nemo_retriever/pyproject.toml b/nemo_retriever/pyproject.toml
index 17ccb9459..fc713c8ce 100644
--- a/nemo_retriever/pyproject.toml
+++ b/nemo_retriever/pyproject.toml
@@ -30,9 +30,9 @@ dependencies = [
   "typer>=0.12.0",
   "pyyaml>=6.0",
   "lancedb",
-  "nv-ingest==26.03.0rc2",
-  "nv-ingest-api==26.03.0rc2",
-  "nv-ingest-client==26.03.0rc2",
+  "nv-ingest==26.3.0rc4",
+  "nv-ingest-api==26.3.0rc4",
+  "nv-ingest-client==26.3.0rc4",
   "fastapi>=0.114.0",
   "uvicorn[standard]>=0.30.0",
   "httpx>=0.27.0",
diff --git a/src/nv_ingest/api/main.py b/src/nv_ingest/api/main.py
index ae72b3fdf..a5a3f7cb3 100644
--- a/src/nv_ingest/api/main.py
+++ b/src/nv_ingest/api/main.py
@@ -23,7 +23,7 @@
 app = FastAPI(
     title="NV-Ingest Microservice",
     description="Service for ingesting heterogenous datatypes",
-    version="26.03.0-RC2",
+    version="26.3.0-RC4",
     contact={
         "name": "NVIDIA Corporation",
         "url": "https://nvidia.com",
diff --git a/tools/harness/pyproject.toml b/tools/harness/pyproject.toml
index c04a4638c..2e661b019 100644
--- a/tools/harness/pyproject.toml
+++ b/tools/harness/pyproject.toml
@@ -10,9 +10,9 @@ dependencies = [
     "pyyaml>=6.0",
     "requests>=2.32.5",
     "pynvml>=11.5.0",
-    "nv-ingest==26.03.0rc2",
-    "nv-ingest-api==26.03.0rc2",
-    "nv-ingest-client==26.03.0rc2",
+    "nv-ingest==26.3.0rc4",
+    "nv-ingest-api==26.3.0rc4",
+    "nv-ingest-client==26.3.0rc4",
     "milvus-lite==2.4.12",
     "pypdfium2>=4.30.0,<5.0.0",
     "nemotron-page-elements-v3==3.0.1",
diff --git a/tools/harness/test_configs.yaml b/tools/harness/test_configs.yaml
index f2a214681..a9278ef62 100644
--- a/tools/harness/test_configs.yaml
+++ b/tools/harness/test_configs.yaml
@@ -28,7 +28,7 @@ active:
     kubectl_bin: microk8s kubectl  # kubectl binary command (e.g., "kubectl", "microk8s kubectl")
     kubectl_sudo: null  # Prepend sudo to kubectl commands (null = same as helm_sudo)
     chart: nemo-microservices/nv-ingest  # Remote chart reference (set to null to use local chart from ./helm)
-    chart_version: 26.03.0-RC2  # Chart version (required for remote charts)
+    chart_version: 26.3.0-RC4  # Chart version (required for remote charts)
     release: nv-ingest
     namespace: nv-ingest
     values_file: .helm-env  # Optional: path to values file

From 86cda76b76b00bca5b419363c13119c538950371 Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Fri, 13 Mar 2026 16:24:40 -0700
Subject: [PATCH 25/55] Updated RNs to show forthcoming changes (#1623)

---
 .../docs/extraction/releasenotes-nv-ingest.md | 62 +++----------------
 1 file changed, 10 insertions(+), 52 deletions(-)

diff --git a/docs/docs/extraction/releasenotes-nv-ingest.md b/docs/docs/extraction/releasenotes-nv-ingest.md
index d824af15b..56c4c71f3 100644
--- a/docs/docs/extraction/releasenotes-nv-ingest.md
+++ b/docs/docs/extraction/releasenotes-nv-ingest.md
@@ -6,62 +6,20 @@ This documentation contains the release notes for [NeMo Retriever Library](overv
 
     NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
 
+## 26.03 Release Notes (in progress)
 
+- NV-Ingest github repo renamed to NeMo-Retriever 
+- NeMo Retriever Extraction pipeline renamed to NeMo Retriever Library
+- NeMo Retriever Library now support two deployment options:
+- Load Hugging Face models locally on your GPU.
+- Use locally deployed NeMo Retriever NIM endpoints for embedding and OCR.
+- Note on Air-gapped support 
+- Added support for RTX4500 Pro Blackwell SKU 
+- Added support for llama-nemotron-embed-vl-v2 ?
 
-## Release 26.01 (26.3.0-RC1)
 
-The NeMo Retriever Library 26.01 release adds new hardware and software support, and other improvements.
+NeMo Retriever Library currently does not support image captioning via VLM. It will be added in the next release.
 
-To upgrade the Helm Charts for this version, refer to [NeMo Retriever Library Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0-RC1/helm/README.md).
-
-
-### Highlights 
-
-This release contains the following key changes:
-
-- Added functional support for [H200 NVL](https://www.nvidia.com/en-us/data-center/h200/). For details, refer to [Support Matrix](support-matrix.md).
-- All Helm deployments for Kubernetes now use [NVIDIA NIM Operator](https://docs.nvidia.com/nim-operator/latest/index.html). For details, refer to [NeMo Retriever Library Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0-RC1/helm/README.md). 
-- Updated RIVA NIM to version 1.4.0. For details, refer to [Extract Speech](audio.md).
-- Updated VLM NIM to [nemotron-nano-12b-v2-vl](https://build.nvidia.com/nvidia/nemotron-nano-12b-v2-vl/modelcard). For details, refer to [Extract Captions from Images](python-api-reference.md#extract-captions-from-images).
-- Added VLM caption prompt customization parameters, including reasoning control. For details, refer to [Caption Images and Control Reasoning](python-api-reference.md#caption-images-and-control-reasoning).
-- Added support for the [nemotron-parse](https://build.nvidia.com/nvidia/nemotron-parse/modelcard) model which replaces the [nemoretriever-parse](https://build.nvidia.com/nvidia/nemoretriever-parse/modelcard) model. For details, refer to [Advanced Visual Parsing](nemoretriever-parse.md).
-- Support is now deprecated for [paddleocr](https://build.nvidia.com/baidu/paddleocr/modelcard).
-- The `meta-llama/Llama-3.2-1B` tokenizer is now pre-downloaded so that you can run token-based splitting without making a network request. For details, refer to [Split Documents](chunking.md).
-- For scanned PDFs, added specialized extraction strategies. For details, refer to [PDF Extraction Strategies](python-api-reference.md#pdf-extraction-strategies).
-- Added support for [LanceDB](https://lancedb.com/). For details, refer to [Upload to a Custom Data Store](data-store.md).
-- The V2 API is now available and is the default processing pipeline. The response format remains backwards-compatible. You can enable the v2 API by using `message_client_kwargs={"api_version": "v2"}`. For details, refer to [V2 API Guide](v2-api-guide.md).
-- Large PDFs are now automatically split into chunks and processed in parallel, delivering faster ingestion for long documents. For details, refer to [PDF Pre-Splitting](v2-api-guide.md).
-- Issues maintaining extraction quality while processing very large files are now resolved with the V2 API. For details, refer to [V2 API Guide](v2-api-guide.md).
-- Updated the embedding task to support embedding on custom content fields like the results of summarization functions. For details, refer to [Use the Python API](python-api-reference.md).
-- User-defined function summarization is now using `nemotron-mini-4b-instruct` which provides significant speed improvements. For details, refer to [User-defined Functions](user-defined-functions.md) and [NeMo Retriever Library UDF Examples](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0-RC1/examples/udfs/README.md).
-- In the `Ingestor.extract` method, the defaults for `extract_text` and `extract_images` are now set to `true` for consistency with `extract_tables` and `extract_charts`. For details, refer to [Use the Python API](python-api-reference.md).
-- The `table-structure` profile is no longer available. The table-structure profile is now part of the default profile. For details, refer to [Profile Information](quickstart-guide.md#profile-information).
-- New documentation [Why Throughput Is Dataset-Dependent](throughput-is-dataset-dependent.md).
-- New documentation [Add User-defined Stages](user-defined-stages.md).
-- New documentation [Add User-defined Functions](user-defined-functions.md).
-- New documentation [Resource Scaling Modes](scaling-modes.md).
-- New documentation [NimClient Usage](nimclient.md).
-- New documentation [Use the API (V2)](v2-api-guide.md).
-
-
-
-### Fixed Known Issues
-
-The following are the known issues that are fixed in this version:
-
-- A10G support is restored. To use A10G hardware, use release 26.3.0-RC1 or later. For details, refer to [Support Matrix](support-matrix.md).
-- L40S support is restored. To use L40S hardware, use release 26.3.0-RC1 or later. For details, refer to [Support Matrix](support-matrix.md).
-- The page number field in the content metadata now starts at 1 instead of 0 so each page number is no longer off by one from what you would expect. For details, refer to [Content Metadata](content-metadata.md).
-- Support for batches that include individual files greater than approximately 400MB is restored. This includes audio files and pdfs.
-
-
-
-## All Known Issues
-
-The following are the known issues for NeMo Retriever Library:
-
-- Advanced visual parsing is not supported on RTX Pro 6000, B200, or H200 NVL. For details, refer to [Advanced Visual Parsing](nemoretriever-parse.md) and [Support Matrix](support-matrix.md).
-- The Page Elements NIM (`nemoretriever-page-elements-v3:1.7.0`) may intermittently fail during inference under high-concurrency workloads. This happens when Triton’s dynamic batching combines requests that exceed the model’s maximum batch size, a situation more commonly seen in multi-GPU setups or large ingestion runs. In these cases, extraction fails for the impacted documents. A correction is planned for `nemoretriever-page-elements-v3:1.7.1`.
 
 
 ## Release Notes for Previous Versions

From e5e3b36f5a7a2c21ad02f0043e644252dc27b14b Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Fri, 13 Mar 2026 16:41:32 -0700
Subject: [PATCH 26/55] update rns (#1624)

---
 docs/docs/extraction/releasenotes-nv-ingest.md | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/docs/docs/extraction/releasenotes-nv-ingest.md b/docs/docs/extraction/releasenotes-nv-ingest.md
index 56c4c71f3..22aeac345 100644
--- a/docs/docs/extraction/releasenotes-nv-ingest.md
+++ b/docs/docs/extraction/releasenotes-nv-ingest.md
@@ -6,8 +6,16 @@ This documentation contains the release notes for [NeMo Retriever Library](overv
 
     NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
 
+    
+
 ## 26.03 Release Notes (in progress)
 
+The NeMo Retriever Library 26.03 release adds new hardware and software support, and other improvements.
+
+To upgrade the Helm Charts for this version, refer to [NeMo Retriever Library Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0-RC1/helm/README.md).
+
+Updates and enhancements in the 26.03 release include the following:
+
 - NV-Ingest github repo renamed to NeMo-Retriever 
 - NeMo Retriever Extraction pipeline renamed to NeMo Retriever Library
 - NeMo Retriever Library now support two deployment options:
@@ -20,8 +28,6 @@ This documentation contains the release notes for [NeMo Retriever Library](overv
 
 NeMo Retriever Library currently does not support image captioning via VLM. It will be added in the next release.
 
-
-
 ## Release Notes for Previous Versions
 
 | [26.1.1](https://docs.nvidia.com/nemo/retriever/26.1.1/extraction/releasenotes-nv-ingest/)

From ce8133daf7efc25d07fb96888a49ef528dae3e91 Mon Sep 17 00:00:00 2001
From: Julio Perez <37191411+jperez999@users.noreply.github.com>
Date: Sat, 14 Mar 2026 10:20:55 -0400
Subject: [PATCH 27/55] Fix score (#1627)

---
 nemo_retriever/src/nemo_retriever/recall/core.py | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/nemo_retriever/src/nemo_retriever/recall/core.py b/nemo_retriever/src/nemo_retriever/recall/core.py
index 882e3722b..0f4fd4a2d 100644
--- a/nemo_retriever/src/nemo_retriever/recall/core.py
+++ b/nemo_retriever/src/nemo_retriever/recall/core.py
@@ -189,16 +189,10 @@ def _hits_to_keys(raw_hits: List[List[Dict[str, Any]]]) -> List[List[str]]:
     for hits in raw_hits:
         keys: List[str] = []
         for h in hits:
-            page_number = h["page_number"]
-            source = h["source"]
             page_number = h["page_number"]
             source = h["source"]
             # Prefer explicit `pdf_page` column; fall back to derived form.
             # if res.get("page_number") is not None and source.get("source_id"):
-            if page_number is not None and source:
-                filename = Path(source).stem
-                keys.append(f"{filename}_{str(page_number)}")
-            # if res.get("page_number") is not None and source.get("source_id"):
             if page_number is not None and source:
                 filename = Path(source).stem
                 keys.append(f"{filename}_{str(page_number)}")

From 8908e219171470bb7acbe74861ebea6bc5ed3652 Mon Sep 17 00:00:00 2001
From: Julio Perez <37191411+jperez999@users.noreply.github.com>
Date: Sat, 14 Mar 2026 13:24:24 -0400
Subject: [PATCH 28/55] rm assert on rerank and readme (#1628)

---
 README.md                                     | 465 ++++++++++++++++++
 .../src/nemo_retriever/retriever.py           |   3 -
 2 files changed, 465 insertions(+), 3 deletions(-)
 create mode 100644 README.md

diff --git a/README.md b/README.md
new file mode 100644
index 000000000..f769225f8
--- /dev/null
+++ b/README.md
@@ -0,0 +1,465 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.
+All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+
+**Important: The default branch is main, which tracks active development and may be ahead of the latest supported release.**
+
+For the latest stable release:
+
+Use the latest release/* branch (for example, release/26.1.2) from the branch dropdown.
+
+See the corresponding NeMo Retriever Library documentation at https://docs.nvidia.com/nemo/retriever/latest/extraction/overview/
+
+# NeMo Retriever Library
+
+NeMo Retriever Library is a scalable, performance-oriented document content and metadata extraction microservice. It uses specialized NVIDIA NIM microservices 
+to find, contextualize, and extract text, tables, charts and infographics that you can use in downstream generative applications.
+
+> [!Note]
+> NeMo Retriever extraction is also known as NVIDIA Ingest and nv-ingest.
+
+NeMo Retriever Library enables parallelization of splitting documents into pages where artifacts are classified (such as text, tables, charts, and infographics), extracted, and further contextualized through optical character recognition (OCR) into a well defined JSON schema. From there, NeMo Retriever Library can optionally manage computation of embeddings for the extracted content, and optionally manage storing into a vector database [Milvus](https://milvus.io/).
+
+> [!Note]
+> Cached and Deplot are deprecated. Instead, NeMo Retriever extraction now uses the yolox-graphic-elements NIM. With this change, you should now be able to run NeMo Retriever Extraction on a single 24GB A10G or better GPU. If you want to use the old pipeline, with Cached and Deplot, use the [NeMo Retriever Extraction 24.12.1 release](https://github.com/NVIDIA/nv-ingest/tree/24.12.1).
+
+
+The following diagram shows the NeMo Retriever Library pipeline.
+
+![Pipeline Overview](https://docs.nvidia.com/nemo/retriever/extraction/images/overview-extraction.png)
+
+## Table of Contents
+1. [NeMo Retriever Library](#nemo-retriever-library)
+2. [Prerequisites](#prerequisites)
+3. [Quickstart](#library-mode-quickstart)
+4. [Benchmarking](#benchmarking)
+5. [GitHub Repository Structure](#github-repository-structure)
+6. [Notices](#notices)
+
+
+## What is NeMo Retriever Library?
+
+The NeMo Retriever Library is a library and microservice framework designed to perform the following functions::
+
+- Accept a job specification that contains a document payload and a set of ingestion tasks to perform on that payload.
+- Store the result of each job to retrieve later. The result is a dictionary that contains a list of metadata that describes the objects extracted from the base document, and processing annotations and timing/trace data.
+- Support multiple methods of extraction for each document type to balance trade-offs between throughput and accuracy. For example, for .pdf documents, extraction is performed by using pdfium, [nemotron-parse](https://build.nvidia.com/nvidia/nemotron-parse), Unstructured.io, and Adobe Content Extraction Services.
+- Support various types of before and after processing operations, including text splitting and chunking, transform and filtering, embedding generation, and image offloading to storage.
+
+
+NeMo Retriever Extraction supports the following file types:
+
+- `avi` (early access)
+- `bmp`
+- `docx`
+- `html` (converted to markdown format)
+- `jpeg`
+- `json` (treated as text)
+- `md` (treated as text)
+- `mkv` (early access)
+- `mov` (early access)
+- `mp3`
+- `mp4` (early access)
+- `pdf`
+- `png`
+- `pptx`
+- `sh` (treated as text)
+- `tiff`
+- `txt`
+- `wav`
+
+
+### What NeMo Retriever Library Isn't
+
+NeMo Retriever Library does not do the following:
+
+- Run a static pipeline or fixed set of operations on every submitted document.
+- Act as a wrapper for any specific document parsing library.
+
+
+For more information, refer to the [NeMo Retriever Library documentation](https://docs.nvidia.com/nemo/retriever/extraction/overview/).
+
+## Documentation Resources
+
+- **[Official Documentation](https://docs.nvidia.com/nemo/retriever/extraction/)** - Complete user guides, API references, and deployment instructions
+- **[Getting Started Guide](https://docs.nvidia.com/nemo/retriever/extraction/overview/)** - Overview and prerequisites for production deployments
+- **[Benchmarking Guide](https://docs.nvidia.com/nemo/retriever/extraction/benchmarking/)** - Performance testing and recall evaluation framework
+- **[MIG Deployment](https://docs.nvidia.com/nemo/retriever/extraction/mig-benchmarking/)** - Multi-Instance GPU configurations for Kubernetes
+- **[API Documentation](https://docs.nvidia.com/nemo/retriever/extraction/api/)** - Python client and API reference
+
+
+## Prerequisites
+
+For production-level performance and scalability, we recommend that you deploy the pipeline and supporting NIMs by using Docker Compose or Kubernetes ([helm charts](helm)). For more information, refer to [prerequisites](https://docs.nvidia.com/nv-ingest/user-guide/getting-started/prerequisites).
+
+
+## Library Mode Quickstart
+
+For small-scale workloads, such as workloads of fewer than 100 PDFs, you can use library mode setup. Library mode set up depends on NIMs that are already self-hosted, or, by default, NIMs that are hosted on build.nvidia.com.
+
+Library mode deployment of nv-ingest requires:
+
+- Linux operating systems (Ubuntu 22.04 or later recommended) or MacOS
+- Python 3.12
+- We strongly advise using an isolated Python virtual env with [uv](https://docs.astral.sh/uv/getting-started/installation/).
+
+### Step 1: Prepare Your Environment
+
+Create a fresh Python environment to install nv-ingest and dependencies.
+
+```shell
+uv venv --python 3.12 nvingest && \
+  source nvingest/bin/activate && \
+  uv pip install nv-ingest==26.1.2 nv-ingest-api==26.1.2 nv-ingest-client==26.1.2 milvus-lite==2.4.12
+```
+
+Set your NVIDIA_API_KEY. If you don't have a key, you can get one on [build.nvidia.com](https://org.ngc.nvidia.com/setup/api-keys). For instructions, refer to [Generate Your NGC Keys](docs/docs/extraction/ngc-api-key.md).
+
+```
+export NVIDIA_API_KEY=nvapi-...
+```
+
+### Step 2: Ingest Documents
+
+You can submit jobs programmatically in Python.
+
+To confirm that you have activated your Python environment, run `which python` and confirm that you see `nvingest` in the result. You can do this before any python command that you run.
+
+```
+which python
+/home/dev/projects/nv-ingest/nvingest/bin/python
+```
+
+If you have a very high number of CPUs, and see the process hang without progress, we recommend that you use `taskset` to limit the number of CPUs visible to the process. Use the following code.
+
+```
+taskset -c 0-3 python your_ingestion_script.py
+```
+
+On a 4 CPU core low end laptop, the following code should take about 10 seconds.
+
+```python
+import time
+
+from nv_ingest.framework.orchestration.ray.util.pipeline.pipeline_runners import run_pipeline
+from nv_ingest_client.client import Ingestor, NvIngestClient
+from nv_ingest_api.util.message_brokers.simple_message_broker import SimpleClient
+from nv_ingest_client.util.process_json_files import ingest_json_results_to_blob
+
+def main():
+    # Start the pipeline subprocess for library mode
+    run_pipeline(block=False, disable_dynamic_scaling=True, run_in_subprocess=True)
+
+    client = NvIngestClient(
+        message_client_allocator=SimpleClient,
+        message_client_port=7671,
+        message_client_hostname="localhost",
+    )
+
+    # gpu_cagra accelerated indexing is not available in milvus-lite
+    # Provide a filename for milvus_uri to use milvus-lite
+    milvus_uri = "milvus.db"
+    collection_name = "test"
+    sparse = False
+
+    # do content extraction from files
+    ingestor = (
+        Ingestor(client=client)
+        .files("data/multimodal_test.pdf")
+        .extract(
+            extract_text=True,
+            extract_tables=True,
+            extract_charts=True,
+            extract_images=True,
+            table_output_format="markdown",
+            extract_infographics=True,
+            # extract_method="nemotron_parse", #Slower, but maximally accurate, especially for PDFs with pages that are scanned images
+            text_depth="page",
+        )
+        .embed()
+        .vdb_upload(
+            collection_name=collection_name,
+            milvus_uri=milvus_uri,
+            sparse=sparse,
+            # for llama-3.2 embedder, use 1024 for e5-v5
+            dense_dim=2048,
+        )
+    )
+
+    print("Starting ingestion..")
+    t0 = time.time()
+
+    # Return both successes and failures
+    # Use for large batches where you want successful chunks/pages to be committed, while collecting detailed diagnostics for failures.
+    results, failures = ingestor.ingest(show_progress=True, return_failures=True)
+
+    # Return only successes
+    # results = ingestor.ingest(show_progress=True)
+
+    t1 = time.time()
+    print(f"Total time: {t1 - t0} seconds")
+
+    # results blob is directly inspectable
+    if results:
+        print(ingest_json_results_to_blob(results[0]))
+
+    # (optional) Review any failures that were returned
+    if failures:
+        print(f"There were {len(failures)} failures. Sample: {failures[0]}")
+
+if __name__ == "__main__":
+    main()
+```
+
+You can see the extracted text that represents the content of the ingested test document.
+
+```shell
+Starting ingestion..
+Total time: 9.243880033493042 seconds
+
+TestingDocument
+A sample document with headings and placeholder text
+Introduction
+This is a placeholder document that can be used for any purpose. It contains some 
+headings and some placeholder text to fill the space. The text is not important and contains 
+no real value, but it is useful for testing. Below, we will have some simple tables and charts 
+that we can use to confirm Ingest is working as expected.
+Table 1
+This table describes some animals, and some activities they might be doing in specific 
+locations.
+Animal Activity Place
+Gira@e Driving a car At the beach
+Lion Putting on sunscreen At the park
+Cat Jumping onto a laptop In a home o@ice
+Dog Chasing a squirrel In the front yard
+Chart 1
+This chart shows some gadgets, and some very fictitious costs.
+... document extract continues ...
+```
+
+### Step 3: Query Ingested Content
+
+To query for relevant snippets of the ingested content, and use them with an LLM to generate answers, use the following code.
+
+```python
+import os
+from openai import OpenAI
+from nv_ingest_client.util.milvus import nvingest_retrieval
+
+milvus_uri = "milvus.db"
+collection_name = "test"
+sparse = False
+
+queries = ["Which animal is responsible for the typos?"]
+
+retrieved_docs = nvingest_retrieval(
+    queries,
+    collection_name,
+    milvus_uri=milvus_uri,
+    hybrid=sparse,
+    top_k=1,
+)
+
+# simple generation example
+extract = retrieved_docs[0][0]["entity"]["text"]
+client = OpenAI(
+    base_url="https://integrate.api.nvidia.com/v1",
+    api_key=os.environ["NVIDIA_API_KEY"],
+)
+
+prompt = f"Using the following content: {extract}\n\n Answer the user query: {queries[0]}"
+print(f"Prompt: {prompt}")
+completion = client.chat.completions.create(
+    model="nvidia/llama-3.1-nemotron-nano-vl-8b-v1",
+    messages=[{"role": "user", "content": prompt}],
+)
+response = completion.choices[0].message.content
+
+print(f"Answer: {response}")
+```
+
+```shell
+Prompt: Using the following content: Table 1
+| This table describes some animals, and some activities they might be doing in specific locations. | This table describes some animals, and some activities they might be doing in specific locations. | This table describes some animals, and some activities they might be doing in specific locations. |
+| Animal | Activity | Place |
+| Giraffe | Driving a car | At the beach |
+| Lion | Putting on sunscreen | At the park |
+| Cat | Jumping onto a laptop | In a home office |
+| Dog | Chasing a squirrel | In the front yard |
+
+ Answer the user query: Which animal is responsible for the typos?
+Answer: A clever query!
+
+Based on the provided Table 1, I'd make an educated inference to answer your question. Since the activities listed are quite unconventional for the respective animals (e.g., a giraffe driving a car, a lion putting on sunscreen), it's likely that the table is using humor or hypothetical scenarios.
+
+Given this context, the question "Which animal is responsible for the typos?" is probably a tongue-in-cheek inquiry, as there's no direct information in the table about typos or typing activities.
+
+However, if we were to make a playful connection, we could look for an animal that's:
+
+1. Typically found in a setting where typing might occur (e.g., an office).
+2. Engaging in an activity that could potentially lead to typos (e.g., interacting with a typing device).
+
+Based on these loose criteria, I'd jokingly point to:
+
+**Cat** as the potential culprit, since it's:
+        * Located "In a home office"
+        * Engaged in "Jumping onto a laptop", which could theoretically lead to accidental keystrokes or typos if the cat were to start "walking" on the keyboard!
+
+Please keep in mind that this response is purely humorous and interpretative, as the table doesn't explicitly mention typos or provide a straightforward answer to the question.
+```
+
+> [!TIP]
+> Beyond inspecting the results, you can read them into things like [llama-index](examples/llama_index_multimodal_rag.ipynb) or [langchain](examples/langchain_multimodal_rag.ipynb) retrieval pipelines.
+>
+> Please also checkout our [demo using a retrieval pipeline on build.nvidia.com](https://build.nvidia.com/nvidia/multimodal-pdf-data-extraction-for-enterprise-rag) to query over document content pre-extracted w/ NVIDIA Ingest.
+
+
+## Benchmarking
+
+nv-ingest includes a comprehensive testing framework for benchmarking performance and evaluating retrieval accuracy.
+
+### Quick Start
+
+```bash
+cd tools/harness
+
+uv sync
+
+# Run end-to-end benchmark
+uv run nv-ingest-harness-run --case=e2e --dataset=bo767
+
+# Evaluate retrieval accuracy
+uv run nv-ingest-harness-run --case=e2e_recall --dataset=bo767
+```
+
+### Available Benchmarks
+
+- **End-to-End Performance** - Measure ingestion throughput, latency, and resource utilization
+- **Retrieval Accuracy** - Evaluate recall@k metrics against ground truth datasets
+- **MIG Benchmarking** - Test performance with NVIDIA Multi-Instance GPU (MIG) configurations
+
+### Documentation
+
+- **[Testing Framework Guide](https://docs.nvidia.com/nemo/retriever/extraction/benchmarking/)** - Complete guide to benchmarking and testing nv-ingest (same as `tools/harness/README.md`)
+- **[MIG Benchmarking](https://docs.nvidia.com/nemo/retriever/extraction/mig-benchmarking/)** - GPU partitioning for multi-tenant deployments on Kubernetes/Helm
+
+### Benchmark Datasets
+
+- **bo767** - 767 PDF documents with ground truth for recall evaluation
+- **bo20** - 20 PDF documents for quick validation
+- **single** - singular multimodal pdf for quick validation
+- **earnings** - earnings reports ppt and pdf dataset
+-- **financebench** - financial data
+- **Custom datasets** - Use your own datasets with the testing framework
+
+For more information, see the [benchmarking documentation](https://docs.nvidia.com/nemo/retriever/extraction/benchmarking/).
+
+
+## GitHub Repository Structure
+
+The following is a description of the folders in the GitHub repository.
+
+- [.devcontainer](https://github.com/NVIDIA/nv-ingest/tree/main/.devcontainer) — VSCode containers for local development
+- [.github](https://github.com/NVIDIA/nv-ingest/tree/main/.github) — GitHub repo configuration files
+- [api](https://github.com/NVIDIA/nv-ingest/tree/main/api) — Core API logic shared across python modules
+- [ci](https://github.com/NVIDIA/nv-ingest/tree/main/ci) — Scripts used to build the nv-ingest container and other packages
+- [client](https://github.com/NVIDIA/nv-ingest/tree/main/client) — Readme, examples, and source code for the nv-ingest-cli utility
+- [config](https://github.com/NVIDIA/nv-ingest/tree/main/config) — Various .yaml files defining configuration for OTEL, Prometheus
+- [data](https://github.com/NVIDIA/nv-ingest/tree/main/data) — Sample PDFs for testing
+- [deploy](https://github.com/NVIDIA/nv-ingest/tree/main/deploy) — Brev.dev-hosted launchable
+- [docker](https://github.com/NVIDIA/nv-ingest/tree/main/docker) — Scripts used by the nv-ingest docker container
+- [docs](https://github.com/NVIDIA/nv-ingest/tree/main/docs/docs) — Documentation for NV Ingest
+- [evaluation](https://github.com/NVIDIA/nv-ingest/tree/main/evaluation) — Notebooks that demonstrate how to test recall accuracy
+- [examples](https://github.com/NVIDIA/nv-ingest/tree/main/examples) — Notebooks, scripts, and tutorial content
+- [helm](https://github.com/NVIDIA/nv-ingest/tree/main/helm) — Documentation for deploying nv-ingest to a Kubernetes cluster via Helm chart
+- [skaffold](https://github.com/NVIDIA/nv-ingest/tree/main/skaffold) — Skaffold configuration
+- [src](https://github.com/NVIDIA/nv-ingest/tree/main/src) — Source code for the nv-ingest pipelines and service
+- [tests](https://github.com/NVIDIA/nv-ingest/tree/main/tests) — Unit tests for nv-ingest
+
+
+## Notices
+
+### Third Party License Notice:
+
+If configured to do so, this project will download and install additional third-party open source software projects.
+Review the license terms of these open source projects before use:
+
+https://pypi.org/project/pdfservices-sdk/
+
+- **`INSTALL_ADOBE_SDK`**:
+  - **Description**: If set to `true`, the Adobe SDK will be installed in the container at launch time. This is
+    required if you want to use the Adobe extraction service for PDF decomposition. Please review the
+    [license agreement](https://github.com/adobe/pdfservices-python-sdk?tab=License-1-ov-file) for the
+    pdfservices-sdk before enabling this option.
+- **Built With Llama**:
+  - **Description**: The NV-Ingest container comes with the `meta-llama/Llama-3.2-1B` tokenizer pre-downloaded so 
+    that the split task can use it for token-based splitting without making a network request. The [Llama 3.2 Community License Agreement](https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/LICENSE.txt) governs your use of these Llama materials.
+    
+    If you're building the container yourself and want to pre-download this model, you'll first need to set 
+    `DOWNLOAD_LLAMA_TOKENIZER` to `True`. Because this is a gated model, you'll also need to 
+    [request access](https://huggingface.co/meta-llama/Llama-3.2-1B) and set `HF_ACCESS_TOKEN` to your HuggingFace 
+    access token in order to use it.
+
+
+### Contributing
+
+We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original
+work, or you have rights to submit it under the same license, or a compatible license.
+
+Any contribution which contains commits that are not signed off are not accepted.
+
+To sign off on a commit, use the --signoff (or -s) option when you commit your changes as shown following.
+
+```
+$ git commit --signoff --message "Add cool feature."
+```
+
+This appends the following text to your commit message.
+
+```
+Signed-off-by: Your Name <your@email.com>
+```
+
+#### Developer Certificate of Origin (DCO)
+
+The following is the full text of the Developer Certificate of Origin (DCO)
+
+```
+  Developer Certificate of Origin
+  Version 1.1
+
+  Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
+  1 Letterman Drive
+  Suite D4700
+  San Francisco, CA, 94129
+
+  Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
+```
+
+```
+  Developer's Certificate of Origin 1.1
+
+  By making a contribution to this project, I certify that:
+
+  (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or
+
+  (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or
+
+  (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it.
+
+  (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.
+```
+
+
+## Security Considerations
+
+- NeMo Retriever Extraction doesn't generate any code that may require sandboxing.
+- NeMo Retriever Extraction is shared as a reference and is provided "as is". The security in the production environment is the responsibility of the end users deploying it. When deploying in a production environment, please have security experts review any potential risks and threats; define the trust boundaries, implement logging and monitoring capabilities, secure the communication channels, integrate AuthN & AuthZ with appropriate access controls, keep the deployment up to date, ensure the containers/source code are secure and free of known vulnerabilities.
+- A frontend that handles AuthN & AuthZ should be in place as missing AuthN & AuthZ could provide ungated access to customer models if directly exposed to e.g. the internet, resulting in either cost to the customer, resource exhaustion, or denial of service.
+- NeMo Retriever Extraction doesn't require any privileged access to the system.
+- The end users are responsible for ensuring the availability of their deployment.
+- The end users are responsible for building the container images and keeping them up to date.
+- The end users are responsible for ensuring that OSS packages used by the developer blueprint are current.
+- The logs from nginx proxy, backend, and demo app are printed to standard out. They can include input prompts and output completions for development purposes. The end users are advised to handle logging securely and avoid information leakage for production use cases.
diff --git a/nemo_retriever/src/nemo_retriever/retriever.py b/nemo_retriever/src/nemo_retriever/retriever.py
index 5d3458ce8..36b8daf3c 100644
--- a/nemo_retriever/src/nemo_retriever/retriever.py
+++ b/nemo_retriever/src/nemo_retriever/retriever.py
@@ -288,9 +288,6 @@ def queries(
         )
 
         if self.reranker:
-            assert self.top_k * self.reranker_refine_factor == len(
-                results[0]
-            ), "top_k must be at least 1/4 of the number of retrieved hits for reranking to work properly."
             results = self._rerank_results(query_texts, results)
 
         return results

From 7d112c3f3517749e1a709445d7f4cf90dcc47df6 Mon Sep 17 00:00:00 2001
From: Randy Gelhausen <rgelhausen@nvidia.com>
Date: Mon, 16 Mar 2026 12:31:00 -0400
Subject: [PATCH 29/55] cherry-pick 15b2bc05681599329276e46e83edfa0f15bb4318
 from main

---
 nemo_retriever/README.md | 466 +++++++++++++++------------------------
 1 file changed, 181 insertions(+), 285 deletions(-)

diff --git a/nemo_retriever/README.md b/nemo_retriever/README.md
index f0af47d61..6c6bb1c24 100644
--- a/nemo_retriever/README.md
+++ b/nemo_retriever/README.md
@@ -2,15 +2,19 @@
 
 NeMo Retriever Library is a retrieval-augmented generation (RAG) ingestion pipeline for documents that can parse text, tables, charts, and infographics. NeMo Retriever Library parses documents, creates embeddings, optionally stores embeddings in LanceDB, and performs recall evaluation.
 
-This quick start guide shows how to run NeMo Retriever Library in library mode, directly from your application, without Docker. In library mode, NeMo Retriever Library supports two deployment options:
-- Load Hugging Face models locally on your GPU.
-- Use locally deployed NeMo Retriever NIM endpoints for embedding and OCR.
+This quick start guide shows how to run NeMo Retriever Library as a library all within local Python processes without containers. NeMo Retriever Library supports two inference options:
+- Pull and run [Nemotron RAG models from Hugging Face](https://huggingface.co/collections/nvidia/nemotron-rag) on your local GPU(s).
+- Make over the network inference calls to build.nvidia.com hosted or locally deployed NeMo Retriever NIM endpoints.
 
-You’ll set up a CUDA 13–compatible environment, install the library and its dependencies, and run GPU‑accelerated ingestion pipelines that convert PDFs, HTML, plain text, and audio into vector embeddings stored in LanceDB, with optional Ray‑based scaling and built‑in recall benchmarking.
+You’ll set up a CUDA 13–compatible environment, install the library and its dependencies, and run GPU‑accelerated ingestion pipelines that convert PDFs, HTML, plain text, audio, or video into vector embeddings stored in LanceDB (on local disk), with Ray‑based scaling and built‑in recall benchmarking.
 
 ## Prerequisites
 
+<<<<<<< HEAD
 Before you start, make sure your system meets the following requirements:
+=======
+Before starting, make sure your system meets the following requirements:
+>>>>>>> 15b2bc05 (Updating the nemo_retriever quickstart README (#1632))
 
 - The host is running CUDA 13.x so that `libcudart.so.13` is available.
 - Your GPUs are visible to the system and compatible with CUDA 13.x.
@@ -25,7 +29,7 @@ export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64
 
 ## Setup your environment
 
-Complete the following steps to setup your environment. You will create and activate isolated Python and project virtual environments, install the NeMo Retriever Library and its CUDA 13–compatible GPU dependencies, and then run the ingestion, benchmarking, and audio pipelines to validate the full setup.
+Complete the following steps to setup your environment. You will create and activate isolated Python and project virtual environments, install the NeMo Retriever Library and its dependencies, and then run the provided ingestion snippets to validate your setup.
 
 1. Create and activate the NeMo Retriever Library environment
 
@@ -34,28 +38,13 @@ Before installing NeMo Retriever Library, create an isolated Python environment
 In your terminal, run the following commands from any location.
 
 ```bash
-uv venv .nemotron-ocr-test --python 3.12
-source .nemotron-ocr-test/bin/activate
-uv pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple nemo-retriever
+uv venv retriever --python 3.12
+source retriever/bin/activate
+uv pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple nemo-retriever==26.3.0rc2 nv-ingest-client==26.3.0rc2 nv-ingest==26.3.0rc2 nv-ingest-api==26.3.0rc2
 ```
 This creates a dedicated Python environment and installs the `nemo-retriever` PyPI package, the canonical distribution for the NeMo Retriever Library.
 
-2. Install NeMo Retriever Library and Dependencies
-
-Install the latest nightly builds of the NeMo Retriever Library so you can test the most recent features and fixes before they are rolled into a stable release. 
-
-In this step, you install the core library, its API layer, and the client package, ensuring the ingestion pipeline and related tooling all come from a consistent, up‑to‑date version set.
-
-In your terminal, run the following commands from any location.
-
-
-```bash
-uv pip install -i https://test.pypi.org/simple nemo-retriever==2026.3.3.dev20260303 nemo-retriever-api==2026.3.3.dev20260303 nemo-retriever-client==2026.3.3.dev20260303 --no-deps
-uv pip install nemo-retriever nemo-retriever-api nemo-retriever-client
-```
-These packages provide the ingestion pipeline and APIs used by NeMo Retriever Library until everything is consolidated under the single `nemo-retriever` surface.
-
-3. Install CUDA 13 builds of Torch and Torchvision
+2. Install CUDA 13 builds of Torch and Torchvision
 
 To ensure NeMo Retriever Library’s OCR and GPU‑accelerated components run correctly on your system, you need PyTorch and TorchVision builds that are compiled for CUDA 13. In this step, you uninstall any existing Torch/TorchVision packages and reinstall them from a dedicated CUDA 13.0 wheel index so they link against the same CUDA runtime as the rest of your pipeline.
 
@@ -67,170 +56,186 @@ uv pip install torch==2.9.1 torchvision -i https://download.pytorch.org/whl/cu13
 ```
 This ensures the OCR and GPU‑accelerated components in NeMo Retriever Library run against the right CUDA runtime.
 
-4. Set up the NeMo Retriever Library project environment
+## Run the pipeline
 
-For local development, you need a project-scoped environment tied directly to the NeMo Retriever Library source tree. 
+The [test PDF](../data/multimodal_test.pdf) contains text, tables, charts, and images. Additional test data resides [here](../data/).
 
-In this step, you create a virtual environment in the repo itself and install the `nemo_retriever` package in editable mode so you can run examples, tweak the code, and pick up changes without reinstallation.
+> **Note:** `batch` is the primary intended run_mode of operation for this library. Other modes are experimental and subject to change or removal.
 
-Run the following code from the NeMo Retriever Library repo root (NVIDIA/NeMo-Retriever).
-
-```bash
-cd /path/to/NeMo-Retriever
-uv venv .retriever
-source .retriever/bin/activate
-uv pip install -e ./nemo_retriever
-```
-This creates a project-local environment and installs the `nemo_retriever` Python package in editable mode for running the examples.
-
-5. Run the batch pipeline on PDFs
+### Ingest a test pdf
+```python
+from nemo_retriever import create_ingestor
+from nemo_retriever.io import to_markdown, to_markdown_by_page
+from pathlib import Path
 
-In this procedure, you run the end‑to‑end NeMo Retriever Library batch pipeline to ingest a collection of PDFs and generate embeddings for them. Pointing the script at a directory of PDF files lets the pipeline handle parsing, OCR, embedding, optional LanceDB upload, and (if configured) recall evaluation in a single command.
+documents = [str(Path("../data/multimodal_test.pdf"))]
+ingestor = create_ingestor(run_mode="batch")
 
-Run the batch pipeline script and point it at the directory that contains your PDFs using the following command.
+# ingestion tasks are chainable and defined lazily
+ingestor = (
+  ingestor.files(documents)
+  .extract(
+    # below are the default values, but content types can be controlled
+    extract_text=True,
+    extract_charts=True,
+    extract_tables=True,
+    extract_infographics=True
+  )
+  .embed()
+  .vdb_upload()
+)
 
-```bash
-uv run python nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py /path/to/pdfs
+# ingestor.ingest() actually executes the pipeline
+# results are returned as a ray dataset and inspectable as chunks
+ray_dataset = ingestor.ingest()
+chunks = ray_dataset.get_dataset().take_all()
 ```
 
-The first positional argument is the `input-dir`, the directory with the PDF files to ingest.
+### Inspect extracts
+You can inspect how recall accuracy optimized text chunks for various content types were extracted into text representations:
+```python
+# page 1 raw text:
+>>> chunks[0]["text"]
+'TestingDocument\r\nA sample document with headings and placeholder text\r\nIntroduction\r\nThis is a placeholder document that can be used for any purpose...'
 
-For recall evaluation, the pipeline uses bo767_query_gt.csv from the current working directory by default; you can override this by running the following command.
+# markdown formatted table from the first page
+>>> chunks[1]["text"]
+'| Table | 1 |\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |'
 
-```bash
-uv run python nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py /path/to/pdfs \
-  --query-csv /path/to/custom_query_gt.csv
-```
+# a chart from the first page
+>>> chunks[2]["text"]
+'Chart 1\nThis chart shows some gadgets, and some very fictitious costs.\nGadgets and their cost\n$160.00\n$140.00\n$120.00\n$100.00\nDollars\n$80.00\n$60.00\n$40.00\n$20.00\n$-\nPowerdrill\nBluetooth speaker\nMinifridge\nPremium desk fan\nHammer\nCost'
 
-If the specified query CSV does not exist, recall evaluation is skipped automatically and only the ingestion process runs.
+# markdown formatting for full pages or documents:
+>>> to_markdown_by_page(chunks).keys()
+dict_keys([1, 2, 3])
 
-By default, the pipeline prints per‑query details (query text, gold answers, and hits); use `--no-recall-details` to show only the missed‑gold summary and overall recall metrics.
+>>> to_markdown_by_page(chunks)[1]
+'## Page 1\n\nTestingDocument\r\nA sample document with headings and placeholder text\r\nIntroduction\r\nThis is a placeholder document that can be used for any purpose. It contains some \r\nheadings and some placeholder text to fill the space. The text is not important and contains \r\nno real value, but it is useful for testing. Below, we will have some simple tables and charts \r\nthat we can use to confirm Ingest is working as expected.\r\nTable 1\r\nThis table describes some animals, and some activities they might be doing in specific \r\nlocations.\r\nAnimal Activity Place\r\nGira@e Driving a car At the beach\r\nLion Putting on sunscreen At the park\r\nCat Jumping onto a laptop In a home o@ice\r\nDog Chasing a squirrel In the front yard\r\nChart 1\r\nThis chart shows some gadgets, and some very fictitious costs.\n\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |\n\nChart 1 This chart shows some gadgets, and some very fictitious costs...'
 
-To reuse an existing Ray cluster, append --ray-address using the following command.
-
-```bash
---ray-address auto
+# full document markdown
+>>> to_markdown(chunks)
 ```
 
-By doing this the pipeline connects to the running Ray deployment instead of starting a new one.
-
-6. Ingest HTML or plain text instead of PDFs
+Since the ingestion job automatically populated a lancedb table with all these chunks, you can use queries to retrieve semantically relevant chunks for feeding directly into an LLM:
 
-If your documents aren't stored as PDFs, you can point the same NeMo Retriever Library batch pipeline to directories of HTML or plain text files instead. 
+### Run a recall query
 
-In this step, you either pass an input‑type flag to the batch example for a simple one‑shot run, or use a staged HTML CLI flow for more control over each phase of ingestion.
+```python
+from nemo_retriever.retriever import Retriever
+
+retriever = Retriever(
+  # default values
+  lancedb_uri="lancedb",
+  lancedb_table="nv-ingest",
+  embedder="nvidia/llama-3.2-nv-embedqa-1b-v2",
+  top_k=5,
+  reranker=False
+)
 
-To run the batch example directly on HTML or plain text, use one of the following commands in your terminal.
+query = "Given their activities, which animal is responsible for the typos in my documents?"
 
-```bash
-uv run python nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py <dir> --input-type html
+# you can also submit a list with retriever.queries[...]
+hits = retriever.query(query)
 ```
-or
 
-```bash
-uv run python nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py <dir> --input-type txt
-```
-Pass the directory that contains your PDFs as the first argument (`input-dir`). For recall evaluation, the pipeline uses `bo767_query_gt.csv` in the current directory by default; override with `--query-csv <path>`. For document-level recall, use `--recall-match-mode pdf_only` with `query,expected_pdf` data. Recall is skipped if the query file does not exist. By default, per-query details (query, gold, hits) are printed; use `--no-recall-details` to print only the missed-gold summary and recall metrics. To use an existing Ray cluster, pass `--ray-address auto`. If OCR fails with a missing `libcudart.so.13`, install the CUDA 13 runtime and set `LD_LIBRARY_PATH` as shown above.
+```python
+# retrieved text from the first page
+>>> hits[0]
+{'text': 'TestingDocument\r\nA sample document with headings and placeholder text\r\nIntroduction\r\nThis is a placeholder document that can be used for any purpose. It contains some \r\nheadings and some placeholder text to fill the space. The text is not important and contains \r\nno real value, but it is useful for testing. Below, we will have some simple tables and charts \r\nthat we can use to confirm Ingest is working as expected.\r\nTable 1\r\nThis table describes some animals, and some activities they might be doing in specific \r\nlocations.\r\nAnimal Activity Place\r\nGira@e Driving a car At the beach\r\nLion Putting on sunscreen At the park\r\nCat Jumping onto a laptop In a home o@ice\r\nDog Chasing a squirrel In the front yard\r\nChart 1\r\nThis chart shows some gadgets, and some very fictitious costs.', 'metadata': '{"page_number": 1, "pdf_page": "multimodal_test_1", "page_elements_v3_num_detections": 9, "page_elements_v3_counts_by_label": {"table": 1, "chart": 1, "title": 3, "text": 4}, "ocr_table_detections": 1, "ocr_chart_detections": 1, "ocr_infographic_detections": 0}', 'source': '{"source_id": "/home/dev/projects/NeMo-Retriever/data/multimodal_test.pdf"}', 'page_number': 1, '_distance': 1.5822279453277588}
 
-Use `--input-type html` for HTML files and `--input-type txt` for plain text.  HTML inputs are converted to markdown using the same tokenizer and chunking strategy used for `.txt` ingestion.
+# retrieved text of the table from the first page
+>>> hits[1]
+{'text': '| Table | 1 |\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |', 'metadata': '{"page_number": 1, "pdf_page": "multimodal_test_1", "page_elements_v3_num_detections": 9, "page_elements_v3_counts_by_label": {"table": 1, "chart": 1, "title": 3, "text": 4}, "ocr_table_detections": 1, "ocr_chart_detections": 1, "ocr_infographic_detections": 0}', 'source': '{"source_id": "/home/dev/projects/NeMo-Retriever/data/multimodal_test.pdf"}', 'page_number': 1, '_distance': 1.614684820175171}
+```
 
-For more step‑by‑step control with HTML, use the following staged HTML CLI flow commands instead.
+###  Generate a query answer using an LLM
+The above retrieval results are often feedable directly to an LLM for answer generation.
 
+To do so, first install the openai client and set your [build.nvidia.com](https://build.nvidia.com/) API key:
 ```bash
-retriever html run --input-dir <dir>
-retriever local stage5 run --input-dir <dir> --pattern "*.html_extraction.json"
-retriever local stage6 run --input-dir <dir>
+uv pip install -y openai
+export NVIDIA_API_KEY=nvapi-...
 ```
-`retriever html run` parses the HTML and writes `*.html_extraction.json` sidecar files into the input directory. `retriever local stage5 run` performs downstream processing over those JSON files, and `retriever local stage6 run` completes the final ingestion stages, such as embedding and optional upload, using the same core extraction pipeline.
-
-- Config files:
-  - `nemo_retriever/harness/test_configs.yaml`
-  - `nemo_retriever/harness/nightly_config.yaml`
-- CLI entrypoint is nested under `retriever harness`.
-- First pass is LanceDB-only and enforces recall-required pass/fail by default.
-- Single-run artifact directories default to `<dataset>_<timestamp>`.
-- Dataset-specific recall adapters are supported via config:
-  - `recall_adapter: none` (default passthrough)
-  - `recall_adapter: page_plus_one` (convert zero-indexed `page` CSVs to `pdf_page`)
-  - `recall_adapter: financebench_json` (convert FinanceBench JSON to `query,expected_pdf`)
-  - `recall_match_mode: pdf_page|pdf_only` controls recall matching mode.
-- Dataset presets configured under `/datasets/nv-ingest/...` will fall back to `/raid/$USER/...` when the dataset is not present in `/datasets`.
-- Relative `query_csv` entries in harness YAML resolve from the config file directory first, then fall back to the repo root.
-- The default `financebench` dataset preset now points at `data/financebench_train.json` and enables recall out of the box.
-
-After you’ve finished installing and configuring NeMo Retriever Library, it's a good idea to validate the entire pipeline with a small, known dataset. In this step, you run the batch pipeline module against the sample `bo20` dataset to confirm that ingestion, OCR under CUDA 13, embedding, and any configured recall evaluation all run end‑to‑end without errors.
 
-```bash
-uv run python -m nemo_retriever.examples.batch_pipeline /datasets/nemo-retriever/bo20
-```
-This uses the module form of the NeMo Retriever Library batch pipeline example and points it at a sample dataset directory, verifying both ingestion and OCR under CUDA 13.
+```python
+from openai import OpenAI
+import os
 
-7. Ingest image files
+client = OpenAI(
+  base_url = "https://integrate.api.nvidia.com/v1",
+  api_key = os.environ.get("NVIDIA_API_KEY")
+)
 
-NeMo Retriever Library can ingest standalone image files through the same detection, OCR, and embedding pipeline used for PDFs. Supported formats are PNG, JPEG, BMP, TIFF, and SVG. SVG support requires the optional `cairosvg` package. Each image is treated as a single page.
+hit_texts = [hit["text"] for hit in hits]
+prompt = f"""
+Given the following retrieved documents, answer the question: {query}
 
-To run the batch pipeline on a directory of images, use `--input-type image` to match all supported formats at once.
+Documents:
+{hit_texts}
+"""
 
-```bash
-uv run python nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py /path/to/images \
-  --input-type image
-```
+completion = client.chat.completions.create(
+  model="nvidia/nemotron-3-super-120b-a12b",
+  messages=[{"role":"user","content":prompt}],
+  stream=False
+)
 
-You can also pass a single-format shortcut to restrict which files are picked up.
+answer = completion.choices[0].message.content
+print(answer)
+```
 
-```bash
-uv run python nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py /path/to/images \
-  --input-type png
+Answer:
+```
+Cat is the animal whose activity (jumping onto a laptop) matches the location of the typos, so the cat is responsible for the typos in the documents.
 ```
 
-Valid single-format values are `png`, `jpg`, `jpeg`, `bmp`, `tiff`, `tif`, and `svg`.
+### Ingest other types of content:
 
-For in-process mode, build the ingestor chain with `extract_image_files` instead of `extract`.
+For PowerPoint and Docx files, ensure libeoffice is installed by your system's package manager. This is required to make their pages renderable as images for our [page-elements content classifier](https://huggingface.co/nvidia/nemotron-page-elements-v3).
 
-```python
-from nemo_retriever import create_ingestor
-from nemo_retriever.params import ExtractParams, EmbedParams
+For example, with apt-get on Ubuntu:
+```bash
+sudo apt install -y libreoffice
+```
 
+Example usage:
+```python
+# docx and pptx files
+documents = [str(Path(f"../data/*{ext}")) for ext in [".pptx", ".docx"]]
+# mixed types of images
+images = [str(Path(f"../data/*{ext}")) for ext in [".png", ".jpeg", ".bmp"]]
 ingestor = (
-    create_ingestor(run_mode="inprocess")
-    .files("images/*.png")
-    .extract_image_files(
-        ExtractParams(
-            extract_text=True,
-            extract_tables=True,
-            extract_charts=True,
-            extract_infographics=True,
-        )
-    )
-    .embed()
-    .vdb_upload()
-    .ingest()
+  # above file types can be combined into a single job
+  ingestor.files(documents + images)
+  .extract()
 )
 ```
 
-All `ExtractParams` options (`extract_text`, `extract_tables`, `extract_charts`, `extract_infographics`) apply to image ingestion.
+*Note:* the `split()` task uses a tokenizer to split texts by a max_token length
 
+<<<<<<< HEAD
 ### Render results as markdown
 
 If you want a readable markdown view of extracted results, pass the full in-process result list
 to `nemo_retriever.io.to_markdown`. The helper now returns a `dict[str, str]` keyed by input
 filename, where each value is the document collapsed into one markdown string without per-page
 headers, so both single-document and multi-document runs follow the same contract.
+=======
+PDF text is split at the page level.
 
-```python
-from nemo_retriever import create_ingestor
-from nemo_retriever.io import to_markdown
+HTML and .txt files have no natural page delimiters, so they almost always need to be paired with the `.split()` task.
+>>>>>>> 15b2bc05 (Updating the nemo_retriever quickstart README (#1632))
 
+```python
+# html and text files - include a split task to prevent texts from exceeding the embedder's max sequence length
+documents = [str(Path(f"../data/*{ext}")) for ext in [".txt", ".html"]]
 ingestor = (
-    create_ingestor(run_mode="inprocess")
-    .files("data/multimodal_test.pdf")
-    .extract(
-        extract_text=True,
-        extract_tables=True,
-        extract_charts=True,
-        extract_infographics=True,
-    )
+  ingestor.files(documents)
+  .extract()
+  .split(max_tokens=5) #1024 by default, set low here to demonstrate chunking
 )
+<<<<<<< HEAD
 results = ingestor.ingest()
 markdown_docs = to_markdown(results)
 print(markdown_docs["multimodal_test.pdf"])
@@ -238,175 +243,66 @@ print(markdown_docs["multimodal_test.pdf"])
 
 Use `to_markdown_by_page(results)` when you want a nested
 `dict[str, dict[int, str]]` instead, where each filename maps to its per-page markdown strings.
-
-## Benchmark harness
-
-NeMo Retriever Library includes a lightweight benchmark harness that lets you run repeatable evaluations and sweeps without using Docker. [NeMo Retriever Library benchmarking documentation](https://docs.nvidia.com/nemo/retriever/latest/extraction/benchmarking/)
-
-1. Configuration
-
-The harness is configured using the following configuration files:
-
-- `nemo_retriever/harness/test_configs.yaml`  
-- `nemo_retriever/harness/nightly_config.yaml`  
-
-The CLI entrypoint is nested under `retriever harness`. The first pass is LanceDB‑only and enforces recall‑required pass/fail by default, and single‑run artifact directories default to `<dataset>_<timestamp>`. [NeMo Retriever Library benchmarking documentation](https://docs.nvidia.com/nemo/retriever/latest/extraction/benchmarking/)
-
-2. Single run
-
-You can run a single benchmark either from a preset dataset name or a direct path.
-
-Preset dataset name
-```bash
-# Dataset preset from test_configs.yaml (recall-required example)
-retriever harness run --dataset jp20 --preset single_gpu
-```
-or
-
-# Direct dataset path
-retriever harness run --dataset /datasets/nv-ingest/bo767 --preset single_gpu
-
-# Add repeatable run or session tags for later review
-retriever harness run --dataset jp20 --preset single_gpu --tag nightly --tag candidate
+=======
 ```
 
-3. Sweep runs
-
-To sweep multiple runs defined in a config file use the following command.
+For audio and video files, ensure ffmpeg is installed by your system's package manager.
+>>>>>>> 15b2bc05 (Updating the nemo_retriever quickstart README (#1632))
 
+For example, with apt-get on Ubuntu:
 ```bash
-retriever harness sweep --runs-config nemo_retriever/harness/nightly_config.yaml
+sudo apt install -y ffmpeg
 ```
 
-4. Nightly sessions
-
-To orchestrate a full nightly benchmark session use the following command.
-
-```bash
-export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/..."
-retriever harness nightly --runs-config nemo_retriever/harness/nightly_config.yaml
-retriever harness nightly --runs-config nemo_retriever/harness/nightly_config.yaml --skip-slack
-retriever harness nightly --dry-run
-retriever harness nightly --replay nemo_retriever/artifacts/nightly_20260305_010203_UTC
+```python
+ingestor = create_ingestor(run_mode="batch")
+ingestor = ingestor.files([str(INPUT_AUDIO)]).extract_audio()
 ```
 
-`nemo_retriever/harness/nightly_config.yaml` supports a small top-level `preset:` and `slack:`
-block alongside `runs:`. Keep the webhook secret out of YAML and source control; provide it only
-through the `SLACK_WEBHOOK_URL` environment variable. If the variable is missing, nightly still
-runs and writes artifacts but skips the Slack post. `--replay` lets you resend a previous session
-directory, run directory, or `results.json` file after fixing webhook access.
+### Explore Different Pipeline Options:
 
-For reusable box-local automation, the harness also includes shell entrypoints:
+You can use the [Nemotron RAG VL Embedder](https://huggingface.co/nvidia/llama-nemotron-embed-vl-1b-v2)
 
-```bash
-# One-shot nightly run using the repo-local .retriever env
-bash nemo_retriever/harness/run_nightly.sh
-
-# Forever loop that sleeps until the next UTC schedule window, then runs nightly
-tmux new-session -d -s retriever-nightly \
-  "cd /path/to/nv-ingest && export SLACK_WEBHOOK_URL='https://hooks.slack.com/services/...' && \
-   bash nemo_retriever/harness/run_nightly_loop.sh"
+```python
+ingestor = (
+  ingestor.files(documents)
+  .extract()
+  .embed(
+    model_name="nvidia/llama-nemotron-embed-vl-1b-v2",
+    #works with plain "text"s, "image"s, and "text_image" pairs
+    embed_modality="text_image"  
+  )
+)
 ```
 
-`run_nightly_loop.sh` is intended as a pragmatic fallback for boxes where cron or timers are
-unreliable. It does not require an interactive SSH session once launched inside `tmux`, but it is
-still less robust than a real scheduler such as `systemd` or a cluster job scheduler.
-
-The `--dry-run` option lets you verify the planned runs without executing them. [NeMo Retriever Library benchmarking documentation](https://docs.nvidia.com/nemo/retriever/latest/extraction/benchmarking/)
-
-5. Harness artifacts
-
-Each harness run writes a compact artifact set (no full stdout/stderr log persistence):
-
-- `results.json` (normalized metrics + pass/fail + config snapshot + `run_metadata`)
-- `command.txt` (exact invoked command)
-- `runtime_metrics/` (Ray runtime summary + timeline files)
-
-Recall metrics in `results.json` are normalized as `recall_1`, `recall_5`, and `recall_10`.
-Nightly/sweep rollups intentionally focus on compact `summary_metrics`:
-
-- `pages`
-- `ingest_secs`
-- `pages_per_sec_ingest`
-- `recall_5`
-
-By default, detection totals are embedded into `results.json` under `detection_summary`.
-If you want a separate detection file for ad hoc inspection, set `write_detection_file: true` in
-`nemo_retriever/harness/test_configs.yaml`.
-When tags are supplied with `--tag`, they are persisted in `results.json` and in session rollups for sweep/nightly runs.
-
-`results.json` also includes a nested `run_metadata` block for lightweight environment context:
-
-- `host`
-- `gpu_count`
-- `cuda_driver`
-- `ray_version`
-- `python_version`
-
-These fields use best-effort discovery and fall back to `null` or `"unknown"` rather than failing a run.
-
-Sweep/nightly sessions additionally write:
-
-The `runtime_metrics/` directory contains:
-
-When Slack posting is enabled, the nightly summary is built from `session_summary.json` plus each
-run's `results.json`, so the on-disk artifacts remain the source of truth even if you need to replay
-or troubleshoot a failed post later.
-
-### Runtime metrics interpretation
-
-- **`run.runtime.summary.json`** - run totals (input files, pages, elapsed seconds)  
-- **`run.ray.timeline.json`** - detailed Ray execution timeline  
-- **`run.rd_dataset.stats.txt`** - Ray dataset stats dump  
-
-Use `results.json` for routine benchmark comparison, and use the files under `runtime_metrics/` when investigating throughput regressions or stage‑level behavior. [NeMo Retriever Library benchmarking documentation](https://docs.nvidia.com/nemo/retriever/latest/extraction/benchmarking/)
-
-6. Artifact size profile
-
-Current benchmark runs show that the LanceDB data dominates the artifact footprint:
-
-### Cron / timer setup
-
-For a simple machine-local schedule, run the nightly command from `cron` or a `systemd` timer on the
-GPU host that already has dataset access and the retriever environment installed.
-
-Example cron entry:
-
-```bash
-0 2 * * * cd /path/to/nv-ingest && source .retriever/bin/activate && \
-  export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/..." && \
-  retriever harness nightly --runs-config nemo_retriever/harness/nightly_config.yaml \
-  >> nemo_retriever/artifacts/nightly_cron.log 2>&1
+You can use a different ingestion pipeline based on [Nemotron-Parse](https://huggingface.co/nvidia/NVIDIA-Nemotron-Parse-v1.2) combined with the default embedder:
+```python
+ingestor = ingestor.files(documents).extract(method="nemotron_parse")
 ```
 
-If you prefer `systemd`, keep the same command in an `ExecStart=` line and move
-`SLACK_WEBHOOK_URL` into an environment file owned by the machine user so the secret stays out of
-the repo.
-
-### Artifact size profile
+## Run with remote inference, no local GPU required:
 
-- **`bo20`** - ~9.0 MiB total, ~8.6 MiB LanceDB  
-- **`jp20`** - ~36.8 MiB total, ~36.2 MiB LanceDB 
-
-## Audio ingestion pipeline
-
-NeMo Retriever Library also supports audio ingestion alongside documents. Audio pipelines typically follow a chained pattern such as the following.  
+For build.nvidia.com hosted inference, make sure you have NVIDIA_API_KEY set as an environment variable. 
 
 ```python
-.files("mp3/*.mp3").extract_audio(...).embed().vdb_upload().ingest()
+ingestor = (
+  ingestor.files(documents)
+  .extract(
+    # for self hosted NIMs, your URLs will depend on your NIM container DNS settings
+    page_elements_invoke_url="https://ai.api.nvidia.com/v1/cv/nvidia/nemotron-page-elements-v3",
+    graphic_elements_invoke_url="https://ai.api.nvidia.com/v1/cv/nvidia/nemotron-graphic-elements-v1",
+    ocr_invoke_url="https://ai.api.nvidia.com/v1/cv/nvidia/nemoretriever-ocr-v1",
+    table_structure_invoke_url="https://ai.api.nvidia.com/v1/cv/nvidia/nemotron-table-structure-v1"
+  )
+  .embed(
+    embed_invoke_url="https://integrate.api.nvidia.com/v1/embeddings",
+    model_name="nvidia/llama-nemotron-embed-1b-v2",
+    embed_modality="text",
+  )
+  .vdb_upload()
+)
 ```
 
-This can be run in batch, in‑process, or fused mode within NeMo Retriever Library. [NeMo Retriever Library audio extraction documentation](https://docs.nvidia.com/nemo/retriever/latest/extraction/audio/)
-
-### ASR options
-
-For automatic speech recognition (ASR), you have the following two options:
-
-- Local: When `audio_endpoints` are not set, the pipeline uses local HuggingFace ASR (`nvidia/parakeet-ctc-1.1b`) through Transformers with NeMo fallback; no NIM or gRPC endpoint is required. [Parakeet CTC 1.1B model on Hugging Face](https://huggingface.co/nvidia/parakeet-ctc-1.1b)
-- Remote: When `audio_endpoints` is set (for example, Parakeet NIM or self‑deployed Riva gRPC), the pipeline uses the remote client; set `AUDIO_GRPC_ENDPOINT`, `NGC_API_KEY`, and optionally `AUDIO_FUNCTION_ID`. [NeMo Retriever Library audio extraction documentation (25.6.3)](https://docs.nvidia.com/nemo/retriever/25.6.3/extraction/audio/)
-
-See `ingest-config.yaml` (sections `audio_chunk`, `audio_asr`) and audio scripts under `retriever/scripts/` for concrete configuration examples. [NeMo Retriever Library audio extraction documentation](https://docs.nvidia.com/nemo/retriever/latest/extraction/audio/)
-
 ## Ray cluster setup
 
 NeMo Retriever Library uses Ray Data for distributed ingestion and benchmarking. [NeMo Ray run guide](https://docs.nvidia.com/nemo/run/latest/guides/ray.html)

From 823775dbba07beb9fb201dfb6ebc1dba44075535 Mon Sep 17 00:00:00 2001
From: Jeremy Dyer <jdye64@gmail.com>
Date: Mon, 16 Mar 2026 14:26:58 -0400
Subject: [PATCH 30/55] Release prep: update version references to 26.3.0
 (#1638)

---
 docker-compose.yaml                             |  2 +-
 docs/docs/extraction/helm.md                    |  2 +-
 docs/docs/extraction/quickstart-guide.md        |  2 +-
 docs/docs/extraction/quickstart-library-mode.md |  2 +-
 docs/docs/extraction/releasenotes-nv-ingest.md  |  2 +-
 docs/docs/extraction/user-defined-functions.md  |  2 +-
 helm/Chart.yaml                                 |  2 +-
 helm/README.md                                  |  8 ++++----
 helm/README.md.gotmpl                           |  6 +++---
 helm/values.yaml                                |  2 +-
 nemo_retriever/pyproject.toml                   | 12 ++++++------
 src/nv_ingest/api/main.py                       |  2 +-
 tools/harness/pyproject.toml                    | 12 ++++++------
 tools/harness/test_configs.yaml                 |  2 +-
 14 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/docker-compose.yaml b/docker-compose.yaml
index dc7dea85a..080986646 100644
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
@@ -262,7 +262,7 @@ services:
       - audio
 
   nv-ingest-ms-runtime:
-    image: nvcr.io/nvidia/nemo-microservices/nv-ingest:26.3.0-RC4
+    image: nvcr.io/nvidia/nemo-microservices/nv-ingest:26.3.0
     shm_size: 40gb # Should be at minimum 30% of assigned memory per Ray documentation
     build:
       context: ${NV_INGEST_ROOT:-.}
diff --git a/docs/docs/extraction/helm.md b/docs/docs/extraction/helm.md
index 5e5c787cd..bcc63b4c0 100644
--- a/docs/docs/extraction/helm.md
+++ b/docs/docs/extraction/helm.md
@@ -3,7 +3,7 @@
 <!-- Use this documentation to deploy [NeMo Retriever Library](overview.md) by using Helm. -->
 
 To deploy [NeMo Retriever Library](overview.md) by using Helm, 
-refer to [NeMo Retriever Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0-RC4/helm/README.md).
+refer to [NeMo Retriever Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0/helm/README.md).
 
 !!! note "Air-gapped environments"
    
diff --git a/docs/docs/extraction/quickstart-guide.md b/docs/docs/extraction/quickstart-guide.md
index 111638f1e..428d1c890 100644
--- a/docs/docs/extraction/quickstart-guide.md
+++ b/docs/docs/extraction/quickstart-guide.md
@@ -84,7 +84,7 @@ h. Run the command `docker ps`. You should see output similar to the following.
     CONTAINER ID  IMAGE                                            COMMAND                 CREATED         STATUS                  PORTS            NAMES
 uv venv --python 3.12 nv-ingest-dev
 source nv-ingest-dev/bin/activate
-uv pip install nv-ingest==26.3.0-RC4 nv-ingest-api==26.3.0-RC4 nv-ingest-client==26.3.0-RC4
+uv pip install nv-ingest==26.3.0 nv-ingest-api==26.3.0 nv-ingest-client==26.3.0
 ```
 
 !!! tip
diff --git a/docs/docs/extraction/quickstart-library-mode.md b/docs/docs/extraction/quickstart-library-mode.md
index 2e5042f10..9dbbd3e3b 100644
--- a/docs/docs/extraction/quickstart-library-mode.md
+++ b/docs/docs/extraction/quickstart-library-mode.md
@@ -34,7 +34,7 @@ Use the following procedure to prepare your environment.
     ```
        uv venv --python 3.12 nvingest && \
          source nvingest/bin/activate && \
-         uv pip install nemo-retriever==26.3.0-RC4 milvus-lite==2.4.12
+         uv pip install nemo-retriever==26.3.0 milvus-lite==2.4.12
     ```
 
     !!! tip
diff --git a/docs/docs/extraction/releasenotes-nv-ingest.md b/docs/docs/extraction/releasenotes-nv-ingest.md
index 22aeac345..445a58e04 100644
--- a/docs/docs/extraction/releasenotes-nv-ingest.md
+++ b/docs/docs/extraction/releasenotes-nv-ingest.md
@@ -12,7 +12,7 @@ This documentation contains the release notes for [NeMo Retriever Library](overv
 
 The NeMo Retriever Library 26.03 release adds new hardware and software support, and other improvements.
 
-To upgrade the Helm Charts for this version, refer to [NeMo Retriever Library Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0-RC1/helm/README.md).
+To upgrade the Helm Charts for this version, refer to [NeMo Retriever Library Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0/helm/README.md).
 
 Updates and enhancements in the 26.03 release include the following:
 
diff --git a/docs/docs/extraction/user-defined-functions.md b/docs/docs/extraction/user-defined-functions.md
index 62013d1d8..eae1f5a4a 100644
--- a/docs/docs/extraction/user-defined-functions.md
+++ b/docs/docs/extraction/user-defined-functions.md
@@ -941,6 +941,6 @@ def debug_udf(control_message: IngestControlMessage) -> IngestControlMessage:
 
 ## Related Topics
 
-- [NeMo Retriever Library UDF Examples](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0-RC1/examples/udfs/README.md)
+- [NeMo Retriever Library UDF Examples](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0/examples/udfs/README.md)
 - [User-Defined Stages for NeMo Retriever Library](user-defined-stages.md)
 - [NimClient Usage](nimclient.md)
diff --git a/helm/Chart.yaml b/helm/Chart.yaml
index 5b018b724..616da93ef 100644
--- a/helm/Chart.yaml
+++ b/helm/Chart.yaml
@@ -2,7 +2,7 @@ apiVersion: v2
 name: nv-ingest
 description: NV-Ingest Microservice
 type: application
-version: 26.3.0-RC4
+version: 26.3.0
 maintainers:
   - name: NVIDIA Corporation
     url: https://www.nvidia.com/
diff --git a/helm/README.md b/helm/README.md
index e1e43362e..d1a752be3 100644
--- a/helm/README.md
+++ b/helm/README.md
@@ -45,7 +45,7 @@ To install or upgrade the Helm chart, run the following code.
 helm upgrade \
     --install \
     nv-ingest \
-    https://helm.ngc.nvidia.com/nvidia/nemo-microservices/charts/nv-ingest-26.3.0-RC4.tgz \
+    https://helm.ngc.nvidia.com/nvidia/nemo-microservices/charts/nv-ingest-26.3.0.tgz \
     -n ${NAMESPACE} \
     --username '$oauthtoken' \
     --password "${NGC_API_KEY}" \
@@ -54,7 +54,7 @@ helm upgrade \
     --set ngcApiSecret.create=true \
     --set ngcApiSecret.password="${NGC_API_KEY}" \
     --set image.repository="nvcr.io/nvidia/nemo-microservices/nv-ingest" \
-    --set image.tag="26.3.0-RC4"
+    --set image.tag="26.3.0"
 ```
 
 Optionally you can create your own versions of the `Secrets` if you do not want to use the creation via the helm chart.
@@ -105,7 +105,7 @@ For more information, refer to [NV-Ingest-Client](https://github.com/NVIDIA/nv-i
 # Just to be cautious we remove any existing installation
 pip uninstall nv-ingest-client
 
-pip install nv-ingest-client==26.3.0-RC4
+pip install nv-ingest-client==26.3.0
 ```
 
 #### Rest Endpoint Ingress
@@ -347,7 +347,7 @@ You can also use NV-Ingest's Python client API to interact with the service runn
 | fullnameOverride | string | `""` |  |
 | image.pullPolicy | string | `"IfNotPresent"` |  |
 | image.repository | string | `"nvcr.io/nvidia/nemo-microservices/nv-ingest"` |  |
-| image.tag | string | `"26.3.0-RC4"` |  |
+| image.tag | string | `"26.3.0"` |  |
 | imagePullSecrets[0].name | string | `"ngc-api"` |  |
 | imagePullSecrets[1].name | string | `"ngc-secret"` |  |
 | ingress.annotations | object | `{}` |  |
diff --git a/helm/README.md.gotmpl b/helm/README.md.gotmpl
index 3686cd1fc..450c62b60 100644
--- a/helm/README.md.gotmpl
+++ b/helm/README.md.gotmpl
@@ -46,7 +46,7 @@ To install or upgrade the Helm chart, run the following code.
 helm upgrade \
     --install \
     nv-ingest \
-    https://helm.ngc.nvidia.com/nvidia/nemo-microservices/charts/nv-ingest-26.3.0-RC4.tgz \
+    https://helm.ngc.nvidia.com/nvidia/nemo-microservices/charts/nv-ingest-26.3.0.tgz \
     -n ${NAMESPACE} \
     --username '$oauthtoken' \
     --password "${NGC_API_KEY}" \
@@ -55,7 +55,7 @@ helm upgrade \
     --set ngcApiSecret.create=true \
     --set ngcApiSecret.password="${NGC_API_KEY}" \
     --set image.repository="nvcr.io/nvidia/nemo-microservices/nv-ingest" \
-    --set image.tag="26.3.0-RC4"
+    --set image.tag="26.3.0"
 ```
 
 Optionally you can create your own versions of the `Secrets` if you do not want to use the creation via the helm chart.
@@ -107,7 +107,7 @@ For more information, refer to [NV-Ingest-Client](https://github.com/NVIDIA/nv-i
 # Just to be cautious we remove any existing installation
 pip uninstall nv-ingest-client
 
-pip install nv-ingest-client==26.3.0-RC4
+pip install nv-ingest-client==26.3.0
 ```
 
 #### Rest Endpoint Ingress
diff --git a/helm/values.yaml b/helm/values.yaml
index d0cdd9364..49132b116 100644
--- a/helm/values.yaml
+++ b/helm/values.yaml
@@ -28,7 +28,7 @@ nameOverride: ""
 image:
   pullPolicy: IfNotPresent
   repository: "nvcr.io/nvidia/nemo-microservices/nv-ingest"
-  tag: "26.3.0-RC4"
+  tag: "26.3.0"
 
 ## @section Pod Configuration
 ## @param podAnnotations [object] Sets additional annotations on the main deployment pods
diff --git a/nemo_retriever/pyproject.toml b/nemo_retriever/pyproject.toml
index fc713c8ce..09607b165 100644
--- a/nemo_retriever/pyproject.toml
+++ b/nemo_retriever/pyproject.toml
@@ -30,9 +30,9 @@ dependencies = [
   "typer>=0.12.0",
   "pyyaml>=6.0",
   "lancedb",
-  "nv-ingest==26.3.0rc4",
-  "nv-ingest-api==26.3.0rc4",
-  "nv-ingest-client==26.3.0rc4",
+  "nv-ingest==26.3.0",
+  "nv-ingest-api==26.3.0",
+  "nv-ingest-client==26.3.0",
   "fastapi>=0.114.0",
   "uvicorn[standard]>=0.30.0",
   "httpx>=0.27.0",
@@ -90,9 +90,9 @@ retriever = "nemo_retriever.__main__:main"
 version = {attr = "nemo_retriever.version.get_build_version"}
 
 [tool.uv.sources]
-nv-ingest = { path = "../src/", editable = true }
-nv-ingest-api = { path = "../api/", editable = true }
-nv-ingest-client = { path = "../client/", editable = true }
+#nv-ingest = { path = "../src/", editable = true }
+#nv-ingest-api = { path = "../api/", editable = true }
+#nv-ingest-client = { path = "../client/", editable = true }
 #nemotron-page-elements-v3 = { index = "test-pypi" }
 #nemotron-graphic-elements-v1 = { index = "test-pypi" }
 #nemotron-table-structure-v1 = { index = "test-pypi" }
diff --git a/src/nv_ingest/api/main.py b/src/nv_ingest/api/main.py
index a5a3f7cb3..4635861db 100644
--- a/src/nv_ingest/api/main.py
+++ b/src/nv_ingest/api/main.py
@@ -23,7 +23,7 @@
 app = FastAPI(
     title="NV-Ingest Microservice",
     description="Service for ingesting heterogenous datatypes",
-    version="26.3.0-RC4",
+    version="26.3.0",
     contact={
         "name": "NVIDIA Corporation",
         "url": "https://nvidia.com",
diff --git a/tools/harness/pyproject.toml b/tools/harness/pyproject.toml
index 2e661b019..e87500a09 100644
--- a/tools/harness/pyproject.toml
+++ b/tools/harness/pyproject.toml
@@ -10,9 +10,9 @@ dependencies = [
     "pyyaml>=6.0",
     "requests>=2.32.5",
     "pynvml>=11.5.0",
-    "nv-ingest==26.3.0rc4",
-    "nv-ingest-api==26.3.0rc4",
-    "nv-ingest-client==26.3.0rc4",
+    "nv-ingest==26.3.0",
+    "nv-ingest-api==26.3.0",
+    "nv-ingest-client==26.3.0",
     "milvus-lite==2.4.12",
     "pypdfium2>=4.30.0,<5.0.0",
     "nemotron-page-elements-v3==3.0.1",
@@ -33,9 +33,9 @@ nv-ingest-harness-stats = "nv_ingest_harness.cli.stats:main"
 package = true
 
 [tool.uv.sources]
-nv-ingest = { path = "../../src", editable = true }
-nv-ingest-api = { path = "../../api/", editable = true }
-nv-ingest-client = { path = "../../client/", editable = true }
+#nv-ingest = { path = "../../src", editable = true }
+#nv-ingest-api = { path = "../../api/", editable = true }
+#nv-ingest-client = { path = "../../client/", editable = true }
 nemotron-page-elements-v3 = { index = "test-pypi" }
 nemotron-graphic-elements-v1 = { index = "test-pypi" }
 nemotron-table-structure-v1 = { index = "test-pypi" }
diff --git a/tools/harness/test_configs.yaml b/tools/harness/test_configs.yaml
index a9278ef62..30b083a8d 100644
--- a/tools/harness/test_configs.yaml
+++ b/tools/harness/test_configs.yaml
@@ -28,7 +28,7 @@ active:
     kubectl_bin: microk8s kubectl  # kubectl binary command (e.g., "kubectl", "microk8s kubectl")
     kubectl_sudo: null  # Prepend sudo to kubectl commands (null = same as helm_sudo)
     chart: nemo-microservices/nv-ingest  # Remote chart reference (set to null to use local chart from ./helm)
-    chart_version: 26.3.0-RC4  # Chart version (required for remote charts)
+    chart_version: 26.3.0  # Chart version (required for remote charts)
     release: nv-ingest
     namespace: nv-ingest
     values_file: .helm-env  # Optional: path to values file

From 7b543855bd0ff710b3784d4c6baf1073b7759c36 Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Tue, 17 Mar 2026 11:27:15 -0700
Subject: [PATCH 31/55] 26.03 RNs (#1641)

---
 .../docs/extraction/releasenotes-nv-ingest.md | 55 ++++++++++---------
 1 file changed, 29 insertions(+), 26 deletions(-)

diff --git a/docs/docs/extraction/releasenotes-nv-ingest.md b/docs/docs/extraction/releasenotes-nv-ingest.md
index 445a58e04..c86c35918 100644
--- a/docs/docs/extraction/releasenotes-nv-ingest.md
+++ b/docs/docs/extraction/releasenotes-nv-ingest.md
@@ -4,32 +4,38 @@ This documentation contains the release notes for [NeMo Retriever Library](overv
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
-
-    
-
-## 26.03 Release Notes (in progress)
-
-The NeMo Retriever Library 26.03 release adds new hardware and software support, and other improvements.
-
-To upgrade the Helm Charts for this version, refer to [NeMo Retriever Library Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0/helm/README.md).
-
-Updates and enhancements in the 26.03 release include the following:
-
-- NV-Ingest github repo renamed to NeMo-Retriever 
-- NeMo Retriever Extraction pipeline renamed to NeMo Retriever Library
-- NeMo Retriever Library now support two deployment options:
-- Load Hugging Face models locally on your GPU.
-- Use locally deployed NeMo Retriever NIM endpoints for embedding and OCR.
-- Note on Air-gapped support 
-- Added support for RTX4500 Pro Blackwell SKU 
-- Added support for llama-nemotron-embed-vl-v2 ?
-
-
-NeMo Retriever Library currently does not support image captioning via VLM. It will be added in the next release.
+    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.   
+
+## 26.03 Release Notes (26.1.3)
+
+NVIDIA® NeMo Retriever Library version 26.03 adds broader hardware and software support along with many pipeline, evaluation, and deployment enhancements.
+
+To upgrade the Helm charts for this release, refer to the (NeMo Retriever Library Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0/helm/README.md).
+
+Highlights for the 26.03 release include:
+
+- NV-Ingest GitHub repo renamed to NeMo-Retriever  
+- NeMo Retriever Extraction pipeline renamed to NeMo Retriever Library  
+- NeMo Retriever Library now supports two deployment options:  
+  - A new no-container, pip-installable in-process library for development (available on PyPI)  
+  - Existing production-ready Helm chart with NIMs  
+- Added documentation notes on Air-gapped deployment support  
+- Added documentation notes on OpenShift support  
+- Added support for RTX4500 Pro Blackwell SKU  
+- Added support for llama-nemotron-embed-vl-v2 in text and text+image modes  
+- New extract methods `pdfium_hybrid` and `ocr` target scanned PDFs to improve text and layout extraction from image-based pages  
+- VLM-based image caption enhancements:  
+  - Infographics can be captioned  
+  - Reasoning mode is configurable  
+- Enabled hybrid search with Lancedb  
+- Added retrieval_bench subfolder with generalizable agentic retrieval pipeline  
+- The project now uses UV as the primary environment and package manager instead of Conda, resulting in faster installs and simpler dependency handling  
+- Default Redis TTL increased from 1–2 hours to 48 hours so long-running jobs (e.g., VLM captioning) don’t expire before completion  
+- NeMo Retriever Library currently does not support image captioning via VLM; this feature will be added in the next release
 
 ## Release Notes for Previous Versions
 
+| [26.1.2](https://docs.nvidia.com/nemo/retriever/26.1.2/extraction/releasenotes-nv-ingest/)
 | [26.1.1](https://docs.nvidia.com/nemo/retriever/26.1.1/extraction/releasenotes-nv-ingest/)
 | [25.9.0](https://docs.nvidia.com/nemo/retriever/25.9.0/extraction/releasenotes-nv-ingest/) 
 | [25.6.3](https://docs.nvidia.com/nemo/retriever/25.6.3/extraction/releasenotes-nv-ingest/) 
@@ -38,9 +44,6 @@ NeMo Retriever Library currently does not support image captioning via VLM. It w
 | [25.3.0](https://docs.nvidia.com/nemo/retriever/25.3.0/extraction/releasenotes-nv-ingest/) 
 | [24.12.1](https://docs.nvidia.com/nemo/retriever/25.3.0/extraction/releasenotes-nv-ingest/) 
 | [24.12.0](https://docs.nvidia.com/nemo/retriever/25.3.0/extraction/releasenotes-nv-ingest/) 
-|
-
-
 
 ## Related Topics
 

From b7be9ba77096361d8abd2762c4163f393c2eafdc Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Tue, 17 Mar 2026 19:12:53 -0700
Subject: [PATCH 32/55] update quickstart library mode (#1642)

---
 .../extraction/quickstart-library-mode.md     | 481 +-----------------
 1 file changed, 2 insertions(+), 479 deletions(-)

diff --git a/docs/docs/extraction/quickstart-library-mode.md b/docs/docs/extraction/quickstart-library-mode.md
index 9dbbd3e3b..0878eb5c1 100644
--- a/docs/docs/extraction/quickstart-library-mode.md
+++ b/docs/docs/extraction/quickstart-library-mode.md
@@ -1,484 +1,7 @@
-# Deploy Without Containers (Library Mode) for NeMo Retriever Library
-
-[NeMo Retriever Library](overview.md) is typically deployed as a cluster of containers for robust, scalable production use. 
+# NeMo Retriever Library
 
 !!! note
 
     NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
 
-In addition, you can use library mode, which is intended for the following cases:
-
-- Local development
-- Experimentation and testing
-- Small-scale workloads, such as workloads of fewer than 100 documents
-
-
-By default, library mode depends on NIMs that are hosted on build.nvidia.com. 
-In library mode you launch the main pipeline service directly within a Python process, 
-while all other services (such as embedding and storage) are hosted remotely in the cloud.
-
-To get started using library mode, you need the following:
-
-- Linux operating systems (Ubuntu 22.04 or later recommended) or MacOS
-- Python 3.12
-- We strongly advise using an isolated Python virtual env with [uv](https://docs.astral.sh/uv/getting-started/installation/).
-
-
-
-## Step 1: Prepare Your Environment
-
-Use the following procedure to prepare your environment.
-
-1. Run the following code to create your Python environment.
-
-    ```
-       uv venv --python 3.12 nvingest && \
-         source nvingest/bin/activate && \
-         uv pip install nemo-retriever==26.3.0 milvus-lite==2.4.12
-    ```
-
-    !!! tip
-
-        To confirm that you have activated your virtual environment, run `which python` and confirm that you see your virtual environment path in the result. You can do this before any python command that you run.
-
-2. Set or create a .env file that contains your NVIDIA Build API key and other environment variables.
-
-    !!! note
-
-        If you have an NGC API key, you can use it here. For more information, refer to [Generate Your NGC Keys](ngc-api-key.md) and [Environment Configuration Variables](environment-config.md).
-
-    - To set your variables, use the following code.
-
-        ```
-        export NVIDIA_API_KEY=nvapi-<your key>
-        ```
-    - To add your variables to a .env file, include the following.
-
-        ```
-        NVIDIA_API_KEY=nvapi-<your key>
-        ```
-
-
-## Step 2: Ingest Documents
-
-You can submit jobs programmatically by using Python.
-
-!!! tip
-
-    For more Python examples, refer to [NeMo Retriever Library: Python Client Quick Start Guide](https://github.com/NVIDIA/NeMo-Retriever/blob/main/client/client_examples/examples/python_client_usage.ipynb).
-
-
-If you have a very high number of CPUs, and see the process hang without progress, 
-we recommend that you use `taskset` to limit the number of CPUs visible to the process. 
-Use the following code.
-
-```
-taskset -c 0-3 python your_ingestion_script.py
-```
-
-On a 4 CPU core low end laptop, the following code should take about 10 seconds.
-
-```python
-import time
-
-from nemo_retriever.framework.orchestration.ray.util.pipeline.pipeline_runners import run_pipeline
-from nemo_retriever.client import Ingestor, NemoRetrieverClient
-from nemo_retriever.util.message_brokers.simple_message_broker import SimpleClient
-from nemo_retriever.util.process_json_files import ingest_json_results_to_blob
-
-def main():
-    # Start the pipeline subprocess for library mode
-    run_pipeline(block=False, disable_dynamic_scaling=True, run_in_subprocess=True)
-
-    client = NvIngestClient(
-        message_client_allocator=SimpleClient,
-        message_client_port=7671,
-        message_client_hostname="localhost",
-    )
-
-    # gpu_cagra accelerated indexing is not available in milvus-lite
-    # Provide a filename for milvus_uri to use milvus-lite
-    milvus_uri = "milvus.db"
-    collection_name = "test"
-    sparse = False
-
-    # do content extraction from files
-    ingestor = (
-        Ingestor(client=client)
-        .files("data/multimodal_test.pdf")
-        .extract(
-            extract_text=True,
-            extract_tables=True,
-            extract_charts=True,
-            extract_images=True,
-            table_output_format="markdown",
-            extract_infographics=True,
-            # extract_method="nemotron_parse", #Slower, but maximally accurate, especially for PDFs with pages that are scanned images
-            text_depth="page",
-        )
-        .embed()
-        .vdb_upload(
-            collection_name=collection_name,
-            milvus_uri=milvus_uri,
-            sparse=sparse,
-            # for llama-3.2 embedder, use 1024 for e5-v5
-            dense_dim=2048,
-        )
-    )
-
-    print("Starting ingestion..")
-    t0 = time.time()
-
-    # Return both successes and failures
-    # Use for large batches where you want successful chunks/pages to be committed, while collecting detailed diagnostics for failures.
-    results, failures = ingestor.ingest(show_progress=True, return_failures=True)
-
-    # Return only successes
-    # results = ingestor.ingest(show_progress=True)
-
-    t1 = time.time()
-    print(f"Total time: {t1 - t0} seconds")
-
-    # results blob is directly inspectable
-    if results:
-        print(ingest_json_results_to_blob(results[0]))
-
-    # (optional) Review any failures that were returned
-    if failures:
-        print(f"There were {len(failures)} failures. Sample: {failures[0]}")
-
-if __name__ == "__main__":
-    main()
-```
-
-!!! note
-
-    For advanced visual parsing with library mode, uncomment `extract_method="nemotron_parse"` in the previous code. For more information, refer to [Advanced Visual Parsing](nemoretriever-parse.md).
-
-
-You can see the extracted text that represents the content of the ingested test document.
-
-```shell
-Starting ingestion..
-Total time: 9.243880033493042 seconds
-
-TestingDocument
-A sample document with headings and placeholder text
-Introduction
-This is a placeholder document that can be used for any purpose. It contains some 
-headings and some placeholder text to fill the space. The text is not important and contains 
-no real value, but it is useful for testing. Below, we will have some simple tables and charts 
-that we can use to confirm Ingest is working as expected.
-Table 1
-This table describes some animals, and some activities they might be doing in specific 
-locations.
-Animal Activity Place
-Gira@e Driving a car At the beach
-Lion Putting on sunscreen At the park
-Cat Jumping onto a laptop In a home o@ice
-Dog Chasing a squirrel In the front yard
-Chart 1
-This chart shows some gadgets, and some very fictitious costs.
-
-... document extract continues ...
-```
-
-## Step 3: Query Ingested Content
-
-To query for relevant snippets of the ingested content, and use them with an LLM to generate answers, use the following code.
-
-```python
-import os
-from openai import OpenAI
-from nemo_retriever.util.milvus import query
-
-milvus_uri = "milvus.db"
-collection_name = "test"
-sparse=False
-
-queries = ["Which animal is responsible for the typos?"]
-
-retrieved_docs = query(
-    queries,
-    collection_name,
-    milvus_uri=milvus_uri,
-    hybrid=sparse,
-    top_k=1,
-)
-
-# simple generation example
-extract = retrieved_docs[0][0]["entity"]["text"]
-client = OpenAI(
-  base_url = "https://integrate.api.nvidia.com/v1",
-  api_key = os.environ["NVIDIA_API_KEY"]
-)
-
-prompt = f"Using the following content: {extract}\n\n Answer the user query: {queries[0]}"
-print(f"Prompt: {prompt}")
-completion = client.chat.completions.create(
-  model="nvidia/llama-3.1-nemotron-nano-vl-8b-v1",
-  messages=[{"role":"user","content": prompt}],
-)
-response = completion.choices[0].message.content
-
-print(f"Answer: {response}")
-```
-
-```shell
-Prompt: Using the following content: Table 1
-| This table describes some animals, and some activities they might be doing in specific locations. | This table describes some animals, and some activities they might be doing in specific locations. | This table describes some animals, and some activities they might be doing in specific locations. |
-| Animal | Activity | Place |
-| Giraffe | Driving a car | At the beach |
-| Lion | Putting on sunscreen | At the park |
-| Cat | Jumping onto a laptop | In a home office |
-| Dog | Chasing a squirrel | In the front yard |
-
- Answer the user query: Which animal is responsible for the typos?
-Answer: A clever query!
-
-Based on the provided Table 1, I'd make an educated inference to answer your question. Since the activities listed are quite unconventional for the respective animals (e.g., a giraffe driving a car, a lion putting on sunscreen), it's likely that the table is using humor or hypothetical scenarios.
-
-Given this context, the question "Which animal is responsible for the typos?" is probably a tongue-in-cheek inquiry, as there's no direct information in the table about typos or typing activities.
-
-However, if we were to make a playful connection, we could look for an animal that's:
-
-1. Typically found in a setting where typing might occur (e.g., an office).
-2. Engaging in an activity that could potentially lead to typos (e.g., interacting with a typing device).
-
-Based on these loose criteria, I'd jokingly point to:
-
-**Cat** as the potential culprit, since it's:
-        * Located "In a home office"
-        * Engaged in "Jumping onto a laptop", which could theoretically lead to accidental keystrokes or typos if the cat were to start "walking" on the keyboard!
-
-Please keep in mind that this response is purely humorous and interpretative, as the table doesn't explicitly mention typos or provide a straightforward answer to the question.
-```
-
-
-
-## Logging Configuration
-
-The NeMo Retriever Library uses [Ray](https://docs.ray.io/en/latest/index.html) for logging. 
-For details, refer to [Configure Ray Logging](ray-logging.md).
-
-By default, library mode runs in quiet mode to minimize startup noise. 
-Quiet mode automatically configures the following environment variables.
-
-| Variable                             | Quiet Mode Value | Description |
-|--------------------------------------|------------------|-------------|
-| `INGEST_RAY_LOG_LEVEL`               | `PRODUCTION`     | Sets Ray logging to ERROR level to reduce noise. |
-| `RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO` | `0`              | Silences Ray accelerator warnings |
-| `OTEL_SDK_DISABLED`                  | `true`           | Disables OpenTelemetry trace export errors |
-
-
-If you want to see detailed startup logs for debugging, use one of the following options:
-
-- Set `quiet=False` when you run the pipeline as shown following.
-
-    ```python
-    run_pipeline(block=False, disable_dynamic_scaling=True, run_in_subprocess=True, quiet=False)
-    ```
-
-- Set the environment variables manually before you run the pipeline as shown following.
-
-    ```bash
-    export INGEST_RAY_LOG_LEVEL=DEVELOPMENT  # or DEBUG for maximum verbosity
-    ```
-
-
-
-## Library Mode Communication and Advanced Examples
-
-Communication in library mode is handled through a simplified, 3-way handshake message broker called `SimpleBroker`.
-
-Attempting to run a library-mode process co-located with a Docker Compose deployment does not work by default. 
-The Docker Compose deployment typically creates a firewall rule or port mapping that captures traffic to port `7671`,
-which prevents the `SimpleBroker` from receiving messages. 
-Always ensure that you use library mode in isolation, without an active containerized deployment listening on the same port.
-
-
-### Example `launch_libmode_service.py`
-
-This example launches the pipeline service in a subprocess, 
-and keeps it running until it is interrupted (for example, by pressing `Ctrl+C`). 
-It listens for ingestion requests on port `7671` from an external client.
-
-```python
-import logging
-import os
-
-from nemo_retriever.framework.orchestration.ray.util.pipeline.pipeline_runners import run_pipeline
-from nemo_retriever.util.logging.configuration import configure_logging as configure_local_logging
-
-# Configure the logger
-logger = logging.getLogger(__name__)
-
-local_log_level = os.getenv("INGEST_LOG_LEVEL", "DEFAULT")
-if local_log_level in ("DEFAULT",):
-    local_log_level = "INFO"
-
-configure_local_logging(local_log_level)
-
-
-def main():
-    """
-    Launch the libmode pipeline service using the embedded default configuration.
-    """
-    try:
-        # Start pipeline and block until interrupted
-        # Note: stdout/stderr cannot be passed when run_in_subprocess=True (not picklable)
-        # Use quiet=False to see verbose startup logs
-        _ = run_pipeline(
-            block=True,
-            disable_dynamic_scaling=True,
-            run_in_subprocess=True,
-        )
-    except KeyboardInterrupt:
-        logger.info("Keyboard interrupt received. Shutting down...")
-    except Exception as e:
-        logger.error(f"An unexpected error occurred: {e}", exc_info=True)
-
-
-if __name__ == "__main__":
-    main()
-```
-
-### Example `launch_libmode_and_run_ingestor.py`
-
-This example starts the pipeline service in-process, 
-and immediately runs an ingestion client against it in the same parent process.
-
-```python
-import logging
-import os
-import time
-
-from nemo_retriever.framework.orchestration.ray.util.pipeline.pipeline_runners import run_pipeline
-from nemo_retriever.util.logging.configuration import configure_logging as configure_local_logging
-from nemo_retriever.util.message_brokers.simple_message_broker import SimpleClient
-from nemo_retriever.client import Ingestor
-from nemo_retriever.client import NemoRetrieverClient
-
-# Configure the logger
-logger = logging.getLogger(__name__)
-
-local_log_level = os.getenv("INGEST_LOG_LEVEL", "INFO")
-if local_log_level in ("DEFAULT",):
-    local_log_level = "INFO"
-
-configure_local_logging(local_log_level)
-
-
-def run_ingestor():
-    """
-    Set up and run the ingestion process to send traffic against the pipeline.
-    """
-    logger.info("Setting up Ingestor client...")
-    client = NvIngestClient(
-        message_client_allocator=SimpleClient, message_client_port=7671, message_client_hostname="localhost"
-    )
-
-    ingestor = (
-        Ingestor(client=client)
-        .files("./data/multimodal_test.pdf")
-        .extract(
-            extract_text=True,
-            extract_tables=True,
-            extract_charts=True,
-            extract_images=True,
-            table_output_format="markdown",
-            extract_infographics=False,
-            text_depth="page",
-        )
-        .split(chunk_size=1024, chunk_overlap=150)
-        .embed()
-    )
-
-    try:
-        results, _ = ingestor.ingest(show_progress=False, return_failures=True)
-        logger.info("Ingestion completed successfully.")
-    except Exception as e:
-        logger.error(f"Ingestion failed: {e}")
-        raise
-
-    print("\nIngest done.")
-    print(f"Got {len(results)} results.")
-
-
-def main():
-    """
-    Launch the libmode pipeline service and run the ingestor against it.
-    Uses the embedded default libmode pipeline configuration.
-    """
-    pipeline = None
-    try:
-        # Start pipeline in subprocess
-        # Note: stdout/stderr cannot be passed when run_in_subprocess=True (not picklable)
-        # Use quiet=False to see verbose startup logs
-        pipeline = run_pipeline(
-            block=False,
-            disable_dynamic_scaling=True,
-            run_in_subprocess=True,
-        )
-        time.sleep(10)
-        run_ingestor()
-        # Run other code...
-    except KeyboardInterrupt:
-        logger.info("Keyboard interrupt received. Shutting down...")
-    except Exception as e:
-        logger.error(f"Error running pipeline: {e}")
-    finally:
-        if pipeline:
-            pipeline.stop()
-            logger.info("Shutting down pipeline...")
-
-
-if __name__ == "__main__":
-    main()
-```
-
-
-
-## The `run_pipeline` Function Reference
-
-The `run_pipeline` function is the main entry point to start the NeMo Retriever Library pipeline. 
-It can run in-process or as a subprocess.
-
-The `run_pipeline` function accepts the following parameters.
-
-| Parameter                | Type                   | Default | Required? | Description                                     |
-|--------------------------|------------------------|---------|-----------|-------------------------------------------------|
-| pipeline_config            | PipelineConfigSchema | —       | Yes       | A configuration object that specifies how the pipeline should be constructed. |
-| run_in_subprocess        | bool                   | False   | Yes       | `True` to launch the pipeline in a separate Python subprocess. `False` to run in the current process. |
-| block                    | bool                   | True    | Yes       | `True` to run the pipeline synchronously. The function returns after it finishes. `False` to return an interface for external pipeline control. |
-| disable_dynamic_scaling  | bool                   | None    | No        | `True` to disable autoscaling regardless of global settings. `None` to use the global default behavior. |
-| dynamic_memory_threshold | float                  | None    | No        | A value between `0.0` and `1.0`. If dynamic scaling is enabled, triggers autoscaling when memory usage crosses this threshold. |
-| stdout                   | TextIO                 | None    | No        | Redirect the subprocess `stdout` to a file or stream. If `None`, defaults to `/dev/null`. |
-| stderr                   | TextIO                 | None    | No        | Redirect subprocess `stderr` to a file or stream. If `None`, defaults to `/dev/null`. |
-| libmode                  | bool                   | True    | No        | `True` to load the default library mode pipeline configuration when `ingest_config` is `None`. |
-| quiet                    | bool                   | None    | No        | `True` to suppress verbose startup logs (PRODUCTION preset). `None` defaults to `True` when `libmode=True`. Set to `False` for verbose output. |
-
-
-The `run_pipeline` function returns the following values, depending on the parameters that you set:
-
-- **run_in_subprocess=False and block=True**  — The function returns a `float` that represents the elapsed time in seconds.
-- **run_in_subprocess=False and block=False** — The function returns a `RayPipelineInterface` object.
-- **run_in_subprocess=True  and block=True**  — The function returns `0.0`.
-- **run_in_subprocess=True  and block=False** — The function returns a `RayPipelineInterface` object.
-
-
-The `run_pipeline` throws the following errors:
-
-- **RuntimeError** — A subprocess failed to start, or exited with error.
-- **Exception** — Any other failure during pipeline setup or execution.
-
-
-
-## Related Topics
-
-- [Prerequisites](prerequisites.md)
-- [Support Matrix](support-matrix.md)
-- [Deploy With Docker Compose (Self-Hosted)](quickstart-guide.md)
-- [Deploy With Helm](helm.md)
-- [Notebooks](notebooks.md)
-- [Enterprise RAG Blueprint](https://build.nvidia.com/nvidia/multimodal-pdf-data-extraction-for-enterprise-rag)
+Use the [Quick Start for NeMo Retriever Library](https://github.com/NVIDIA/NeMo-Retriever/blob/26.03/nemo_retriever/README.md) to set up and run the NeMo Retriever Library locally, so you can build a GPU‑accelerated, multimodal RAG ingestion pipeline that parses PDFs, HTML, text, audio, and video into LanceDB vector embeddings, integrates with Nemotron RAG models (locally or via NIM endpoints), which includes Ray‑based scaling plus built‑in recall evaluation.
\ No newline at end of file

From 1c6ec7993bcf3fc853034a168a21f413d7b1349b Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Tue, 17 Mar 2026 19:13:08 -0700
Subject: [PATCH 33/55] update release version from 26.1.3 to 26.3.0 on Release
 Notes (#1643)

---
 docs/docs/extraction/releasenotes-nv-ingest.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/docs/extraction/releasenotes-nv-ingest.md b/docs/docs/extraction/releasenotes-nv-ingest.md
index c86c35918..2e966efe0 100644
--- a/docs/docs/extraction/releasenotes-nv-ingest.md
+++ b/docs/docs/extraction/releasenotes-nv-ingest.md
@@ -6,7 +6,7 @@ This documentation contains the release notes for [NeMo Retriever Library](overv
 
     NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.   
 
-## 26.03 Release Notes (26.1.3)
+## 26.03 Release Notes (26.3.0)
 
 NVIDIA® NeMo Retriever Library version 26.03 adds broader hardware and software support along with many pipeline, evaluation, and deployment enhancements.
 

From cfd0b729ffee4d4da40c42b214733deee5c92e07 Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Tue, 17 Mar 2026 19:13:35 -0700
Subject: [PATCH 34/55] Kheiss/bullets (#1644)

Co-authored-by: sosahi <syousefisahi@nvidia.com>
---
 docs/docs/extraction/releasenotes-nv-ingest.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/docs/extraction/releasenotes-nv-ingest.md b/docs/docs/extraction/releasenotes-nv-ingest.md
index 2e966efe0..2c444ba09 100644
--- a/docs/docs/extraction/releasenotes-nv-ingest.md
+++ b/docs/docs/extraction/releasenotes-nv-ingest.md
@@ -17,16 +17,16 @@ Highlights for the 26.03 release include:
 - NV-Ingest GitHub repo renamed to NeMo-Retriever  
 - NeMo Retriever Extraction pipeline renamed to NeMo Retriever Library  
 - NeMo Retriever Library now supports two deployment options:  
-  - A new no-container, pip-installable in-process library for development (available on PyPI)  
-  - Existing production-ready Helm chart with NIMs  
+   - A new no-container, pip-installable in-process library for development (available on PyPI)  
+   - Existing production-ready Helm chart with NIMs  
 - Added documentation notes on Air-gapped deployment support  
 - Added documentation notes on OpenShift support  
 - Added support for RTX4500 Pro Blackwell SKU  
 - Added support for llama-nemotron-embed-vl-v2 in text and text+image modes  
 - New extract methods `pdfium_hybrid` and `ocr` target scanned PDFs to improve text and layout extraction from image-based pages  
 - VLM-based image caption enhancements:  
-  - Infographics can be captioned  
-  - Reasoning mode is configurable  
+   - Infographics can be captioned  
+   - Reasoning mode is configurable  
 - Enabled hybrid search with Lancedb  
 - Added retrieval_bench subfolder with generalizable agentic retrieval pipeline  
 - The project now uses UV as the primary environment and package manager instead of Conda, resulting in faster installs and simpler dependency handling  

From 818de0af914cb4c418f4b5a1048839050ade5388 Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Wed, 18 Mar 2026 09:18:40 -0700
Subject: [PATCH 35/55] Update README.md

removed GitHub artifacts
---
 nemo_retriever/README.md | 15 +--------------
 1 file changed, 1 insertion(+), 14 deletions(-)

diff --git a/nemo_retriever/README.md b/nemo_retriever/README.md
index 6c6bb1c24..49feffc0c 100644
--- a/nemo_retriever/README.md
+++ b/nemo_retriever/README.md
@@ -10,11 +10,7 @@ You’ll set up a CUDA 13–compatible environment, install the library and its
 
 ## Prerequisites
 
-<<<<<<< HEAD
-Before you start, make sure your system meets the following requirements:
-=======
 Before starting, make sure your system meets the following requirements:
->>>>>>> 15b2bc05 (Updating the nemo_retriever quickstart README (#1632))
 
 - The host is running CUDA 13.x so that `libcudart.so.13` is available.
 - Your GPUs are visible to the system and compatible with CUDA 13.x.
@@ -99,7 +95,6 @@ You can inspect how recall accuracy optimized text chunks for various content ty
 'TestingDocument\r\nA sample document with headings and placeholder text\r\nIntroduction\r\nThis is a placeholder document that can be used for any purpose...'
 
 # markdown formatted table from the first page
->>> chunks[1]["text"]
 '| Table | 1 |\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |'
 
 # a chart from the first page
@@ -213,19 +208,16 @@ ingestor = (
 ```
 
 *Note:* the `split()` task uses a tokenizer to split texts by a max_token length
-
-<<<<<<< HEAD
 ### Render results as markdown
 
 If you want a readable markdown view of extracted results, pass the full in-process result list
 to `nemo_retriever.io.to_markdown`. The helper now returns a `dict[str, str]` keyed by input
 filename, where each value is the document collapsed into one markdown string without per-page
 headers, so both single-document and multi-document runs follow the same contract.
-=======
+
 PDF text is split at the page level.
 
 HTML and .txt files have no natural page delimiters, so they almost always need to be paired with the `.split()` task.
->>>>>>> 15b2bc05 (Updating the nemo_retriever quickstart README (#1632))
 
 ```python
 # html and text files - include a split task to prevent texts from exceeding the embedder's max sequence length
@@ -235,7 +227,6 @@ ingestor = (
   .extract()
   .split(max_tokens=5) #1024 by default, set low here to demonstrate chunking
 )
-<<<<<<< HEAD
 results = ingestor.ingest()
 markdown_docs = to_markdown(results)
 print(markdown_docs["multimodal_test.pdf"])
@@ -243,11 +234,7 @@ print(markdown_docs["multimodal_test.pdf"])
 
 Use `to_markdown_by_page(results)` when you want a nested
 `dict[str, dict[int, str]]` instead, where each filename maps to its per-page markdown strings.
-=======
-```
-
 For audio and video files, ensure ffmpeg is installed by your system's package manager.
->>>>>>> 15b2bc05 (Updating the nemo_retriever quickstart README (#1632))
 
 For example, with apt-get on Ubuntu:
 ```bash

From 671d78aa9b0a945d9cb70a98ed26bae5222079d0 Mon Sep 17 00:00:00 2001
From: Julio Perez <37191411+jperez999@users.noreply.github.com>
Date: Wed, 18 Mar 2026 13:43:28 -0400
Subject: [PATCH 36/55] Updating & simplifying main README (#1647) (#1650)

---
 README.md                | 450 ++++++++-------------------------------
 contributing.md          |  50 +++++
 nemo_retriever/README.md |  16 +-
 3 files changed, 148 insertions(+), 368 deletions(-)
 create mode 100644 contributing.md

diff --git a/README.md b/README.md
index f769225f8..d2358aa76 100644
--- a/README.md
+++ b/README.md
@@ -6,11 +6,9 @@ SPDX-License-Identifier: Apache-2.0
 
 **Important: The default branch is main, which tracks active development and may be ahead of the latest supported release.**
 
-For the latest stable release:
+For the latest stable release use the [release/26.03 branch](https://github.com/NVIDIA/NeMo-Retriever/tree/26.03).
 
-Use the latest release/* branch (for example, release/26.1.2) from the branch dropdown.
-
-See the corresponding NeMo Retriever Library documentation at https://docs.nvidia.com/nemo/retriever/latest/extraction/overview/
+See the corresponding [NeMo Retriever Library documentation](https://docs.nvidia.com/nemo/retriever/latest/extraction/overview/).
 
 # NeMo Retriever Library
 
@@ -20,223 +18,81 @@ to find, contextualize, and extract text, tables, charts and infographics that y
 > [!Note]
 > NeMo Retriever extraction is also known as NVIDIA Ingest and nv-ingest.
 
-NeMo Retriever Library enables parallelization of splitting documents into pages where artifacts are classified (such as text, tables, charts, and infographics), extracted, and further contextualized through optical character recognition (OCR) into a well defined JSON schema. From there, NeMo Retriever Library can optionally manage computation of embeddings for the extracted content, and optionally manage storing into a vector database [Milvus](https://milvus.io/).
-
-> [!Note]
-> Cached and Deplot are deprecated. Instead, NeMo Retriever extraction now uses the yolox-graphic-elements NIM. With this change, you should now be able to run NeMo Retriever Extraction on a single 24GB A10G or better GPU. If you want to use the old pipeline, with Cached and Deplot, use the [NeMo Retriever Extraction 24.12.1 release](https://github.com/NVIDIA/nv-ingest/tree/24.12.1).
-
+NeMo Retriever Library enables parallelization of splitting documents into pages where artifacts are classified (such as text, tables, charts, and infographics), extracted, and further contextualized through optical character recognition (OCR) into a well defined JSON schema. From there, NeMo Retriever Library manages computaiton of embeddings for the extracted content as well as storing them in a vector database [Milvus](https://milvus.io/).
 
 The following diagram shows the NeMo Retriever Library pipeline.
 
 ![Pipeline Overview](https://docs.nvidia.com/nemo/retriever/extraction/images/overview-extraction.png)
 
-## Table of Contents
-1. [NeMo Retriever Library](#nemo-retriever-library)
-2. [Prerequisites](#prerequisites)
-3. [Quickstart](#library-mode-quickstart)
-4. [Benchmarking](#benchmarking)
-5. [GitHub Repository Structure](#github-repository-structure)
-6. [Notices](#notices)
-
-
-## What is NeMo Retriever Library?
-
-The NeMo Retriever Library is a library and microservice framework designed to perform the following functions::
-
-- Accept a job specification that contains a document payload and a set of ingestion tasks to perform on that payload.
-- Store the result of each job to retrieve later. The result is a dictionary that contains a list of metadata that describes the objects extracted from the base document, and processing annotations and timing/trace data.
-- Support multiple methods of extraction for each document type to balance trade-offs between throughput and accuracy. For example, for .pdf documents, extraction is performed by using pdfium, [nemotron-parse](https://build.nvidia.com/nvidia/nemotron-parse), Unstructured.io, and Adobe Content Extraction Services.
-- Support various types of before and after processing operations, including text splitting and chunking, transform and filtering, embedding generation, and image offloading to storage.
-
-
-NeMo Retriever Extraction supports the following file types:
-
-- `avi` (early access)
-- `bmp`
-- `docx`
-- `html` (converted to markdown format)
-- `jpeg`
-- `json` (treated as text)
-- `md` (treated as text)
-- `mkv` (early access)
-- `mov` (early access)
-- `mp3`
-- `mp4` (early access)
-- `pdf`
-- `png`
-- `pptx`
-- `sh` (treated as text)
-- `tiff`
-- `txt`
-- `wav`
-
-
-### What NeMo Retriever Library Isn't
-
-NeMo Retriever Library does not do the following:
-
-- Run a static pipeline or fixed set of operations on every submitted document.
-- Act as a wrapper for any specific document parsing library.
-
-
-For more information, refer to the [NeMo Retriever Library documentation](https://docs.nvidia.com/nemo/retriever/extraction/overview/).
-
-## Documentation Resources
-
-- **[Official Documentation](https://docs.nvidia.com/nemo/retriever/extraction/)** - Complete user guides, API references, and deployment instructions
-- **[Getting Started Guide](https://docs.nvidia.com/nemo/retriever/extraction/overview/)** - Overview and prerequisites for production deployments
-- **[Benchmarking Guide](https://docs.nvidia.com/nemo/retriever/extraction/benchmarking/)** - Performance testing and recall evaluation framework
-- **[MIG Deployment](https://docs.nvidia.com/nemo/retriever/extraction/mig-benchmarking/)** - Multi-Instance GPU configurations for Kubernetes
-- **[API Documentation](https://docs.nvidia.com/nemo/retriever/extraction/api/)** - Python client and API reference
-
-
-## Prerequisites
-
-For production-level performance and scalability, we recommend that you deploy the pipeline and supporting NIMs by using Docker Compose or Kubernetes ([helm charts](helm)). For more information, refer to [prerequisites](https://docs.nvidia.com/nv-ingest/user-guide/getting-started/prerequisites).
-
+For production-level performance and scalability, we recommend that you deploy the pipeline and supporting NIMs by using Kubernetes ([helm charts](helm)). For more information, refer to [prerequisites](https://docs.nvidia.com/nv-ingest/user-guide/getting-started/prerequisites).
 
-## Library Mode Quickstart
+*Note*:
+Along with the recent repo name change, we're phasing out the nv-ingest APIs and simplifying the dependencies. You can follow this work and see the forward looking API via the [nemo_retriever](nemo_retriever) library subfolder.
 
-For small-scale workloads, such as workloads of fewer than 100 PDFs, you can use library mode setup. Library mode set up depends on NIMs that are already self-hosted, or, by default, NIMs that are hosted on build.nvidia.com.
 
-Library mode deployment of nv-ingest requires:
+## Typical Use
 
-- Linux operating systems (Ubuntu 22.04 or later recommended) or MacOS
-- Python 3.12
-- We strongly advise using an isolated Python virtual env with [uv](https://docs.astral.sh/uv/getting-started/installation/).
+For small-scale workloads, such as workloads of fewer than 100 PDFs, you can use our in development library setup which works with HuggingFace models on local GPUs or with NIMs hosted on build.nvidia.com.
 
-### Step 1: Prepare Your Environment
-
-Create a fresh Python environment to install nv-ingest and dependencies.
-
-```shell
-uv venv --python 3.12 nvingest && \
-  source nvingest/bin/activate && \
-  uv pip install nv-ingest==26.1.2 nv-ingest-api==26.1.2 nv-ingest-client==26.1.2 milvus-lite==2.4.12
-```
-
-Set your NVIDIA_API_KEY. If you don't have a key, you can get one on [build.nvidia.com](https://org.ngc.nvidia.com/setup/api-keys). For instructions, refer to [Generate Your NGC Keys](docs/docs/extraction/ngc-api-key.md).
+After [following the quickstart installation steps](nemo_retriever), you can start ingesting content like with the following snippet:
+```python
+from nemo_retriever import create_ingestor
+from nemo_retriever.io import to_markdown, to_markdown_by_page
+from pathlib import Path
+
+documents = [str(Path("../data/multimodal_test.pdf"))]
+ingestor = create_ingestor(run_mode="batch")
+
+# ingestion tasks are chainable and defined lazily
+ingestor = (
+  ingestor.files(documents)
+  .extract(
+    # below are the default values, but content types can be controlled
+    extract_text=True,
+    extract_charts=True,
+    extract_tables=True,
+    extract_infographics=True
+  )
+  .embed()
+  .vdb_upload()
+)
 
+# ingestor.ingest() actually executes the pipeline
+# results are returned as a ray dataset and inspectable as chunks
+ray_dataset = ingestor.ingest()
+chunks = ray_dataset.get_dataset().take_all()
 ```
-export NVIDIA_API_KEY=nvapi-...
-```
-
-### Step 2: Ingest Documents
 
-You can submit jobs programmatically in Python.
-
-To confirm that you have activated your Python environment, run `which python` and confirm that you see `nvingest` in the result. You can do this before any python command that you run.
+You can see the extracted text that represents the content of the ingested test document.
 
-```
-which python
-/home/dev/projects/nv-ingest/nvingest/bin/python
-```
+```python
+# page 1 raw text:
+>>> chunks[0]["text"]
+'TestingDocument\r\nA sample document with headings and placeholder text\r\nIntroduction\r\nThis is a placeholder document that can be used for any purpose...'
 
-If you have a very high number of CPUs, and see the process hang without progress, we recommend that you use `taskset` to limit the number of CPUs visible to the process. Use the following code.
+# markdown formatted table from the first page
+>>> chunks[1]["text"]
+'| Table | 1 |\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |'
 
-```
-taskset -c 0-3 python your_ingestion_script.py
-```
+# a chart from the first page
+>>> chunks[2]["text"]
+'Chart 1\nThis chart shows some gadgets, and some very fictitious costs.\nGadgets and their cost\n$160.00\n$140.00\n$120.00\n$100.00\nDollars\n$80.00\n$60.00\n$40.00\n$20.00\n$-\nPowerdrill\nBluetooth speaker\nMinifridge\nPremium desk fan\nHammer\nCost'
 
-On a 4 CPU core low end laptop, the following code should take about 10 seconds.
+# markdown formatting for full pages or documents:
+# document results are keyed by source filename
+>>> to_markdown_by_page(chunks).keys()
+dict_keys(['multimodal_test.pdf'])
 
-```python
-import time
-
-from nv_ingest.framework.orchestration.ray.util.pipeline.pipeline_runners import run_pipeline
-from nv_ingest_client.client import Ingestor, NvIngestClient
-from nv_ingest_api.util.message_brokers.simple_message_broker import SimpleClient
-from nv_ingest_client.util.process_json_files import ingest_json_results_to_blob
-
-def main():
-    # Start the pipeline subprocess for library mode
-    run_pipeline(block=False, disable_dynamic_scaling=True, run_in_subprocess=True)
-
-    client = NvIngestClient(
-        message_client_allocator=SimpleClient,
-        message_client_port=7671,
-        message_client_hostname="localhost",
-    )
-
-    # gpu_cagra accelerated indexing is not available in milvus-lite
-    # Provide a filename for milvus_uri to use milvus-lite
-    milvus_uri = "milvus.db"
-    collection_name = "test"
-    sparse = False
-
-    # do content extraction from files
-    ingestor = (
-        Ingestor(client=client)
-        .files("data/multimodal_test.pdf")
-        .extract(
-            extract_text=True,
-            extract_tables=True,
-            extract_charts=True,
-            extract_images=True,
-            table_output_format="markdown",
-            extract_infographics=True,
-            # extract_method="nemotron_parse", #Slower, but maximally accurate, especially for PDFs with pages that are scanned images
-            text_depth="page",
-        )
-        .embed()
-        .vdb_upload(
-            collection_name=collection_name,
-            milvus_uri=milvus_uri,
-            sparse=sparse,
-            # for llama-3.2 embedder, use 1024 for e5-v5
-            dense_dim=2048,
-        )
-    )
-
-    print("Starting ingestion..")
-    t0 = time.time()
-
-    # Return both successes and failures
-    # Use for large batches where you want successful chunks/pages to be committed, while collecting detailed diagnostics for failures.
-    results, failures = ingestor.ingest(show_progress=True, return_failures=True)
-
-    # Return only successes
-    # results = ingestor.ingest(show_progress=True)
-
-    t1 = time.time()
-    print(f"Total time: {t1 - t0} seconds")
-
-    # results blob is directly inspectable
-    if results:
-        print(ingest_json_results_to_blob(results[0]))
-
-    # (optional) Review any failures that were returned
-    if failures:
-        print(f"There were {len(failures)} failures. Sample: {failures[0]}")
-
-if __name__ == "__main__":
-    main()
-```
+# results per document are keyed by page number
+>>> to_markdown_by_page(chunks)["multimodal_test.pdf"].keys()
+dict_keys([1, 2, 3])
 
-You can see the extracted text that represents the content of the ingested test document.
+>>> to_markdown_by_page(chunks)["multimodal_test.pdf"][1]
+'TestingDocument\r\nA sample document with headings and placeholder text\r\nIntroduction\r\nThis is a placeholder document that can be used for any purpose. It contains some \r\nheadings and some placeholder text to fill the space. The text is not important and contains \r\nno real value, but it is useful for testing. Below, we will have some simple tables and charts \r\nthat we can use to confirm Ingest is working as expected.\r\nTable 1\r\nThis table describes some animals, and some activities they might be doing in specific \r\nlocations.\r\nAnimal Activity Place\r\nGira@e Driving a car At the beach\r\nLion Putting on sunscreen At the park\r\nCat Jumping onto a laptop In a home o@ice\r\nDog Chasing a squirrel In the front yard\r\nChart 1\r\nThis chart shows some gadgets, and some very fictitious costs.\n\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |\n\nChart 1 This chart shows some gadgets, and some very fictitious costs. Gadgets and their cost $160.00 $140.00 $120.00 $100.00 Dollars $80.00 $60.00 $40.00 $20.00 $- Powerdrill Bluetooth speaker Minifridge Premium desk fan Hammer Cost\n\n### Table 1\n\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |\n\n### Chart 1\n\nChart 1 This chart shows some gadgets, and some very fictitious costs. Gadgets and their cost $160.00 $140.00 $120.00 $100.00 Dollars $80.00 $60.00 $40.00 $20.00 $- Powerdrill Bluetooth speaker Minifridge Premium desk fan Hammer Cost\n\n### Table 2\n\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |\n\n### Chart 2\n\nChart 1 This chart shows some gadgets, and some very fictitious costs. Gadgets and their cost $160.00 $140.00 $120.00 $100.00 Dollars $80.00 $60.00 $40.00 $20.00 $- Powerdrill Bluetooth speaker Minifridge Premium desk fan Hammer Cost\n\n### Table 3\n\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |\n\n### Chart 3\n\nChart 1 This chart shows some gadgets, and some very fictitious costs. Gadgets and their cost $160.00 $140.00 $120.00 $100.00 Dollars $80.00 $60.00 $40.00 $20.00 $- Powerdrill Bluetooth speaker Minifridge Premium desk fan Hammer Cost'
 
-```shell
-Starting ingestion..
-Total time: 9.243880033493042 seconds
-
-TestingDocument
-A sample document with headings and placeholder text
-Introduction
-This is a placeholder document that can be used for any purpose. It contains some 
-headings and some placeholder text to fill the space. The text is not important and contains 
-no real value, but it is useful for testing. Below, we will have some simple tables and charts 
-that we can use to confirm Ingest is working as expected.
-Table 1
-This table describes some animals, and some activities they might be doing in specific 
-locations.
-Animal Activity Place
-Gira@e Driving a car At the beach
-Lion Putting on sunscreen At the park
-Cat Jumping onto a laptop In a home o@ice
-Dog Chasing a squirrel In the front yard
-Chart 1
-This chart shows some gadgets, and some very fictitious costs.
-... document extract continues ...
+# full document markdown also keyed by source filename
+>>> to_markdown(chunks).keys()
+dict_keys(['multimodal_test.pdf'])
 ```
 
 ### Step 3: Query Ingested Content
@@ -244,70 +100,43 @@ This chart shows some gadgets, and some very fictitious costs.
 To query for relevant snippets of the ingested content, and use them with an LLM to generate answers, use the following code.
 
 ```python
-import os
+from nemo_retriever.retriever import Retriever
 from openai import OpenAI
-from nv_ingest_client.util.milvus import nvingest_retrieval
+import os
 
-milvus_uri = "milvus.db"
-collection_name = "test"
-sparse = False
+retriever = Retriever()
 
-queries = ["Which animal is responsible for the typos?"]
+query = "Given their activities, which animal is responsible for the typos in my documents?"
 
-retrieved_docs = nvingest_retrieval(
-    queries,
-    collection_name,
-    milvus_uri=milvus_uri,
-    hybrid=sparse,
-    top_k=1,
-)
+# you can also submit a list with retriever.queries[...]
+hits = retriever.query(query)
 
-# simple generation example
-extract = retrieved_docs[0][0]["entity"]["text"]
 client = OpenAI(
-    base_url="https://integrate.api.nvidia.com/v1",
-    api_key=os.environ["NVIDIA_API_KEY"],
+  base_url = "https://integrate.api.nvidia.com/v1",
+  api_key = os.environ.get("NVIDIA_API_KEY")
 )
 
-prompt = f"Using the following content: {extract}\n\n Answer the user query: {queries[0]}"
-print(f"Prompt: {prompt}")
+hit_texts = [hit["text"] for hit in hits]
+prompt = f"""
+Given the following retrieved documents, answer the question: {query}
+
+Documents:
+{hit_texts}
+"""
+
 completion = client.chat.completions.create(
-    model="nvidia/llama-3.1-nemotron-nano-vl-8b-v1",
-    messages=[{"role": "user", "content": prompt}],
+  model="nvidia/nemotron-3-super-120b-a12b",
+  messages=[{"role":"user","content":prompt}],
+  stream=False
 )
-response = completion.choices[0].message.content
 
-print(f"Answer: {response}")
+answer = completion.choices[0].message.content
+print(answer)
 ```
 
+Answer:
 ```shell
-Prompt: Using the following content: Table 1
-| This table describes some animals, and some activities they might be doing in specific locations. | This table describes some animals, and some activities they might be doing in specific locations. | This table describes some animals, and some activities they might be doing in specific locations. |
-| Animal | Activity | Place |
-| Giraffe | Driving a car | At the beach |
-| Lion | Putting on sunscreen | At the park |
-| Cat | Jumping onto a laptop | In a home office |
-| Dog | Chasing a squirrel | In the front yard |
-
- Answer the user query: Which animal is responsible for the typos?
-Answer: A clever query!
-
-Based on the provided Table 1, I'd make an educated inference to answer your question. Since the activities listed are quite unconventional for the respective animals (e.g., a giraffe driving a car, a lion putting on sunscreen), it's likely that the table is using humor or hypothetical scenarios.
-
-Given this context, the question "Which animal is responsible for the typos?" is probably a tongue-in-cheek inquiry, as there's no direct information in the table about typos or typing activities.
-
-However, if we were to make a playful connection, we could look for an animal that's:
-
-1. Typically found in a setting where typing might occur (e.g., an office).
-2. Engaging in an activity that could potentially lead to typos (e.g., interacting with a typing device).
-
-Based on these loose criteria, I'd jokingly point to:
-
-**Cat** as the potential culprit, since it's:
-        * Located "In a home office"
-        * Engaged in "Jumping onto a laptop", which could theoretically lead to accidental keystrokes or typos if the cat were to start "walking" on the keyboard!
-
-Please keep in mind that this response is purely humorous and interpretative, as the table doesn't explicitly mention typos or provide a straightforward answer to the question.
+Cat is the animal whose activity (jumping onto a laptop) matches the location of the typos, so the cat is responsible for the typos in the documents.
 ```
 
 > [!TIP]
@@ -315,69 +144,13 @@ Please keep in mind that this response is purely humorous and interpretative, as
 >
 > Please also checkout our [demo using a retrieval pipeline on build.nvidia.com](https://build.nvidia.com/nvidia/multimodal-pdf-data-extraction-for-enterprise-rag) to query over document content pre-extracted w/ NVIDIA Ingest.
 
+## Documentation Resources
 
-## Benchmarking
-
-nv-ingest includes a comprehensive testing framework for benchmarking performance and evaluating retrieval accuracy.
-
-### Quick Start
-
-```bash
-cd tools/harness
-
-uv sync
-
-# Run end-to-end benchmark
-uv run nv-ingest-harness-run --case=e2e --dataset=bo767
-
-# Evaluate retrieval accuracy
-uv run nv-ingest-harness-run --case=e2e_recall --dataset=bo767
-```
-
-### Available Benchmarks
-
-- **End-to-End Performance** - Measure ingestion throughput, latency, and resource utilization
-- **Retrieval Accuracy** - Evaluate recall@k metrics against ground truth datasets
-- **MIG Benchmarking** - Test performance with NVIDIA Multi-Instance GPU (MIG) configurations
-
-### Documentation
-
-- **[Testing Framework Guide](https://docs.nvidia.com/nemo/retriever/extraction/benchmarking/)** - Complete guide to benchmarking and testing nv-ingest (same as `tools/harness/README.md`)
-- **[MIG Benchmarking](https://docs.nvidia.com/nemo/retriever/extraction/mig-benchmarking/)** - GPU partitioning for multi-tenant deployments on Kubernetes/Helm
-
-### Benchmark Datasets
-
-- **bo767** - 767 PDF documents with ground truth for recall evaluation
-- **bo20** - 20 PDF documents for quick validation
-- **single** - singular multimodal pdf for quick validation
-- **earnings** - earnings reports ppt and pdf dataset
--- **financebench** - financial data
-- **Custom datasets** - Use your own datasets with the testing framework
-
-For more information, see the [benchmarking documentation](https://docs.nvidia.com/nemo/retriever/extraction/benchmarking/).
-
-
-## GitHub Repository Structure
-
-The following is a description of the folders in the GitHub repository.
-
-- [.devcontainer](https://github.com/NVIDIA/nv-ingest/tree/main/.devcontainer) — VSCode containers for local development
-- [.github](https://github.com/NVIDIA/nv-ingest/tree/main/.github) — GitHub repo configuration files
-- [api](https://github.com/NVIDIA/nv-ingest/tree/main/api) — Core API logic shared across python modules
-- [ci](https://github.com/NVIDIA/nv-ingest/tree/main/ci) — Scripts used to build the nv-ingest container and other packages
-- [client](https://github.com/NVIDIA/nv-ingest/tree/main/client) — Readme, examples, and source code for the nv-ingest-cli utility
-- [config](https://github.com/NVIDIA/nv-ingest/tree/main/config) — Various .yaml files defining configuration for OTEL, Prometheus
-- [data](https://github.com/NVIDIA/nv-ingest/tree/main/data) — Sample PDFs for testing
-- [deploy](https://github.com/NVIDIA/nv-ingest/tree/main/deploy) — Brev.dev-hosted launchable
-- [docker](https://github.com/NVIDIA/nv-ingest/tree/main/docker) — Scripts used by the nv-ingest docker container
-- [docs](https://github.com/NVIDIA/nv-ingest/tree/main/docs/docs) — Documentation for NV Ingest
-- [evaluation](https://github.com/NVIDIA/nv-ingest/tree/main/evaluation) — Notebooks that demonstrate how to test recall accuracy
-- [examples](https://github.com/NVIDIA/nv-ingest/tree/main/examples) — Notebooks, scripts, and tutorial content
-- [helm](https://github.com/NVIDIA/nv-ingest/tree/main/helm) — Documentation for deploying nv-ingest to a Kubernetes cluster via Helm chart
-- [skaffold](https://github.com/NVIDIA/nv-ingest/tree/main/skaffold) — Skaffold configuration
-- [src](https://github.com/NVIDIA/nv-ingest/tree/main/src) — Source code for the nv-ingest pipelines and service
-- [tests](https://github.com/NVIDIA/nv-ingest/tree/main/tests) — Unit tests for nv-ingest
-
+- **[Official Documentation](https://docs.nvidia.com/nemo/retriever/extraction/)** - Complete user guides, API references, and deployment instructions
+- **[Getting Started Guide](https://docs.nvidia.com/nemo/retriever/extraction/overview/)** - Overview and prerequisites for production deployments
+- **[Benchmarking Guide](https://docs.nvidia.com/nemo/retriever/extraction/benchmarking/)** - Performance testing and recall evaluation framework
+- **[MIG Deployment](https://docs.nvidia.com/nemo/retriever/extraction/mig-benchmarking/)** - Multi-Instance GPU configurations for Kubernetes
+- **[API Documentation](https://docs.nvidia.com/nemo/retriever/extraction/api/)** - Python client and API reference
 
 ## Notices
 
@@ -402,56 +175,7 @@ https://pypi.org/project/pdfservices-sdk/
     [request access](https://huggingface.co/meta-llama/Llama-3.2-1B) and set `HF_ACCESS_TOKEN` to your HuggingFace 
     access token in order to use it.
 
-
-### Contributing
-
-We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original
-work, or you have rights to submit it under the same license, or a compatible license.
-
-Any contribution which contains commits that are not signed off are not accepted.
-
-To sign off on a commit, use the --signoff (or -s) option when you commit your changes as shown following.
-
-```
-$ git commit --signoff --message "Add cool feature."
-```
-
-This appends the following text to your commit message.
-
-```
-Signed-off-by: Your Name <your@email.com>
-```
-
-#### Developer Certificate of Origin (DCO)
-
-The following is the full text of the Developer Certificate of Origin (DCO)
-
-```
-  Developer Certificate of Origin
-  Version 1.1
-
-  Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
-  1 Letterman Drive
-  Suite D4700
-  San Francisco, CA, 94129
-
-  Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
-```
-
-```
-  Developer's Certificate of Origin 1.1
-
-  By making a contribution to this project, I certify that:
-
-  (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or
-
-  (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or
-
-  (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it.
-
-  (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.
-```
-
+Before contributing to this project, please review our [Contributor Guide](contributing.md).
 
 ## Security Considerations
 
diff --git a/contributing.md b/contributing.md
new file mode 100644
index 000000000..f8dc3815d
--- /dev/null
+++ b/contributing.md
@@ -0,0 +1,50 @@
+### Contributing
+
+We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original
+work, or you have rights to submit it under the same license, or a compatible license.
+
+Any contribution which contains commits that are not signed off are not accepted.
+
+To sign off on a commit, use the --signoff (or -s) option when you commit your changes as shown following.
+
+```
+$ git commit --signoff --message "Add cool feature."
+```
+
+This appends the following text to your commit message.
+
+```
+Signed-off-by: Your Name <your@email.com>
+```
+
+#### Developer Certificate of Origin (DCO)
+
+The following is the full text of the Developer Certificate of Origin (DCO)
+
+```
+  Developer Certificate of Origin
+  Version 1.1
+
+  Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
+  1 Letterman Drive
+  Suite D4700
+  San Francisco, CA, 94129
+
+  Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
+```
+
+```
+  Developer's Certificate of Origin 1.1
+
+  By making a contribution to this project, I certify that:
+
+  (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or
+
+  (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or
+
+  (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it.
+
+  (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.
+```
+
+
diff --git a/nemo_retriever/README.md b/nemo_retriever/README.md
index 49feffc0c..6816ea1f8 100644
--- a/nemo_retriever/README.md
+++ b/nemo_retriever/README.md
@@ -36,7 +36,7 @@ In your terminal, run the following commands from any location.
 ```bash
 uv venv retriever --python 3.12
 source retriever/bin/activate
-uv pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple nemo-retriever==26.3.0rc2 nv-ingest-client==26.3.0rc2 nv-ingest==26.3.0rc2 nv-ingest-api==26.3.0rc2
+uv pip install nemo-retriever==26.3.0 nv-ingest-client==26.3.0 nv-ingest==26.3.0 nv-ingest-api==26.3.0
 ```
 This creates a dedicated Python environment and installs the `nemo-retriever` PyPI package, the canonical distribution for the NeMo Retriever Library.
 
@@ -102,14 +102,20 @@ You can inspect how recall accuracy optimized text chunks for various content ty
 'Chart 1\nThis chart shows some gadgets, and some very fictitious costs.\nGadgets and their cost\n$160.00\n$140.00\n$120.00\n$100.00\nDollars\n$80.00\n$60.00\n$40.00\n$20.00\n$-\nPowerdrill\nBluetooth speaker\nMinifridge\nPremium desk fan\nHammer\nCost'
 
 # markdown formatting for full pages or documents:
+# document results are keyed by source filename
 >>> to_markdown_by_page(chunks).keys()
+dict_keys(['multimodal_test.pdf'])
+
+# results per document are keyed by page number
+>>> to_markdown_by_page(chunks)["multimodal_test.pdf"].keys()
 dict_keys([1, 2, 3])
 
->>> to_markdown_by_page(chunks)[1]
-'## Page 1\n\nTestingDocument\r\nA sample document with headings and placeholder text\r\nIntroduction\r\nThis is a placeholder document that can be used for any purpose. It contains some \r\nheadings and some placeholder text to fill the space. The text is not important and contains \r\nno real value, but it is useful for testing. Below, we will have some simple tables and charts \r\nthat we can use to confirm Ingest is working as expected.\r\nTable 1\r\nThis table describes some animals, and some activities they might be doing in specific \r\nlocations.\r\nAnimal Activity Place\r\nGira@e Driving a car At the beach\r\nLion Putting on sunscreen At the park\r\nCat Jumping onto a laptop In a home o@ice\r\nDog Chasing a squirrel In the front yard\r\nChart 1\r\nThis chart shows some gadgets, and some very fictitious costs.\n\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |\n\nChart 1 This chart shows some gadgets, and some very fictitious costs...'
+>>> to_markdown_by_page(chunks)["multimodal_test.pdf"][1]
+'TestingDocument\r\nA sample document with headings and placeholder text\r\nIntroduction\r\nThis is a placeholder document that can be used for any purpose. It contains some \r\nheadings and some placeholder text to fill the space. The text is not important and contains \r\nno real value, but it is useful for testing. Below, we will have some simple tables and charts \r\nthat we can use to confirm Ingest is working as expected.\r\nTable 1\r\nThis table describes some animals, and some activities they might be doing in specific \r\nlocations.\r\nAnimal Activity Place\r\nGira@e Driving a car At the beach\r\nLion Putting on sunscreen At the park\r\nCat Jumping onto a laptop In a home o@ice\r\nDog Chasing a squirrel In the front yard\r\nChart 1\r\nThis chart shows some gadgets, and some very fictitious costs.\n\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |\n\nChart 1 This chart shows some gadgets, and some very fictitious costs. Gadgets and their cost $160.00 $140.00 $120.00 $100.00 Dollars $80.00 $60.00 $40.00 $20.00 $- Powerdrill Bluetooth speaker Minifridge Premium desk fan Hammer Cost\n\n### Table 1\n\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |\n\n### Chart 1\n\nChart 1 This chart shows some gadgets, and some very fictitious costs. Gadgets and their cost $160.00 $140.00 $120.00 $100.00 Dollars $80.00 $60.00 $40.00 $20.00 $- Powerdrill Bluetooth speaker Minifridge Premium desk fan Hammer Cost\n\n### Table 2\n\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |\n\n### Chart 2\n\nChart 1 This chart shows some gadgets, and some very fictitious costs. Gadgets and their cost $160.00 $140.00 $120.00 $100.00 Dollars $80.00 $60.00 $40.00 $20.00 $- Powerdrill Bluetooth speaker Minifridge Premium desk fan Hammer Cost\n\n### Table 3\n\n| This | table | describes | some | animals, | and | some | activities | they | might | be | doing | in | specific |\n| locations. |\n| Animal | Activity | Place |\n| Giraffe | Driving | a | car | At | the | beach |\n| Lion | Putting | on | sunscreen | At | the | park |\n| Cat | Jumping | onto | a | laptop | In | a | home | office |\n| Dog | Chasing | a | squirrel | In | the | front | yard |\n| Chart | 1 |\n\n### Chart 3\n\nChart 1 This chart shows some gadgets, and some very fictitious costs. Gadgets and their cost $160.00 $140.00 $120.00 $100.00 Dollars $80.00 $60.00 $40.00 $20.00 $- Powerdrill Bluetooth speaker Minifridge Premium desk fan Hammer Cost'
 
-# full document markdown
->>> to_markdown(chunks)
+# full document markdown also keyed by source filename
+>>> to_markdown(chunks).keys()
+dict_keys(['multimodal_test.pdf'])
 ```
 
 Since the ingestion job automatically populated a lancedb table with all these chunks, you can use queries to retrieve semantically relevant chunks for feeding directly into an LLM:

From 85168e2282745e955dc1fd464530b604c59f0631 Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Wed, 18 Mar 2026 10:51:46 -0700
Subject: [PATCH 37/55] updates to release notes to fix bullets and doc link
 (#1651)

---
 docs/docs/extraction/releasenotes-nv-ingest.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/docs/extraction/releasenotes-nv-ingest.md b/docs/docs/extraction/releasenotes-nv-ingest.md
index 2c444ba09..726a794fe 100644
--- a/docs/docs/extraction/releasenotes-nv-ingest.md
+++ b/docs/docs/extraction/releasenotes-nv-ingest.md
@@ -10,23 +10,23 @@ This documentation contains the release notes for [NeMo Retriever Library](overv
 
 NVIDIA® NeMo Retriever Library version 26.03 adds broader hardware and software support along with many pipeline, evaluation, and deployment enhancements.
 
-To upgrade the Helm charts for this release, refer to the (NeMo Retriever Library Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0/helm/README.md).
+To upgrade the Helm charts for this release, refer to the [NeMo Retriever Library Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0/helm/README.md).
 
 Highlights for the 26.03 release include:
 
 - NV-Ingest GitHub repo renamed to NeMo-Retriever  
 - NeMo Retriever Extraction pipeline renamed to NeMo Retriever Library  
 - NeMo Retriever Library now supports two deployment options:  
-   - A new no-container, pip-installable in-process library for development (available on PyPI)  
-   - Existing production-ready Helm chart with NIMs  
+    - A new no-container, pip-installable in-process library for development (available on PyPI)  
+    - Existing production-ready Helm chart with NIMs  
 - Added documentation notes on Air-gapped deployment support  
 - Added documentation notes on OpenShift support  
 - Added support for RTX4500 Pro Blackwell SKU  
 - Added support for llama-nemotron-embed-vl-v2 in text and text+image modes  
 - New extract methods `pdfium_hybrid` and `ocr` target scanned PDFs to improve text and layout extraction from image-based pages  
 - VLM-based image caption enhancements:  
-   - Infographics can be captioned  
-   - Reasoning mode is configurable  
+    - Infographics can be captioned  
+    - Reasoning mode is configurable  
 - Enabled hybrid search with Lancedb  
 - Added retrieval_bench subfolder with generalizable agentic retrieval pipeline  
 - The project now uses UV as the primary environment and package manager instead of Conda, resulting in faster installs and simpler dependency handling  

From 4075ae942fbe243da60be8439a50928e4f99b6d2 Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Wed, 18 Mar 2026 11:53:45 -0700
Subject: [PATCH 38/55] Kheiss/5970976 (#1652)

---
 docs/docs/extraction/quickstart-guide.md       | 11 +++++++++++
 docs/docs/extraction/releasenotes-nv-ingest.md |  2 +-
 helm/README.md                                 |  4 ++++
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/docs/docs/extraction/quickstart-guide.md b/docs/docs/extraction/quickstart-guide.md
index 428d1c890..0d02aedcc 100644
--- a/docs/docs/extraction/quickstart-guide.md
+++ b/docs/docs/extraction/quickstart-guide.md
@@ -110,6 +110,17 @@ From this prompt, you can run the `nemo-retriever` CLI and Python examples.
 
 Because many service URIs default to localhost, running inside the `nemo-retriever` container also requires that you specify URIs manually so that services can communicate across containers on the internal Docker network. See the example following for how to set the `milvus_uri`.
 
+## Air-Gapped Deployment (Docker Compose)
+
+When deploying in an air-gapped environment (no internet or NGC registry access), you must pre-stage container images on a machine with network access, then transfer and load them in the isolated environment.
+
+1. **On a machine with network access:** Clone the repo, authenticate with NGC (`docker login nvcr.io`), and pull all images used by your chosen profile (for example, `docker compose --profile retrieval pull`).
+2. **Save images:** Export the images to archives (for example, using `docker save` for each image or a script that saves all images referenced by your [docker-compose.yaml](https://github.com/NVIDIA/NeMo-Retriever/blob/main/docker-compose.yaml)).
+3. **Transfer** the image archives and your `docker-compose.yaml` (and `.env` if used) to the air-gapped system.
+4. **On the air-gapped machine:** Load the images (`docker load -i <archive>`) and start the stack with the same profile (for example, `docker compose --profile retrieval up`).
+
+Ensure the same image tags and `docker-compose.yaml` version are used in both environments so that service configuration stays consistent.
+
 ## Step 3: Ingest Documents
 
 You can submit jobs programmatically in Python or using the [NeMo Retriever Library CLI](cli-reference.md).
diff --git a/docs/docs/extraction/releasenotes-nv-ingest.md b/docs/docs/extraction/releasenotes-nv-ingest.md
index 726a794fe..4781bb476 100644
--- a/docs/docs/extraction/releasenotes-nv-ingest.md
+++ b/docs/docs/extraction/releasenotes-nv-ingest.md
@@ -19,7 +19,7 @@ Highlights for the 26.03 release include:
 - NeMo Retriever Library now supports two deployment options:  
     - A new no-container, pip-installable in-process library for development (available on PyPI)  
     - Existing production-ready Helm chart with NIMs  
-- Added documentation notes on Air-gapped deployment support  
+- Added documentation notes on Air-gapped deployment support for both Helm (Kubernetes) and Docker Compose  
 - Added documentation notes on OpenShift support  
 - Added support for RTX4500 Pro Blackwell SKU  
 - Added support for llama-nemotron-embed-vl-v2 in text and text+image modes  
diff --git a/helm/README.md b/helm/README.md
index d1a752be3..080e0a866 100644
--- a/helm/README.md
+++ b/helm/README.md
@@ -12,6 +12,10 @@ Before you install the Helm charts, be sure you meet the hardware and software p
 The [Nvidia nim-operator](https://docs.nvidia.com/nim-operator/latest/install.html) must also be installed and configured in your cluster to ensure that
 the Nvidia NIMs are properly deployed.
 
+## Air-Gapped Deployment (Kubernetes)
+
+For deploying in an air-gapped environment (no internet or NGC registry access), refer to the [NVIDIA NIM Operator documentation on Air-Gapped Environments](https://docs.nvidia.com/nim-operator/latest/air-gap.html), which explains how to deploy NIMs in such environments.
+
 ## Initial Environment Setup
 
 1. Create your namespace by running the following code.

From ebb12539ac05b6034b53ac3de3f23ccb12f23614 Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Wed, 18 Mar 2026 12:03:13 -0700
Subject: [PATCH 39/55] Kheiss/5966534 (#1653)

Co-authored-by: sosahi <syousefisahi@nvidia.com>
---
 docs/docs/extraction/support-matrix.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/docs/docs/extraction/support-matrix.md b/docs/docs/extraction/support-matrix.md
index eec709b8c..845e671f6 100644
--- a/docs/docs/extraction/support-matrix.md
+++ b/docs/docs/extraction/support-matrix.md
@@ -10,9 +10,10 @@ Before you begin using [NeMo Retriever Library](overview.md), ensure that you ha
 ## Core and Advanced Pipeline Features
 
 The NeMo Retriever Library core pipeline features run on a single A10G or better GPU. 
+
 The core pipeline features include the following:
 
-- llama3.2-nv-embedqa-1b-v2 — Embedding model for converting text chunks into vectors.
+- llama-nemotron-embed-1b-v2 — Embedding model for converting text chunks into vectors.
 - nemotron-page-elements-v3 — Detects and classifies images on a page as a table, chart or infographic.
 - nemotron-table-structure-v1 — Detects rows, columns, and cells within a table to preserve table structure and convert to Markdown format. 
 - nemotron-graphic-elements-v1 — Detects graphic elements within chart images such as titles, legends, axes, and numerical values. 

From 924a18e93b7ab389028fda29f68d6807366ef0d9 Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Wed, 18 Mar 2026 13:27:38 -0700
Subject: [PATCH 40/55] Kheiss/5970976 - change location of air gap
 documentation (#1656)

---
 docs/docs/extraction/quickstart-guide.md | 26 ++++++++++++------------
 helm/README.md.gotmpl                    |  3 +++
 2 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/docs/docs/extraction/quickstart-guide.md b/docs/docs/extraction/quickstart-guide.md
index 0d02aedcc..3dae43e66 100644
--- a/docs/docs/extraction/quickstart-guide.md
+++ b/docs/docs/extraction/quickstart-guide.md
@@ -110,18 +110,7 @@ From this prompt, you can run the `nemo-retriever` CLI and Python examples.
 
 Because many service URIs default to localhost, running inside the `nemo-retriever` container also requires that you specify URIs manually so that services can communicate across containers on the internal Docker network. See the example following for how to set the `milvus_uri`.
 
-## Air-Gapped Deployment (Docker Compose)
-
-When deploying in an air-gapped environment (no internet or NGC registry access), you must pre-stage container images on a machine with network access, then transfer and load them in the isolated environment.
-
-1. **On a machine with network access:** Clone the repo, authenticate with NGC (`docker login nvcr.io`), and pull all images used by your chosen profile (for example, `docker compose --profile retrieval pull`).
-2. **Save images:** Export the images to archives (for example, using `docker save` for each image or a script that saves all images referenced by your [docker-compose.yaml](https://github.com/NVIDIA/NeMo-Retriever/blob/main/docker-compose.yaml)).
-3. **Transfer** the image archives and your `docker-compose.yaml` (and `.env` if used) to the air-gapped system.
-4. **On the air-gapped machine:** Load the images (`docker load -i <archive>`) and start the stack with the same profile (for example, `docker compose --profile retrieval up`).
-
-Ensure the same image tags and `docker-compose.yaml` version are used in both environments so that service configuration stays consistent.
-
-## Step 3: Ingest Documents
+## Step 2: Ingest Documents
 
 You can submit jobs programmatically in Python or using the [NeMo Retriever Library CLI](cli-reference.md).
 
@@ -342,7 +331,7 @@ INFO:nemo_retriever.cli.util.processing:Throughput (Pages/sec): 1.28
 INFO:nemo_retriever.cli.util.processing:Throughput (Files/sec): 0.43
 ```
 
-## Step 4: Inspecting and Consuming Results
+## Step 3: Inspecting and Consuming Results
 
 After the ingestion steps above have been completed, you should be able to find the `text` and `image` subfolders inside your processed docs folder. Each will contain JSON-formatted extracted content and metadata.
 
@@ -413,6 +402,16 @@ You can specify multiple `--profile` options.
 | `nemotron-parse`      | Advanced | Use [nemotron-parse](https://build.nvidia.com/nvidia/nemotron-parse), which adds state-of-the-art text and table extraction. For more information, refer to [Advanced Visual Parsing](nemoretriever-parse.md). | 
 | `vlm`                 | Advanced | Use [llama 3.1 Nemotron 8B Vision](https://build.nvidia.com/nvidia/llama-3.1-nemotron-nano-vl-8b-v1/modelcard) for experimental image captioning of unstructured images. You can also configure other VLMs for your specific use cases. For more information, refer to [Extract Captions from Images](python-api-reference.md#extract-captions-from-images). | 
 
+## Air-Gapped Deployment (Docker Compose)
+
+When deploying in an air-gapped environment (no internet or NGC registry access), you must pre-stage container images on a machine with network access, then transfer and load them in the isolated environment.
+
+1. **On a machine with network access:** Clone the repo, authenticate with NGC (`docker login nvcr.io`), and pull all images used by your chosen profile (for example, `docker compose --profile retrieval pull`).
+2. **Save images:** Export the images to archives (for example, using `docker save` for each image or a script that saves all images referenced by your [docker-compose.yaml](https://github.com/NVIDIA/NeMo-Retriever/blob/main/docker-compose.yaml)).
+3. **Transfer** the image archives and your `docker-compose.yaml` (and `.env` if used) to the air-gapped system.
+4. **On the air-gapped machine:** Load the images (`docker load -i <archive>`) and start the stack with the same profile (for example, `docker compose --profile retrieval up`).
+
+Ensure the same image tags and `docker-compose.yaml` version are used in both environments so that service configuration stays consistent.
 
 ## Docker Compose override files
 
@@ -499,6 +498,7 @@ This syntax and structure can be repeated for each NIM model used by CAS, ensuri
 
     Advanced features require additional GPU support and disk space. For more information, refer to [Support Matrix](support-matrix.md).
 
+
 ## Related Topics
 
 - [Troubleshoot](troubleshoot.md)
diff --git a/helm/README.md.gotmpl b/helm/README.md.gotmpl
index 450c62b60..5328c8020 100644
--- a/helm/README.md.gotmpl
+++ b/helm/README.md.gotmpl
@@ -12,6 +12,9 @@ Before you install the Helm charts, be sure you meet the hardware and software p
 The [Nvidia nim-operator](https://docs.nvidia.com/nim-operator/latest/install.html) must also be installed and configured in your cluster to ensure that
 the Nvidia NIMs are properly deployed.
 
+## Air-Gapped Deployment (Kubernetes)
+
+For deploying in an air-gapped environment (no internet or NGC registry access), refer to the [NVIDIA NIM Operator documentation on Air-Gapped Environments](https://docs.nvidia.com/nim-operator/latest/air-gap.html), which explains how to deploy NIMs in such environments.
 
 ## Initial Environment Setup
 

From 4129d5b89aa404396a0aeaf08826a038ab7247d4 Mon Sep 17 00:00:00 2001
From: Jeremy Dyer <jdye64@gmail.com>
Date: Thu, 19 Mar 2026 15:06:16 -0400
Subject: [PATCH 41/55] Revert doc naming changes

---
 docs/docs/extraction/audio.md                 |  24 +-
 docs/docs/extraction/benchmarking.md          | 104 +--
 docs/docs/extraction/chunking.md              |   8 +-
 docs/docs/extraction/content-metadata.md      |   8 +-
 docs/docs/extraction/contributing.md          |   6 +-
 docs/docs/extraction/custom-metadata.md       |  16 +-
 docs/docs/extraction/data-store.md            | 147 ++++-
 docs/docs/extraction/environment-config.md    |  18 +-
 docs/docs/extraction/faq.md                   |  36 +-
 docs/docs/extraction/helm.md                  |   6 +-
 docs/docs/extraction/nemoretriever-parse.md   |  49 +-
 docs/docs/extraction/nimclient.md             |  20 +-
 docs/docs/extraction/notebooks.md             |  22 +-
 docs/docs/extraction/nv-ingest-python-api.md  | 600 ++++++++++++++++++
 docs/docs/extraction/nv-ingest_cli.md         | 175 +++++
 docs/docs/extraction/overview.md              |   8 +-
 docs/docs/extraction/prerequisites.md         |   2 +-
 docs/docs/extraction/quickstart-guide.md      | 190 +++---
 .../extraction/quickstart-library-mode.md     | 486 +++++++++++++-
 .../docs/extraction/releasenotes-nv-ingest.md |  91 ++-
 docs/docs/extraction/scaling-modes.md         |  12 +-
 docs/docs/extraction/support-matrix.md        |  25 +-
 docs/docs/extraction/telemetry.md             |   2 +-
 docs/docs/extraction/troubleshoot.md          |   4 +-
 .../docs/extraction/user-defined-functions.md |  46 +-
 docs/docs/extraction/user-defined-stages.md   |  16 +-
 docs/docs/extraction/v2-api-guide.md          |  88 +--
 docs/docs/extraction/vlm-embed.md             | 123 +---
 docs/docs/index.md                            |  28 +-
 docs/mkdocs.yml                               |   2 +-
 30 files changed, 1813 insertions(+), 549 deletions(-)
 create mode 100644 docs/docs/extraction/nv-ingest-python-api.md
 create mode 100644 docs/docs/extraction/nv-ingest_cli.md

diff --git a/docs/docs/extraction/audio.md b/docs/docs/extraction/audio.md
index 4be7ee8ac..021ee74b6 100644
--- a/docs/docs/extraction/audio.md
+++ b/docs/docs/extraction/audio.md
@@ -9,7 +9,7 @@ to extract speech from audio files.
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NeMo Retriever Library is also known as NVIDIA Ingest.
 
 Currently, you can extract speech from the following file types:
 
@@ -22,8 +22,8 @@ Currently, you can extract speech from the following file types:
 
 [NeMo Retriever Library](overview.md) supports extracting speech from audio files for Retrieval Augmented Generation (RAG) applications. 
 Similar to how the multimodal document extraction pipeline leverages object detection and image OCR microservices, 
-NeMo Retriever Library leverages the [RIVA ASR NIM microservice](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html) 
-to transcribe speech to text, which is then embedded by using the Nemotron embedding NIM. 
+NeMo Retriever leverages the [RIVA ASR NIM microservice](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html) 
+to transcribe speech to text, which is then embedded by using the NeMo Retriever embedding NIM. 
 
 !!! important
 
@@ -62,17 +62,17 @@ Use the following procedure to run the NIM locally.
     NGC_API_KEY=<your-ngc-key>
     ```
 
-3. Start the NeMo Retriever Library services with the `audio` profile. This profile includes the necessary components for audio processing. Use the following command. The `--profile audio` flag ensures that speech-specific services are launched. For more information, refer to [Profile Information](quickstart-guide.md#profile-information).
+3. Start the retriever services with the `audio` profile. This profile includes the necessary components for audio processing. Use the following command. The `--profile audio` flag ensures that speech-specific services are launched. For more information, refer to [Profile Information](quickstart-guide.md#profile-information).
 
     ```shell
     docker compose --profile retrieval --profile audio up
     ```
 
-4. After the services are running, you can interact with NeMo Retriever Library by using Python.
+4. After the services are running, you can interact with the pipeline by using Python.
 
     - The `Ingestor` object initializes the ingestion process.
     - The `files` method specifies the input files to process.
-    - The `extract` method tells NeMo Retriever Library to extract information from WAV audio files.
+    - The `extract` method tells the pipeline to extract information from WAV audio files.
     - The `document_type` parameter is optional, because `Ingestor` should detect the file type automatically.
 
     ```python
@@ -89,12 +89,12 @@ Use the following procedure to run the NIM locally.
 
     !!! tip
 
-        For more Python examples, refer to [NeMo Retriever Library: Python Client Quick Start Guide](https://github.com/NVIDIA/NeMo-Retriever/blob/main/client/client_examples/examples/python_client_usage.ipynb).
+        For more Python examples, refer to [NV-Ingest: Python Client Quick Start Guide](https://github.com/NVIDIA/nv-ingest/blob/main/client/client_examples/examples/python_client_usage.ipynb).
 
 
 ## Use NVCF Endpoints for Cloud-Based Inference
 
-Instead of running NeMo Retriever Library locally, you can use NVCF to perform inference by using remote endpoints.
+Instead of running the pipeline locally, you can use NVCF to perform inference by using remote endpoints.
 
 1. NVCF requires an authentication token and a function ID for access. Ensure you have these credentials ready before making API calls.
 
@@ -102,7 +102,7 @@ Instead of running NeMo Retriever Library locally, you can use NVCF to perform i
 
     - The `Ingestor` object initializes the ingestion process.
     - The `files` method specifies the input files to process.
-    - The `extract` method tells NeMo Retriever Library to extract information from WAV audio files.
+    - The `extract` method tells the pipeline to extract information from WAV audio files.
     - The `document_type` parameter is optional, because `Ingestor` should detect the file type automatically.
 
     ```python
@@ -124,12 +124,12 @@ Instead of running NeMo Retriever Library locally, you can use NVCF to perform i
 
     !!! tip
 
-        For more Python examples, refer to [NeMo Retriever Library: Python Client Quick Start Guide](https://github.com/NVIDIA/NeMo-Retriever/blob/main/client/client_examples/examples/python_client_usage.ipynb).
+        For more Python examples, refer to [NV-Ingest: Python Client Quick Start Guide](https://github.com/NVIDIA/nv-ingest/blob/main/client/client_examples/examples/python_client_usage.ipynb).
 
 
 
 ## Related Topics
 
 - [Support Matrix](support-matrix.md)
-- [Troubleshoot NeMo Retriever Library](troubleshoot.md)
-- [Use the NeMo Retriever Library Python API](python-api-reference.md)
+- [Troubleshoot Nemo Retriever Extraction](troubleshoot.md)
+- [Use the Python API](nv-ingest-python-api.md)
diff --git a/docs/docs/extraction/benchmarking.md b/docs/docs/extraction/benchmarking.md
index 62abbf302..0f8cd81d9 100644
--- a/docs/docs/extraction/benchmarking.md
+++ b/docs/docs/extraction/benchmarking.md
@@ -1,11 +1,11 @@
-# NeMo Retriever Library Integration Testing Framework
+# nv-ingest Integration Testing Framework
 
-A configurable, dataset-agnostic testing framework for end-to-end validation of NeMo Retriever Library pipelines. This framework uses structured YAML configuration for type safety, validation, and parameter management.
+A configurable, dataset-agnostic testing framework for end-to-end validation of nv-ingest pipelines. This framework uses structured YAML configuration for type safety, validation, and parameter management.
 
 ## Dataset Prerequisites
 
     
-Before you run any benchmarking or evaluation tests, you must first download the benchmark datasets. The three primary datasets used in NeMo Retriever Library benchmarking and evaluations are the following:
+Before you run any benchmarking or evaluation tests, you must first download the benchmark datasets. The three primary datasets used in nv-ingest benchmarking and evaluations are the following:
     
 - **Bo20** - 20 PDFs for quick testing
 - **Bo767** - 767 PDFs for comprehensive benchmarking
@@ -13,7 +13,7 @@ Before you run any benchmarking or evaluation tests, you must first download the
     
 ### How to Download the Datasets
     
-Use the [Digital Corpora Download Notebook](https://github.com/NVIDIA/NeMo-Retriever/blob/main/evaluation/digital_corpora_download.ipynb) to download these datasets from the public Digital Corpora source. This notebook provides automated download functions that do the following:
+Use the [Digital Corpora Download Notebook](https://github.com/NVIDIA/nv-ingest/blob/main/evaluation/digital_corpora_download.ipynb) to download these datasets from the public Digital Corpora source. This notebook provides automated download functions that do the following:
     
 - Download PDFs directly from Digital Corpora's public repository.
 - Support all three dataset sizes (Bo20, Bo767, Bo10k).
@@ -29,26 +29,26 @@ Use the [Digital Corpora Download Notebook](https://github.com/NVIDIA/NeMo-Retri
 Before you use this documentation, you need the following:
 
 - Docker and Docker Compose are running
-- A Python environment with nemo-retriever installed
+- A Python environment with nv-ingest-client installed
 - The [benchmark datasets are downloaded](#dataset-prerequisites)
 
 ### Run Your First Test
 
 ```bash
-# 1. Navigate to the nemo-retriever-bench directory
+# 1. Navigate to the nv-ingest-harness directory
 cd tools/harness
 
 # 2. Install dependencies
 uv sync
 
 # 3. Run with a pre-configured dataset (assumes services are running)
-uv run nemo-retriever-bench --case=e2e --dataset=bo767
+uv run nv-ingest-harness-run --case=e2e --dataset=bo767
 
 # Or use a custom path that uses the "active" configuration
-uv run nemo-retriever-bench --case=e2e --dataset=/path/to/your/data
+uv run nv-ingest-harness-run --case=e2e --dataset=/path/to/your/data
 
 # With managed infrastructure (starts/stops services)
-uv run nemo-retriever-bench --case=e2e --dataset=bo767 --managed
+uv run nv-ingest-harness-run --case=e2e --dataset=bo767 --managed
 ```
 
 ## Configuration System
@@ -109,7 +109,7 @@ Each dataset includes its path, extraction settings, and recall evaluator in one
 ```yaml
 datasets:
   bo767:
-    path: /datasets/nv-ingest/bo767
+    path: /raid/jioffe/bo767
     extract_text: true
     extract_tables: true
     extract_charts: true
@@ -118,7 +118,7 @@ datasets:
     recall_dataset: bo767  # Evaluator for recall testing
   
   bo20:
-    path: /datasets/nv-ingest/bo20
+    path: /raid/jioffe/bo20
     extract_text: true
     extract_tables: true
     extract_charts: true
@@ -127,7 +127,7 @@ datasets:
     recall_dataset: null  # bo20 does not have recall
   
   earnings:
-    path: /datasets/nv-ingest/earnings_consulting
+    path: /raid/jioffe/earnings_conusulting
     extract_text: true
     extract_tables: true
     extract_charts: true
@@ -144,13 +144,13 @@ datasets:
 **Usage:**
 ```bash
 # Single dataset - configs applied automatically
-uv run nemo-retriever-bench --case=e2e --dataset=bo767
+uv run nv-ingest-harness-run --case=e2e --dataset=bo767
 
 # Multiple datasets (sweeping) - each gets its own config
-uv run nemo-retriever-bench --case=e2e --dataset=bo767,earnings,bo20
+uv run nv-ingest-harness-run --case=e2e --dataset=bo767,earnings,bo20
 
 # Custom path still works (uses active section config)
-uv run nemo-retriever-bench --case=e2e --dataset=/custom/path
+uv run nv-ingest-harness-run --case=e2e --dataset=/custom/path
 ```
 
 **Dataset Extraction Settings:**
@@ -176,7 +176,7 @@ Example:
 # YAML active section has api_version: v1
 # Dataset bo767 has extract_images: false
 # Override via environment variable (highest priority)
-EXTRACT_IMAGES=true API_VERSION=v2 uv run nemo-retriever-bench --case=e2e --dataset=bo767
+EXTRACT_IMAGES=true API_VERSION=v2 uv run nv-ingest-harness-run --case=e2e --dataset=bo767
 # Result: Uses bo767 path, but extract_images=true (env override) and api_version=v2 (env override)
 ```
 
@@ -240,13 +240,13 @@ Configuration is validated on load with helpful error messages.
 
 ```bash
 # Run with default YAML configuration (assumes services are running)
-uv run nemo-retriever-bench --case=e2e --dataset=bo767
+uv run nv-ingest-harness-run --case=e2e --dataset=bo767
 
 # With document-level analysis
-uv run nemo-retriever-bench --case=e2e --dataset=bo767 --doc-analysis
+uv run nv-ingest-harness-run --case=e2e --dataset=bo767 --doc-analysis
 
 # With managed infrastructure (starts/stops services)
-uv run nemo-retriever-bench --case=e2e --dataset=bo767 --managed
+uv run nv-ingest-harness-run --case=e2e --dataset=bo767 --managed
 ```
 
 ### Dataset Sweeping
@@ -255,7 +255,7 @@ Run multiple datasets in a single command - each dataset automatically gets its
 
 ```bash
 # Sweep multiple datasets
-uv run nemo-retriever-bench --case=e2e --dataset=bo767,earnings,bo20
+uv run nv-ingest-harness-run --case=e2e --dataset=bo767,earnings,bo20
 
 # Each dataset runs sequentially with its own:
 # - Extraction settings (from dataset config)
@@ -263,13 +263,13 @@ uv run nemo-retriever-bench --case=e2e --dataset=bo767,earnings,bo20
 # - Results summary at the end
 
 # With managed infrastructure (services start once, shared across all datasets)
-uv run nemo-retriever-bench --case=e2e --dataset=bo767,earnings,bo20 --managed
+uv run nv-ingest-harness-run --case=e2e --dataset=bo767,earnings,bo20 --managed
 
 # E2E+Recall sweep (each dataset ingests then evaluates recall)
-uv run nemo-retriever-bench --case=e2e_recall --dataset=bo767,earnings
+uv run nv-ingest-harness-run --case=e2e_recall --dataset=bo767,earnings
 
 # Recall-only sweep (evaluates existing collections)
-uv run nemo-retriever-bench --case=recall --dataset=bo767,earnings
+uv run nv-ingest-harness-run --case=recall --dataset=bo767,earnings
 ```
 
 **Sweep Behavior:**
@@ -283,10 +283,10 @@ uv run nemo-retriever-bench --case=recall --dataset=bo767,earnings
 
 ```bash
 # Override via environment (useful for CI/CD)
-API_VERSION=v2 EXTRACT_TABLES=false uv run nemo-retriever-bench --case=e2e
+API_VERSION=v2 EXTRACT_TABLES=false uv run nv-ingest-harness-run --case=e2e
 
 # Temporary changes without editing YAML
-DATASET_DIR=/custom/path uv run nemo-retriever-bench --case=e2e
+DATASET_DIR=/custom/path uv run nv-ingest-harness-run --case=e2e
 ```
 
 ## Test Scenarios
@@ -472,23 +472,23 @@ recall:
 ```bash
 # Evaluate existing bo767 collections (no reranker)
 # recall_dataset automatically set from dataset config
-uv run nemo-retriever-bench --case=recall --dataset=bo767
+uv run nv-ingest-harness-run --case=recall --dataset=bo767
 
 # With reranker only (set reranker_mode in YAML recall section)
-uv run nemo-retriever-bench --case=recall --dataset=bo767
+uv run nv-ingest-harness-run --case=recall --dataset=bo767
 
 # Sweep multiple datasets for recall evaluation
-uv run nemo-retriever-bench --case=recall --dataset=bo767,earnings
+uv run nv-ingest-harness-run --case=recall --dataset=bo767,earnings
 ```
 
 **E2E + Recall (fresh ingestion):**
 ```bash
 # Fresh ingestion with recall evaluation
 # recall_dataset automatically set from dataset config
-uv run nemo-retriever-bench --case=e2e_recall --dataset=bo767
+uv run nv-ingest-harness-run --case=e2e_recall --dataset=bo767
 
 # Sweep multiple datasets (each ingests then evaluates)
-uv run nemo-retriever-bench --case=e2e_recall --dataset=bo767,earnings
+uv run nv-ingest-harness-run --case=e2e_recall --dataset=bo767,earnings
 ```
 
 **Dataset configuration:**
@@ -536,7 +536,7 @@ The easiest way to test multiple datasets is using dataset sweeping:
 
 ```bash
 # Test multiple datasets - each gets its native config automatically
-uv run nemo-retriever-bench --case=e2e --dataset=bo767,earnings,bo20
+uv run nv-ingest-harness-run --case=e2e --dataset=bo767,earnings,bo20
 
 # Each dataset runs with its pre-configured extraction settings
 # Results are organized in separate artifact directories
@@ -547,7 +547,7 @@ uv run nemo-retriever-bench --case=e2e --dataset=bo767,earnings,bo20
 To sweep through different parameter values:
 
 1. **Edit** `test_configs.yaml` - Update values in the `active` section
-2. **Run** the test: `uv run nemo-retriever-bench --case=e2e --dataset=<name>`
+2. **Run** the test: `uv run nv-ingest-harness-run --case=e2e --dataset=<name>`
 3. **Analyze** results in `artifacts/<test_name>_<timestamp>/`
 4. **Repeat** steps 1-3 for next parameter combination
 
@@ -555,18 +555,18 @@ Example parameter sweep workflow:
 ```bash
 # Test 1: Baseline V1
 vim test_configs.yaml  # Set: api_version=v1, extract_tables=true
-uv run nemo-retriever-bench --case=e2e --dataset=bo767
+uv run nv-ingest-harness-run --case=e2e --dataset=bo767
 
 # Test 2: V2 with 32-page splitting
 vim test_configs.yaml  # Set: api_version=v2, pdf_split_page_count=32
-uv run nemo-retriever-bench --case=e2e --dataset=bo767
+uv run nv-ingest-harness-run --case=e2e --dataset=bo767
 
 # Test 3: V2 with 8-page splitting
 vim test_configs.yaml  # Set: pdf_split_page_count=8
-uv run nemo-retriever-bench --case=e2e --dataset=bo767
+uv run nv-ingest-harness-run --case=e2e --dataset=bo767
 
 # Test 4: Tables disabled (override via env var)
-EXTRACT_TABLES=false uv run nemo-retriever-bench --case=e2e --dataset=bo767
+EXTRACT_TABLES=false uv run nv-ingest-harness-run --case=e2e --dataset=bo767
 ```
 
 **Note**: Each test run creates a new timestamped artifact directory, so you can compare results across sweeps.
@@ -576,7 +576,7 @@ EXTRACT_TABLES=false uv run nemo-retriever-bench --case=e2e --dataset=bo767
 ### Attach Mode (Default)
 
 ```bash
-uv run nemo-retriever-bench --case=e2e --dataset=bo767
+uv run nv-ingest-harness-run --case=e2e --dataset=bo767
 ```
 
 - **Default behavior**: Assumes services are already running
@@ -588,7 +588,7 @@ uv run nemo-retriever-bench --case=e2e --dataset=bo767
 ### Managed Mode
 
 ```bash
-uv run nemo-retriever-bench --case=e2e --dataset=bo767 --managed
+uv run nv-ingest-harness-run --case=e2e --dataset=bo767 --managed
 ```
 
 - Starts Docker services automatically
@@ -600,10 +600,10 @@ uv run nemo-retriever-bench --case=e2e --dataset=bo767 --managed
 **Managed mode options:**
 ```bash
 # Skip Docker image rebuild (faster startup)
-uv run nemo-retriever-bench --case=e2e --dataset=bo767 --managed --no-build
+uv run nv-ingest-harness-run --case=e2e --dataset=bo767 --managed --no-build
 
 # Keep services running after test (useful for multi-test scenarios)
-uv run nemo-retriever-bench --case=e2e --dataset=bo767 --managed --keep-up
+uv run nv-ingest-harness-run --case=e2e --dataset=bo767 --managed --keep-up
 ```
 
 ## Artifacts and Logging
@@ -631,7 +631,7 @@ tools/harness/artifacts/<test_name>_<timestamp>_UTC/
 Enable per-document element breakdown:
 
 ```bash
-uv run nemo-retriever-bench --case=e2e --doc-analysis
+uv run nv-ingest-harness-run --case=e2e --doc-analysis
 ```
 
 **Sample Output:**
@@ -655,21 +655,21 @@ This provides:
 - `test_configs.yaml` - Structured configuration file
   - Active test configuration (edit directly)
   - Dataset shortcuts for quick access
-- `src/nemo_retriever_harness/config.py` - Configuration management
+- `src/nv_ingest_harness/config.py` - Configuration management
   - YAML loading and parsing
   - Type-safe config dataclass
   - Validation logic with helpful errors
   - Environment variable override support
 
 **2. Test Runner**
-- `src/nemo_retriever_harness/cli/run.py` - Main orchestration
+- `src/nv_ingest_harness/cli/run.py` - Main orchestration
   - Configuration loading with precedence chain
   - Docker service management (managed mode)
   - Test case execution with config injection
   - Artifact collection and consolidation
 
 **3. Test Cases**
-- `src/nemo_retriever_harness/cases/e2e.py` - Primary E2E test (✅ YAML-based)
+- `src/nv_ingest_harness/cases/e2e.py` - Primary E2E test (✅ YAML-based)
   - Accepts config object directly
   - Type-safe parameter access
   - Full pipeline validation (extract → embed → VDB → retrieval)
@@ -677,19 +677,19 @@ This provides:
 - `cases/e2e_with_llm_summary.py` - E2E with LLM (✅ YAML-based)
   - Adds UDF-based LLM summarization
   - Same config-based architecture as e2e.py
-- `src/nemo_retriever_harness/cases/recall.py` - Recall evaluation (✅ YAML-based)
+- `src/nv_ingest_harness/cases/recall.py` - Recall evaluation (✅ YAML-based)
   - Evaluates retrieval accuracy against existing collections
   - Requires `recall_dataset` in config (from dataset config or env var)
   - Supports reranker comparison modes (none, with, both)
   - Multimodal-only evaluation against `{test_name}_multimodal` collection
-- `src/nemo_retriever_harness/cases/e2e_recall.py` - E2E + Recall (✅ YAML-based)
+- `src/nv_ingest_harness/cases/e2e_recall.py` - E2E + Recall (✅ YAML-based)
   - Combines ingestion (via e2e.py) with recall evaluation (via recall.py)
   - Automatically creates collection during ingestion
   - Requires `recall_dataset` in config (from dataset config or env var)
   - Merges ingestion and recall metrics in results
 
 **4. Shared Utilities**
-- `src/nemo_retriever_harness/utils/interact.py` - Common testing utilities
+- `src/nv_ingest_harness/utils/interact.py` - Common testing utilities
   - `embed_info()` - Embedding model detection
   - `milvus_chunks()` - Vector database statistics
   - `segment_results()` - Result categorization by type
@@ -812,7 +812,7 @@ The framework is dataset-agnostic and supports multiple approaches:
 **Option 1: Use pre-configured dataset (Recommended)**
 ```bash
 # Dataset configs automatically applied
-uv run nemo-retriever-bench --case=e2e --dataset=bo767
+uv run nv-ingest-harness-run --case=e2e --dataset=bo767
 ```
 
 **Option 2: Add new dataset to YAML**
@@ -827,17 +827,17 @@ datasets:
     extract_infographics: false
     recall_dataset: null  # or set to evaluator name if applicable
 ```
-Then use: `uv run nemo-retriever-bench --case=e2e --dataset=my_dataset`
+Then use: `uv run nv-ingest-harness-run --case=e2e --dataset=my_dataset`
 
 **Option 3: Use custom path (uses active section config)**
 ```bash
-uv run nemo-retriever-bench --case=e2e --dataset=/path/to/your/dataset
+uv run nv-ingest-harness-run --case=e2e --dataset=/path/to/your/dataset
 ```
 
 **Option 4: Environment variable override**
 ```bash
 # Override specific settings via env vars
-EXTRACT_IMAGES=true uv run nemo-retriever-bench --case=e2e --dataset=bo767
+EXTRACT_IMAGES=true uv run nv-ingest-harness-run --case=e2e --dataset=bo767
 ```
 
 **Best Practice**: For repeated testing, add your dataset to the `datasets` section with its native extraction settings. This ensures consistent configuration and enables dataset sweeping.
@@ -849,4 +849,4 @@ EXTRACT_IMAGES=true uv run nemo-retriever-bench --case=e2e --dataset=bo767
 - **Docker setup**: See project root README for service management commands
 - **API documentation**: See `docs/` for API version differences
 
-The framework prioritizes clarity, type safety, and validation to support reliable testing of NeMo Retriever pipelines.
+The framework prioritizes clarity, type safety, and validation to support reliable testing of nv-ingest pipelines.
diff --git a/docs/docs/extraction/chunking.md b/docs/docs/extraction/chunking.md
index ec7b88632..540b147ca 100644
--- a/docs/docs/extraction/chunking.md
+++ b/docs/docs/extraction/chunking.md
@@ -4,7 +4,7 @@ Splitting, also known as chunking, breaks large documents or text into smaller,
 After chunking, only the most relevant pieces of information are retrieved for a given query. 
 Chunking also prevents text from exceeding the context window of the embedding model.
 
-There are two ways that the retriever pipeline chunks text:
+There are two ways that NV Ingest chunks text:
 
 - By using the `text_depth` parameter in the `extraction` task.
 - Token-based splitting by using the `split` task.
@@ -93,7 +93,7 @@ The following table contains the `split` parameters.
 
 ### Pre-download the Tokenizer
 
-By default, the NeMo Retriever Library container comes with the `meta-llama/Llama-3.2-1B` tokenizer pre-downloaded 
+By default, the NV Ingest container comes with the `meta-llama/Llama-3.2-1B` tokenizer pre-downloaded 
 so that it doesn't have to download a tokenizer at runtime.
 If you are building the container yourself and want to pre-download this model, do the following:
 
@@ -106,6 +106,6 @@ If you are building the container yourself and want to pre-download this model,
 
 ## Related Topics
 
-- [Use the Python API](python-api-reference.md)
+- [Use the Python API](nv-ingest-python-api.md)
 - [NeMo Retriever Library V2 API Guide](v2-api-guide.md)
-- [Environment Variables](environment-config.md)
+- [Environment Variables](environment-variables.md)
diff --git a/docs/docs/extraction/content-metadata.md b/docs/docs/extraction/content-metadata.md
index c02aa8c46..e7d8bb050 100644
--- a/docs/docs/extraction/content-metadata.md
+++ b/docs/docs/extraction/content-metadata.md
@@ -10,7 +10,7 @@ Metadata can be extracted from a source or content, or generated by using models
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NeMo Retriever Library is also known as NVIDIA Ingest.
 
 
 
@@ -43,7 +43,7 @@ These fields apply to all content types including text, images, and tables.
 | Subtype | The type of the content for structured data types, such as table or chart. | — |
 | Content | Content extracted from the source.  | Extracted |
 | Description | A text description of the content object. | Generated |
-| Page \# | The page \# of the content in the source. Prior to 26.3.0-RC1, this field was 0-indexed. Beginning with 26.3.0-RC1, this field is 1-indexed. | Extracted |
+| Page \# | The page \# of the content in the source. Prior to 26.1.2, this field was 0-indexed. Beginning with 26.1.2, this field is 1-indexed. | Extracted |
 | Hierarchy | The location or order of the content within the source.  | Extracted |
 
 
@@ -282,7 +282,7 @@ The following enums are used by this schema:
 
 The following is an example JSON representation of metadata. 
 This is an example only, and does not contain the full metadata.
-For the full file, refer to the [data folder](https://github.com/NVIDIA/NeMo-Retriever/blob/main/data/multimodal_test.json).
+For the full file, refer to the [data folder](https://github.com/NVIDIA/nv-ingest/blob/main/data/multimodal_test.json).
 
 ```json
 {
@@ -374,4 +374,4 @@ For the full file, refer to the [data folder](https://github.com/NVIDIA/NeMo-Ret
 
 ## Related Topics
 
-- [Environment Variables](environment-config.md)
+- [Environment Variables](environment-variables.md)
diff --git a/docs/docs/extraction/contributing.md b/docs/docs/extraction/contributing.md
index b39dcb636..6a136c218 100644
--- a/docs/docs/extraction/contributing.md
+++ b/docs/docs/extraction/contributing.md
@@ -1,4 +1,4 @@
-# Contributing to NeMo Retriever Library
+# Contributing to NV-Ingest
 
-External contributions to NeMo Retriever Library will be welcome soon, and they are greatly appreciated! 
-For more information, refer to [Contributing to NeMo Retriever Library](https://github.com/NVIDIA/NeMo-Retriever/blob/main/CONTRIBUTING.md).
+External contributions to NV-Ingest will be welcome soon, and they are greatly appreciated! 
+For more information, refer to [Contributing to NV-Ingest](https://github.com/NVIDIA/nv-ingest/blob/main/CONTRIBUTING.md).
diff --git a/docs/docs/extraction/custom-metadata.md b/docs/docs/extraction/custom-metadata.md
index 1ac644243..805aef628 100644
--- a/docs/docs/extraction/custom-metadata.md
+++ b/docs/docs/extraction/custom-metadata.md
@@ -56,14 +56,14 @@ meta_df.to_csv(file_path)
 ### Example: Add Custom Metadata During Ingestion
 
 The following example adds custom metadata during ingestion. 
-For more information about the `Ingestor` class, see [Use the NeMo Retriever Library Python API](python-api-reference.md).
+For more information about the `Ingestor` class, see [Use the Python API](nv-ingest-python-api.md).
 For more information about the `vdb_upload` method, see [Upload Data](data-store.md).
 
 ```python
-from nemo_retriever.client import Ingestor
+from nv_ingest_client.client import Ingestor
 
 hostname="localhost"
-collection_name = "nemo_retriever_collection"
+collection_name = "nv_ingest_collection"
 sparse = True
 
 ingestor = ( 
@@ -142,13 +142,13 @@ you can use the `content_metadata` field to filter search results.
 The following example uses a filter expression to narrow results by department.
 
 ```python
-from nemo_retriever.util.milvus import query
+from nv_ingest_client.util.milvus import nvingest_retrieval
 
 hostname="localhost"
-collection_name = "nemo_retriever_collection"
+collection_name = "nv_ingest_collection"
 sparse = True
 top_k = 5
-model_name="nvidia/llama-nemotron-embed-1b-v2"
+model_name="nvidia/llama-3.2-nv-embedqa-1b-v2"
 
 filter_expr = 'content_metadata["department"] == "Engineering"'
 
@@ -156,7 +156,7 @@ queries = ["this is expensive"]
 q_results = []
 for que in queries:
     q_results.append(
-        query(
+        nvingest_retrieval(
             [que], 
             collection_name, 
             milvus_uri=f"http://{hostname}:19530", 
@@ -177,4 +177,4 @@ print(f"{q_results}")
 ## Related Content
 
 - For a notebook that uses the CLI to add custom metadata and filter query results, see [metadata_and_filtered_search.ipynb
-](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/metadata_and_filtered_search.ipynb).
+](https://github.com/NVIDIA/nv-ingest/blob/main/examples/metadata_and_filtered_search.ipynb).
diff --git a/docs/docs/extraction/data-store.md b/docs/docs/extraction/data-store.md
index 990ee4f3d..580a1dfc7 100644
--- a/docs/docs/extraction/data-store.md
+++ b/docs/docs/extraction/data-store.md
@@ -4,48 +4,143 @@ Use this documentation to learn how [NeMo Retriever Library](overview.md) handle
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NeMo Retriever Library is also known as NVIDIA Ingest.
 
 
 ## Overview
 
-NeMo Retriever Library supports extracting text representations of various forms of content, 
-and ingesting to the [Milvus vector database](https://milvus.io/). 
-The data upload task (`vdb_upload`) pulls extraction results to the Python client, 
-and then pushes them to Milvus by using its underlying MinIO object store service.
+NeMo Retriever Library supports extracting text representations of various forms of content,
+and ingesting to a vector database. **[LanceDB](https://lancedb.com/) is the default vector database backend** for storing and retrieving extracted embeddings. [Milvus](https://milvus.io/) remains fully supported as an alternative.
 
-The vector database stores only the extracted text representations of ingested data. 
+The data upload task (`vdb_upload`) pulls extraction results to the Python client,
+and then pushes them to the configured vector database (LanceDB or Milvus). When using Milvus, data is pushed by using its underlying MinIO object store service.
+
+The vector database stores only the extracted text representations of ingested data.
 It does not store the embeddings for images.
 
 !!! tip "Storing Extracted Images"
 
-    To persist extracted images, tables, and chart renderings to disk or object storage, use the `store` task in addition to `vdb_upload`. The `store` task supports any fsspec-compatible backend (local filesystem, S3, GCS, etc.). For details, refer to [Store Extracted Images](python-api-reference.md#store-extracted-images).
+    To persist extracted images, tables, and chart renderings to disk or object storage, use the `store` task in addition to `vdb_upload`. The `store` task supports any fsspec-compatible backend (local filesystem, S3, GCS, etc.). For details, refer to [Store Extracted Images](nv-ingest-python-api.md#store-extracted-images).
+
+NeMo Retriever Library supports uploading data by using the [Ingestor.vdb_upload API](nv-ingest-python-api.md).
+Currently, data upload is not supported through the [CLI](nv-ingest_cli.md).
+
+
+
+## Why LanceDB?
+
+LanceDB delivers measurably lower retrieval latency through three architectural advantages over the previous Milvus default:
+
+- **Lance columnar format** — Data is stored in Lance files, an Arrow/Parquet-style analytics layout optimized for fast local scans and indexed retrieval. This eliminates the serialization overhead of client-server protocols.
+- **IVF_HNSW_SQ index** — Vectors are scalar-quantized (SQ) within an IVF-HNSW index, compressing them for faster search with lower memory bandwidth cost.
+- **Embedded runtime** — LanceDB runs in-process, removing the multi-service dependency chain required by Milvus (Milvus server + etcd + MinIO). No external containers to start, configure, or maintain.
+
+This combination of file format, index strategy, and simpler runtime path produces the latency improvements observed in benchmarks.
+
+
+
+## Upload to LanceDB (default)
+
+LanceDB uses the `LanceDB` operator class from the client library. You can configure it via the Python API or via the test harness.
+
+### Programmatic API (Python)
+
+```python
+from nv_ingest_client.util.vdb.lancedb import LanceDB
+
+vdb = LanceDB(
+    uri="lancedb",           # Path to LanceDB database directory
+    table_name="nv-ingest",  # Table name
+    index_type="IVF_HNSW_SQ",  # Index type (default)
+    hybrid=False,            # Enable hybrid search (BM25 FTS + vector)
+)
+
+# Ingest
+vdb.run(results)
+
+# Retrieve
+docs = vdb.retrieval(queries, top_k=10)
+```
+
+When using the `Ingestor` with `vdb_upload`, the backend defaults to LanceDB unless you configure Milvus (see [Upload to Milvus](#upload-to-milvus)).
 
-NeMo Retriever Library supports uploading data by using the [Ingestor.vdb_upload API](python-api-reference.md). 
-Currently, data upload is not supported through the [NeMo Retriever CLI](cli-reference.md).
+### Test harness configuration
+
+In `tools/harness/test_configs.yaml`:
+
+```yaml
+active:
+  vdb_backend: lancedb   # Options: "lancedb" (default) or "milvus"
+  hybrid: false          # LanceDB only: enable hybrid retrieval (FTS + vector)
+  sparse: false          # Milvus only: enable BM42 sparse embeddings
+```
+
+Or via environment variables:
+
+```bash
+# Switch to Milvus
+VDB_BACKEND=milvus uv run nv-ingest-harness-run --case=e2e --dataset=bo767
+
+# Enable LanceDB hybrid search
+HYBRID=true uv run nv-ingest-harness-run --case=e2e --dataset=bo767
+```
+
+
+
+## Hybrid search (LanceDB)
+
+LanceDB supports **hybrid retrieval**, combining dense vector similarity with BM25 full-text search. Results are fused using Reciprocal Rank Fusion (RRF) reranking.
+
+Hybrid search improves recall by approximately +0.5% to +3.5% over vector-only retrieval with negligible latency impact:
+
+| Dataset            | Vector-Only Recall@5 | Hybrid Recall@5 | Delta  |
+|--------------------|----------------------|-----------------|--------|
+| bo767 (76K rows)   | 84.5%                | 85.0%           | +0.5%  |
+| bo767 (reranked)   | 90.7%                | 91.8%           | +1.1%  |
+| earnings (19K rows)| 61.5%                | 65.0%           | +3.5%  |
+| earnings (reranked)| 74.5%                | 76.4%           | +1.9%  |
+
+Hybrid search latency is typically 28–57 ms/query (vs. 31–37 ms/query for vector-only). The one-time FTS index build adds approximately 6.5 seconds for a 76K-row dataset.
+
+Enable hybrid search by setting `hybrid=True` when creating the LanceDB operator or via the harness/config (e.g. `HYBRID=true`).
+
+
+
+## Infrastructure: LanceDB vs Milvus
+
+| Aspect              | LanceDB (default)       | Milvus                    |
+|---------------------|-------------------------|---------------------------|
+| Runtime model       | Embedded (in-process)   | Client-server             |
+| External services   | None                    | Milvus + etcd + MinIO     |
+| Docker Compose profile | Not needed           | `--profile retrieval`     |
+| Index type          | IVF_HNSW_SQ             | HNSW, GPU_CAGRA, etc.     |
+| Hybrid search       | BM25 FTS + vector (RRF) | BM42 sparse embeddings    |
+| Persistence         | Lance files on disk     | Milvus server + MinIO     |
 
 
 
 ## Upload to Milvus
 
-The `vdb_upload` method uses GPU Cagra accelerated bulk indexing support to load chunks into Milvus. 
-To enable hybrid retrieval, NeMo Retriever supports both dense (llama-embedder embeddings) and sparse (bm25) embeddings. 
+You can continue using Milvus with no code changes — set `vdb_backend: milvus` in the harness config or use the existing Milvus API calls (`vdb_upload(milvus_uri=...)`, `nvingest_retrieval(...)`).
+
+The `vdb_upload` method uses GPU Cagra accelerated bulk indexing support to load chunks into Milvus.
+To enable hybrid retrieval with Milvus, the library supports both dense (llama-embedder embeddings) and sparse (BM42) embeddings.
 
-Bulk indexing is high throughput, but has a built-in overhead of around one minute. 
-If the number of ingested documents is 10 or fewer, NeMo Retriever uses faster streaming inserts instead. 
-You can control this by setting `stream=True`. 
+Bulk indexing is high throughput, but has a built-in overhead of around one minute.
+If the number of ingested documents is 10 or fewer, the library uses faster streaming inserts instead.
+You can control this by setting `stream=True`.
 
-If you set `recreate=True`, NeMo Retriever drops and recreates the collection given as *collection_name*. 
-The Milvus service persists data to disk by using a Docker volume defined in docker-compose.yaml. 
-You can delete all collections by deleting that volume, and then restarting the NeMo Retriever service.
+If you set `recreate=True`, the pipeline drops and recreates the collection given as *collection_name*.
+The Milvus service persists data to disk by using a Docker volume defined in docker-compose.yaml.
+You can delete all collections by deleting that volume, and then restarting the Milvus service.
 
 !!! warning
 
-    When you use the `vdb_upload` task with Milvus, you must expose the ports for the Milvus and MinIO containers to the NeMo Retriever client. This ensures that the NeMo Retriever client can connect to both services and perform the `vdb_upload` action.
+    When you use the `vdb_upload` task with Milvus, you must expose the ports for the Milvus and MinIO containers to the client. This ensures that the client can connect to both services and perform the `vdb_upload` action.
 
 !!! tip
 
-    When you use the `vdb_upload` method, the behavior of the upload depends on the `return_failures` parameter of the `ingest` method. For details, refer to [Capture Job Failures](python-api-reference.md#capture-job-failures).
+    When you use the `vdb_upload` method, the behavior of the upload depends on the `return_failures` parameter of the `ingest` method. For details, refer to [Capture Job Failures](nv-ingest-python-api.md#capture-job-failures).
 
 To upload to Milvus, use code similar to the following to define your `Ingestor`.
 
@@ -70,21 +165,21 @@ Ingestor(client=client)
 
 ## Upload to a Custom Data Store
 
-You can ingest to other data stores by using the `Ingestor.vdb_upload` method; 
-however, you must configure other data stores and connections yourself. 
-NeMo Retriever Library does not provide connections to other data sources. 
+You can ingest to other data stores by using the `Ingestor.vdb_upload` method;
+however, you must configure other data stores and connections yourself.
+NeMo Retriever Library does not provide connections to other data sources.
 
 !!! important
 
     NVIDIA makes no claim about accuracy, performance, or functionality of any vector database except Milvus. If you use a different vector database, it's your responsibility to test and maintain it.
 
-For more information, refer to [Build a Custom Vector Database Operator](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/building_vdb_operator.ipynb).
+For more information, refer to [Build a Custom Vector Database Operator](https://github.com/NVIDIA/nv-ingest/blob/main/examples/building_vdb_operator.ipynb).
 
 
 
 ## Related Topics
 
-- [Python API Reference](python-api-reference.md)
-- [Store Extracted Images](python-api-reference.md#store-extracted-images)
+- [Use the NeMo Retriever Library Python API](nv-ingest-python-api.md)
+- [Store Extracted Images](nv-ingest-python-api.md#store-extracted-images)
 - [Environment Variables](environment-config.md)
-- [Troubleshoot NeMo Retriever Library](troubleshoot.md)
+- [Troubleshoot Nemo Retriever Extraction](troubleshoot.md)
diff --git a/docs/docs/extraction/environment-config.md b/docs/docs/extraction/environment-config.md
index 422db0add..b411ea5b6 100644
--- a/docs/docs/extraction/environment-config.md
+++ b/docs/docs/extraction/environment-config.md
@@ -5,7 +5,7 @@ You can specify these in your .env file or directly in your environment.
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NeMo Retriever Library is also known as NVIDIA Ingest.
 
 
 ## General Environment Variables
@@ -17,19 +17,29 @@ You can specify these in your .env file or directly in your environment.
 | `INGEST_LOG_LEVEL`               | - `DEBUG` <br/> - `INFO` <br/> - `WARNING` <br/> - `ERROR` <br/> - `CRITICAL` <br/> | The log level for the ingest service, which controls the verbosity of the logging output. |
 | `MESSAGE_CLIENT_HOST`            | - `redis` <br/> - `localhost` <br/> - `192.168.1.10` <br/> | Specifies the hostname or IP address of the message broker used for communication between services. |
 | `MESSAGE_CLIENT_PORT`            | - `7670` <br/> - `6379` <br/>                              | Specifies the port number on which the message broker is listening. |
-| `MINIO_BUCKET`                   | `nemo-retriever` <br/>                                        | Name of MinIO bucket, used to store image, table, and chart extractions. |
+| `MINIO_BUCKET`                   | `nv-ingest` <br/>                                        | Name of MinIO bucket, used to store image, table, and chart extractions. |
 | `NGC_API_KEY`                    | `nvapi-*************` <br/>                              | An authorized NGC API key, used to interact with hosted NIMs. To create an NGC key, go to [https://org.ngc.nvidia.com/setup/api-keys](https://org.ngc.nvidia.com/setup/api-keys). |
 | `NIM_NGC_API_KEY`                | —                                                          | The key that NIM microservices inside docker containers use to access NGC resources. This is necessary only in some cases when it is different from `NGC_API_KEY`. If this is not specified, `NGC_API_KEY` is used to access NGC resources. |
 | `OTEL_EXPORTER_OTLP_ENDPOINT`    | `http://otel-collector:4317` <br/>                       | The endpoint for the OpenTelemetry exporter, used for sending telemetry data. |
 | `REDIS_INGEST_TASK_QUEUE`        | `ingest_task_queue` <br/>                              | The name of the task queue in Redis where tasks are stored and processed. |
 | `REDIS_POOL_SIZE`                | - `50` (default) <br/> - `100` <br/> - `200` <br/>     | Maximum Redis connection pool size. Increase for high-concurrency workloads processing many documents in parallel. Default of 50 works well for most deployments. |
-| `IMAGE_STORAGE_URI`              | `s3://nemo-retriever/artifacts/store/images` <br/>          | Default fsspec-compatible URI for the `store` task. Supports `s3://`, `file://`, `gs://`, etc. See [Store Extracted Images](python-api-reference.md#store-extracted-images). |
+| `IMAGE_STORAGE_URI`              | `s3://nv-ingest/artifacts/store/images` <br/>          | Default fsspec-compatible URI for the `store` task. Supports `s3://`, `file://`, `gs://`, etc. See [Store Extracted Images](nv-ingest-python-api.md#store-extracted-images). |
 | `IMAGE_STORAGE_PUBLIC_BASE_URL`  | `https://assets.example.com/images` <br/>              | Optional HTTP(S) base URL for serving stored images. |
 
 
+## Vector Database (Retrieval) Environment Variables
+
+These variables apply when using the test harness or when configuring the vector database backend.
+
+| Name                             | Example                        | Description                                                           |
+|----------------------------------|--------------------------------|-----------------------------------------------------------------------|
+| `VDB_BACKEND`                    | `lancedb` (default) <br/> `milvus` <br/> | Vector database backend. Use `lancedb` for embedded, in-process storage (default), or `milvus` for client-server. |
+| `HYBRID`                         | `true` <br/> `false` (default) <br/> | LanceDB only: enable hybrid retrieval (BM25 FTS + vector, RRF). |
+
+
 ## Library Mode Environment Variables
 
-These environment variables apply specifically when running NeMo Retriever in library mode.
+These environment variables apply specifically when running the library in library mode.
 
 | Name                              | Example                                                 | Description |
 |-----------------------------------|---------------------------------------------------------|-------------|
diff --git a/docs/docs/extraction/faq.md b/docs/docs/extraction/faq.md
index d7eabd490..7bbb33d5f 100644
--- a/docs/docs/extraction/faq.md
+++ b/docs/docs/extraction/faq.md
@@ -4,35 +4,35 @@ This documentation contains the Frequently Asked Questions (FAQ) for [NeMo Retri
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NeMo Retriever Library is also known as NVIDIA Ingest.
 
 
 
 ## What if I already have a retrieval pipeline? Can I just use NeMo Retriever Library? 
 
-You can use the nemo-retriever CLI or Python APIs to perform extraction only, and then consume the results.
+You can use the CLI or Python APIs to perform extraction only, and then consume the results.
 Using the Python API, `results` is a list object with one entry.
-For code examples, see the Jupyter notebooks [Multimodal RAG with LlamaIndex](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/llama_index_multimodal_rag.ipynb) 
-and [Multimodal RAG with LangChain](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/langchain_multimodal_rag.ipynb).
+For code examples, see the Jupyter notebooks [Multimodal RAG with LlamaIndex](https://github.com/NVIDIA/nv-ingest/blob/main/examples/llama_index_multimodal_rag.ipynb) 
+and [Multimodal RAG with LangChain](https://github.com/NVIDIA/nv-ingest/blob/main/examples/langchain_multimodal_rag.ipynb).
 
 
 
 ## Where does NeMo Retriever Library ingest to?
 
-NeMo Retriever Library supports extracting text representations of various forms of content, 
-and ingesting to the [Milvus vector database](https://milvus.io/). 
-NeMo Retriever Library does not store data on disk except through Milvus and its underlying Minio object store. 
-You can ingest to other data stores; however, you must configure other data stores yourself. 
+NeMo Retriever Library supports extracting text representations of various forms of content,
+and ingesting to a vector database. **[LanceDB](https://lancedb.com/) is the default**; [Milvus](https://milvus.io/) is also fully supported.
+NeMo Retriever Library does not store data on disk except through the vector database (LanceDB uses local Lance files; Milvus uses its server and MinIO).
+You can ingest to other data stores; however, you must configure other data stores yourself.
 For more information, refer to [Data Upload](data-store.md).
 
 
 
 ## How would I process unstructured images?
 
-For images that `nemotron-page-elements-v3` does not classify as tables, charts, or infographics,
+For images that `nemoretriever-page-elements-v3` does not classify as tables, charts, or infographics,
 you can use our VLM caption task to create a dense caption of the detected image. 
 That caption is then be embedded along with the rest of your content. 
-For more information, refer to [Extract Captions from Images](python-api-reference.md#extract-captions-from-images).
+For more information, refer to [Extract Captions from Images](nv-ingest-python-api.md#extract-captions-from-images).
 
 
 
@@ -57,21 +57,27 @@ You can set those directly in `docker-compose.yaml`, or in an [environment varia
 
 ### Library Mode
 
-For production environments, you should use the provided Helm charts. For [library mode](quickstart-library-mode.md), you should set the environment variable `NVIDIA_API_KEY`. This happens because the NeMo Retriever Library containers—and the services running inside them—don’t have access to the environment variables of the host machine where the `docker compose` command is executed. Setting the variables in the `.env` file ensures that they are passed into the containers and available to the services that need them.
+For production environments, you should use the provided Helm charts. For [library mode](quickstart-library-mode.md), you should set the environment variable `NVIDIA_API_KEY`. This is because the NeMo Retriever containers and the NeMo Retriever services running inside them do not have access to the environment variables on the host machine where you run the `docker compose` command. Setting the variables in the `.env` file ensures that they are passed into the containers and available to the services that need them.
 
 For advanced scenarios, you might want to use library mode with self-hosted NIM instances. 
 You can set custom endpoints for each NIM. 
-For examples of `*_ENDPOINT` variables, refer to [NeMo-Retriever/docker-compose.yaml](https://github.com/NVIDIA/NeMo-Retriever/blob/main/docker-compose.yaml).
+For examples of `*_ENDPOINT` variables, refer to [nv-ingest/docker-compose.yaml](https://github.com/NVIDIA/nv-ingest/blob/main/docker-compose.yaml).
+
+
+
+
+
+
 
 ## What parameters or settings can I adjust to optimize extraction from my documents or data? 
 
 See the [Profile Information](quickstart-guide.md#profile-information) section 
 for information about the optional NIM components of the pipeline.
 
-You can configure the `extract`, `caption`, and other tasks by using the [Ingestor API](python-api-reference.md).
+You can configure the `extract`, `caption`, and other tasks by using the [Ingestor API](nv-ingest-python-api.md).
 
 To choose what types of content to extract, use code similar to the following. 
-For more information, refer to [Extract Specific Elements from PDFs](python-api-reference.md#extract-specific-elements-from-pdfs).
+For more information, refer to [Extract Specific Elements from PDFs](nv-ingest-python-api.md#extract-specific-elements-from-pdfs).
 
 ```python
 Ingestor(client=client)
@@ -88,7 +94,7 @@ Ingestor(client=client)
 ```
 
 To generate captions for images, use code similar to the following.
-For more information, refer to [Extract Captions from Images](python-api-reference.md#extract-captions-from-images).
+For more information, refer to [Extract Captions from Images](nv-ingest-python-api.md#extract-captions-from-images).
 
 ```python
 Ingestor(client=client)
diff --git a/docs/docs/extraction/helm.md b/docs/docs/extraction/helm.md
index bcc63b4c0..1a6e885a3 100644
--- a/docs/docs/extraction/helm.md
+++ b/docs/docs/extraction/helm.md
@@ -3,8 +3,4 @@
 <!-- Use this documentation to deploy [NeMo Retriever Library](overview.md) by using Helm. -->
 
 To deploy [NeMo Retriever Library](overview.md) by using Helm, 
-refer to [NeMo Retriever Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0/helm/README.md).
-
-!!! note "Air-gapped environments"
-   
-    For deploying in an air-gapped environment, refer to the [NVIDIA NIM Operator documentation on Air-Gapped Environments](https://docs.nvidia.com/nim-operator/latest/air-gap.html), which explains how to deploy NIMs when your cluster has no internet or NGC registry access.
+refer to [NV-Ingest Helm Charts](https://github.com/NVIDIA/nv-ingest/blob/release/26.1.2/helm/README.md).
diff --git a/docs/docs/extraction/nemoretriever-parse.md b/docs/docs/extraction/nemoretriever-parse.md
index 1c78c3cdb..3da84cdce 100644
--- a/docs/docs/extraction/nemoretriever-parse.md
+++ b/docs/docs/extraction/nemoretriever-parse.md
@@ -4,25 +4,25 @@ For scanned documents, or documents with complex layouts,
 we recommend that you use [nemotron-parse](https://build.nvidia.com/nvidia/nemotron-parse). 
 Nemotron parse provides higher-accuracy text extraction. 
 
-This documentation describes the following three methods
+This documentation describes the following two methods 
 to run [NeMo Retriever Library](overview.md) with nemotron-parse.
 
 - Run the NIM locally by using Docker Compose
 - Use NVIDIA Cloud Functions (NVCF) endpoints for cloud-based inference
-- Run the Ray batch pipeline with nemotron-parse (library mode)
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NeMo Retriever Library is also known as NVIDIA Ingest.
 
 
 ## Limitations
 
-Currently, the limitations to using `nemotron-parse` with the NeMo Retriever Library are the following:
+Currently, the limitations to using `nemotron-parse` with NeMo Retriever Library are the following:
 
-- Extraction with `nemotron-parse` only supports PDFs, not image files. For more information, refer to [Troubleshoot NeMo Retriever Library](troubleshoot.md).
+- Extraction with `nemotron-parse` only supports PDFs, not image files. For more information, refer to [Troubleshoot Nemo Retriever Extraction](troubleshoot.md).
 - `nemotron-parse` is not supported on RTX Pro 6000, B200, or H200 NVL. For more information, refer to the [Nemotron Parse Support Matrix](https://docs.nvidia.com/nim/vision-language-models/latest/support-matrix.html#nemotron-parse).
 
+
 ## Run the NIM Locally by Using Docker Compose
 
 Use the following procedure to run the NIM locally.
@@ -32,7 +32,7 @@ Use the following procedure to run the NIM locally.
     Due to limitations in available VRAM controls in the current release of nemotron-parse, it must run on a [dedicated additional GPU](support-matrix.md). Edit docker-compose.yaml to set nemotron-parse's device_id to a dedicated GPU: device_ids: ["1"] or higher.
 
 
-1. Start the NeMo Retriever Library services with the `nemotron-parse` profile. This profile includes the necessary components for extracting text and metadata from images. Use the following command.
+1. Start the retriever services with the `nemotron-parse` profile. This profile includes the necessary components for extracting text and metadata from images. Use the following command.
 
     - The --profile nemotron-parse flag ensures that vision-language retrieval services are launched.  For more information, refer to [Profile Information](quickstart-guide.md#profile-information).
 
@@ -40,11 +40,11 @@ Use the following procedure to run the NIM locally.
     docker compose --profile nemotron-parse up
     ```
 
-2. After the services are running, you can interact with NeMo Retriever Library by using Python.
+2. After the services are running, you can interact with the pipeline by using Python.
 
     - The `Ingestor` object initializes the ingestion process.
     - The `files` method specifies the input files to process.
-    - The `extract` method tells NeMo Retriever Library to use `nemotron-parse` for extracting text and metadata from images.
+    - The `extract` method tells the pipeline to use `nemotron-parse` for extracting text and metadata from images.
     - The `document_type` parameter is optional, because `Ingestor` should detect the file type automatically.
 
     ```python
@@ -60,12 +60,12 @@ Use the following procedure to run the NIM locally.
 
     !!! tip
 
-        For more Python examples, refer to [NeMo Retriever Library: Python Client Quick Start Guide](https://github.com/NVIDIA/NeMo-Retriever/blob/main/client/client_examples/examples/python_client_usage.ipynb).
+        For more Python examples, refer to [NV-Ingest: Python Client Quick Start Guide](https://github.com/NVIDIA/nv-ingest/blob/main/client/client_examples/examples/python_client_usage.ipynb).
 
 
 ## Using NVCF Endpoints for Cloud-Based Inference
 
-Instead of running NeMo Retriever Library locally, you can use NVCF to perform inference by using remote endpoints.
+Instead of running the pipeline locally, you can use NVCF to perform inference by using remote endpoints.
 
 1. Set the authentication token in the `.env` file.
 
@@ -85,7 +85,7 @@ Instead of running NeMo Retriever Library locally, you can use NVCF to perform i
 
     - The `Ingestor` object initializes the ingestion process.
     - The `files` method specifies the input files to process.
-    - The `extract` method tells NeMo Retriever Library to use `nemotron-parse` for extracting text and metadata from images.
+    - The `extract` method tells the pipeline to use `nemotron-parse` for extracting text and metadata from images.
     - The `document_type` parameter is optional, because `Ingestor` should detect the file type automatically.
 
     ```python
@@ -101,33 +101,12 @@ Instead of running NeMo Retriever Library locally, you can use NVCF to perform i
 
     !!! tip
 
-        For more Python examples, refer to [NeMo Retriever Library: Python Client Quick Start Guide](https://github.com/NVIDIA/NeMo-Retriever/blob/main/client/client_examples/examples/python_client_usage.ipynb).
-
-
-## Run the Ray batch pipeline with `nemotron-parse`
-
-When the `nemotron-parse` model is used in the retriever batch pipeline, the `page-elements` and `nemotron-ocr` stages are skipped in the Ray pipeline, because their functionality is handled by the `nemotron-parse` actor. This behavior applies when you run the pipeline from the command line (for instance, by using `batch_pipeline.py` in library mode).
-
-To enable `nemotron-parse` in the batch pipeline, set each of the following options to a value greater than zero:
-​
-- `--nemotron-parse-workers` — The number of Ray workers to run `nemotron-parse`.
-- `--gpu-nemotron-parse` — The GPU fraction to allocate for each worker (for example, 0.25 to run four workers on a single GPU).
-- `--nemotron-parse-batch-size` — The batch size for each `nemotron-parse` request.
-
-
-For example, to run the batch pipeline on a directory of PDFs with `nemotron-parse` turned on, use the following code:
+        For more Python examples, refer to [NV-Ingest: Python Client Quick Start Guide](https://github.com/NVIDIA/nv-ingest/blob/main/client/client_examples/examples/python_client_usage.ipynb).
 
-```shell
-python nemo_retriever/src/nemo_retriever/examples/batch_pipeline.py /path/to/pdfs \
-  --nemotron-parse-workers 16 \
-  --gpu-nemotron-parse .25 \
-  --nemotron-parse-batch-size 32
-```
 
-Replace `/path/to/pdfs` with the path to your input directory (for example, `/home/local/jdyer/datasets/jp20`).
 
 ## Related Topics
 
 - [Support Matrix](support-matrix.md)
-- [Troubleshoot NeMo Retriever Library](troubleshoot.md)
-- [Use the NeMo Retriever Library Python API](python-api-reference.md)
+- [Troubleshoot Nemo Retriever Extraction](troubleshoot.md)
+- [Use the Python API](nv-ingest-python-api.md)
diff --git a/docs/docs/extraction/nimclient.md b/docs/docs/extraction/nimclient.md
index cc1c402f2..fae98c068 100644
--- a/docs/docs/extraction/nimclient.md
+++ b/docs/docs/extraction/nimclient.md
@@ -5,14 +5,14 @@ This documentation demonstrates how to create custom NIM integrations for use in
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NeMo Retriever Library is also known as NVIDIA Ingest.
 
 The NimClient architecture consists of two main components:
 
 1. **NimClient**: The client class that handles communication with NIM endpoints via gRPC or HTTP protocols
 2. **ModelInterface**: An abstract base class that defines how to format input data, parse output responses, and process inference results for specific models
 
-For advanced usage patterns, see the existing model interfaces in `api/src/nemo_retriever/internal/primitives/nim/model_interface/`.
+For advanced usage patterns, see the existing model interfaces in `api/src/nv_ingest_api/internal/primitives/nim/model_interface/`.
 
 
 ## Quick Start
@@ -20,8 +20,8 @@ For advanced usage patterns, see the existing model interfaces in `api/src/nemo_
 ### Basic NimClient Creation
 
 ```python
-from nemo_retriever.util.nim import create_inference_client
-from nemo_retriever.internal.primitives.nim import ModelInterface
+from nv_ingest_api.util.nim import create_inference_client
+from nv_ingest_api.internal.primitives.nim import ModelInterface
 
 # Create a custom model interface (see examples below)
 model_interface = MyCustomModelInterface()
@@ -48,7 +48,7 @@ results = client.infer(data, model_name="your-model-name")
 
 ```python
 import os
-from nemo_retriever.util.nim import create_inference_client
+from nv_ingest_api.util.nim import create_inference_client
 
 # Use environment variables for configuration
 auth_token = os.getenv("NGC_API_KEY")
@@ -71,7 +71,7 @@ To integrate a new NIM, you need to create a custom `ModelInterface` subclass th
 ```python
 from typing import Dict, Any, List, Tuple, Optional
 import numpy as np
-from nemo_retriever.internal.primitives.nim import ModelInterface
+from nv_ingest_api.internal.primitives.nim import ModelInterface
 
 class MyCustomModelInterface(ModelInterface):
     """
@@ -305,7 +305,7 @@ class TextGenerationModelInterface(ModelInterface):
 
 ```python
 import base64
-from nemo_retriever.util.image_processing.transforms import numpy_to_base64
+from nv_ingest_api.util.image_processing.transforms import numpy_to_base64
 
 class ImageAnalysisModelInterface(ModelInterface):
     """Interface for image analysis NIMs (e.g., vision models)."""
@@ -382,8 +382,8 @@ class ImageAnalysisModelInterface(ModelInterface):
 ### Basic UDF with NimClient
 
 ```python
-from nemo_retriever.internal.primitives.control_message import IngestControlMessage
-from nemo_retriever.util.nim import create_inference_client
+from nv_ingest_api.internal.primitives.control_message import IngestControlMessage
+from nv_ingest_api.util.nim import create_inference_client
 import os
 
 def analyze_document_with_nim(control_message: IngestControlMessage) -> IngestControlMessage:
@@ -570,7 +570,7 @@ If memory issues persist, you can reduce the `NIM_TRITON_RATE_LIMIT` value — e
 import logging
 
 # Enable debug logging
-logging.getLogger("nemo_retriever.internal.primitives.nim").setLevel(logging.DEBUG)
+logging.getLogger("nv_ingest_api.internal.primitives.nim").setLevel(logging.DEBUG)
 
 # Test your model interface separately
 model_interface = MyCustomModelInterface()
diff --git a/docs/docs/extraction/notebooks.md b/docs/docs/extraction/notebooks.md
index c9a0ca480..4bf6fe7b4 100644
--- a/docs/docs/extraction/notebooks.md
+++ b/docs/docs/extraction/notebooks.md
@@ -4,30 +4,30 @@ To get started using [NeMo Retriever Library](overview.md), you can try one of t
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NeMo Retriever Library is also known as NVIDIA Ingest.
 
 
 ## Dataset Downloads for Benchmarking
 
-If you plan to run benchmarking or evaluation tests, you must download the [Benchmark Datasets (Bo20, Bo767, Bo10k)](https://github.com/NVIDIA/NeMo-Retriever/blob/main/evaluation/digital_corpora_download.ipynb) from Digital Corpora. This is a prerequisite for all benchmarking operations.
+If you plan to run benchmarking or evaluation tests, you must download the [Benchmark Datasets (Bo20, Bo767, Bo10k)](https://github.com/NVIDIA/nv-ingest/blob/main/evaluation/digital_corpora_download.ipynb) from Digital Corpora. This is a prerequisite for all benchmarking operations.
 
 ## Getting Started
 
 To get started with the basics, try one of the following notebooks:
 
-- [NeMo Retriever Library: CLI Client Quick Start Guide](https://github.com/NVIDIA/NeMo-Retriever/blob/main/client/client_examples/examples/cli_client_usage.ipynb)
-- [NeMo Retriever Library: Python Client Quick Start Guide](https://github.com/NVIDIA/NeMo-Retriever/blob/main/client/client_examples/examples/python_client_usage.ipynb)
-- [How to add metadata to your documents and filter searches](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/metadata_and_filtered_search.ipynb)
-- [How to reindex a collection](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/reindex_example.ipynb)
+- [NV-Ingest: CLI Client Quick Start Guide](https://github.com/NVIDIA/nv-ingest/blob/main/client/client_examples/examples/cli_client_usage.ipynb)
+- [NV-Ingest: Python Client Quick Start Guide](https://github.com/NVIDIA/nv-ingest/blob/main/client/client_examples/examples/python_client_usage.ipynb)
+- [How to add metadata to your documents and filter searches](https://github.com/NVIDIA/nv-ingest/blob/main/examples/metadata_and_filtered_search.ipynb)
+- [How to reindex a collection](https://github.com/NVIDIA/nv-ingest/blob/main/examples/reindex_example.ipynb)
 
 
 For more advanced scenarios, try one of the following notebooks:
 
-- [Build a Custom Vector Database Operator](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/building_vdb_operator.ipynb)
-- [Try Enterprise RAG Blueprint](https://github.com/NVIDIA/NeMo-Retriever/blob/main/deploy/pdf-blueprint.ipynb)
-- [Evaluate bo767 retrieval recall accuracy with NeMo Retriever Library and Milvus](https://github.com/NVIDIA/NeMo-Retriever/blob/main/evaluation/bo767_recall.ipynb)
-- [Multimodal RAG with LangChain](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/langchain_multimodal_rag.ipynb)
-- [Multimodal RAG with LlamaIndex](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/llama_index_multimodal_rag.ipynb)
+- [Build a Custom Vector Database Operator](https://github.com/NVIDIA/nv-ingest/blob/main/examples/building_vdb_operator.ipynb)
+- [Try Enterprise RAG Blueprint](https://github.com/NVIDIA/nv-ingest/blob/main/deploy/pdf-blueprint.ipynb)
+- [Evaluate bo767 retrieval recall accuracy with NV-Ingest and Milvus](https://github.com/NVIDIA/nv-ingest/blob/main/evaluation/bo767_recall.ipynb)
+- [Multimodal RAG with LangChain](https://github.com/NVIDIA/nv-ingest/blob/main/examples/langchain_multimodal_rag.ipynb)
+- [Multimodal RAG with LlamaIndex](https://github.com/NVIDIA/nv-ingest/blob/main/examples/llama_index_multimodal_rag.ipynb)
 
 
 
diff --git a/docs/docs/extraction/nv-ingest-python-api.md b/docs/docs/extraction/nv-ingest-python-api.md
new file mode 100644
index 000000000..d4c29f2b5
--- /dev/null
+++ b/docs/docs/extraction/nv-ingest-python-api.md
@@ -0,0 +1,600 @@
+# Use the NeMo Retriever Library Python API
+
+The [NeMo Retriever Library](overview.md) Python API provides a simple and flexible interface for processing and extracting information from various document types, including PDFs.
+
+!!! note
+
+    NeMo Retriever Library is also known as NVIDIA Ingest.
+
+!!! tip
+
+    There is a Jupyter notebook available to help you get started with the Python API. For more information, refer to [Python Client Quick Start Guide](https://github.com/NVIDIA/nv-ingest/blob/main/client/client_examples/examples/python_client_usage.ipynb).
+
+
+## Summary of Key Methods
+
+The main class in the Python API is `Ingestor`. 
+The `Ingestor` class provides an interface for building, managing, and running data ingestion jobs, enabling for chainable task additions and job state tracking. 
+
+### Ingestor Methods
+
+The following table describes methods of the `Ingestor` class.
+
+| Method         | Description                       |
+|----------------|-----------------------------------|
+| `caption`      | Extract captions from images within the document. |
+| `embed`        | Generate embeddings from extracted content. |
+| `extract`      | Add an extraction task (text, tables, charts, infographics). |
+| `files`        | Add document paths for processing. |
+| `ingest`       | Submit jobs and retrieve results synchronously. |
+| `load`         | Ensure files are locally accessible (downloads if needed). |
+| `save_to_disk` | Save ingestion results to disk instead of memory. |
+| `store`        | Persist extracted images/structured renderings to an fsspec-compatible backend. |
+| `split`        | Split documents into smaller sections for processing. For more information, refer to [Split Documents](chunking.md). |
+| `vdb_upload`   | Push extraction results to the vector database (LanceDB by default, or Milvus). For more information, refer to [Data Upload](data-store.md). |
+
+
+### Extract Method Options
+
+The following table describes the `extract_method` options.
+
+| Value                | Status       | Description                                      |
+|----------------------|--------------|--------------------------------------------------|
+| `audio`              | Current      | Extract information from audio files.            |
+| `nemotron_parse`     | Current      | NVIDIA Nemotron Parse extraction.                |
+| `ocr`                | Current      | Bypasses native text extraction and processes every page using the full OCR pipeline. Use this for fully scanned documents or when native text is corrupt. |
+| `pdfium`             | Current      | Uses PDFium to extract native text. This is the default. This is the fastest method but does not capture text from scanned images/pages. |
+| `pdfium_hybrid`      | Current      | A hybrid approach that uses PDFium for pages with native text and automatically switches to OCR for scanned pages. This offers a robust balance of speed and coverage for mixed documents. |
+| `adobe`              | Deprecated   | Adobe PDF Services API extraction.               |
+| `haystack`           | Deprecated   | Haystack-based extraction.                       |
+| `llama_parse`        | Deprecated   | LlamaParse extraction.                           |
+| `tika`               | Deprecated   | Apache Tika extraction.                          |
+| `unstructured_io`    | Deprecated   | Unstructured.io API extraction.                  |
+| `unstructured_local` | Deprecated   | Local Unstructured extraction.                   |
+
+
+### Caption images and control reasoning
+
+The caption task can call a vision-language model (VLM) with the following optional controls:
+
+- `prompt` (string): User prompt for captioning. Defaults to `"Caption the content of this image:"`.
+- `reasoning` (boolean): Enable reasoning mode. `True` enables reasoning, `False` disables it. Defaults to `None` (service default, typically disabled).
+
+!!! note
+    The `reasoning` parameter maps to the VLM's system prompt: `reasoning=True` sets the system prompt to `"/think"`, and `reasoning=False` sets it to `"/no_think"` per the [Nemotron Nano 12B v2 VL model card] (https://build.nvidia.com/nvidia/nemotron-nano-12b-v2-vl/modelcard).
+
+Example:
+```python
+from nv_ingest_client.client.interface import Ingestor
+
+ingestor = (
+    Ingestor()
+    .files("path/to/doc-with-images.pdf")
+    .extract(extract_images=True)
+    .caption(
+        prompt="Caption the content of this image:",
+        reasoning=True,  # Enable reasoning
+    )
+    .ingest()
+)
+```
+
+
+
+## Track Job Progress
+
+For large document batches, you can enable a progress bar by setting `show_progress` to true. 
+Use the following code.
+
+```python
+# Return only successes
+results = ingestor.ingest(show_progress=True)
+
+print(len(results), "successful documents")
+```
+
+
+
+## Capture Job Failures
+
+You can capture job failures by setting `return_failures` to true. 
+Use the following code.
+
+```python
+# Return both successes and failures
+results, failures = ingestor.ingest(show_progress=True, return_failures=True)
+
+print(f"{len(results)} successful docs; {len(failures)} failures")
+
+if failures:
+    print("Failures:", failures[:1])
+```
+
+When you use the `vdb_upload` method, uploads are performed after ingestion completes. 
+The behavior of the upload depends on the following values of `return_failures`:
+
+- **False** – If any job fails, the `ingest` method raises a runtime error and does not upload any data (all-or-nothing data upload). This is the default setting.
+- **True** – If any jobs succeed, the results from those jobs are uploaded, and no errors are raised (partial data upload). The `ingest` method returns a failures object that contains the details for any jobs that failed. You can inspect the failures object and selectively retry or remediate the failed jobs.
+
+
+The following example uploads data to the vector database (LanceDB by default; use `milvus_uri` for Milvus) and returns any failures.
+
+```python
+ingestor = (
+    Ingestor(client=client)
+    .files(["/path/doc1.pdf", "/path/doc2.pdf"])
+    .extract()
+    .embed()
+    .vdb_upload(collection_name="my_collection", milvus_uri="milvus.db")
+)
+
+# Use for large batches where you want successful chunks/pages to be committed, while collecting detailed diagnostics for failures.
+results, failures = ingestor.ingest(return_failures=True)
+
+print(f"Uploaded {len(results)} successful docs; {len(failures)} failures")
+
+if failures:
+    print("Failures:", failures[:1])
+```
+
+
+
+## Quick Start: Extracting PDFs
+
+The following example demonstrates how to initialize `Ingestor`, load a PDF file, and extract its contents.
+The `extract` method enables different types of data to be extracted.
+
+### Extract a Single PDF
+
+Use the following code to extract a single PDF file.
+
+```python
+from nv_ingest_client.client.interface import Ingestor
+
+# Initialize Ingestor with a local PDF file
+ingestor = Ingestor().files("path/to/document.pdf")
+
+# Extract text, tables, and images
+result = ingestor.extract().ingest()
+
+print(result)
+```
+
+### Extract Multiple PDFs
+
+Use the following code to process multiple PDFs at one time.
+
+```python
+ingestor = Ingestor().files(["path/to/doc1.pdf", "path/to/doc2.pdf"])
+
+# Extract content from all PDFs
+result = ingestor.extract().ingest()
+
+for doc in result:
+    print(doc)
+```
+
+### Extract Specific Elements from PDFs
+
+By default, the `extract` method extracts all supported content types. 
+You can customize the extraction behavior by using the following code.
+
+```python
+ingestor = ingestor.extract(
+    extract_text=True,  # Extract text
+    text_depth="page",
+    extract_tables=False,  # Skip table extraction
+    extract_charts=True,  # Extract charts
+    extract_infographics=True,  # Extract infographic images
+    extract_images=False  # Skip image extraction
+)
+```
+
+### Extract Non-standard Document Types
+
+Use the following code to extract text from `.md`, `.sh`, and `.html` files.
+
+```python
+ingestor = Ingestor().files(["path/to/doc1.md", "path/to/doc2.html"])
+
+ingestor = ingestor.extract(
+    extract_text=True,  # Only extract text
+    extract_tables=False,
+    extract_charts=False,
+    extract_infographics=False,
+    extract_images=False
+)
+
+result = ingestor.ingest()
+```
+
+
+### Extract with Custom Document Type
+
+Use the following code to specify a custom document type for extraction.
+
+```python
+ingestor = ingestor.extract(document_type="pdf")
+```
+
+
+
+### Extract Office Documents (DOCX and PPTX)
+
+NeMo Retriever Library offers the following two extraction methods for Microsoft Office documents (.docx and .pptx), to balance performance and layout fidelity:
+
+- Native extraction
+- Render as PDF
+
+#### Native Extraction (Default)
+
+The default methods (`python_docx` and `python_pptx`) extract content directly from the file structure.
+This is generally faster, but you might lose some visual layout information.
+
+```python
+# Uses default native extraction
+ingestor = Ingestor().files(["report.docx", "presentation.pptx"]).extract()
+```
+
+#### Render as PDF
+
+The `render_as_pdf` method uses [LibreOffice](https://www.libreoffice.org/) to convert the document to a PDF before extraction.
+We recommend this approach when preserving the visual layout is critical, or when you need to extract visual elements, such as tables and charts, that are better detected by using computer vision on a rendered page.
+
+```python
+ingestor = Ingestor().files(["report.docx", "presentation.pptx"])
+
+ingestor = ingestor.extract(
+    extract_text=True,
+    extract_tables=True,
+    extract_charts=True,
+    extract_infographics=True,
+    extract_method="render_as_pdf"  # Convert to PDF first for improved visual extraction
+)
+```
+
+
+
+### PDF Extraction Strategies
+
+NeMo Retriever Library offers specialized strategies for PDF processing to handle various document qualities.
+You can select the strategy by using the following `extract_method` parameter values. 
+For the full list of `extract_method` options, refer to [Extract Method Options](#extract-method-options).
+
+- **ocr** – Bypasses native text extraction and processes every page using the full OCR pipeline. Use this for fully scanned documents or when native text is corrupt.
+- **pdfium** – Uses PDFium to extract native text. This is the default. This is the fastest method but does not capture text from scanned images/pages.
+- **pdfium_hybrid** – A hybrid approach that uses PDFium for pages with native text and automatically switches to OCR for scanned pages. This offers a robust balance of speed and coverage for mixed documents.
+
+```python
+ingestor = Ingestor().files("mixed_content.pdf")
+
+# Use hybrid mode for mixed digital/scanned PDFs
+ingestor = ingestor.extract(
+    document_type="pdf",
+    extract_method="pdfium_hybrid",
+)
+results = ingestor.ingest()
+```
+
+
+
+## Work with Large Datasets: Save to Disk
+
+By default, NeMo Retriever Library stores the results from every document in system memory (RAM). 
+When you process a very large dataset with thousands of documents, you might encounter an Out-of-Memory (OOM) error. 
+The `save_to_disk` method configures the extraction pipeline to write the output for each document to a separate JSONL file on disk.
+
+
+### Basic Usage: Save to a Directory
+
+To save results to disk, simply chain the `save_to_disk` method to your ingestion task.
+By using `save_to_disk` the `ingest` method returns a list of `LazyLoadedList` objects, 
+which are memory-efficient proxies that read from the result files on disk.
+
+In the following example, the results are saved to a directory named `my_ingest_results`. 
+You are responsible for managing the created files.
+
+```python
+ingestor = Ingestor().files("large_dataset/*.pdf")
+
+# Use save_to_disk to configure the ingestor to save results to a specific directory.
+# Set cleanup=False to ensure that the directory is not deleted by any automatic process.
+ingestor.save_to_disk(output_directory="./my_ingest_results", cleanup=False)  # Offload results to disk to prevent OOM errors
+
+# 'results' is a list of LazyLoadedList objects that point to the new jsonl files.
+results = ingestor.extract().ingest()
+
+print("Ingestion results saved in ./my_ingest_results")
+# You can now iterate over the results or inspect the files directly.
+```
+
+### Managing Disk Space with Automatic Cleanup
+
+When you use `save_to_disk`, NeMo Retriever Library creates intermediate files. 
+For workflows where these files are temporary, NeMo Retriever Library provides two automatic cleanup mechanisms.
+
+- **Directory Cleanup with Context Manager** — While not required for general use, the Ingestor can be used as a context manager (`with` statement). This enables the automatic cleanup of the entire output directory when `save_to_disk(cleanup=True)` is set (which is the default).
+
+- **File Purge After VDB Upload** – The `vdb_upload` method includes a `purge_results_after_upload: bool = True` parameter (the default). After a successful VDB upload, this feature deletes the individual `.jsonl` files that were just uploaded.
+
+You can also configure the output directory by using the `NV_INGEST_CLIENT_SAVE_TO_DISK_OUTPUT_DIRECTORY` environment variable.
+
+
+#### Example (Fully Automatic Cleanup)
+
+Fully Automatic cleanup is the recommended pattern for ingest-and-upload workflows where the intermediate files are no longer needed. 
+The entire process is temporary, and no files are left on disk.
+The following example includes automatic file purge. 
+
+```python
+# After the 'with' block finishes, 
+# the temporary directory and all its contents are automatically deleted.
+
+with (
+    Ingestor()
+    .files("/path/to/large_dataset/*.pdf")
+    .extract()
+    .embed()
+    .save_to_disk()  # cleanup=True is the default, enables directory deletion on exit
+    .vdb_upload()  # purge_results_after_upload=True is the default, deletes files after upload
+) as ingestor:
+    results = ingestor.ingest()
+
+```
+
+
+#### Example (Preserve Results on Disk)
+
+In scenarios where you need to inspect or use the intermediate `jsonl` files, you can disable the cleanup features. 
+The following example disables automatic file purge. 
+
+```python
+# After the 'with' block finishes, 
+# the './permanent_results' directory and all jsonl files are preserved for inspection or other uses.
+
+with (
+    Ingestor()
+    .files("/path/to/large_dataset/*.pdf")
+    .extract()
+    .embed()
+    .save_to_disk(output_directory="./permanent_results", cleanup=False)  # Specify a directory and disable directory-level cleanup
+    .vdb_upload(purge_results_after_upload=False)  # Disable automatic file purge after the VDB upload
+) as ingestor:
+    results = ingestor.ingest()
+```
+
+
+
+## Extract Captions from Images
+
+The `caption` method generates image captions by using a VLM. 
+You can use this to generate descriptions of unstructured images, infographics, and other visual content extracted from documents.
+
+!!! note
+
+    To use the `caption` option, enable the `vlm` profile when you start the NeMo Retriever Library services. The default model used by `caption` is `nvidia/llama-3.1-nemotron-nano-vl-8b-v1`. For more information, refer to [Profile Information in the Quickstart Guide](quickstart-guide.md#profile-information).
+
+### Basic Usage
+
+!!! tip
+
+    You can configure and use other vision language models for image captioning by specifying a different `model_name` and `endpoint_url` in the `caption` method. Choose a VLM that best fits your specific use case requirements.
+
+```python
+ingestor = ingestor.caption()
+```
+
+To specify a different API endpoint, pass additional parameters to `caption`.
+
+```python
+ingestor = ingestor.caption(
+    endpoint_url="https://integrate.api.nvidia.com/v1/chat/completions",
+    model_name="nvidia/llama-3.1-nemotron-nano-vl-8b-v1",
+    api_key="nvapi-"
+)
+```
+
+### Captioning Infographics
+
+Infographics are complex visual elements that combine text, charts, diagrams, and images to convey information.
+VLMs are particularly effective at generating descriptive captions for infographics because they can understand and summarize the visual content.
+
+The following example extracts and captions infographics from a document:
+
+```python
+ingestor = (
+    Ingestor()
+    .files("document_with_infographics.pdf")
+    .extract(
+        extract_text=True,
+        extract_tables=True,
+        extract_charts=True,
+        extract_infographics=True,  # Extract infographics for captioning
+        extract_images=False,
+    )
+    .caption(
+        prompt="Describe the content and key information in this infographic:",
+        reasoning=True,  # Enable reasoning for more detailed captions
+    )
+)
+results = ingestor.ingest()
+```
+
+!!! tip
+
+    For more information about working with infographics and multimodal content, refer to [Use Multimodal Embedding](vlm-embed.md).
+
+### Caption Images and Control Reasoning
+
+The caption task can call a VLM with optional prompt and system prompt overrides:
+
+- `caption_prompt` (user prompt): defaults to `"Caption the content of this image:"`.
+- `caption_system_prompt` (system prompt): defaults to `"/no_think"` (reasoning off). Set to `"/think"` to enable reasoning per the Nemotron Nano 12B v2 VL model card.
+
+Example:
+```python
+from nv_ingest_client.client.interface import Ingestor
+
+ingestor = (
+    Ingestor()
+    .files("path/to/doc-with-images.pdf")
+    .extract(extract_images=True)
+    .caption(
+        prompt="Caption the content of this image:",
+        system_prompt="/think",  # or "/no_think"
+    )
+    .ingest()
+)
+```
+
+
+
+## Extract Embeddings
+
+The `embed` method in the library generates text embeddings for document content.
+
+```python
+ingestor = ingestor.embed()
+```
+
+!!! note
+
+    By default, `embed` uses the [llama-3.2-nv-embedqa-1b-v2](https://build.nvidia.com/nvidia/llama-3_2-nv-embedqa-1b-v2) model.
+
+To use a different embedding model, such as [nv-embedqa-e5-v5](https://build.nvidia.com/nvidia/nv-embedqa-e5-v5), specify a different `model_name` and `endpoint_url`.
+
+```python
+ingestor = ingestor.embed(
+    endpoint_url="https://integrate.api.nvidia.com/v1",
+    model_name="nvidia/nv-embedqa-e5-v5",
+    api_key="nvapi-"
+)
+```
+
+## Store Extracted Images
+
+The `store` method exports decoded images (unstructured images as well as structured renderings such as tables and charts) to any fsspec-compatible URI so you can inspect or serve the generated visuals.
+
+```python
+ingestor = ingestor.store(
+    structured=True,   # persist table/chart renderings
+    images=True,       # persist unstructured images
+    storage_uri="file:///workspace/data/artifacts/store/images",  # Supports file://, s3://, etc.
+    public_base_url="https://assets.example.com/images"  # Optional CDN/base URL for download links
+)
+```
+
+### Store Method Parameters
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `structured` | bool | Persist table and chart renderings. Default: `False` |
+| `images` | bool | Persist unstructured images extracted from documents. Default: `False` |
+| `storage_uri` | str | fsspec-compatible URI (`file://`, `s3://`, `gs://`, etc.). Defaults to server-side `IMAGE_STORAGE_URI` environment variable. |
+| `public_base_url` | str | Optional HTTP(S) base URL for serving stored images. When set, metadata includes public download links. |
+
+### Supported Storage Backends
+
+The `store` task uses [fsspec](https://filesystem-spec.readthedocs.io/) for storage, supporting multiple backends:
+
+| Backend | URI Format | Example |
+|---------|------------|---------|
+| Local filesystem | `file://` | `file:///workspace/data/images` |
+| Amazon S3 | `s3://` | `s3://my-bucket/extracted-images` |
+| Google Cloud Storage | `gs://` | `gs://my-bucket/images` |
+| Azure Blob Storage | `abfs://` | `abfs://container@account.dfs.core.windows.net/images` |
+| MinIO (S3-compatible) | `s3://` | `s3://nv-ingest/artifacts/store/images` (default) |
+
+!!! tip
+
+    `storage_uri` defaults to the server-side `IMAGE_STORAGE_URI` environment variable (commonly `s3://nv-ingest/...`). If you change that variable—for example to a host-mounted `file://` path—restart the runtime so the container picks up the new value.
+
+When `public_base_url` is provided, the metadata returned from `ingest()` surfaces that HTTP(S) link while still recording the underlying storage URI. Leave it unset when the storage endpoint itself is already publicly reachable.
+
+### Docker Volume Mounts for Local Storage
+
+When running the pipeline via Docker and using `file://` storage URIs, the path must be within a mounted volume for files to persist on the host machine.
+
+By default, the `docker-compose.yaml` mounts a single volume:
+
+```yaml
+volumes:
+  - ${DATASET_ROOT:-./data}:/workspace/data
+```
+
+This means:
+
+| Container Path | Host Path | Works with `file://`? |
+|----------------|-----------|----------------------|
+| `/workspace/data/...` | `${DATASET_ROOT}/...` (default: `./data/...`) | ✅ Yes |
+| `/tmp/...` | (container only) | ❌ No - files lost on restart |
+| `/raid/custom/path` | (container only) | ❌ No - path not mounted |
+
+**Example: Save to host filesystem**
+
+```python
+# Files save to ./data/artifacts/images on the host
+ingestor = ingestor.store(
+    structured=True,
+    images=True,
+    storage_uri="file:///workspace/data/artifacts/images"
+)
+```
+
+**Example: Use a custom host directory**
+
+```bash
+# Set DATASET_ROOT before starting services
+export DATASET_ROOT=/raid/my-project/nv-ingest-data
+docker compose up -d
+```
+
+```python
+# Now /workspace/data maps to /raid/my-project/nv-ingest-data
+ingestor = ingestor.store(
+    structured=True,
+    images=True,
+    storage_uri="file:///workspace/data/extracted-images"
+)
+# Files save to /raid/my-project/nv-ingest-data/extracted-images on host
+```
+
+For more information on environment variables, refer to [Environment Variables](environment-config.md).
+
+
+
+## Extract Audio
+
+Use the following code to extract mp3 audio content.
+
+```python
+from nv_ingest_client.client import Ingestor
+
+ingestor = Ingestor().files("audio_file.mp3")
+
+ingestor = ingestor.extract(
+        document_type="mp3",
+        extract_text=True,
+        extract_tables=False,
+        extract_charts=False,
+        extract_images=False,
+        extract_infographics=False,
+    ).split(
+        tokenizer="meta-llama/Llama-3.2-1B",
+        chunk_size=150,
+        chunk_overlap=0,
+        params={"split_source_types": ["mp3"], "hf_access_token": "hf_***"}
+    )
+
+results = ingestor.ingest()
+```
+
+
+
+## Related Topics
+
+- [Split Documents](chunking.md)
+- [Troubleshoot Nemo Retriever Extraction](troubleshoot.md)
+- [Advanced Visual Parsing](nemoretriever-parse.md)
+- [Use NeMo Retriever Library with Riva for Audio Processing](audio.md)
+- [Use Multimodal Embedding](vlm-embed.md)
diff --git a/docs/docs/extraction/nv-ingest_cli.md b/docs/docs/extraction/nv-ingest_cli.md
new file mode 100644
index 000000000..718e656c2
--- /dev/null
+++ b/docs/docs/extraction/nv-ingest_cli.md
@@ -0,0 +1,175 @@
+# Use the NV-Ingest Command Line Interface
+
+After you install the Python dependencies, you can use the [NV-Ingest](overview.md) command line interface (CLI). 
+To use the CLI, use the `nv-ingest-cli` command.
+
+To check the version of the CLI that you have installed, run the following command.
+
+```bash
+nv-ingest-cli --version
+```
+
+To get a list of the current CLI commands and their options, run the following command.
+
+```bash
+nv-ingest-cli --help
+```
+
+!!! tip
+
+    There is a Jupyter notebook available to help you get started with the CLI. For more information, refer to [CLI Client Quick Start Guide](https://github.com/NVIDIA/nv-ingest/blob/main/client/client_examples/examples/cli_client_usage.ipynb).
+
+
+## Examples
+
+Use the following code examples to submit a document to the `nv-ingest-ms-runtime` service.
+
+Each of the following commands can be run from the host machine, or from within the `nv-ingest-ms-runtime` container.
+
+- Host: `nv-ingest-cli ...`
+- Container: `nv-ingest-cli ...`
+
+
+### Example: Text File With No Splitting
+
+To submit a text file with no splitting, run the following code.
+
+!!! note
+
+    You receive a response that contains a single document, which is the entire text file. The data that is returned is wrapped in the appropriate [metadata structure](content-metadata.md).
+
+```bash
+nv-ingest-cli \
+  --doc ./data/test.pdf \
+  --client_host=localhost \
+  --client_port=7670
+```
+
+
+### Example: PDF File With Splitting Only
+
+To submit a .pdf file with only a splitting task, run the following code.
+
+```bash
+nv-ingest-cli \
+  --doc ./data/test.pdf \
+  --output_directory ./processed_docs \
+  --task='split' \
+  --client_host=localhost \
+  --client_port=7670
+```
+
+
+### Example: PDF File With Splitting and Extraction
+
+To submit a .pdf file with both a splitting task and an extraction task, run the following code.
+
+!!! note
+    Currently, `split` only works for pdfium, nemotron-parse, and Unstructured.io.
+
+```bash
+nv-ingest-cli \
+  --doc ./data/test.pdf \
+  --output_directory ./processed_docs \
+  --task='extract:{"document_type": "pdf", "extract_method": "pdfium"}' \
+  --task='extract:{"document_type": "docx", "extract_method": "python_docx"}' \
+  --task='split' \
+  --client_host=localhost \
+  --client_port=7670
+
+```
+
+
+### Example: PDF File With Custom Split Page Count
+
+To submit a PDF file with a custom split page count, use the `--pdf_split_page_count` option. 
+This allows you to control how many pages are included in each PDF chunk during processing.
+
+!!! note
+    The `--pdf_split_page_count` option requires using the V2 API (set via `--api_version v2` or environment variable `NV_INGEST_API_VERSION=v2`).
+    It accepts values between 1 and 128 pages per chunk (default is server default, typically 32).
+    Smaller chunks provide more parallelism but increase overhead, while larger chunks reduce overhead but limit concurrency.
+
+```bash
+nv-ingest-cli \
+  --doc ./data/test.pdf \
+  --output_directory ./processed_docs \
+  --task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_text": "true"}' \
+  --pdf_split_page_count 64 \
+  --api_version v2 \
+  --client_host=localhost \
+  --client_port=7670
+```
+
+### Example: Caption images with reasoning control
+
+To invoke image captioning and control reasoning:
+
+```bash
+nv-ingest-cli \
+  --doc ./data/test.pdf \
+  --task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_images": "true"}' \
+  --task='caption:{"prompt": "Caption the content of this image:", "reasoning": true}' \
+  --client_host=localhost \
+  --client_port=7670
+```
+
+- `reasoning` (boolean): Set to `true` to enable reasoning, `false` to disable it. Defaults to service default (typically disabled).
+- Ensure the VLM caption profile/service is running or pointing to the public build endpoint; otherwise the caption task will be skipped.
+
+!!! tip
+
+  The caption service uses a default VLM which you can override by selecting other vision-language models to better match your image captioning needs. For more information, refer to [Extract Captions from Images](nv-ingest-python-api.md#extract-captions-from-images).
+
+Alternatively, you can use an environment variable to set the API version:
+
+```bash
+export NV_INGEST_API_VERSION=v2
+
+nv-ingest-cli \
+  --doc ./data/test.pdf \
+  --output_directory ./processed_docs \
+  --task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_text": "true"}' \
+  --pdf_split_page_count 64 \
+  --client_host=localhost \
+  --client_port=7670
+```
+
+
+### Example: Process a Dataset
+
+To submit a dataset for processing, run the following code. 
+To create a dataset, refer to [Command Line Dataset Creation with Enumeration and Sampling](#command-line-dataset-creation-with-enumeration-and-sampling).
+
+```shell
+nv-ingest-cli \
+  --dataset dataset.json \
+  --output_directory ./processed_docs \
+  --task='extract:{"document_type": "pdf", "extract_method": "pdfium"}' \
+  --client_host=localhost \
+  --client_port=7670
+
+```
+
+Submit a PDF file with extraction tasks and upload extracted images to MinIO.
+
+```bash
+nv-ingest-cli \
+  --doc ./data/test.pdf \
+  --output_directory ./processed_docs \
+  --task='extract:{"document_type": "pdf", "extract_method": "pdfium"}' \
+  --client_host=localhost \
+  --client_port=7670
+
+```
+
+
+## Command Line Dataset Creation with Enumeration and Sampling
+
+The `gen_dataset.py` script samples files from a specified source directory according to defined proportions and a total size target. 
+It offers options for caching the file list, outputting a sampled file list, and validating the output.
+
+```shell
+python ./src/util/gen_dataset.py --source_directory=./data --size=1GB --sample pdf=60 --sample txt=40 --output_file \
+  dataset.json --validate-output
+```
diff --git a/docs/docs/extraction/overview.md b/docs/docs/extraction/overview.md
index 263204ddc..8891d57e6 100644
--- a/docs/docs/extraction/overview.md
+++ b/docs/docs/extraction/overview.md
@@ -6,21 +6,21 @@ to find, contextualize, and extract text, tables, charts and infographics that y
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NeMo Retriever Library is also known as NVIDIA Ingest.
 
 NeMo Retriever Library enables parallelization of splitting documents into pages where artifacts are classified (such as text, tables, charts, and infographics), extracted, and further contextualized through optical character recognition (OCR) into a well defined JSON schema. 
 From there, NeMo Retriever Library can optionally manage computation of embeddings for the extracted content, 
-and optionally manage storing into a vector database [Milvus](https://milvus.io/).
+and optionally manage storing into a vector database ([LanceDB](https://lancedb.com/) by default, or [Milvus](https://milvus.io/)).
 
 !!! note
 
-    Cached and Deplot are deprecated. Instead, NeMo Retriever Library now uses the yolox-graphic-elements NIM. With this change, you should now be able to run NeMo Retriever Library on a single 24GB A10G or better GPU. If you want to use the old pipeline, with Cached and Deplot, use the [NeMo Retriever Library 24.12.1 release](https://github.com/NVIDIA/NeMo-Retriever/tree/24.12.1).
+    Cached and Deplot are deprecated. Instead, NeMo Retriever Library now uses the yolox-graphic-elements NIM. With this change, you should now be able to run NeMo Retriever Library on a single 24GB A10G or better GPU. If you want to use the old pipeline, with Cached and Deplot, use the [NeMo Retriever Library 24.12.1 release](https://github.com/NVIDIA/nv-ingest/tree/24.12.1).
 
 
 
 ## What NeMo Retriever Library Is ✔️
 
-The following diagram shows the NeMo Retriever Library pipeline.
+The following diagram shows the retriever pipeline.
 
 ![Overview diagram](images/overview-extraction.png)
 
diff --git a/docs/docs/extraction/prerequisites.md b/docs/docs/extraction/prerequisites.md
index 902c499c8..a5e0512e1 100644
--- a/docs/docs/extraction/prerequisites.md
+++ b/docs/docs/extraction/prerequisites.md
@@ -4,7 +4,7 @@ Before you begin using [NeMo Retriever Library](overview.md), ensure the followi
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NeMo Retriever Library is also known as NVIDIA Ingest.
 
 
 
diff --git a/docs/docs/extraction/quickstart-guide.md b/docs/docs/extraction/quickstart-guide.md
index 3dae43e66..6f7ab7194 100644
--- a/docs/docs/extraction/quickstart-guide.md
+++ b/docs/docs/extraction/quickstart-guide.md
@@ -5,22 +5,22 @@ This guide helps you get started using [NeMo Retriever Library](overview.md) in
 
 ## Step 1: Start Containers
 
-Use the provided [docker-compose.yaml](https://github.com/NVIDIA/NeMo-Retriever/blob/main/docker-compose.yaml) to start all needed services with a few commands.
+Use the provided [docker-compose.yaml](https://github.com/NVIDIA/nv-ingest/blob/main/docker-compose.yaml) to start all needed services with a few commands.
 
 !!! warning
 
     NIM containers on their first startup can take 10-15 minutes to pull and fully load models.
 
 
-If you prefer, you can run on Kubernetes by using [our Helm chart](https://github.com/NVIDIA/NeMo-Retriever/blob/main/helm/README.md). Also, there are [additional environment variables](environment-config.md) you can configure.
+If you prefer, you can run on Kubernetes by using [our Helm chart](https://github.com/NVIDIA/nv-ingest/blob/main/helm/README.md). Also, there are [additional environment variables](environment-config.md) you can configure.
 
 a. Git clone the repo:
 
-    `git clone https://github.com/nvidia/NeMo-Retriever`
+    `git clone https://github.com/nvidia/nv-ingest`
 
 b. Change the directory to the cloned repo by running the following code.
    
-    `cd NeMo-Retriever`.
+    `cd nv-ingest`.
 
 c. [Generate API keys](ngc-api-key.md) and authenticate with NGC with the `docker login` command.
 
@@ -48,13 +48,17 @@ e. Make sure that NVIDIA is set as your default container runtime before you run
 
     `sudo nvidia-ctk runtime configure --runtime=docker --set-as-default`
 
-f. Start core services. This example uses the retrieval profile.  For more information about other profiles, see [Profile Information](#profile-information).
+f. Start core services. By default, the pipeline uses **LanceDB** as the vector database (embedded, in-process); no extra Docker profile is required. If you want to use **Milvus** instead, start with the retrieval profile. This example uses the retrieval profile to run Milvus. For more information about other profiles, see [Profile Information](#profile-information).
 
     `docker compose --profile retrieval up`
 
+    !!! tip "LanceDB (default)"
+
+        To use the default LanceDB backend, you can run `docker compose up` without `--profile retrieval`. LanceDB runs in-process and does not require Milvus, etcd, or MinIO. For details, see [Data Upload](data-store.md).
+
     !!! tip
 
-        By default, we have [configured log levels to be verbose](https://github.com/NVIDIA/NeMo-Retriever/blob/main/docker-compose.yaml). It's possible to observe service startup proceeding. You will notice a lot of log messages. Disable verbose logging by configuring `NIM_TRITON_LOG_VERBOSE=0` for each NIM in [docker-compose.yaml](https://github.com/NVIDIA/NeMo-Retriever/blob/main/docker-compose.yaml).
+        By default, we have [configured log levels to be verbose](https://github.com/NVIDIA/nv-ingest/blob/main/docker-compose.yaml). It's possible to observe service startup proceeding. You will notice a lot of log messages. Disable verbose logging by configuring `NIM_TRITON_LOG_VERBOSE=0` for each NIM in [docker-compose.yaml](https://github.com/NVIDIA/nv-ingest/blob/main/docker-compose.yaml).
 
     !!! tip
 
@@ -82,37 +86,60 @@ h. Run the command `docker ps`. You should see output similar to the following.
 
     ```
     CONTAINER ID  IMAGE                                            COMMAND                 CREATED         STATUS                  PORTS            NAMES
+    1b885f37c991  nvcr.io/nvidia/nemo-microservices/nv-ingest:...  "/usr/bin/tini -- /w…"  7 minutes ago   Up 7 minutes (healthy)  0.0.0.0:7670...  nv-ingest-nv-ingest-ms-runtime-1
+    14ef31ed7f49  milvusdb/milvus:v2.5.3-gpu                       "/tini -- bash -c 's…"  7 minutes ago   Up 7 minutes (healthy)  0.0.0.0:9091...  milvus-standalone
+    dceaf36cc5df  otel/opentelemetry-collector-contrib:...         "/otelcol-contrib --…"  7 minutes ago   Up 7 minutes            0.0.0.0:4317...  nv-ingest-otel-collector-1
+    5bd0b48eb71b  nvcr.io/nim/nvidia/nemoretriever-graphic-ele...  "/opt/nvidia/nvidia_…"  7 minutes ago   Up 7 minutes            0.0.0.0:8003...  nv-ingest-graphic-elements-1
+    daf878669036  nvcr.io/nim/nvidia/nemoretriever-ocr-v1:1.2.1    "/opt/nvidia/nvidia_…"  7 minutes ago   Up 7 minutes            0.0.0.0:8009...  nv-ingest-ocr-1
+    216bdf11c566  nvcr.io/nim/nvidia/nemoretriever-page-elements-v3:1.7.0  "/opt/nvidia/nvidia_…"  7 minutes ago   Up 7 minutes            0.0.0.0:8000...  nv-ingest-page-elements-1
+    aee9580b0b9a  nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.10.0  "/opt/nvidia/nvidia_…"  7 minutes ago   Up 7 minutes            0.0.0.0:8012...  nv-ingest-embedding-1
+    178a92bf6f7f  nvcr.io/nim/nvidia/nemoretriever-table-struc...  "/opt/nvidia/nvidia_…"  7 minutes ago   Up 7 minutes            0.0.0.0:8006...  nv-ingest-table-structure-1
+    7ddbf7690036  openzipkin/zipkin                                "start-zipkin"          7 minutes ago   Up 7 minutes (healthy)  9410/tcp...      nv-ingest-zipkin-1
+    b73bbe0c202d  minio/minio:RELEASE.2023-03-20T20-16-18Z         "/usr/bin/docker-ent…"  7 minutes ago   Up 7 minutes (healthy)  0.0.0.0:9000...  minio
+    97fa798dbe4f  prom/prometheus:latest                           "/bin/prometheus --w…"  7 minutes ago   Up 7 minutes            0.0.0.0:9090...  nv-ingest-prometheus-1
+    f17cb556b086  grafana/grafana                                  "/run.sh"               7 minutes ago   Up 7 minutes            0.0.0.0:3000...  grafana-service
+    3403c5a0e7be  redis/redis-stack                                "/entrypoint.sh"        7 minutes ago   Up 7 minutes            0.0.0.0:6379...  nv-ingest-redis-1
+    ```
+
+
+## Step 2: Install Python Dependencies
+
+You can interact with the service from the host, or by using `docker exec` to run commands in the runtime container.
+
+To interact from the host, you'll need a Python environment that has the client dependencies installed.
+
+```
 uv venv --python 3.12 nv-ingest-dev
 source nv-ingest-dev/bin/activate
-uv pip install nv-ingest==26.3.0 nv-ingest-api==26.3.0 nv-ingest-client==26.3.0
+uv pip install nv-ingest==26.1.2 nv-ingest-api==26.1.2 nv-ingest-client==26.1.2
 ```
 
 !!! tip
 
-    To confirm that you have activated your Conda environment, run `which pip` and `which python`, and confirm that you see `nemo_retriever` in the result. You can do this before any pip or python command that you run.
+    To confirm that you have activated your virtual environment, run `which pip` and `which python`, and confirm that you see `nvingest` in the result. You can do this before any pip or python command that you run.
 
 
 !!! note
 
-Interaction from the host requires the appropriate port to be exposed from the `nemo-retriever` container, as defined in the `docker-compose.yaml` file. If you prefer, you can disable this port and interact directly with the NeMo Retriever Library from within its container.
+Interaction from the host requires the appropriate port to be exposed from the runtime container, as defined in the `docker-compose.yaml` file. If you prefer, you can disable this port and interact directly with the service from within its container.
 
 To work inside the container, run the following code.
 
 ```bash
-docker exec -it nemo-retriever-ms-runtime-1 bash
+docker exec -it nv-ingest-nv-ingest-ms-runtime-1 bash
 ```
-This command opens a shell in the `/workspace` directory, where the `DATASET_ROOT` from your `.env` file is mounted at `./data`. The pre-activated `nemo_retriever_runtime` conda environment includes all necessary Python client libraries. You should see a prompt similar to the following.
+This command opens a shell in the `/workspace` directory, where the `DATASET_ROOT` from your `.env` file is mounted at `./data`. The pre-created `nv_ingest_runtime` virtual environment includes all necessary Python client libraries. You should see a prompt similar to the following.
 
 ```bash
-(nemo_retriever_runtime) root@your-computer-name:/workspace#
+(nv_ingest_runtime) root@your-computer-name:/workspace#
 ```
-From this prompt, you can run the `nemo-retriever` CLI and Python examples.
+From this prompt, you can run the CLI and Python examples.
 
-Because many service URIs default to localhost, running inside the `nemo-retriever` container also requires that you specify URIs manually so that services can communicate across containers on the internal Docker network. See the example following for how to set the `milvus_uri`.
+Because many service URIs default to localhost, running inside the runtime container also requires that you specify URIs manually so that services can communicate across containers on the internal Docker network. When using Milvus, see the example following for how to set the `milvus_uri`. With the default LanceDB backend, no extra URI configuration is needed.
 
-## Step 2: Ingest Documents
+## Step 3: Ingest Documents
 
-You can submit jobs programmatically in Python or using the [NeMo Retriever Library CLI](cli-reference.md).
+You can submit jobs programmatically in Python or using the [CLI](nv-ingest_cli.md).
 
 The following examples demonstrate how to extract text, charts, tables, and images:
 
@@ -126,14 +153,14 @@ The following examples demonstrate how to extract text, charts, tables, and imag
 
 !!! tip
 
-    For more Python examples, refer to [NeMo Retriever Library: Python Client Quick Start Guide](https://github.com/NVIDIA/NeMo-Retriever/blob/main/client/client_examples/examples/python_client_usage.ipynb).
+    For more Python examples, refer to [NV-Ingest: Python Client Quick Start Guide](https://github.com/NVIDIA/nv-ingest/blob/main/client/client_examples/examples/python_client_usage.ipynb).
 
 <a id="ingest_python_example"></a>
 ```python
 import logging, os, time
-from nemo_retriever.client import Ingestor, NemoRetrieverClient
-from nemo_retriever.util.process_json_files import ingest_json_results_to_blob
-client = NemoRetrieverClient(                                                                         
+from nv_ingest_client.client import Ingestor, NvIngestClient
+from nv_ingest_client.util.process_json_files import ingest_json_results_to_blob
+client = NvIngestClient(                                                                         
     message_client_port=7670,                                                               
     message_client_hostname="localhost"        
 )                                                                 
@@ -263,15 +290,15 @@ image_caption:[]
 
 ```
 
-### Using the `nemo-retriever` CLI
+### Using the `nv-ingest-cli`
 
 !!! tip
 
-    There is a Jupyter notebook available to help you get started with the CLI. For more information, refer to [CLI Client Quick Start Guide](https://github.com/NVIDIA/NeMo-Retriever/blob/main/client/client_examples/examples/cli_client_usage.ipynb).
+    There is a Jupyter notebook available to help you get started with the CLI. For more information, refer to [CLI Client Quick Start Guide](https://github.com/NVIDIA/nv-ingest/blob/main/client/client_examples/examples/cli_client_usage.ipynb).
 
 <a id="ingest_cli_example"></a>
 ```shell
-nemo-retriever \
+nv-ingest-cli \
   --doc ./data/multimodal_test.pdf \
   --output_directory ./processed_docs \
   --task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_tables": "true", "extract_images": "true", "extract_charts": "true"}' \
@@ -284,54 +311,54 @@ You should see output that indicates the document processing status followed by
 ```
 None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
 [nltk_data] Downloading package punkt_tab to
-[nltk_data]     /raid/jdyer/miniforge3/envs/nemo-retriever-
+[nltk_data]     /raid/jdyer/miniforge3/envs/nv-ingest-
 [nltk_data]     dev/lib/python3.10/site-
 [nltk_data]     packages/llama_index/core/_static/nltk_cache...
 [nltk_data]   Package punkt_tab is already up-to-date!
-INFO:nemo_retriever.cli:Processing 1 documents.
-INFO:nemo_retriever.cli:Output will be written to: ./processed_docs
+INFO:nv_ingest_client.nv_ingest_cli:Processing 1 documents.
+INFO:nv_ingest_client.nv_ingest_cli:Output will be written to: ./processed_docs
 Processing files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.34s/file, pages_per_sec=1.28]
-INFO:nemo_retriever.cli.util.processing:message_broker_task_source: Avg: 2.39 ms, Median: 2.39 ms, Total Time: 2.39 ms, Total % of Trace Computation: 0.06%
-INFO:nemo_retriever.cli.util.processing:broker_source_network_in: Avg: 9.51 ms, Median: 9.51 ms, Total Time: 9.51 ms, Total % of Trace Computation: 0.25%
-INFO:nemo_retriever.cli.util.processing:job_counter: Avg: 1.47 ms, Median: 1.47 ms, Total Time: 1.47 ms, Total % of Trace Computation: 0.04%
-INFO:nemo_retriever.cli.util.processing:job_counter_channel_in: Avg: 0.46 ms, Median: 0.46 ms, Total Time: 0.46 ms, Total % of Trace Computation: 0.01%
-INFO:nemo_retriever.cli.util.processing:metadata_injection: Avg: 3.52 ms, Median: 3.52 ms, Total Time: 3.52 ms, Total % of Trace Computation: 0.09%
-INFO:nemo_retriever.cli.util.processing:metadata_injection_channel_in: Avg: 0.16 ms, Median: 0.16 ms, Total Time: 0.16 ms, Total % of Trace Computation: 0.00%
-INFO:nemo_retriever.cli.util.processing:pdf_content_extractor: Avg: 475.64 ms, Median: 163.77 ms, Total Time: 2378.21 ms, Total % of Trace Computation: 62.73%
-INFO:nemo_retriever.cli.util.processing:pdf_content_extractor_channel_in: Avg: 0.31 ms, Median: 0.31 ms, Total Time: 0.31 ms, Total % of Trace Computation: 0.01%
-INFO:nemo_retriever.cli.util.processing:image_content_extractor: Avg: 0.67 ms, Median: 0.67 ms, Total Time: 0.67 ms, Total % of Trace Computation: 0.02%
-INFO:nemo_retriever.cli.util.processing:image_content_extractor_channel_in: Avg: 0.21 ms, Median: 0.21 ms, Total Time: 0.21 ms, Total % of Trace Computation: 0.01%
-INFO:nemo_retriever.cli.util.processing:docx_content_extractor: Avg: 0.46 ms, Median: 0.46 ms, Total Time: 0.46 ms, Total % of Trace Computation: 0.01%
-INFO:nemo_retriever.cli.util.processing:docx_content_extractor_channel_in: Avg: 0.20 ms, Median: 0.20 ms, Total Time: 0.20 ms, Total % of Trace Computation: 0.01%
-INFO:nemo_retriever.cli.util.processing:pptx_content_extractor: Avg: 0.68 ms, Median: 0.68 ms, Total Time: 0.68 ms, Total % of Trace Computation: 0.02%
-INFO:nemo_retriever.cli.util.processing:pptx_content_extractor_channel_in: Avg: 0.46 ms, Median: 0.46 ms, Total Time: 0.46 ms, Total % of Trace Computation: 0.01%
-INFO:nemo_retriever.cli.util.processing:audio_data_extraction: Avg: 1.08 ms, Median: 1.08 ms, Total Time: 1.08 ms, Total % of Trace Computation: 0.03%
-INFO:nemo_retriever.cli.util.processing:audio_data_extraction_channel_in: Avg: 0.20 ms, Median: 0.20 ms, Total Time: 0.20 ms, Total % of Trace Computation: 0.01%
-INFO:nemo_retriever.cli.util.processing:dedup_images: Avg: 0.42 ms, Median: 0.42 ms, Total Time: 0.42 ms, Total % of Trace Computation: 0.01%
-INFO:nemo_retriever.cli.util.processing:dedup_images_channel_in: Avg: 0.42 ms, Median: 0.42 ms, Total Time: 0.42 ms, Total % of Trace Computation: 0.01%
-INFO:nemo_retriever.cli.util.processing:filter_images: Avg: 0.59 ms, Median: 0.59 ms, Total Time: 0.59 ms, Total % of Trace Computation: 0.02%
-INFO:nemo_retriever.cli.util.processing:filter_images_channel_in: Avg: 0.57 ms, Median: 0.57 ms, Total Time: 0.57 ms, Total % of Trace Computation: 0.02%
-INFO:nemo_retriever.cli.util.processing:table_data_extraction: Avg: 240.75 ms, Median: 240.75 ms, Total Time: 481.49 ms, Total % of Trace Computation: 12.70%
-INFO:nemo_retriever.cli.util.processing:table_data_extraction_channel_in: Avg: 0.38 ms, Median: 0.38 ms, Total Time: 0.38 ms, Total % of Trace Computation: 0.01%
-INFO:nemo_retriever.cli.util.processing:chart_data_extraction: Avg: 300.54 ms, Median: 299.94 ms, Total Time: 901.62 ms, Total % of Trace Computation: 23.78%
-INFO:nemo_retriever.cli.util.processing:chart_data_extraction_channel_in: Avg: 0.23 ms, Median: 0.23 ms, Total Time: 0.23 ms, Total % of Trace Computation: 0.01%
-INFO:nemo_retriever.cli.util.processing:infographic_data_extraction: Avg: 0.77 ms, Median: 0.77 ms, Total Time: 0.77 ms, Total % of Trace Computation: 0.02%
-INFO:nemo_retriever.cli.util.processing:infographic_data_extraction_channel_in: Avg: 0.25 ms, Median: 0.25 ms, Total Time: 0.25 ms, Total % of Trace Computation: 0.01%
-INFO:nemo_retriever.cli.util.processing:caption_ext: Avg: 0.55 ms, Median: 0.55 ms, Total Time: 0.55 ms, Total % of Trace Computation: 0.01%
-INFO:nemo_retriever.cli.util.processing:caption_ext_channel_in: Avg: 0.51 ms, Median: 0.51 ms, Total Time: 0.51 ms, Total % of Trace Computation: 0.01%
-INFO:nemo_retriever.cli.util.processing:embed_text: Avg: 1.21 ms, Median: 1.21 ms, Total Time: 1.21 ms, Total % of Trace Computation: 0.03%
-INFO:nemo_retriever.cli.util.processing:embed_text_channel_in: Avg: 0.21 ms, Median: 0.21 ms, Total Time: 0.21 ms, Total % of Trace Computation: 0.01%
-INFO:nemo_retriever.cli.util.processing:store_embedding_minio: Avg: 0.32 ms, Median: 0.32 ms, Total Time: 0.32 ms, Total % of Trace Computation: 0.01%
-INFO:nemo_retriever.cli.util.processing:store_embedding_minio_channel_in: Avg: 1.18 ms, Median: 1.18 ms, Total Time: 1.18 ms, Total % of Trace Computation: 0.03%
-INFO:nemo_retriever.cli.util.processing:message_broker_task_sink_channel_in: Avg: 0.42 ms, Median: 0.42 ms, Total Time: 0.42 ms, Total % of Trace Computation: 0.01%
-INFO:nemo_retriever.cli.util.processing:No unresolved time detected. Trace times account for the entire elapsed duration.
-INFO:nemo_retriever.cli.util.processing:Processed 1 files in 2.34 seconds.
-INFO:nemo_retriever.cli.util.processing:Total pages processed: 3
-INFO:nemo_retriever.cli.util.processing:Throughput (Pages/sec): 1.28
-INFO:nemo_retriever.cli.util.processing:Throughput (Files/sec): 0.43
+INFO:nv_ingest_client.cli.util.processing:message_broker_task_source: Avg: 2.39 ms, Median: 2.39 ms, Total Time: 2.39 ms, Total % of Trace Computation: 0.06%
+INFO:nv_ingest_client.cli.util.processing:broker_source_network_in: Avg: 9.51 ms, Median: 9.51 ms, Total Time: 9.51 ms, Total % of Trace Computation: 0.25%
+INFO:nv_ingest_client.cli.util.processing:job_counter: Avg: 1.47 ms, Median: 1.47 ms, Total Time: 1.47 ms, Total % of Trace Computation: 0.04%
+INFO:nv_ingest_client.cli.util.processing:job_counter_channel_in: Avg: 0.46 ms, Median: 0.46 ms, Total Time: 0.46 ms, Total % of Trace Computation: 0.01%
+INFO:nv_ingest_client.cli.util.processing:metadata_injection: Avg: 3.52 ms, Median: 3.52 ms, Total Time: 3.52 ms, Total % of Trace Computation: 0.09%
+INFO:nv_ingest_client.cli.util.processing:metadata_injection_channel_in: Avg: 0.16 ms, Median: 0.16 ms, Total Time: 0.16 ms, Total % of Trace Computation: 0.00%
+INFO:nv_ingest_client.cli.util.processing:pdf_content_extractor: Avg: 475.64 ms, Median: 163.77 ms, Total Time: 2378.21 ms, Total % of Trace Computation: 62.73%
+INFO:nv_ingest_client.cli.util.processing:pdf_content_extractor_channel_in: Avg: 0.31 ms, Median: 0.31 ms, Total Time: 0.31 ms, Total % of Trace Computation: 0.01%
+INFO:nv_ingest_client.cli.util.processing:image_content_extractor: Avg: 0.67 ms, Median: 0.67 ms, Total Time: 0.67 ms, Total % of Trace Computation: 0.02%
+INFO:nv_ingest_client.cli.util.processing:image_content_extractor_channel_in: Avg: 0.21 ms, Median: 0.21 ms, Total Time: 0.21 ms, Total % of Trace Computation: 0.01%
+INFO:nv_ingest_client.cli.util.processing:docx_content_extractor: Avg: 0.46 ms, Median: 0.46 ms, Total Time: 0.46 ms, Total % of Trace Computation: 0.01%
+INFO:nv_ingest_client.cli.util.processing:docx_content_extractor_channel_in: Avg: 0.20 ms, Median: 0.20 ms, Total Time: 0.20 ms, Total % of Trace Computation: 0.01%
+INFO:nv_ingest_client.cli.util.processing:pptx_content_extractor: Avg: 0.68 ms, Median: 0.68 ms, Total Time: 0.68 ms, Total % of Trace Computation: 0.02%
+INFO:nv_ingest_client.cli.util.processing:pptx_content_extractor_channel_in: Avg: 0.46 ms, Median: 0.46 ms, Total Time: 0.46 ms, Total % of Trace Computation: 0.01%
+INFO:nv_ingest_client.cli.util.processing:audio_data_extraction: Avg: 1.08 ms, Median: 1.08 ms, Total Time: 1.08 ms, Total % of Trace Computation: 0.03%
+INFO:nv_ingest_client.cli.util.processing:audio_data_extraction_channel_in: Avg: 0.20 ms, Median: 0.20 ms, Total Time: 0.20 ms, Total % of Trace Computation: 0.01%
+INFO:nv_ingest_client.cli.util.processing:dedup_images: Avg: 0.42 ms, Median: 0.42 ms, Total Time: 0.42 ms, Total % of Trace Computation: 0.01%
+INFO:nv_ingest_client.cli.util.processing:dedup_images_channel_in: Avg: 0.42 ms, Median: 0.42 ms, Total Time: 0.42 ms, Total % of Trace Computation: 0.01%
+INFO:nv_ingest_client.cli.util.processing:filter_images: Avg: 0.59 ms, Median: 0.59 ms, Total Time: 0.59 ms, Total % of Trace Computation: 0.02%
+INFO:nv_ingest_client.cli.util.processing:filter_images_channel_in: Avg: 0.57 ms, Median: 0.57 ms, Total Time: 0.57 ms, Total % of Trace Computation: 0.02%
+INFO:nv_ingest_client.cli.util.processing:table_data_extraction: Avg: 240.75 ms, Median: 240.75 ms, Total Time: 481.49 ms, Total % of Trace Computation: 12.70%
+INFO:nv_ingest_client.cli.util.processing:table_data_extraction_channel_in: Avg: 0.38 ms, Median: 0.38 ms, Total Time: 0.38 ms, Total % of Trace Computation: 0.01%
+INFO:nv_ingest_client.cli.util.processing:chart_data_extraction: Avg: 300.54 ms, Median: 299.94 ms, Total Time: 901.62 ms, Total % of Trace Computation: 23.78%
+INFO:nv_ingest_client.cli.util.processing:chart_data_extraction_channel_in: Avg: 0.23 ms, Median: 0.23 ms, Total Time: 0.23 ms, Total % of Trace Computation: 0.01%
+INFO:nv_ingest_client.cli.util.processing:infographic_data_extraction: Avg: 0.77 ms, Median: 0.77 ms, Total Time: 0.77 ms, Total % of Trace Computation: 0.02%
+INFO:nv_ingest_client.cli.util.processing:infographic_data_extraction_channel_in: Avg: 0.25 ms, Median: 0.25 ms, Total Time: 0.25 ms, Total % of Trace Computation: 0.01%
+INFO:nv_ingest_client.cli.util.processing:caption_ext: Avg: 0.55 ms, Median: 0.55 ms, Total Time: 0.55 ms, Total % of Trace Computation: 0.01%
+INFO:nv_ingest_client.cli.util.processing:caption_ext_channel_in: Avg: 0.51 ms, Median: 0.51 ms, Total Time: 0.51 ms, Total % of Trace Computation: 0.01%
+INFO:nv_ingest_client.cli.util.processing:embed_text: Avg: 1.21 ms, Median: 1.21 ms, Total Time: 1.21 ms, Total % of Trace Computation: 0.03%
+INFO:nv_ingest_client.cli.util.processing:embed_text_channel_in: Avg: 0.21 ms, Median: 0.21 ms, Total Time: 0.21 ms, Total % of Trace Computation: 0.01%
+INFO:nv_ingest_client.cli.util.processing:store_embedding_minio: Avg: 0.32 ms, Median: 0.32 ms, Total Time: 0.32 ms, Total % of Trace Computation: 0.01%
+INFO:nv_ingest_client.cli.util.processing:store_embedding_minio_channel_in: Avg: 1.18 ms, Median: 1.18 ms, Total Time: 1.18 ms, Total % of Trace Computation: 0.03%
+INFO:nv_ingest_client.cli.util.processing:message_broker_task_sink_channel_in: Avg: 0.42 ms, Median: 0.42 ms, Total Time: 0.42 ms, Total % of Trace Computation: 0.01%
+INFO:nv_ingest_client.cli.util.processing:No unresolved time detected. Trace times account for the entire elapsed duration.
+INFO:nv_ingest_client.cli.util.processing:Processed 1 files in 2.34 seconds.
+INFO:nv_ingest_client.cli.util.processing:Total pages processed: 3
+INFO:nv_ingest_client.cli.util.processing:Throughput (Pages/sec): 1.28
+INFO:nv_ingest_client.cli.util.processing:Throughput (Files/sec): 0.43
 ```
 
-## Step 3: Inspecting and Consuming Results
+## Step 4: Inspecting and Consuming Results
 
 After the ingestion steps above have been completed, you should be able to find the `text` and `image` subfolders inside your processed docs folder. Each will contain JSON-formatted extracted content and metadata.
 
@@ -355,7 +382,7 @@ multimodal_test.pdf.metadata.json
 
 For the full metadata definitions, refer to [Content Metadata](content-metadata.md). 
 
-We also provide a script for inspecting [extracted images](https://github.com/NVIDIA/NeMo-Retriever/blob/main/src/util/image_viewer.py).
+We also provide a script for inspecting [extracted images](https://github.com/NVIDIA/nv-ingest/blob/main/src/util/image_viewer.py).
 
 First, install `tkinter` by running the following code. Choose the code for your OS.
 
@@ -386,7 +413,7 @@ python src/util/image_viewer.py --file_path ./processed_docs/image/multimodal_te
 
 !!! tip
 
-    Beyond inspecting the results, you can read them into things like [llama-index](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/llama_index_multimodal_rag.ipynb) or [langchain](https://github.com/NVIDIA/NeMo-Retriever/blob/main/examples/langchain_multimodal_rag.ipynb) retrieval pipelines. Also, checkout our [Enterprise RAG Blueprint on build.nvidia.com](https://build.nvidia.com/nvidia/multimodal-pdf-data-extraction-for-enterprise-rag) to query over document content pre-extracted with NeMo Retriever Library.
+    Beyond inspecting the results, you can read them into things like [llama-index](https://github.com/NVIDIA/nv-ingest/blob/main/examples/llama_index_multimodal_rag.ipynb) or [langchain](https://github.com/NVIDIA/nv-ingest/blob/main/examples/langchain_multimodal_rag.ipynb) retrieval pipelines. Also, checkout our [Enterprise RAG Blueprint on build.nvidia.com](https://build.nvidia.com/nvidia/multimodal-pdf-data-extraction-for-enterprise-rag) to query over document content pre-extracted with the retriever pipeline.
 
 
 
@@ -397,25 +424,15 @@ You can specify multiple `--profile` options.
 
 | Profile               | Type     | Description                                                       | 
 |-----------------------|----------|-------------------------------------------------------------------| 
-| `retrieval`           | Core     | Enables the embedding NIM and (GPU accelerated) Milvus.           | 
+| `retrieval`           | Core     | Enables the embedding NIM and (optional) GPU-accelerated Milvus. Omit this profile to use the default LanceDB backend.           | 
 | `audio`               | Advanced | Use [Riva](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html) for processing audio files. For more information, refer to [Audio Processing](audio.md). | 
 | `nemotron-parse`      | Advanced | Use [nemotron-parse](https://build.nvidia.com/nvidia/nemotron-parse), which adds state-of-the-art text and table extraction. For more information, refer to [Advanced Visual Parsing](nemoretriever-parse.md). | 
-| `vlm`                 | Advanced | Use [llama 3.1 Nemotron 8B Vision](https://build.nvidia.com/nvidia/llama-3.1-nemotron-nano-vl-8b-v1/modelcard) for experimental image captioning of unstructured images. You can also configure other VLMs for your specific use cases. For more information, refer to [Extract Captions from Images](python-api-reference.md#extract-captions-from-images). | 
-
-## Air-Gapped Deployment (Docker Compose)
+| `vlm`                 | Advanced | Use [llama 3.1 Nemotron 8B Vision](https://build.nvidia.com/nvidia/llama-3.1-nemotron-nano-vl-8b-v1/modelcard) for image captioning of unstructured images and infographics. This profile enables the `caption` method in the Python API to generate text descriptions of visual content. For more information, refer to [Use Multimodal Embedding](vlm-embed.md) and [Extract Captions from Images](nv-ingest-python-api.md#extract-captions-from-images). | 
 
-When deploying in an air-gapped environment (no internet or NGC registry access), you must pre-stage container images on a machine with network access, then transfer and load them in the isolated environment.
-
-1. **On a machine with network access:** Clone the repo, authenticate with NGC (`docker login nvcr.io`), and pull all images used by your chosen profile (for example, `docker compose --profile retrieval pull`).
-2. **Save images:** Export the images to archives (for example, using `docker save` for each image or a script that saves all images referenced by your [docker-compose.yaml](https://github.com/NVIDIA/NeMo-Retriever/blob/main/docker-compose.yaml)).
-3. **Transfer** the image archives and your `docker-compose.yaml` (and `.env` if used) to the air-gapped system.
-4. **On the air-gapped machine:** Load the images (`docker load -i <archive>`) and start the stack with the same profile (for example, `docker compose --profile retrieval up`).
-
-Ensure the same image tags and `docker-compose.yaml` version are used in both environments so that service configuration stays consistent.
 
 ## Docker Compose override files
 
-The default [docker-compose.yaml](https://github.com/NVIDIA/NeMo-Retriever/blob/main/docker-compose.yaml) might exceed VRAM on a single GPU for some hardware. Override files reduce per-service memory, batch sizes, or concurrency so the full pipeline can run on the available GPU. To use an override, pass a second `-f` file after the base compose file; Docker Compose merges them and the override takes precedence.
+The default [docker-compose.yaml](https://github.com/NVIDIA/nv-ingest/blob/main/docker-compose.yaml) might exceed VRAM on a single GPU for some hardware. Override files reduce per-service memory, batch sizes, or concurrency so the full pipeline can run on the available GPU. To use an override, pass a second `-f` file after the base compose file; Docker Compose merges them and the override takes precedence.
 
 | Override file | GPU target |
 |---------------|------------|
@@ -429,7 +446,7 @@ For RTX Pro 6000 Server Edition and other GPUs with limited VRAM, use the overri
 
 Infographics often combine text, charts, and diagrams into complex visuals. Vision-language model (VLM) captioning generates natural language descriptions that capture this complexity, making the content searchable and more accessible for downstream applications.
 
-To use VLM captioning for infographics, start the NeMo Retriever Library with both the `retrieval` and `vlm` profiles by running the following code.
+To use VLM captioning for infographics, start NeMo Retriever Library with both the `retrieval` and `vlm` profiles by running the following code.
 ```shell
 docker compose \
   -f docker-compose.yaml \
@@ -469,14 +486,14 @@ docker compose \
 
 ## Specify MIG slices for NIM models
 
-When you deploy NeMo Retriever Library with NIM models on MIG‑enabled GPUs, MIG device slices are requested and scheduled through the `values.yaml` file for the corresponding NIM microservice. For IBM Content-Aware Storage (CAS) deployments, this allows NeMo Retriever Library NIM pods to land only on nodes that expose the desired MIG profiles [raw.githubusercontent](https://raw.githubusercontent.com/NVIDIA/NeMo-Retriever/main/helm/README.md%E2%80%8B).​
+When you deploy the pipeline with NIM models on MIG‑enabled GPUs, MIG device slices are requested and scheduled through the `values.yaml` file for the corresponding NIM microservice. For IBM Content-Aware Storage (CAS) deployments, this allows NIM pods to land only on nodes that expose the desired MIG profiles [raw.githubusercontent](https://raw.githubusercontent.com/NVIDIA/nv-ingest/main/helm/README.md%E2%80%8B).​
 
 To target a specific MIG profile—for example, a 3g.20gb slice on an A100, which is a hardware-partitioned virtual GPU instance that gives your workload a fixed mid-sized share of the A100’s compute plus 20 GB of dedicated GPU memory and behaves like a smaller independent GPU—for a given NIM, configure the `resources` and `nodeSelector` under that NIM’s values path in `values.yaml`.
 
-The following example shows the pattern. Paths vary by NIM, such as `nemo_retriever.nvidiaNim.nemoretrieverPageElements` instead of the generic `nemo_retriever.nim` placeholder. For details refer to [catalog.ngc.nvidia.com](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo-microservices/helm-charts/nemo-retriever)​.
+The following example shows the pattern. Paths vary by NIM, such as `nvingest.nvidiaNim.nemoretrieverPageElements` instead of the generic `nvingest.nim` placeholder. For details refer to [catalog.ngc.nvidia](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo-microservices/helm-charts/nv-ingest)​.
 Set `resources.requests` and `resources.limits` to the name of the MIG resource that you want (for example, `nvidia.com/mig-3g.20gb`).
 ```shell
-nemo_retriever:
+nvingest:
   nvidiaNim:
     nemoretrieverPageElements:
       modelName: "meta/llama3-8b-instruct"        # Example NIM model
@@ -489,16 +506,15 @@ nemo_retriever:
         nvidia.com/gpu.product: A100-SXM4-40GB-MIG-3g.20gb
 ```
 Key points:
-* Use the appropriate NIM‑specific values path (for example, `nemo_retriever.nvidiaNim.nemoretrieverPageElements.resources`) rather than the generic `nemo_retriever.nim` placeholder.
+* Use the appropriate NIM‑specific values path (for example, `nvingest.nvidiaNim.nemoretrieverPageElements.resources`) rather than the generic `nvingest.nim` placeholder.
 * Set `resources.requests` and `resources.limits` to the desired MIG resource name (for example, `nvidia.com/mig-3g.20gb`).
 * Use `nodeSelector` (or tolerations/affinity, if you prefer) to target nodes labeled with the corresponding MIG‑enabled GPU product (for example, `nvidia.com/gpu.product: A100-SXM4-40GB-MIG-3g.20gb`).
-This syntax and structure can be repeated for each NIM model used by CAS, ensuring that each NeMo Retriever Library NIM pod is mapped to the correct MIG slice type and scheduled onto compatible nodes.
+This syntax and structure can be repeated for each NIM model used by CAS, ensuring that each NV-Ingest NIM pod is mapped to the correct MIG slice type and scheduled onto compatible nodes.
 
 !!! important
 
     Advanced features require additional GPU support and disk space. For more information, refer to [Support Matrix](support-matrix.md).
 
-
 ## Related Topics
 
 - [Troubleshoot](troubleshoot.md)
diff --git a/docs/docs/extraction/quickstart-library-mode.md b/docs/docs/extraction/quickstart-library-mode.md
index 0878eb5c1..c027e6a0d 100644
--- a/docs/docs/extraction/quickstart-library-mode.md
+++ b/docs/docs/extraction/quickstart-library-mode.md
@@ -1,7 +1,487 @@
-# NeMo Retriever Library
+# Deploy Without Containers (Library Mode) for NeMo Retriever Library
+
+[NeMo Retriever Library](overview.md) is typically deployed as a cluster of containers for robust, scalable production use. 
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NeMo Retriever Library is also known as NVIDIA Ingest.
+
+In addition, you can use library mode, which is intended for the following cases:
+
+- Local development
+- Experimentation and testing
+- Small-scale workloads, such as workloads of fewer than 100 documents
+
+
+By default, library mode depends on NIMs that are hosted on build.nvidia.com. 
+In library mode you launch the main pipeline service directly within a Python process, 
+while all other services (such as embedding and storage) are hosted remotely in the cloud.
+
+To get started using library mode, you need the following:
+
+- Linux operating systems (Ubuntu 22.04 or later recommended) or MacOS
+- Python 3.12
+- We strongly advise using an isolated Python virtual env with [uv](https://docs.astral.sh/uv/getting-started/installation/).
+
+
+
+## Step 1: Prepare Your Environment
+
+Use the following procedure to prepare your environment.
+
+1. Run the following code to create your NV Ingest Python environment.
+
+    ```
+       uv venv --python 3.12 nvingest && \
+         source nvingest/bin/activate && \
+         uv pip install nv-ingest==26.1.2 nv-ingest-api==26.1.2 nv-ingest-client==26.1.2
+    ```
+
+    By default, the pipeline uses **LanceDB** as the vector database (no extra package required). To use **Milvus** (e.g. milvus-lite) instead, also install `milvus-lite==2.4.12` and pass `milvus_uri="milvus.db"` in `vdb_upload`. For details, see [Data Upload](data-store.md).
+
+    !!! tip
+
+        To confirm that you have activated your virtual environment, run `which python` and confirm that you see `nvingest` in the result. You can do this before any python command that you run.
+
+2. Set or create a .env file that contains your NVIDIA Build API key and other environment variables.
+
+    !!! note
+
+        If you have an NGC API key, you can use it here. For more information, refer to [Generate Your NGC Keys](ngc-api-key.md) and [Environment Configuration Variables](environment-config.md).
+
+    - To set your variables, use the following code.
+
+        ```
+        export NVIDIA_API_KEY=nvapi-<your key>
+        ```
+    - To add your variables to a .env file, include the following.
+
+        ```
+        NVIDIA_API_KEY=nvapi-<your key>
+        ```
+
+
+## Step 2: Ingest Documents
+
+You can submit jobs programmatically by using Python.
+
+!!! tip
+
+    For more Python examples, refer to [NV-Ingest: Python Client Quick Start Guide](https://github.com/NVIDIA/nv-ingest/blob/main/client/client_examples/examples/python_client_usage.ipynb).
+
+
+If you have a very high number of CPUs, and see the process hang without progress, 
+we recommend that you use `taskset` to limit the number of CPUs visible to the process. 
+Use the following code.
+
+```
+taskset -c 0-3 python your_ingestion_script.py
+```
+
+On a 4 CPU core low end laptop, the following code should take about 10 seconds.
+
+```python
+import time
+
+from nv_ingest.framework.orchestration.ray.util.pipeline.pipeline_runners import run_pipeline
+from nv_ingest_client.client import Ingestor, NvIngestClient
+from nv_ingest_api.util.message_brokers.simple_message_broker import SimpleClient
+from nv_ingest_client.util.process_json_files import ingest_json_results_to_blob
+
+def main():
+    # Start the pipeline subprocess for library mode
+    run_pipeline(block=False, disable_dynamic_scaling=True, run_in_subprocess=True)
+
+    client = NvIngestClient(
+        message_client_allocator=SimpleClient,
+        message_client_port=7671,
+        message_client_hostname="localhost",
+    )
+
+    # Optional: use Milvus (e.g. milvus-lite) by providing milvus_uri and installing milvus-lite.
+    # By default, LanceDB is used and no milvus_uri is needed.
+    # milvus_uri = "milvus.db"
+    collection_name = "test"
+    sparse = False
+
+    # do content extraction from files
+    ingestor = (
+        Ingestor(client=client)
+        .files("data/multimodal_test.pdf")
+        .extract(
+            extract_text=True,
+            extract_tables=True,
+            extract_charts=True,
+            extract_images=True,
+            table_output_format="markdown",
+            extract_infographics=True,
+            # extract_method="nemotron_parse", #Slower, but maximally accurate, especially for PDFs with pages that are scanned images
+            text_depth="page",
+        )
+        .embed()
+        .vdb_upload(
+            collection_name=collection_name,
+            # milvus_uri=milvus_uri,  # Uncomment to use Milvus instead of LanceDB
+            sparse=sparse,
+            # for llama-3.2 embedder, use 1024 for e5-v5
+            dense_dim=2048,
+        )
+    )
+
+    print("Starting ingestion..")
+    t0 = time.time()
+
+    # Return both successes and failures
+    # Use for large batches where you want successful chunks/pages to be committed, while collecting detailed diagnostics for failures.
+    results, failures = ingestor.ingest(show_progress=True, return_failures=True)
+
+    # Return only successes
+    # results = ingestor.ingest(show_progress=True)
+
+    t1 = time.time()
+    print(f"Total time: {t1 - t0} seconds")
+
+    # results blob is directly inspectable
+    if results:
+        print(ingest_json_results_to_blob(results[0]))
+
+    # (optional) Review any failures that were returned
+    if failures:
+        print(f"There were {len(failures)} failures. Sample: {failures[0]}")
+
+if __name__ == "__main__":
+    main()
+```
+
+!!! note
+
+    For advanced visual parsing with library mode, uncomment `extract_method="nemotron_parse"` in the previous code. For more information, refer to [Advanced Visual Parsing](nemoretriever-parse.md).
+
+
+You can see the extracted text that represents the content of the ingested test document.
+
+```shell
+Starting ingestion..
+Total time: 9.243880033493042 seconds
+
+TestingDocument
+A sample document with headings and placeholder text
+Introduction
+This is a placeholder document that can be used for any purpose. It contains some 
+headings and some placeholder text to fill the space. The text is not important and contains 
+no real value, but it is useful for testing. Below, we will have some simple tables and charts 
+that we can use to confirm Ingest is working as expected.
+Table 1
+This table describes some animals, and some activities they might be doing in specific 
+locations.
+Animal Activity Place
+Gira@e Driving a car At the beach
+Lion Putting on sunscreen At the park
+Cat Jumping onto a laptop In a home o@ice
+Dog Chasing a squirrel In the front yard
+Chart 1
+This chart shows some gadgets, and some very fictitious costs.
+
+... document extract continues ...
+```
+
+## Step 3: Query Ingested Content
+
+To query for relevant snippets of the ingested content, and use them with an LLM to generate answers, use the following code. With the default LanceDB backend, use the LanceDB retrieval API (see [Data Upload](data-store.md)). The example below shows retrieval when using Milvus (e.g. milvus-lite).
+
+```python
+import os
+from openai import OpenAI
+from nv_ingest_client.util.milvus import nvingest_retrieval
+
+# Only needed when using Milvus (e.g. milvus-lite) instead of LanceDB
+milvus_uri = "milvus.db"
+collection_name = "test"
+sparse = False
+
+queries = ["Which animal is responsible for the typos?"]
+
+retrieved_docs = nvingest_retrieval(
+    queries,
+    collection_name,
+    milvus_uri=milvus_uri,
+    hybrid=sparse,
+    top_k=1,
+)
+
+# simple generation example
+extract = retrieved_docs[0][0]["entity"]["text"]
+client = OpenAI(
+  base_url = "https://integrate.api.nvidia.com/v1",
+  api_key = os.environ["NVIDIA_API_KEY"]
+)
+
+prompt = f"Using the following content: {extract}\n\n Answer the user query: {queries[0]}"
+print(f"Prompt: {prompt}")
+completion = client.chat.completions.create(
+  model="nvidia/llama-3.1-nemotron-nano-vl-8b-v1",
+  messages=[{"role":"user","content": prompt}],
+)
+response = completion.choices[0].message.content
+
+print(f"Answer: {response}")
+```
+
+```shell
+Prompt: Using the following content: Table 1
+| This table describes some animals, and some activities they might be doing in specific locations. | This table describes some animals, and some activities they might be doing in specific locations. | This table describes some animals, and some activities they might be doing in specific locations. |
+| Animal | Activity | Place |
+| Giraffe | Driving a car | At the beach |
+| Lion | Putting on sunscreen | At the park |
+| Cat | Jumping onto a laptop | In a home office |
+| Dog | Chasing a squirrel | In the front yard |
+
+ Answer the user query: Which animal is responsible for the typos?
+Answer: A clever query!
+
+Based on the provided Table 1, I'd make an educated inference to answer your question. Since the activities listed are quite unconventional for the respective animals (e.g., a giraffe driving a car, a lion putting on sunscreen), it's likely that the table is using humor or hypothetical scenarios.
+
+Given this context, the question "Which animal is responsible for the typos?" is probably a tongue-in-cheek inquiry, as there's no direct information in the table about typos or typing activities.
+
+However, if we were to make a playful connection, we could look for an animal that's:
+
+1. Typically found in a setting where typing might occur (e.g., an office).
+2. Engaging in an activity that could potentially lead to typos (e.g., interacting with a typing device).
+
+Based on these loose criteria, I'd jokingly point to:
+
+**Cat** as the potential culprit, since it's:
+        * Located "In a home office"
+        * Engaged in "Jumping onto a laptop", which could theoretically lead to accidental keystrokes or typos if the cat were to start "walking" on the keyboard!
+
+Please keep in mind that this response is purely humorous and interpretative, as the table doesn't explicitly mention typos or provide a straightforward answer to the question.
+```
+
+
+
+## Logging Configuration
+
+Nemo Retriever extraction uses [Ray](https://docs.ray.io/en/latest/index.html) for logging. 
+For details, refer to [Configure Ray Logging](ray-logging.md).
+
+By default, library mode runs in quiet mode to minimize startup noise. 
+Quiet mode automatically configures the following environment variables.
+
+| Variable                             | Quiet Mode Value | Description |
+|--------------------------------------|------------------|-------------|
+| `INGEST_RAY_LOG_LEVEL`               | `PRODUCTION`     | Sets Ray logging to ERROR level to reduce noise. |
+| `RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO` | `0`              | Silences Ray accelerator warnings |
+| `OTEL_SDK_DISABLED`                  | `true`           | Disables OpenTelemetry trace export errors |
+
+
+If you want to see detailed startup logs for debugging, use one of the following options:
+
+- Set `quiet=False` when you run the pipeline as shown following.
+
+    ```python
+    run_pipeline(block=False, disable_dynamic_scaling=True, run_in_subprocess=True, quiet=False)
+    ```
+
+- Set the environment variables manually before you run the pipeline as shown following.
+
+    ```bash
+    export INGEST_RAY_LOG_LEVEL=DEVELOPMENT  # or DEBUG for maximum verbosity
+    ```
+
+
+
+## Library Mode Communication and Advanced Examples
+
+Communication in library mode is handled through a simplified, 3-way handshake message broker called `SimpleBroker`.
+
+Attempting to run a library-mode process co-located with a Docker Compose deployment does not work by default. 
+The Docker Compose deployment typically creates a firewall rule or port mapping that captures traffic to port `7671`,
+which prevents the `SimpleBroker` from receiving messages. 
+Always ensure that you use library mode in isolation, without an active containerized deployment listening on the same port.
+
+
+### Example `launch_libmode_service.py`
+
+This example launches the pipeline service in a subprocess, 
+and keeps it running until it is interrupted (for example, by pressing `Ctrl+C`). 
+It listens for ingestion requests on port `7671` from an external client.
+
+```python
+import logging
+import os
+
+from nv_ingest.framework.orchestration.ray.util.pipeline.pipeline_runners import run_pipeline
+from nv_ingest_api.util.logging.configuration import configure_logging as configure_local_logging
+
+# Configure the logger
+logger = logging.getLogger(__name__)
+
+local_log_level = os.getenv("INGEST_LOG_LEVEL", "DEFAULT")
+if local_log_level in ("DEFAULT",):
+    local_log_level = "INFO"
+
+configure_local_logging(local_log_level)
+
+
+def main():
+    """
+    Launch the libmode pipeline service using the embedded default configuration.
+    """
+    try:
+        # Start pipeline and block until interrupted
+        # Note: stdout/stderr cannot be passed when run_in_subprocess=True (not picklable)
+        # Use quiet=False to see verbose startup logs
+        _ = run_pipeline(
+            block=True,
+            disable_dynamic_scaling=True,
+            run_in_subprocess=True,
+        )
+    except KeyboardInterrupt:
+        logger.info("Keyboard interrupt received. Shutting down...")
+    except Exception as e:
+        logger.error(f"An unexpected error occurred: {e}", exc_info=True)
+
+
+if __name__ == "__main__":
+    main()
+```
+
+### Example `launch_libmode_and_run_ingestor.py`
+
+This example starts the pipeline service in-process, 
+and immediately runs an ingestion client against it in the same parent process.
+
+```python
+import logging
+import os
+import time
+
+from nv_ingest.framework.orchestration.ray.util.pipeline.pipeline_runners import run_pipeline
+from nv_ingest_api.util.logging.configuration import configure_logging as configure_local_logging
+from nv_ingest_api.util.message_brokers.simple_message_broker import SimpleClient
+from nv_ingest_client.client import Ingestor
+from nv_ingest_client.client import NvIngestClient
+
+# Configure the logger
+logger = logging.getLogger(__name__)
+
+local_log_level = os.getenv("INGEST_LOG_LEVEL", "INFO")
+if local_log_level in ("DEFAULT",):
+    local_log_level = "INFO"
+
+configure_local_logging(local_log_level)
+
+
+def run_ingestor():
+    """
+    Set up and run the ingestion process to send traffic against the pipeline.
+    """
+    logger.info("Setting up Ingestor client...")
+    client = NvIngestClient(
+        message_client_allocator=SimpleClient, message_client_port=7671, message_client_hostname="localhost"
+    )
+
+    ingestor = (
+        Ingestor(client=client)
+        .files("./data/multimodal_test.pdf")
+        .extract(
+            extract_text=True,
+            extract_tables=True,
+            extract_charts=True,
+            extract_images=True,
+            table_output_format="markdown",
+            extract_infographics=False,
+            text_depth="page",
+        )
+        .split(chunk_size=1024, chunk_overlap=150)
+        .embed()
+    )
+
+    try:
+        results, _ = ingestor.ingest(show_progress=False, return_failures=True)
+        logger.info("Ingestion completed successfully.")
+    except Exception as e:
+        logger.error(f"Ingestion failed: {e}")
+        raise
+
+    print("\nIngest done.")
+    print(f"Got {len(results)} results.")
+
+
+def main():
+    """
+    Launch the libmode pipeline service and run the ingestor against it.
+    Uses the embedded default libmode pipeline configuration.
+    """
+    pipeline = None
+    try:
+        # Start pipeline in subprocess
+        # Note: stdout/stderr cannot be passed when run_in_subprocess=True (not picklable)
+        # Use quiet=False to see verbose startup logs
+        pipeline = run_pipeline(
+            block=False,
+            disable_dynamic_scaling=True,
+            run_in_subprocess=True,
+        )
+        time.sleep(10)
+        run_ingestor()
+        # Run other code...
+    except KeyboardInterrupt:
+        logger.info("Keyboard interrupt received. Shutting down...")
+    except Exception as e:
+        logger.error(f"Error running pipeline: {e}")
+    finally:
+        if pipeline:
+            pipeline.stop()
+            logger.info("Shutting down pipeline...")
+
+
+if __name__ == "__main__":
+    main()
+```
+
+
+
+## The `run_pipeline` Function Reference
+
+The `run_pipeline` function is the main entry point to start the Nemo Retriever Extraction pipeline. 
+It can run in-process or as a subprocess.
+
+The `run_pipeline` function accepts the following parameters.
+
+| Parameter                | Type                   | Default | Required? | Description                                     |
+|--------------------------|------------------------|---------|-----------|-------------------------------------------------|
+| pipeline_config            | PipelineConfigSchema | —       | Yes       | A configuration object that specifies how the pipeline should be constructed. |
+| run_in_subprocess        | bool                   | False   | Yes       | `True` to launch the pipeline in a separate Python subprocess. `False` to run in the current process. |
+| block                    | bool                   | True    | Yes       | `True` to run the pipeline synchronously. The function returns after it finishes. `False` to return an interface for external pipeline control. |
+| disable_dynamic_scaling  | bool                   | None    | No        | `True` to disable autoscaling regardless of global settings. `None` to use the global default behavior. |
+| dynamic_memory_threshold | float                  | None    | No        | A value between `0.0` and `1.0`. If dynamic scaling is enabled, triggers autoscaling when memory usage crosses this threshold. |
+| stdout                   | TextIO                 | None    | No        | Redirect the subprocess `stdout` to a file or stream. If `None`, defaults to `/dev/null`. |
+| stderr                   | TextIO                 | None    | No        | Redirect subprocess `stderr` to a file or stream. If `None`, defaults to `/dev/null`. |
+| libmode                  | bool                   | True    | No        | `True` to load the default library mode pipeline configuration when `ingest_config` is `None`. |
+| quiet                    | bool                   | None    | No        | `True` to suppress verbose startup logs (PRODUCTION preset). `None` defaults to `True` when `libmode=True`. Set to `False` for verbose output. |
+
+
+The `run_pipeline` function returns the following values, depending on the parameters that you set:
+
+- **run_in_subprocess=False and block=True**  — The function returns a `float` that represents the elapsed time in seconds.
+- **run_in_subprocess=False and block=False** — The function returns a `RayPipelineInterface` object.
+- **run_in_subprocess=True  and block=True**  — The function returns `0.0`.
+- **run_in_subprocess=True  and block=False** — The function returns a `RayPipelineInterface` object.
+
+
+The `run_pipeline` throws the following errors:
+
+- **RuntimeError** — A subprocess failed to start, or exited with error.
+- **Exception** — Any other failure during pipeline setup or execution.
+
+
+
+## Related Topics
 
-Use the [Quick Start for NeMo Retriever Library](https://github.com/NVIDIA/NeMo-Retriever/blob/26.03/nemo_retriever/README.md) to set up and run the NeMo Retriever Library locally, so you can build a GPU‑accelerated, multimodal RAG ingestion pipeline that parses PDFs, HTML, text, audio, and video into LanceDB vector embeddings, integrates with Nemotron RAG models (locally or via NIM endpoints), which includes Ray‑based scaling plus built‑in recall evaluation.
\ No newline at end of file
+- [Prerequisites](prerequisites.md)
+- [Support Matrix](support-matrix.md)
+- [Deploy With Docker Compose (Self-Hosted)](quickstart-guide.md)
+- [Deploy With Helm](helm.md)
+- [Notebooks](notebooks.md)
+- [Enterprise RAG Blueprint](https://build.nvidia.com/nvidia/multimodal-pdf-data-extraction-for-enterprise-rag)
diff --git a/docs/docs/extraction/releasenotes-nv-ingest.md b/docs/docs/extraction/releasenotes-nv-ingest.md
index 4781bb476..fb4b847f8 100644
--- a/docs/docs/extraction/releasenotes-nv-ingest.md
+++ b/docs/docs/extraction/releasenotes-nv-ingest.md
@@ -4,38 +4,68 @@ This documentation contains the release notes for [NeMo Retriever Library](overv
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.   
-
-## 26.03 Release Notes (26.3.0)
-
-NVIDIA® NeMo Retriever Library version 26.03 adds broader hardware and software support along with many pipeline, evaluation, and deployment enhancements.
-
-To upgrade the Helm charts for this release, refer to the [NeMo Retriever Library Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0/helm/README.md).
-
-Highlights for the 26.03 release include:
-
-- NV-Ingest GitHub repo renamed to NeMo-Retriever  
-- NeMo Retriever Extraction pipeline renamed to NeMo Retriever Library  
-- NeMo Retriever Library now supports two deployment options:  
-    - A new no-container, pip-installable in-process library for development (available on PyPI)  
-    - Existing production-ready Helm chart with NIMs  
-- Added documentation notes on Air-gapped deployment support for both Helm (Kubernetes) and Docker Compose  
-- Added documentation notes on OpenShift support  
-- Added support for RTX4500 Pro Blackwell SKU  
-- Added support for llama-nemotron-embed-vl-v2 in text and text+image modes  
-- New extract methods `pdfium_hybrid` and `ocr` target scanned PDFs to improve text and layout extraction from image-based pages  
-- VLM-based image caption enhancements:  
-    - Infographics can be captioned  
-    - Reasoning mode is configurable  
-- Enabled hybrid search with Lancedb  
-- Added retrieval_bench subfolder with generalizable agentic retrieval pipeline  
-- The project now uses UV as the primary environment and package manager instead of Conda, resulting in faster installs and simpler dependency handling  
-- Default Redis TTL increased from 1–2 hours to 48 hours so long-running jobs (e.g., VLM captioning) don’t expire before completion  
-- NeMo Retriever Library currently does not support image captioning via VLM; this feature will be added in the next release
+    NeMo Retriever Library is also known as NVIDIA Ingest.
+
+
+
+## Release 26.01 (26.1.2)
+
+The NeMo Retriever Library 26.01 release adds new hardware and software support, and other improvements.
+
+To upgrade the Helm Charts for this version, refer to [NV-Ingest Helm Charts](https://github.com/NVIDIA/nv-ingest/blob/release/26.1.2/helm/README.md).
+
+
+### Highlights 
+
+This release contains the following key changes:
+
+- Added functional support for [H200 NVL](https://www.nvidia.com/en-us/data-center/h200/). For details, refer to [Support Matrix](support-matrix.md).
+- All Helm deployments for Kubernetes now use [NVIDIA NIM Operator](https://docs.nvidia.com/nim-operator/latest/index.html). For details, refer to [NV-Ingest Helm Charts](https://github.com/NVIDIA/nv-ingest/blob/release/26.1.2/helm/README.md). 
+- Updated RIVA NIM to version 1.4.0. For details, refer to [Extract Speech](audio.md).
+- Updated VLM NIM to [nemotron-nano-12b-v2-vl](https://build.nvidia.com/nvidia/nemotron-nano-12b-v2-vl/modelcard). For details, refer to [Extract Captions from Images](nv-ingest-python-api.md#extract-captions-from-images).
+- Added VLM caption prompt customization parameters, including reasoning control. For details, refer to [Caption Images and Control Reasoning](nv-ingest-python-api.md#caption-images-and-control-reasoning).
+- Added support for the [nemotron-parse](https://build.nvidia.com/nvidia/nemotron-parse/modelcard) model which replaces the [nemoretriever-parse](https://build.nvidia.com/nvidia/nemoretriever-parse/modelcard) model. For details, refer to [Advanced Visual Parsing](nemoretriever-parse.md).
+- Support is now deprecated for [paddleocr](https://build.nvidia.com/baidu/paddleocr/modelcard).
+- The `meta-llama/Llama-3.2-1B` tokenizer is now pre-downloaded so that you can run token-based splitting without making a network request. For details, refer to [Split Documents](chunking.md).
+- For scanned PDFs, added specialized extraction strategies. For details, refer to [PDF Extraction Strategies](nv-ingest-python-api.md#pdf-extraction-strategies).
+- [LanceDB](https://lancedb.com/) is now the default vector database backend; Milvus remains fully supported. For details, refer to [Data Upload](data-store.md).
+- The V2 API is now available and is the default processing pipeline. The response format remains backwards-compatible. You can enable the v2 API by using `message_client_kwargs={"api_version": "v2"}`.For details, refer to [API Reference](api-docs).
+- Large PDFs are now automatically split into chunks and processed in parallel, delivering faster ingestion for long documents. For details, refer to [PDF Pre-Splitting](v2-api-guide.md).
+- Issues maintaining extraction quality while processing very large files are now resolved with the V2 API. For details, refer to [V2 API Guide](v2-api-guide.md).
+- Updated the embedding task to support embedding on custom content fields like the results of summarization functions. For details, refer to [Use the Python API](nv-ingest-python-api.md).
+- User-defined function summarization is now using `nemotron-mini-4b-instruct` which provides significant speed improvements. For details, refer to [User-defined Functions](user-defined-functions.md) and [NV-Ingest UDF Examples](https://github.com/NVIDIA/nv-ingest/blob/release/26.1.2/examples/udfs/README.md).
+- In the `Ingestor.extract` method, the defaults for `extract_text` and `extract_images` are now set to `true` for consistency with `extract_tables` and `extract_charts`. For details, refer to [Use the Python API](nv-ingest-python-api.md).
+- The `table-structure` profile is no longer available. The table-structure profile is now part of the default profile. For details, refer to [Profile Information](quickstart-guide.md#profile-information).
+- New documentation [Why Throughput Is Dataset-Dependent](throughput-is-dataset-dependent.md).
+- New documentation [Add User-defined Stages](user-defined-stages.md).
+- New documentation [Add User-defined Functions](user-defined-functions.md).
+- New documentation [Resource Scaling Modes](scaling-modes.md).
+- New documentation [NimClient Usage](nimclient.md).
+- New documentation [Use the API (V2)](v2-api-guide.md).
+
+
+
+### Fixed Known Issues
+
+The following are the known issues that are fixed in this version:
+
+- A10G support is restored. To use A10G hardware, use release 26.1.2 or later. For details, refer to [Support Matrix](support-matrix.md).
+- L40S support is restored. To use L40S hardware, use release 26.1.2 or later. For details, refer to [Support Matrix](support-matrix.md).
+- The page number field in the content metadata now starts at 1 instead of 0 so each page number is no longer off by one from what you would expect. For details, refer to [Content Metadata](content-metadata.md).
+- Support for batches that include individual files greater than approximately 400MB is restored. This includes audio files and pdfs.
+
+
+
+## All Known Issues
+
+The following are the known issues for NeMo Retriever Library:
+
+- Advanced visual parsing is not supported on RTX Pro 6000, B200, or H200 NVL. For details, refer to [Advanced Visual Parsing](advanced-visual-parsing.md) and [Support Matrix](support-matrix.md).
+- The Page Elements NIM (`nemoretriever-page-elements-v3:1.7.0`) may intermittently fail during inference under high-concurrency workloads. This happens when Triton’s dynamic batching combines requests that exceed the model’s maximum batch size, a situation more commonly seen in multi-GPU setups or large ingestion runs. In these cases, extraction fails for the impacted documents. A correction is planned for `nemoretriever-page-elements-v3:1.7.1`.
+
 
 ## Release Notes for Previous Versions
 
-| [26.1.2](https://docs.nvidia.com/nemo/retriever/26.1.2/extraction/releasenotes-nv-ingest/)
 | [26.1.1](https://docs.nvidia.com/nemo/retriever/26.1.1/extraction/releasenotes-nv-ingest/)
 | [25.9.0](https://docs.nvidia.com/nemo/retriever/25.9.0/extraction/releasenotes-nv-ingest/) 
 | [25.6.3](https://docs.nvidia.com/nemo/retriever/25.6.3/extraction/releasenotes-nv-ingest/) 
@@ -44,6 +74,9 @@ Highlights for the 26.03 release include:
 | [25.3.0](https://docs.nvidia.com/nemo/retriever/25.3.0/extraction/releasenotes-nv-ingest/) 
 | [24.12.1](https://docs.nvidia.com/nemo/retriever/25.3.0/extraction/releasenotes-nv-ingest/) 
 | [24.12.0](https://docs.nvidia.com/nemo/retriever/25.3.0/extraction/releasenotes-nv-ingest/) 
+|
+
+
 
 ## Related Topics
 
diff --git a/docs/docs/extraction/scaling-modes.md b/docs/docs/extraction/scaling-modes.md
index 3e8cba20a..5c57b33ac 100644
--- a/docs/docs/extraction/scaling-modes.md
+++ b/docs/docs/extraction/scaling-modes.md
@@ -7,7 +7,7 @@ This guide covers how resource scaling modes work across stages in [NeMo Retriev
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NeMo Retriever Library is also known as NVIDIA Ingest.
 
 
 
@@ -18,7 +18,7 @@ This guide covers how resource scaling modes work across stages in [NeMo Retriev
 
 ## Configure (docker-compose)
 
-Edit `services > nemo-retriever-ms-runtime > environment` in `docker-compose.yaml`.
+Edit `services > nv-ingest-ms-runtime > environment` in `docker-compose.yaml`.
 
 ### Select mode
 
@@ -35,7 +35,7 @@ Example (Static):
 
 ```yaml
 services:
-  nemo-retriever-ms-runtime:
+  nv-ingest-ms-runtime:
     environment:
       - INGEST_DISABLE_DYNAMIC_SCALING=true
       - INGEST_STATIC_MEMORY_THRESHOLD=0.85
@@ -45,7 +45,7 @@ Example (Dynamic):
 
 ```yaml
 services:
-  nemo-retriever-ms-runtime:
+  nv-ingest-ms-runtime:
     environment:
       - INGEST_DISABLE_DYNAMIC_SCALING=false
       - INGEST_DYNAMIC_MEMORY_THRESHOLD=0.80
@@ -91,7 +91,7 @@ services:
 
 Open `docker-compose.yaml` and locate:
 
-- `services > nemo-retriever-ms-runtime > environment`:
+- `services > nv-ingest-ms-runtime > environment`:
   - `INGEST_DISABLE_DYNAMIC_SCALING`
   - `INGEST_DYNAMIC_MEMORY_THRESHOLD`
   - `INGEST_STATIC_MEMORY_THRESHOLD`
@@ -102,4 +102,4 @@ Open `docker-compose.yaml` and locate:
 
 - [Prerequisites](prerequisites.md)
 - [Support Matrix](support-matrix.md)
-- [Troubleshooting](troubleshoot.md)
+- [Troubleshooting](troubleshooting.md)
diff --git a/docs/docs/extraction/support-matrix.md b/docs/docs/extraction/support-matrix.md
index 845e671f6..5dbc49508 100644
--- a/docs/docs/extraction/support-matrix.md
+++ b/docs/docs/extraction/support-matrix.md
@@ -4,34 +4,33 @@ Before you begin using [NeMo Retriever Library](overview.md), ensure that you ha
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NeMo Retriever Library is also known as NVIDIA Ingest.
 
 
 ## Core and Advanced Pipeline Features
 
-The NeMo Retriever Library core pipeline features run on a single A10G or better GPU. 
-
+The Nemo Retriever extraction core pipeline features run on a single A10G or better GPU. 
 The core pipeline features include the following:
 
-- llama-nemotron-embed-1b-v2 — Embedding model for converting text chunks into vectors.
-- nemotron-page-elements-v3 — Detects and classifies images on a page as a table, chart or infographic.
-- nemotron-table-structure-v1 — Detects rows, columns, and cells within a table to preserve table structure and convert to Markdown format. 
-- nemotron-graphic-elements-v1 — Detects graphic elements within chart images such as titles, legends, axes, and numerical values. 
-- nemotron-ocr-v1 — Image OCR model to detect and extract text from images.
-- retrieval — Enables embedding and indexing into Milvus.
+- llama3.2-nv-embedqa-1b-v2 — Embedding model for converting text chunks into vectors.
+- nemoretriever-page-elements-v3 — Detects and classifies images on a page as a table, chart or infographic.
+- nemoretriever-table-structure-v1 — Detects rows, columns, and cells within a table to preserve table structure and convert to Markdown format. 
+- nemoretriever-graphic-elements-v1 — Detects graphic elements within chart images such as titles, legends, axes, and numerical values. 
+- nemoretriever-ocr-v1 — Image OCR model to detect and extract text from images.
+- retrieval — Enables embedding and indexing into LanceDB (default) or Milvus.
 
 Advanced features require additional GPU support and disk space. 
 This includes the following:
 
 - Audio extraction — Use [Riva](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html) for processing audio files. For more information, refer to [Audio Processing](audio.md).
-- Advanced visual parsing — Use [nemotron-parse](https://build.nvidia.com/nvidia/nemotron-parse/modelcard), which adds state-of-the-art text and table extraction. For more information, refer to [Advanced Visual Parsing ](nemoretriever-parse.md).
-- VLM — Use [nemotron-nano-12b-v2-vl](https://build.nvidia.com/nvidia/nemotron-nano-12b-v2-vl/modelcard) for experimental image captioning of unstructured images. 
+- Advanced visual parsing — Use [nemotron-parse](https://docs.nvidia.com/nim/vision-language-models/latest/examples/nemotron-parse/overview.html), which adds state-of-the-art text and table extraction. For more information, refer to [Advanced Visual Parsing ](nemoretriever-parse.md).
+- git — Use [nemotron-nano-12b-v2-vl](https://build.nvidia.com/nvidia/nemotron-nano-12b-v2-vl/modelcard) for experimental image captioning of unstructured images. 
     
     !!! note
     
-        While nemotron-nano-12b-v2-vl is the default VLM, you can configure and use other vision language models for image captioning based on your specific use case requirements. For more information, refer to [Extract Captions from Images](python-api-reference.md#extract-captions-from-images).
+        While nemotron-nano-12b-v2-vl is the default VLM, you can configure and use other vision language models for image captioning based on your specific use case requirements. For more information, refer to [Extract Captions from Images](nv-ingest-python-api.md#extract-captions-from-images).
 
-- Reranker — Use [llama-nemotron-rerank-1b-v2](https://build.nvidia.com/nvidia/llama-nemotron-rerank-1b-v2) for improved retrieval accuracy.
+- Reranker — Use [llama-3.2-nv-rerankqa-1b-v2](https://build.nvidia.com/nvidia/llama-3.2-nv-rerankqa-1b-v2) for improved retrieval accuracy.
 
 
 
diff --git a/docs/docs/extraction/telemetry.md b/docs/docs/extraction/telemetry.md
index 5c050452f..9d34aaa92 100644
--- a/docs/docs/extraction/telemetry.md
+++ b/docs/docs/extraction/telemetry.md
@@ -4,7 +4,7 @@ You can view telemetry data for [NeMo Retriever Library](overview.md).
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NeMo Retriever Library is also known as NVIDIA Ingest.
 
 
 ## OpenTelemetry
diff --git a/docs/docs/extraction/troubleshoot.md b/docs/docs/extraction/troubleshoot.md
index 1e97448e8..1b130952b 100644
--- a/docs/docs/extraction/troubleshoot.md
+++ b/docs/docs/extraction/troubleshoot.md
@@ -4,7 +4,7 @@ Use this documentation to troubleshoot issues that arise when you use [NeMo Retr
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NeMo Retriever Library is also known as NVIDIA Ingest.
 
 
 ## Can't process long, non-language text strings
@@ -52,7 +52,7 @@ This happens because, by default, NeMo Retriever Library stores the results from
 If the total size of the results exceeds the available memory, the process fails.
 
 To resolve this issue, use the `save_to_disk` method. 
-For details, refer to [Working with Large Datasets: Saving to Disk](python-api-reference.md#work-with-large-datasets-save-to-disk).
+For details, refer to [Working with Large Datasets: Saving to Disk](nv-ingest-python-api.md#work-with-large-datasets-save-to-disk).
 
 
 
diff --git a/docs/docs/extraction/user-defined-functions.md b/docs/docs/extraction/user-defined-functions.md
index eae1f5a4a..74a710698 100644
--- a/docs/docs/extraction/user-defined-functions.md
+++ b/docs/docs/extraction/user-defined-functions.md
@@ -5,7 +5,7 @@ This guide covers how to write, validate, and submit UDFs using both the CLI and
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NeMo Retriever Library is also known as NVIDIA Ingest.
 
 
 
@@ -16,7 +16,7 @@ This guide covers how to write, validate, and submit UDFs using both the CLI and
 Create a Python function that accepts an `IngestControlMessage` and returns a modified `IngestControlMessage`:
 
 ```python
-from nemo_retriever.internal.primitives.ingest_control_message import IngestControlMessage
+from nv_ingest_api.internal.primitives.ingest_control_message import IngestControlMessage
 
 def my_custom_processor(control_message: IngestControlMessage) -> IngestControlMessage:
     """Add custom metadata to all documents."""
@@ -41,7 +41,7 @@ The CLI supports all UDF function specification formats. Here are examples of ea
 #### Inline Function String
 ```bash
 # Submit inline UDF function
-nemo-retriever \
+nv-ingest-cli \
     --doc /path/to/document.pdf \
     --output-directory ./output \
     --task 'udf:{"udf_function": "def my_processor(control_message): print(\"Processing...\"); return control_message", "udf_function_name": "my_processor", "target_stage": "text_embedder", "run_before": true}'
@@ -50,7 +50,7 @@ nemo-retriever \
 #### Module Path with Colon (Recommended)
 ```bash
 # Submit UDF from importable module (preserves all imports and context)
-nemo-retriever \
+nv-ingest-cli \
     --doc /path/to/document.pdf \
     --output-directory ./output \
     --task 'udf:{"udf_function": "my_package.processors:enhance_metadata", "target_stage": "text_embedder", "run_after": true}'
@@ -59,7 +59,7 @@ nemo-retriever \
 #### File Path
 ```bash
 # Submit UDF from file path
-nemo-retriever \
+nv-ingest-cli \
     --doc /path/to/document.pdf \
     --output-directory ./output \
     --task 'udf:{"udf_function": "my_file.py:my_custom_processor", "target_stage": "text_embedder", "run_before": true}'
@@ -68,7 +68,7 @@ nemo-retriever \
 #### Legacy Import Path (Limited)
 ```bash
 # Submit UDF using legacy dot notation (function only, no imports)
-nemo-retriever \
+nv-ingest-cli \
     --doc /path/to/document.pdf \
     --output-directory ./output \
     --task 'udf:{"udf_function": "my_package.processors.basic_processor", "target_stage": "text_embedder", "run_after": true}'
@@ -77,7 +77,7 @@ nemo-retriever \
 ### 3. Submit via Python Client
 
 ```python
-from nemo_retriever.client.interface import Ingestor
+from nv_ingest_client.client.interface import Ingestor
 
 # Create an Ingestor instance with default client
 ingestor = Ingestor()
@@ -121,7 +121,7 @@ results = ingestor.files("/path/to/document.pdf") \
 
 ### Understanding IngestControlMessage (ICM)
 
-The `IngestControlMessage` is the primary data structure that flows through the NeMo Retriever Library pipeline. Your UDF receives an ICM and must return a (potentially modified) ICM.
+The `IngestControlMessage` is the primary data structure that flows through the pipeline. Your UDF receives an ICM and must return a (potentially modified) ICM.
 
 #### Key ICM Methods
 
@@ -265,7 +265,7 @@ def enhance_metadata(control_message: IngestControlMessage) -> IngestControlMess
     return control_message
 ```
 
-> **📖 For detailed metadata schema documentation, see:** [Content Metadata](content-metadata.md)
+> **📖 For detailed metadata schema documentation, see:** [metadata_documentation.md](metadata_documentation.md)
 
 ### UDF Targeting
 
@@ -311,9 +311,9 @@ UDFs can be executed at different stages of the pipeline by specifying the `targ
 
 ```bash
 # CLI examples for different target stages
-nemo-retriever --doc file.pdf --task 'udf:{"udf_function": "processor.py:validate_input", "target_stage": "pdf_extractor", "run_before": true}'
-nemo-retriever --doc file.pdf --task 'udf:{"udf_function": "processor.py:extract_custom", "target_stage": "text_embedder", "run_after": true}'
-nemo-retriever --doc file.pdf --task 'udf:{"udf_function": "processor.py:enhance_output", "target_stage": "embedding_storage", "run_before": true}'
+nv-ingest-cli --doc file.pdf --task 'udf:{"udf_function": "processor.py:validate_input", "target_stage": "pdf_extractor", "run_before": true}'
+nv-ingest-cli --doc file.pdf --task 'udf:{"udf_function": "processor.py:extract_custom", "target_stage": "text_embedder", "run_after": true}'
+nv-ingest-cli --doc file.pdf --task 'udf:{"udf_function": "processor.py:enhance_output", "target_stage": "embedding_storage", "run_before": true}'
 ```
 
 ```python
@@ -379,7 +379,7 @@ def my_udf(control_message: IngestControlMessage) -> IngestControlMessage:
 
 ### UDF Function Specification Formats
 
-NeMo Retriever Library supports four different formats for specifying UDF functions:
+The library supports four different formats for specifying UDF functions:
 
 ### 1. Inline Function String
 Define your function directly as a string:
@@ -456,14 +456,14 @@ ingestor.udf(udf_function="my_package.processors.text_utils:enhance_metadata")
 
 ## Integrating with NVIDIA NIMs
 
-NVIDIA Inference Microservices (NIMs) provide powerful AI capabilities that can be seamlessly integrated into your UDFs. The `NimClient` class offers a unified interface for connecting to and using NIMs within the NeMo Retriever Library pipeline.
+NVIDIA Inference Microservices (NIMs) provide powerful AI capabilities that can be seamlessly integrated into your UDFs. The `NimClient` class offers a unified interface for connecting to and using NIMs within the pipeline.
 
 ### Quick NIM Integration
 
 ```python
-from nemo_retriever.internal.primitives.control_message import IngestControlMessage
-from nemo_retriever.util.nim import create_inference_client
-from nemo_retriever.internal.primitives.nim.model_interface.vlm import VLMModelInterface
+from nv_ingest_api.internal.primitives.control_message import IngestControlMessage
+from nv_ingest_api.util.nim import create_inference_client
+from nv_ingest_api.internal.primitives.nim.model_interface.vlm import VLMModelInterface
 import os
 
 def document_analysis_with_nim(control_message: IngestControlMessage) -> IngestControlMessage:
@@ -521,7 +521,7 @@ export NGC_API_KEY="your-ngc-api-key"
 
 ### Available NIM Interfaces
 
-NeMo Retriever Library provides several pre-built model interfaces:
+The library provides several pre-built model interfaces:
 
 - **VLMModelInterface**: Vision-Language Models for image analysis and captioning
 - **EmbeddingModelInterface**: Text embedding generation
@@ -538,11 +538,11 @@ For detailed guidance on creating custom NIM integrations, including:
 - Error handling and debugging
 - Performance best practices
 
-See the comprehensive [**NimClient Usage Guide**](nimclient.md).
+See the comprehensive [**NimClient Usage Guide**](nimclient_usage.md).
 
 ### Error Handling
 
-The NeMo Retriever Library system automatically catches all exceptions that occur within UDF execution. If your UDF fails for any reason, the system will:
+The system automatically catches all exceptions that occur within UDF execution. If your UDF fails for any reason, the system will:
 
 1. Annotate the job with appropriate error information
 2. Mark the job as failed
@@ -553,7 +553,7 @@ You do not need to implement extensive error handling within your UDF - focus on
 
 ### Performance Considerations
 
-UDFs execute within the NeMo Retriever Library pipeline and can significantly impact overall system performance and stability. Understanding these considerations is crucial for maintaining optimal pipeline throughput and reliability.
+UDFs execute within the pipeline and can significantly impact overall system performance and stability. Understanding these considerations is crucial for maintaining optimal pipeline throughput and reliability.
 
 #### Pipeline Impact
 
@@ -873,7 +873,7 @@ Test your UDF functions in isolation before deploying them to the pipeline:
 
 ```python
 import pandas as pd
-from nemo_retriever.internal.primitives.ingest_control_message import IngestControlMessage
+from nv_ingest_api.internal.primitives.ingest_control_message import IngestControlMessage
 
 def test_my_udf():
     # Create test data
@@ -941,6 +941,6 @@ def debug_udf(control_message: IngestControlMessage) -> IngestControlMessage:
 
 ## Related Topics
 
-- [NeMo Retriever Library UDF Examples](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0/examples/udfs/README.md)
+- [NV-Ingest UDF Examples](https://github.com/NVIDIA/nv-ingest/blob/release/26.1.2/examples/udfs/README.md)
 - [User-Defined Stages for NeMo Retriever Library](user-defined-stages.md)
 - [NimClient Usage](nimclient.md)
diff --git a/docs/docs/extraction/user-defined-stages.md b/docs/docs/extraction/user-defined-stages.md
index a20e17673..57f68179f 100644
--- a/docs/docs/extraction/user-defined-stages.md
+++ b/docs/docs/extraction/user-defined-stages.md
@@ -8,7 +8,7 @@ and operate on a well-defined DataFrame payload and metadata structure.
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NeMo Retriever Library is also known as NVIDIA Ingest.
 
 
 To add user-defined stages to your pipeline, you need the following:
@@ -21,7 +21,7 @@ To add user-defined stages to your pipeline, you need the following:
 
 - **A DataFrame payload** — The `control_message.payload` field must be a [pandas.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). For more information, refer to [Create a DataFrame Payload](#create-a-dataframe-payload).
 
-- **Valid metadata** — The `metadata` field must conform to the [NeMo Retriever Library metadata schema](content-metadata.md). For more information, refer to [Update and Validate Metadata](#update-and-validate-metadata).
+- **Valid metadata** — The `metadata` field must conform to the [content metadata schema](content-metadata.md). For more information, refer to [Update and Validate Metadata](#update-and-validate-metadata).
 
 
 
@@ -44,8 +44,8 @@ The following example demonstrates how to create a valid Lambda function and con
 ```python
 import pandas as pd
 from pydantic import BaseModel
-from nemo_retriever.internal.primitives.ingest_control_message import IngestControlMessage
-from nemo_retriever.internal.schemas.meta.metadata_schema import validate_metadata
+from nv_ingest_api.internal.primitives.ingest_control_message import IngestControlMessage
+from nv_ingest_api.internal.schemas.meta.metadata_schema import validate_metadata
 
 # Config schema for your stage
 class MyToyConfig(BaseModel):
@@ -160,13 +160,13 @@ When the pipeline runs it does the following:
 ## Update and Validate Metadata
 
 The `metadata` column in each row is a dictionary (JSON object), 
-and must conform to the [NeMo Retriever Library metadata schema](content-metadata.md). 
+and must conform to the [content metadata schema](content-metadata.md). 
 
 After you change any metadata, you can validate it by using the `validate_metadata` function 
 as demonstrated in the following code example.
 
 ```python
-from nemo_retriever.internal.schemas.meta.metadata_schema import validate_metadata
+from nv_ingest_api.internal.schemas.meta.metadata_schema import validate_metadata
 
 def edit_metadata(control_message: IngestControlMessage, stage_config: MyToyConfig) -> IngestControlMessage:
   df = control_message.payload()
@@ -235,8 +235,8 @@ The  following example adds user-defined stages to your NeMo Retriever Library p
     ```python
     # my_pipeline/stages.py
     from pydantic import BaseModel
-    from nemo_retriever.internal.primitives.ingest_control_message import IngestControlMessage
-    from nemo_retriever.internal.schemas.meta.metadata_schema import validate_metadata
+    from nv_ingest_api.internal.primitives.ingest_control_message import IngestControlMessage
+    from nv_ingest_api.internal.schemas.meta.metadata_schema import validate_metadata
 
     class DoubleConfig(BaseModel):
     multiply_by: int = 2
diff --git a/docs/docs/extraction/v2-api-guide.md b/docs/docs/extraction/v2-api-guide.md
index 1ac15d216..5aa87f1f6 100644
--- a/docs/docs/extraction/v2-api-guide.md
+++ b/docs/docs/extraction/v2-api-guide.md
@@ -9,10 +9,9 @@
 ## Table of Contents
 
 1. [Quick Start](#quick-start) - Get running in 5 minutes
-2. [HTTP API Reference](#http-api-reference) - Endpoint paths, methods, and status codes
-3. [Configuration Guide](#configuration-guide) - All configuration options
-4. [How It Works](#how-it-works) - Architecture overview
-5. [Migration from V1](#migration-from-v1) - Upgrade existing code
+2. [Configuration Guide](#configuration-guide) - All configuration options
+3. [How It Works](#how-it-works) - Architecture overview
+4. [Migration from V1](#migration-from-v1) - Upgrade existing code
 
 
 ---
@@ -30,7 +29,7 @@ The V2 API automatically splits large PDFs into smaller chunks before processing
 ### Minimal Example
 
 ```python
-from nemo_retriever.client import Ingestor
+from nv_ingest_client.client import Ingestor
 
 # Two-step configuration
 ingestor = Ingestor(
@@ -42,7 +41,7 @@ ingestor = Ingestor(
 # Run with optional chunk size override
 results = ingestor.files(["large_document.pdf"]) \
     .extract(extract_text=True, extract_tables=True) \
-    .pdf_split_config(pages_per_chunk=64) \
+    .pdf_split_config(pages_per_chunk=64) \  # ← Step 2: Configure splitting
     .ingest()
 
 print(f"Processed {results['metadata']['total_pages']} pages")
@@ -51,7 +50,7 @@ print(f"Processed {results['metadata']['total_pages']} pages")
 ### CLI Usage
 
 ```bash
-nemo-retriever \
+nv-ingest-cli \
   --api_version v2 \
   --pdf_split_page_count 64 \
   --doc large_document.pdf \
@@ -63,63 +62,6 @@ nemo-retriever \
 
 ---
 
-## HTTP API Reference
-
-The following endpoint reference is provided for custom HTTP clients (curl, Postman, etc.) and debugging. Base URL is the service root (e.g. `http://localhost:7670`); use the paths below as the path component of the full URL.
-
-### Endpoint Summary
-
-| Version | Method | Endpoint | Purpose | Status codes |
-|---------|--------|----------|---------|--------------|
-| V1 | POST | `/v1/submit` | Multipart/form-data upload (curl-friendly) | 200 |
-| V1 | POST | `/v1/submit_job` | JSON job submission | 200 |
-| V1 | GET | `/v1/fetch_job/{job_id}` | Fetch job result | 200, 202, 404, 410, 503 |
-| V1 | POST | `/v1/convert` | PDF conversion | 200 |
-| V1 | GET | `/v1/status/{job_id}` | Job status check | 200 |
-| V2 | POST | `/v2/submit_job` | V2 job submission (with optional PDF splitting) | 200, 500, 503 |
-| V2 | GET | `/v2/fetch_job/{job_id}` | V2 fetch with parent job aggregation | 200, 202, 404, 410, 500, 503 |
-
-### Request and Response Overview
-
-**V1 `/v1/submit` (POST)**  
-- **Content-Type:** `multipart/form-data`  
-- **Body:** `file` (uploaded PDF)  
-- **Response:** `200` — job ID (text)
-
-**V1 `/v1/submit_job` (POST)**  
-- **Content-Type:** `application/json`  
-- **Body:** `MessageWrapper` with job spec payload (JSON)  
-- **Response:** `200` — job ID (text). Header `x-trace-id` set.
-
-**V1 `/v1/fetch_job/{job_id}` (GET)**  
-- **Path:** `job_id` — UUID returned from submit  
-- **Response:** `200` result body, `202` still processing, `404` not found, `410` result consumed, `503` processing failed
-
-**V1 `/v1/convert` (POST)**  
-- **Content-Type:** `application/json` (or as defined by endpoint)  
-- **Response:** Conversion result (format depends on request)
-
-**V1 `/v1/status/{job_id}` (GET)**  
-- **Path:** `job_id`  
-- **Response:** Job state (e.g. SUBMITTED, PROCESSING, RETRIEVED_*)
-
-**V2 `/v2/submit_job` (POST)**  
-- **Content-Type:** `application/json`  
-- **Body:** Same as V1 `submit_job`; may include PDF split config in job spec  
-- **Response:** `200` — parent job ID (text). Header `x-trace-id` set.
-
-**V2 `/v2/fetch_job/{job_id}` (GET)**  
-- **Path:** `job_id` — parent job ID from V2 submit  
-- **Response:** Same status codes as V1 fetch; when the job was split, the service aggregates all chunk results and returns a single combined response. See [HTTP status codes](#http-status-codes) below.
-
-*Endpoint reference added for custom HTTP clients and debugging (Bug 596672).*
-
-### HTTP Status Codes
-
-See the [status code table](#http-status-codes) in this guide for `200`, `202`, `404`, `410`, `500`, and `503` meanings.
-
----
-
 ## Configuration Guide
 
 ### Two Required Settings
@@ -428,11 +370,11 @@ ingestor = Ingestor(
 
 ### Test Script Pattern
 
-For test scripts like `tools/harness/src/nemo_retriever_harness/cases/e2e.py`:
+For test scripts like `tools/harness/src/nv_ingest_harness/cases/e2e.py`:
 
 ```python
 import os
-from nemo_retriever.client import Ingestor
+from nv_ingest_client.client import Ingestor
 
 # Read from environment
 api_version = os.getenv("API_VERSION", "v1")
@@ -462,7 +404,7 @@ ingestor = ingestor.extract(...).ingest()
 ### Backward Compatibility
 
 **V1 clients continue to work:**
-- Still route to `/v1/submit_job` and `/v1/fetch_job` (refer to the [HTTP API Reference](#http-api-reference) for all V1/V2 paths)
+- Still route to `/v1/submit_job` and `/v1/fetch_job`
 - No changes required
 - No splitting occurs
 
@@ -473,7 +415,7 @@ ingestor = ingestor.extract(...).ingest()
 
 ---
 
-<a id="http-status-codes"></a>**HTTP status codes:**
+**HTTP status codes:**
 
 | Code | Meaning | Action |
 |------|---------|--------|
@@ -526,17 +468,17 @@ WARNING: Client requested split_page_count=1000; clamped to 128
 
 ### Key Files
 
-**Server Implementation (this repo: `nv_ingest`, refer to the [NeMo-Retriever](https://github.com/NVIDIA/NeMo-Retriever.git) for client):**
+**Server Implementation:**
 - `src/nv_ingest/api/v2/ingest.py` - V2 endpoints
 - `src/nv_ingest/framework/util/service/impl/ingest/redis_ingest_service.py` - Redis state management
 
 **Client Implementation:**
-- `client/src/nemo_retriever/client/interface.py` - Ingestor class
-- `client/src/nemo_retriever/util/util.py` - Configuration utilities
-- `client/src/nemo_retriever/client/ingest_job_handler.py` - Job handling
+- `client/src/nv_ingest_client/client/interface.py` - Ingestor class
+- `client/src/nv_ingest_client/util/util.py` - Configuration utilities
+- `client/src/nv_ingest_client/client/ingest_job_handler.py` - Job handling
 
 **Schemas:**
-- `api/src/nemo_retriever/internal/schemas/meta/ingest_job_schema.py` - PdfConfigSchema
+- `api/src/nv_ingest_api/internal/schemas/meta/ingest_job_schema.py` - PdfConfigSchema
 
 ---
 
diff --git a/docs/docs/extraction/vlm-embed.md b/docs/docs/extraction/vlm-embed.md
index 331379ab3..3ffe2b7c0 100644
--- a/docs/docs/extraction/vlm-embed.md
+++ b/docs/docs/extraction/vlm-embed.md
@@ -1,27 +1,28 @@
 # Use Multimodal Embedding with NeMo Retriever Library
 
-This guide explains how to use the [NeMo Retriever Library](https://www.perplexity.ai/search/overview.md) with the multimodal embedding model [Llama Nemotron Embed VL 1B v2](https://build.nvidia.com/nvidia/llama-nemotron-embed-vl-1b-v2).
+This documentation describes how to use [NeMo Retriever Library](overview.md) 
+with the multimodal embedding model [Llama 3.2 NeMo Retriever Multimodal Embedding 1B](https://build.nvidia.com/nvidia/llama-3_2-nemoretriever-1b-vlm-embed-v1).
 
-The `Llama Nemotron Embed VL 1B v2` model is optimized for multimodal question-answering and retrieval tasks.
-It can embed documents as text, images, or paired text-image combinations.
-These embeddings enable retrieving relevant documents based on a text query.
-The model supports three embedding modalities: `text`, `image`, and `text_image`.
+The `Llama 3.2 NeMo Retriever Multimodal Embedding 1B` model is optimized for multimodal question-answering retrieval. 
+The model can embed documents in the form of an image, text, or a combination of image and text. 
+Documents can then be retrieved given a user query in text form. 
+The model supports images that contain text, tables, charts, and infographics.
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NeMo Retriever Library is also known as NVIDIA Ingest.
 
 
 ## Configure and Run the Multimodal NIM
 
 Use the following procedure to configure and run the multimodal embedding NIM locally.
 
-1. Configure the embedding model in your `.env` file. This instructs the NeMo Retriever Library to use the Llama Nemotron Embed VL model instead of the default text-only model.
+1. Set the embedding model in your .env file. This tells NeMo Retriever Library to use the Llama 3.2 Multimodal model instead of the default text-only embedding model.
 
     ```
-    EMBEDDING_IMAGE=nvcr.io/nim/nvidia/llama-nemotron-embed-vl-1b-v2
-    EMBEDDING_TAG=1.12.0
-    EMBEDDING_NIM_MODEL_NAME=nvidia/llama-nemotron-embed-vl-1b-v2
+    EMBEDDING_IMAGE=nvcr.io/nvidia/nemo-microservices/llama-3.2-nemoretriever-1b-vlm-embed-v1
+    EMBEDDING_TAG=1.7.0
+    EMBEDDING_NIM_MODEL_NAME=nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1
     ```
 
 2. Start the NeMo Retriever Library services. The multimodal embedding service is included by default.
@@ -32,32 +33,13 @@ Use the following procedure to configure and run the multimodal embedding NIM lo
 
 
 After the services are running, you can interact with the extraction pipeline by using Python.
-The key to using the multimodal model effectively is configuring the `extract` and `embed` methods to handle different content types with the correct modality.
+The key to leveraging the multimodal model is 
+to configure the `extract` and `embed` methods to process different content types as either text or images.
 
 
-## Supported Modalities
+## Example with Default Text-Based Embedding
 
-The multimodal embedding model supports three modalities:
-
-- **`text`** – Embeds content as plain text. This is the default modality and provides a strong baseline for retrieval.
-- **`image`** – Embeds content as an image, capturing visual and spatial layout details that are helpful for tables, charts, and infographics.
-- **`text_image`** – Embeds paired text and image together, combining the semantic depth of text with the visual context of an image for higher retrieval quality.
-
-
-## Per-Element Modality Control
-
-You can apply different modalities to various content types by passing per-element modality parameters to the embed method:
-
-- **`text_elements_modality`** – Specifies the modality for text elements (default: "text").
-- **`structured_elements_modality`** – Specifies the modality for tables and charts (default: "text").
-- **`image_elements_modality`** – Specifies the modality for images, including page images (default: "text").
-
-This configuration lets you, for example, embed plain text as text while embedding tables as images or as combined text and image.
-
-
-## Example 1: Default Text-Based Embedding
-
-By default, when you use the multimodal model, all extracted content—such as text, tables, and charts—is processed as plain text.
+When you use the multimodal model, by default, all extracted content (text, tables, charts) is treated as plain text. 
 The following example provides a strong baseline for retrieval.
 
 - The `extract` method is configured to pull out text, tables, and charts.
@@ -79,9 +61,9 @@ results = ingestor.ingest()
 ```
 
 
-## Example 2: Structured Elements as Images
+## Example with Embedding Structured Elements as Images
 
-It is common to process PDFs by embedding regular text as text and embedding visual elements, such as tables and charts, as images.
+It is common to process PDFs by embedding standard text as text, and embed visual elements like tables and charts as images. 
 The following example enables the multimodal model to capture the spatial and structural information of the visual content.
 
 - The `extract` method is configured to pull out text, tables, and charts.
@@ -105,39 +87,18 @@ results = ingestor.ingest()
 ```
 
 
-## Example 3: Structured Elements as Text+Image Pairs
-
-For the highest-quality retrieval of tables and charts, embed them as paired text and image.
-This approach combines the extracted table text with the rendered table image, giving the model both semantic and visual context.
+## Example with Embedding Entire PDF Pages as Images
 
-- The `extract` method is configured to capture text, tables, and charts.
-- The embed method is configured with `structured_elements_modality="text_image"` so that tables and charts are embedded as paired text and image.
+For documents where the entire page layout is important (such as infographics, complex diagrams, or forms), 
+you can configure NeMo Retriever Library to treat every page as a single image.
+The following example extracts and embeds each page as an image.
 
-```python
-ingestor = (
-    Ingestor()
-    .files("./data/*.pdf")
-    .extract(
-        extract_text=True,
-        extract_tables=True,
-        extract_charts=True,
-        extract_images=False,
-    )
-    .embed(
-        structured_elements_modality="text_image",
-    )
-)
-results = ingestor.ingest()
-```
-
-
-## Example 4: Full Page as Image
+!!! note
 
-For documents where the full page layout matters (such as infographics, complex diagrams, or forms), you can configure NeMo Retriever Library to treat each page as a single image.
-In the following example, every page is extracted and embedded as an image.
+    The `extract_page_as_image` feature is experimental. Its behavior may change in future releases.
 
-- The `extract` method uses `extract_page_as_image=True`, with all other extraction options set to `False`.
-- The `embed` method then processes these page images with `image_elements_modality="image"`.
+- The `extract method` uses the `extract_page_as_image=True` parameter. All other extraction types are set to `False`.
+- The `embed method` processes the page images.
 
 ```python
 ingestor = (
@@ -157,37 +118,9 @@ ingestor = (
 results = ingestor.ingest()
 ```
 
-
-## Example 5: Full Page as Text+Image
-
-For the best retrieval quality on full-page content, you can embed each page as a paired text and image.
-When `image_elements_modality="text_image"` is set, the pipeline automatically aggregates the text content from each page and pairs it with the page image for joint embedding.
-
-- The `extract` method extracts both page images and text content, aggregating the text and pairing it with the corresponding page image.
-- The `embed` method processes the page images with `image_elements_modality="text_image"`.
-
-```python
-ingestor = (
-    Ingestor()
-    .files("./data/*.pdf")
-    .extract(
-        extract_text=True,
-        extract_tables=True,
-        extract_charts=True,
-        extract_infographics=True,
-        extract_images=False,
-        extract_page_as_image=True,
-    )
-    .embed(
-        image_elements_modality="text_image",
-    )
-)
-results = ingestor.ingest()
-```
-
 ## Related Topics
 
 - [Support Matrix](support-matrix.md)
-- [Troubleshoot NeMo Retriever Library](troubleshoot.md)
-- [Use the NeMo Retriever Library Python API](python-api-reference.md)
-- [Extract Captions from Images](python-api-reference.md#extract-captions-from-images)
+- [Troubleshoot Nemo Retriever Extraction](troubleshoot.md)
+- [Use the Python API](nv-ingest-python-api.md)
+- [Extract Captions from Images](nv-ingest-python-api.md#extract-captions-from-images)
diff --git a/docs/docs/index.md b/docs/docs/index.md
index e9eaac7f8..32d7b1acd 100644
--- a/docs/docs/index.md
+++ b/docs/docs/index.md
@@ -1,13 +1,13 @@
-# What is NVIDIA NeMo Retriever Library?
+# What is NVIDIA NeMo Retriever?
 
-NVIDIA NeMo Retriever Library is a collection of microservices 
+NVIDIA NeMo Retriever is a collection of microservices 
 for building and scaling multimodal data extraction, embedding, and reranking pipelines 
 with high accuracy and maximum data privacy – built with NVIDIA NIM. 
-NeMo Retriever Library, part of the [NVIDIA NeMo](https://www.nvidia.com/en-us/ai-data-science/products/nemo/) software suite for managing the AI agent lifecycle, 
+NeMo Retriever, part of the [NVIDIA NeMo](https://www.nvidia.com/en-us/ai-data-science/products/nemo/) software suite for managing the AI agent lifecycle, 
 ensures data privacy and seamlessly connects to proprietary data wherever it resides, 
 empowering secure, enterprise-grade retrieval.
 
-NeMo Retriever Library provides the following:
+NeMo Retriever provides the following:
 
 - **Multimodal Data Extraction** — Quickly extract documents at scale that include text, tables, charts, and infographics.
 - **Embedding + Indexing** — Embed all extracted text from text chunks and images, and then insert into LanceDB (default) or Milvus — accelerated with NVIDIA cuVS.
@@ -19,31 +19,31 @@ NeMo Retriever Library provides the following:
 
 ## Enterprise-Ready Features
 
-NVIDIA NeMo Retriever Library comes with enterprise-ready features, including the following:
+NVIDIA NeMo Retriever comes with enterprise-ready features, including the following:
 
-- **High Accuracy** — NeMo Retriever Library exhibits a high level of accuracy when retrieving across various modalities through enterprise documents. 
-- **High Throughput** — NeMo Retriever Library is capable of extracting, embedding, indexing and retrieving across hundreds of thousands of documents at scale with high throughput. 
-- **Decomposable/Customizable** — NeMo Retriever Library consists of modules that can be separately used and deployed in your own environment. 
-- **Enterprise-Grade Security** — NeMo Retriever Library NIMs come with security features such as the use of [safetensors](https://huggingface.co/docs/safetensors/index), continuous patching of CVEs, and more. 
+- **High Accuracy** — NeMo Retriever exhibits a high level of accuracy when retrieving across various modalities through enterprise documents. 
+- **High Throughput** — NeMo Retriever is capable of extracting, embedding, indexing and retrieving across hundreds of thousands of documents at scale with high throughput. 
+- **Decomposable/Customizable** — NeMo Retriever consists of modules that can be separately used and deployed in your own environment. 
+- **Enterprise-Grade Security** — NeMo Retriever NIMs come with security features such as the use of [safetensors](https://huggingface.co/docs/safetensors/index), continuous patching of CVEs, and more. 
 
 
 
 ## Applications
 
-The following are some applications that use NVIDIA NeMo Retriever Library:
+The following are some applications that use NVIDIA Nemo Retriever:
 
 - [AI Virtual Assistant for Customer Service](https://github.com/NVIDIA-AI-Blueprints/ai-virtual-assistant) (NVIDIA AI Blueprint)
 - [Build an Enterprise RAG pipeline](https://build.nvidia.com/nvidia/build-an-enterprise-rag-pipeline/blueprintcard) (NVIDIA AI Blueprint)
 - [Building Code Documentation Agents with CrewAI](https://github.com/crewAIInc/nvidia-demo) (CrewAI Demo)
 - [Digital Human for Customer Service](https://github.com/NVIDIA-AI-Blueprints/digital-human) (NVIDIA AI Blueprint)
-- [Document Research Assistant for Blog Creation](https://developers.llamaindex.ai/python/examples/agent/nvidia_document_research_assistant_for_blog_creation/) (LlamaIndex)
+- [Document Research Assistant for Blog Creation](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/agent/nvidia_document_research_assistant_for_blog_creation.ipynb) (LlamaIndex Jupyter Notebook)
 - [Video Search and Summarization](https://github.com/NVIDIA-AI-Blueprints/video-search-and-summarization) (NVIDIA AI Blueprint)
 
 
 
 ## Related Topics
 
-- [NeMo Retriever Library Text Embedding NIM](https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/overview.html)
-- [NeMo Retriever Library Text Reranking NIM](https://docs.nvidia.com/nim/nemo-retriever/text-reranking/latest/overview.html)
+- [NeMo Retriever Text Embedding NIM](https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/overview.html)
+- [NeMo Retriever Text Reranking NIM](https://docs.nvidia.com/nim/nemo-retriever/text-reranking/latest/overview.html)
 - [NVIDIA NIM for Object Detection](https://docs.nvidia.com/nim/ingestion/object-detection/latest/overview.html)
-- [NVIDIA NIM for Image OCR](https://docs.nvidia.com/nim/ingestion/image-ocr/latest/overview.html)
+- [NVIDIA NIM for Image OCR](https://docs.nvidia.com/nim/ingestion/table-extraction/latest/overview.html)
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
index 36cf26dd2..678ae4cfa 100644
--- a/docs/mkdocs.yml
+++ b/docs/mkdocs.yml
@@ -58,7 +58,7 @@ nav:
   - NeMo Retriever: 
     - Overview: 
       - Overview: index.md
-    - NeMo Retriever Library:
+    - NeMo Retriever Extraction:
       - Overview: extraction/overview.md
       - Release Notes: extraction/releasenotes-nv-ingest.md
       - Get Started:

From 22d58bfea16d063697c96b8a2ca2b990d3d30f99 Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Thu, 19 Mar 2026 13:22:39 -0700
Subject: [PATCH 42/55] =?UTF-8?q?Confirmed=20product=20naming=20of=20NeMo?=
 =?UTF-8?q?=20Retriever=20Library=20in=20files=20and=20code=20=E2=80=A6=20?=
 =?UTF-8?q?(#1664)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 docs/docs/extraction/benchmarking.md | 2 +-
 docs/docs/extraction/contributing.md | 4 ----
 2 files changed, 1 insertion(+), 5 deletions(-)
 delete mode 100644 docs/docs/extraction/contributing.md

diff --git a/docs/docs/extraction/benchmarking.md b/docs/docs/extraction/benchmarking.md
index 0f8cd81d9..30488fa60 100644
--- a/docs/docs/extraction/benchmarking.md
+++ b/docs/docs/extraction/benchmarking.md
@@ -1,4 +1,4 @@
-# nv-ingest Integration Testing Framework
+# NV-Ingest Integration Testing Framework
 
 A configurable, dataset-agnostic testing framework for end-to-end validation of nv-ingest pipelines. This framework uses structured YAML configuration for type safety, validation, and parameter management.
 
diff --git a/docs/docs/extraction/contributing.md b/docs/docs/extraction/contributing.md
deleted file mode 100644
index 6a136c218..000000000
--- a/docs/docs/extraction/contributing.md
+++ /dev/null
@@ -1,4 +0,0 @@
-# Contributing to NV-Ingest
-
-External contributions to NV-Ingest will be welcome soon, and they are greatly appreciated! 
-For more information, refer to [Contributing to NV-Ingest](https://github.com/NVIDIA/nv-ingest/blob/main/CONTRIBUTING.md).

From 17e0148447c88b043c165999ec32a4cdbcbc393d Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Fri, 20 Mar 2026 11:02:40 -0700
Subject: [PATCH 43/55] update helm file (#1679)

fixed link per Sohail's comment
---
 docs/docs/extraction/helm.md | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/docs/docs/extraction/helm.md b/docs/docs/extraction/helm.md
index 1a6e885a3..0983a382e 100644
--- a/docs/docs/extraction/helm.md
+++ b/docs/docs/extraction/helm.md
@@ -1,6 +1,9 @@
+<!-- Use this documentation to deploy [NeMo Retriever Library](overview.md) by using Helm. -->
+
 # Deploy With Helm for NeMo Retriever Library
 
-<!-- Use this documentation to deploy [NeMo Retriever Library](overview.md) by using Helm. -->
+To deploy [NeMo Retriever Library](overview.md) by using Helm, refer to [NV-Ingest Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/26.03/helm/README.md).
 
-To deploy [NeMo Retriever Library](overview.md) by using Helm, 
-refer to [NV-Ingest Helm Charts](https://github.com/NVIDIA/nv-ingest/blob/release/26.1.2/helm/README.md).
+!!! note "Air-gapped environments"
+   
+    For deploying in an air-gapped environment, refer to the [NVIDIA NIM Operator documentation on Air-Gapped Environments](https://docs.nvidia.com/nim-operator/latest/air-gap.html), which explains how to deploy NIMs when your cluster has no internet or NGC registry access.

From 3d4fdaeee1f25acaea5dd5f6d9afa3221853eeab Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Mon, 23 Mar 2026 08:35:08 -0700
Subject: [PATCH 44/55] updated quickstart to current version following
 reversion (#1683)

---
 docs/docs/extraction/quickstart-guide.md | 51 ++++++------------------
 1 file changed, 13 insertions(+), 38 deletions(-)

diff --git a/docs/docs/extraction/quickstart-guide.md b/docs/docs/extraction/quickstart-guide.md
index 6f7ab7194..217fbbed1 100644
--- a/docs/docs/extraction/quickstart-guide.md
+++ b/docs/docs/extraction/quickstart-guide.md
@@ -101,43 +101,7 @@ h. Run the command `docker ps`. You should see output similar to the following.
     3403c5a0e7be  redis/redis-stack                                "/entrypoint.sh"        7 minutes ago   Up 7 minutes            0.0.0.0:6379...  nv-ingest-redis-1
     ```
 
-
-## Step 2: Install Python Dependencies
-
-You can interact with the service from the host, or by using `docker exec` to run commands in the runtime container.
-
-To interact from the host, you'll need a Python environment that has the client dependencies installed.
-
-```
-uv venv --python 3.12 nv-ingest-dev
-source nv-ingest-dev/bin/activate
-uv pip install nv-ingest==26.1.2 nv-ingest-api==26.1.2 nv-ingest-client==26.1.2
-```
-
-!!! tip
-
-    To confirm that you have activated your virtual environment, run `which pip` and `which python`, and confirm that you see `nvingest` in the result. You can do this before any pip or python command that you run.
-
-
-!!! note
-
-Interaction from the host requires the appropriate port to be exposed from the runtime container, as defined in the `docker-compose.yaml` file. If you prefer, you can disable this port and interact directly with the service from within its container.
-
-To work inside the container, run the following code.
-
-```bash
-docker exec -it nv-ingest-nv-ingest-ms-runtime-1 bash
-```
-This command opens a shell in the `/workspace` directory, where the `DATASET_ROOT` from your `.env` file is mounted at `./data`. The pre-created `nv_ingest_runtime` virtual environment includes all necessary Python client libraries. You should see a prompt similar to the following.
-
-```bash
-(nv_ingest_runtime) root@your-computer-name:/workspace#
-```
-From this prompt, you can run the CLI and Python examples.
-
-Because many service URIs default to localhost, running inside the runtime container also requires that you specify URIs manually so that services can communicate across containers on the internal Docker network. When using Milvus, see the example following for how to set the `milvus_uri`. With the default LanceDB backend, no extra URI configuration is needed.
-
-## Step 3: Ingest Documents
+## Step 2: Ingest Documents
 
 You can submit jobs programmatically in Python or using the [CLI](nv-ingest_cli.md).
 
@@ -358,7 +322,7 @@ INFO:nv_ingest_client.cli.util.processing:Throughput (Pages/sec): 1.28
 INFO:nv_ingest_client.cli.util.processing:Throughput (Files/sec): 0.43
 ```
 
-## Step 4: Inspecting and Consuming Results
+## Step 3: Inspecting and Consuming Results
 
 After the ingestion steps above have been completed, you should be able to find the `text` and `image` subfolders inside your processed docs folder. Each will contain JSON-formatted extracted content and metadata.
 
@@ -430,6 +394,17 @@ You can specify multiple `--profile` options.
 | `vlm`                 | Advanced | Use [llama 3.1 Nemotron 8B Vision](https://build.nvidia.com/nvidia/llama-3.1-nemotron-nano-vl-8b-v1/modelcard) for image captioning of unstructured images and infographics. This profile enables the `caption` method in the Python API to generate text descriptions of visual content. For more information, refer to [Use Multimodal Embedding](vlm-embed.md) and [Extract Captions from Images](nv-ingest-python-api.md#extract-captions-from-images). | 
 
 
+## Air-Gapped Deployment (Docker Compose)
+
+When deploying in an air-gapped environment (no internet or NGC registry access), you must pre-stage container images on a machine with network access, then transfer and load them in the isolated environment.
+
+1. On a machine with network access: Clone the repo, authenticate with NGC (`docker login nvcr.io`), and pull all images used by your chosen profile (for example, `docker compose --profile retrieval pull`).
+2. Save images: Export the images to archives (for example, using `docker save` for each image or a script that saves all images referenced by your [docker-compose.yaml](https://github.com/NVIDIA/NeMo-Retriever/blob/main/docker-compose.yaml)).
+3. Transfer the image archives and your `docker-compose.yaml` (and `.env` if used) to the air-gapped system.
+4. On the air-gapped machine: Load the images (`docker load -i <archive>`) and start the stack with the same profile (for example, `docker compose --profile retrieval up`).
+
+Ensure the same image tags and `docker-compose.yaml` version are used in both environments so that service configuration stays consistent.
+
 ## Docker Compose override files
 
 The default [docker-compose.yaml](https://github.com/NVIDIA/nv-ingest/blob/main/docker-compose.yaml) might exceed VRAM on a single GPU for some hardware. Override files reduce per-service memory, batch sizes, or concurrency so the full pipeline can run on the available GPU. To use an override, pass a second `-f` file after the base compose file; Docker Compose merges them and the override takes precedence.

From b1f56bb432212275a0de83b7ec064b025dc615c5 Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Mon, 23 Mar 2026 08:36:39 -0700
Subject: [PATCH 45/55] Kheiss/quickstart lib mode update (#1682)

---
 .../extraction/quickstart-library-mode.md     | 482 +-----------------
 1 file changed, 1 insertion(+), 481 deletions(-)

diff --git a/docs/docs/extraction/quickstart-library-mode.md b/docs/docs/extraction/quickstart-library-mode.md
index c027e6a0d..a3ecbf867 100644
--- a/docs/docs/extraction/quickstart-library-mode.md
+++ b/docs/docs/extraction/quickstart-library-mode.md
@@ -1,487 +1,7 @@
 # Deploy Without Containers (Library Mode) for NeMo Retriever Library
 
-[NeMo Retriever Library](overview.md) is typically deployed as a cluster of containers for robust, scalable production use. 
-
 !!! note
 
     NeMo Retriever Library is also known as NVIDIA Ingest.
 
-In addition, you can use library mode, which is intended for the following cases:
-
-- Local development
-- Experimentation and testing
-- Small-scale workloads, such as workloads of fewer than 100 documents
-
-
-By default, library mode depends on NIMs that are hosted on build.nvidia.com. 
-In library mode you launch the main pipeline service directly within a Python process, 
-while all other services (such as embedding and storage) are hosted remotely in the cloud.
-
-To get started using library mode, you need the following:
-
-- Linux operating systems (Ubuntu 22.04 or later recommended) or MacOS
-- Python 3.12
-- We strongly advise using an isolated Python virtual env with [uv](https://docs.astral.sh/uv/getting-started/installation/).
-
-
-
-## Step 1: Prepare Your Environment
-
-Use the following procedure to prepare your environment.
-
-1. Run the following code to create your NV Ingest Python environment.
-
-    ```
-       uv venv --python 3.12 nvingest && \
-         source nvingest/bin/activate && \
-         uv pip install nv-ingest==26.1.2 nv-ingest-api==26.1.2 nv-ingest-client==26.1.2
-    ```
-
-    By default, the pipeline uses **LanceDB** as the vector database (no extra package required). To use **Milvus** (e.g. milvus-lite) instead, also install `milvus-lite==2.4.12` and pass `milvus_uri="milvus.db"` in `vdb_upload`. For details, see [Data Upload](data-store.md).
-
-    !!! tip
-
-        To confirm that you have activated your virtual environment, run `which python` and confirm that you see `nvingest` in the result. You can do this before any python command that you run.
-
-2. Set or create a .env file that contains your NVIDIA Build API key and other environment variables.
-
-    !!! note
-
-        If you have an NGC API key, you can use it here. For more information, refer to [Generate Your NGC Keys](ngc-api-key.md) and [Environment Configuration Variables](environment-config.md).
-
-    - To set your variables, use the following code.
-
-        ```
-        export NVIDIA_API_KEY=nvapi-<your key>
-        ```
-    - To add your variables to a .env file, include the following.
-
-        ```
-        NVIDIA_API_KEY=nvapi-<your key>
-        ```
-
-
-## Step 2: Ingest Documents
-
-You can submit jobs programmatically by using Python.
-
-!!! tip
-
-    For more Python examples, refer to [NV-Ingest: Python Client Quick Start Guide](https://github.com/NVIDIA/nv-ingest/blob/main/client/client_examples/examples/python_client_usage.ipynb).
-
-
-If you have a very high number of CPUs, and see the process hang without progress, 
-we recommend that you use `taskset` to limit the number of CPUs visible to the process. 
-Use the following code.
-
-```
-taskset -c 0-3 python your_ingestion_script.py
-```
-
-On a 4 CPU core low end laptop, the following code should take about 10 seconds.
-
-```python
-import time
-
-from nv_ingest.framework.orchestration.ray.util.pipeline.pipeline_runners import run_pipeline
-from nv_ingest_client.client import Ingestor, NvIngestClient
-from nv_ingest_api.util.message_brokers.simple_message_broker import SimpleClient
-from nv_ingest_client.util.process_json_files import ingest_json_results_to_blob
-
-def main():
-    # Start the pipeline subprocess for library mode
-    run_pipeline(block=False, disable_dynamic_scaling=True, run_in_subprocess=True)
-
-    client = NvIngestClient(
-        message_client_allocator=SimpleClient,
-        message_client_port=7671,
-        message_client_hostname="localhost",
-    )
-
-    # Optional: use Milvus (e.g. milvus-lite) by providing milvus_uri and installing milvus-lite.
-    # By default, LanceDB is used and no milvus_uri is needed.
-    # milvus_uri = "milvus.db"
-    collection_name = "test"
-    sparse = False
-
-    # do content extraction from files
-    ingestor = (
-        Ingestor(client=client)
-        .files("data/multimodal_test.pdf")
-        .extract(
-            extract_text=True,
-            extract_tables=True,
-            extract_charts=True,
-            extract_images=True,
-            table_output_format="markdown",
-            extract_infographics=True,
-            # extract_method="nemotron_parse", #Slower, but maximally accurate, especially for PDFs with pages that are scanned images
-            text_depth="page",
-        )
-        .embed()
-        .vdb_upload(
-            collection_name=collection_name,
-            # milvus_uri=milvus_uri,  # Uncomment to use Milvus instead of LanceDB
-            sparse=sparse,
-            # for llama-3.2 embedder, use 1024 for e5-v5
-            dense_dim=2048,
-        )
-    )
-
-    print("Starting ingestion..")
-    t0 = time.time()
-
-    # Return both successes and failures
-    # Use for large batches where you want successful chunks/pages to be committed, while collecting detailed diagnostics for failures.
-    results, failures = ingestor.ingest(show_progress=True, return_failures=True)
-
-    # Return only successes
-    # results = ingestor.ingest(show_progress=True)
-
-    t1 = time.time()
-    print(f"Total time: {t1 - t0} seconds")
-
-    # results blob is directly inspectable
-    if results:
-        print(ingest_json_results_to_blob(results[0]))
-
-    # (optional) Review any failures that were returned
-    if failures:
-        print(f"There were {len(failures)} failures. Sample: {failures[0]}")
-
-if __name__ == "__main__":
-    main()
-```
-
-!!! note
-
-    For advanced visual parsing with library mode, uncomment `extract_method="nemotron_parse"` in the previous code. For more information, refer to [Advanced Visual Parsing](nemoretriever-parse.md).
-
-
-You can see the extracted text that represents the content of the ingested test document.
-
-```shell
-Starting ingestion..
-Total time: 9.243880033493042 seconds
-
-TestingDocument
-A sample document with headings and placeholder text
-Introduction
-This is a placeholder document that can be used for any purpose. It contains some 
-headings and some placeholder text to fill the space. The text is not important and contains 
-no real value, but it is useful for testing. Below, we will have some simple tables and charts 
-that we can use to confirm Ingest is working as expected.
-Table 1
-This table describes some animals, and some activities they might be doing in specific 
-locations.
-Animal Activity Place
-Gira@e Driving a car At the beach
-Lion Putting on sunscreen At the park
-Cat Jumping onto a laptop In a home o@ice
-Dog Chasing a squirrel In the front yard
-Chart 1
-This chart shows some gadgets, and some very fictitious costs.
-
-... document extract continues ...
-```
-
-## Step 3: Query Ingested Content
-
-To query for relevant snippets of the ingested content, and use them with an LLM to generate answers, use the following code. With the default LanceDB backend, use the LanceDB retrieval API (see [Data Upload](data-store.md)). The example below shows retrieval when using Milvus (e.g. milvus-lite).
-
-```python
-import os
-from openai import OpenAI
-from nv_ingest_client.util.milvus import nvingest_retrieval
-
-# Only needed when using Milvus (e.g. milvus-lite) instead of LanceDB
-milvus_uri = "milvus.db"
-collection_name = "test"
-sparse = False
-
-queries = ["Which animal is responsible for the typos?"]
-
-retrieved_docs = nvingest_retrieval(
-    queries,
-    collection_name,
-    milvus_uri=milvus_uri,
-    hybrid=sparse,
-    top_k=1,
-)
-
-# simple generation example
-extract = retrieved_docs[0][0]["entity"]["text"]
-client = OpenAI(
-  base_url = "https://integrate.api.nvidia.com/v1",
-  api_key = os.environ["NVIDIA_API_KEY"]
-)
-
-prompt = f"Using the following content: {extract}\n\n Answer the user query: {queries[0]}"
-print(f"Prompt: {prompt}")
-completion = client.chat.completions.create(
-  model="nvidia/llama-3.1-nemotron-nano-vl-8b-v1",
-  messages=[{"role":"user","content": prompt}],
-)
-response = completion.choices[0].message.content
-
-print(f"Answer: {response}")
-```
-
-```shell
-Prompt: Using the following content: Table 1
-| This table describes some animals, and some activities they might be doing in specific locations. | This table describes some animals, and some activities they might be doing in specific locations. | This table describes some animals, and some activities they might be doing in specific locations. |
-| Animal | Activity | Place |
-| Giraffe | Driving a car | At the beach |
-| Lion | Putting on sunscreen | At the park |
-| Cat | Jumping onto a laptop | In a home office |
-| Dog | Chasing a squirrel | In the front yard |
-
- Answer the user query: Which animal is responsible for the typos?
-Answer: A clever query!
-
-Based on the provided Table 1, I'd make an educated inference to answer your question. Since the activities listed are quite unconventional for the respective animals (e.g., a giraffe driving a car, a lion putting on sunscreen), it's likely that the table is using humor or hypothetical scenarios.
-
-Given this context, the question "Which animal is responsible for the typos?" is probably a tongue-in-cheek inquiry, as there's no direct information in the table about typos or typing activities.
-
-However, if we were to make a playful connection, we could look for an animal that's:
-
-1. Typically found in a setting where typing might occur (e.g., an office).
-2. Engaging in an activity that could potentially lead to typos (e.g., interacting with a typing device).
-
-Based on these loose criteria, I'd jokingly point to:
-
-**Cat** as the potential culprit, since it's:
-        * Located "In a home office"
-        * Engaged in "Jumping onto a laptop", which could theoretically lead to accidental keystrokes or typos if the cat were to start "walking" on the keyboard!
-
-Please keep in mind that this response is purely humorous and interpretative, as the table doesn't explicitly mention typos or provide a straightforward answer to the question.
-```
-
-
-
-## Logging Configuration
-
-Nemo Retriever extraction uses [Ray](https://docs.ray.io/en/latest/index.html) for logging. 
-For details, refer to [Configure Ray Logging](ray-logging.md).
-
-By default, library mode runs in quiet mode to minimize startup noise. 
-Quiet mode automatically configures the following environment variables.
-
-| Variable                             | Quiet Mode Value | Description |
-|--------------------------------------|------------------|-------------|
-| `INGEST_RAY_LOG_LEVEL`               | `PRODUCTION`     | Sets Ray logging to ERROR level to reduce noise. |
-| `RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO` | `0`              | Silences Ray accelerator warnings |
-| `OTEL_SDK_DISABLED`                  | `true`           | Disables OpenTelemetry trace export errors |
-
-
-If you want to see detailed startup logs for debugging, use one of the following options:
-
-- Set `quiet=False` when you run the pipeline as shown following.
-
-    ```python
-    run_pipeline(block=False, disable_dynamic_scaling=True, run_in_subprocess=True, quiet=False)
-    ```
-
-- Set the environment variables manually before you run the pipeline as shown following.
-
-    ```bash
-    export INGEST_RAY_LOG_LEVEL=DEVELOPMENT  # or DEBUG for maximum verbosity
-    ```
-
-
-
-## Library Mode Communication and Advanced Examples
-
-Communication in library mode is handled through a simplified, 3-way handshake message broker called `SimpleBroker`.
-
-Attempting to run a library-mode process co-located with a Docker Compose deployment does not work by default. 
-The Docker Compose deployment typically creates a firewall rule or port mapping that captures traffic to port `7671`,
-which prevents the `SimpleBroker` from receiving messages. 
-Always ensure that you use library mode in isolation, without an active containerized deployment listening on the same port.
-
-
-### Example `launch_libmode_service.py`
-
-This example launches the pipeline service in a subprocess, 
-and keeps it running until it is interrupted (for example, by pressing `Ctrl+C`). 
-It listens for ingestion requests on port `7671` from an external client.
-
-```python
-import logging
-import os
-
-from nv_ingest.framework.orchestration.ray.util.pipeline.pipeline_runners import run_pipeline
-from nv_ingest_api.util.logging.configuration import configure_logging as configure_local_logging
-
-# Configure the logger
-logger = logging.getLogger(__name__)
-
-local_log_level = os.getenv("INGEST_LOG_LEVEL", "DEFAULT")
-if local_log_level in ("DEFAULT",):
-    local_log_level = "INFO"
-
-configure_local_logging(local_log_level)
-
-
-def main():
-    """
-    Launch the libmode pipeline service using the embedded default configuration.
-    """
-    try:
-        # Start pipeline and block until interrupted
-        # Note: stdout/stderr cannot be passed when run_in_subprocess=True (not picklable)
-        # Use quiet=False to see verbose startup logs
-        _ = run_pipeline(
-            block=True,
-            disable_dynamic_scaling=True,
-            run_in_subprocess=True,
-        )
-    except KeyboardInterrupt:
-        logger.info("Keyboard interrupt received. Shutting down...")
-    except Exception as e:
-        logger.error(f"An unexpected error occurred: {e}", exc_info=True)
-
-
-if __name__ == "__main__":
-    main()
-```
-
-### Example `launch_libmode_and_run_ingestor.py`
-
-This example starts the pipeline service in-process, 
-and immediately runs an ingestion client against it in the same parent process.
-
-```python
-import logging
-import os
-import time
-
-from nv_ingest.framework.orchestration.ray.util.pipeline.pipeline_runners import run_pipeline
-from nv_ingest_api.util.logging.configuration import configure_logging as configure_local_logging
-from nv_ingest_api.util.message_brokers.simple_message_broker import SimpleClient
-from nv_ingest_client.client import Ingestor
-from nv_ingest_client.client import NvIngestClient
-
-# Configure the logger
-logger = logging.getLogger(__name__)
-
-local_log_level = os.getenv("INGEST_LOG_LEVEL", "INFO")
-if local_log_level in ("DEFAULT",):
-    local_log_level = "INFO"
-
-configure_local_logging(local_log_level)
-
-
-def run_ingestor():
-    """
-    Set up and run the ingestion process to send traffic against the pipeline.
-    """
-    logger.info("Setting up Ingestor client...")
-    client = NvIngestClient(
-        message_client_allocator=SimpleClient, message_client_port=7671, message_client_hostname="localhost"
-    )
-
-    ingestor = (
-        Ingestor(client=client)
-        .files("./data/multimodal_test.pdf")
-        .extract(
-            extract_text=True,
-            extract_tables=True,
-            extract_charts=True,
-            extract_images=True,
-            table_output_format="markdown",
-            extract_infographics=False,
-            text_depth="page",
-        )
-        .split(chunk_size=1024, chunk_overlap=150)
-        .embed()
-    )
-
-    try:
-        results, _ = ingestor.ingest(show_progress=False, return_failures=True)
-        logger.info("Ingestion completed successfully.")
-    except Exception as e:
-        logger.error(f"Ingestion failed: {e}")
-        raise
-
-    print("\nIngest done.")
-    print(f"Got {len(results)} results.")
-
-
-def main():
-    """
-    Launch the libmode pipeline service and run the ingestor against it.
-    Uses the embedded default libmode pipeline configuration.
-    """
-    pipeline = None
-    try:
-        # Start pipeline in subprocess
-        # Note: stdout/stderr cannot be passed when run_in_subprocess=True (not picklable)
-        # Use quiet=False to see verbose startup logs
-        pipeline = run_pipeline(
-            block=False,
-            disable_dynamic_scaling=True,
-            run_in_subprocess=True,
-        )
-        time.sleep(10)
-        run_ingestor()
-        # Run other code...
-    except KeyboardInterrupt:
-        logger.info("Keyboard interrupt received. Shutting down...")
-    except Exception as e:
-        logger.error(f"Error running pipeline: {e}")
-    finally:
-        if pipeline:
-            pipeline.stop()
-            logger.info("Shutting down pipeline...")
-
-
-if __name__ == "__main__":
-    main()
-```
-
-
-
-## The `run_pipeline` Function Reference
-
-The `run_pipeline` function is the main entry point to start the Nemo Retriever Extraction pipeline. 
-It can run in-process or as a subprocess.
-
-The `run_pipeline` function accepts the following parameters.
-
-| Parameter                | Type                   | Default | Required? | Description                                     |
-|--------------------------|------------------------|---------|-----------|-------------------------------------------------|
-| pipeline_config            | PipelineConfigSchema | —       | Yes       | A configuration object that specifies how the pipeline should be constructed. |
-| run_in_subprocess        | bool                   | False   | Yes       | `True` to launch the pipeline in a separate Python subprocess. `False` to run in the current process. |
-| block                    | bool                   | True    | Yes       | `True` to run the pipeline synchronously. The function returns after it finishes. `False` to return an interface for external pipeline control. |
-| disable_dynamic_scaling  | bool                   | None    | No        | `True` to disable autoscaling regardless of global settings. `None` to use the global default behavior. |
-| dynamic_memory_threshold | float                  | None    | No        | A value between `0.0` and `1.0`. If dynamic scaling is enabled, triggers autoscaling when memory usage crosses this threshold. |
-| stdout                   | TextIO                 | None    | No        | Redirect the subprocess `stdout` to a file or stream. If `None`, defaults to `/dev/null`. |
-| stderr                   | TextIO                 | None    | No        | Redirect subprocess `stderr` to a file or stream. If `None`, defaults to `/dev/null`. |
-| libmode                  | bool                   | True    | No        | `True` to load the default library mode pipeline configuration when `ingest_config` is `None`. |
-| quiet                    | bool                   | None    | No        | `True` to suppress verbose startup logs (PRODUCTION preset). `None` defaults to `True` when `libmode=True`. Set to `False` for verbose output. |
-
-
-The `run_pipeline` function returns the following values, depending on the parameters that you set:
-
-- **run_in_subprocess=False and block=True**  — The function returns a `float` that represents the elapsed time in seconds.
-- **run_in_subprocess=False and block=False** — The function returns a `RayPipelineInterface` object.
-- **run_in_subprocess=True  and block=True**  — The function returns `0.0`.
-- **run_in_subprocess=True  and block=False** — The function returns a `RayPipelineInterface` object.
-
-
-The `run_pipeline` throws the following errors:
-
-- **RuntimeError** — A subprocess failed to start, or exited with error.
-- **Exception** — Any other failure during pipeline setup or execution.
-
-
-
-## Related Topics
-
-- [Prerequisites](prerequisites.md)
-- [Support Matrix](support-matrix.md)
-- [Deploy With Docker Compose (Self-Hosted)](quickstart-guide.md)
-- [Deploy With Helm](helm.md)
-- [Notebooks](notebooks.md)
-- [Enterprise RAG Blueprint](https://build.nvidia.com/nvidia/multimodal-pdf-data-extraction-for-enterprise-rag)
+Use the [Quick Start for NeMo Retriever Library](https://github.com/NVIDIA/NeMo-Retriever/blob/26.03/nemo_retriever/README.md) to set up and run the NeMo Retriever Library locally, so you can build a GPU‑accelerated, multimodal RAG ingestion pipeline that parses PDFs, HTML, text, audio, and video into LanceDB vector embeddings, integrates with Nemotron RAG models (locally or via NIM endpoints), which includes Ray‑based scaling with built‑in recall evaluation.
\ No newline at end of file

From 19e77e11cbf99ca3c9abe568cc4c7b8856a2540b Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Mon, 23 Mar 2026 08:44:07 -0700
Subject: [PATCH 46/55] Update RNs to current version (#1687)

---
 .../docs/extraction/releasenotes-nv-ingest.md | 93 ++++++-------------
 1 file changed, 30 insertions(+), 63 deletions(-)

diff --git a/docs/docs/extraction/releasenotes-nv-ingest.md b/docs/docs/extraction/releasenotes-nv-ingest.md
index fb4b847f8..d3b71b4a5 100644
--- a/docs/docs/extraction/releasenotes-nv-ingest.md
+++ b/docs/docs/extraction/releasenotes-nv-ingest.md
@@ -4,68 +4,38 @@ This documentation contains the release notes for [NeMo Retriever Library](overv
 
 !!! note
 
-    NeMo Retriever Library is also known as NVIDIA Ingest.
-
-
-
-## Release 26.01 (26.1.2)
-
-The NeMo Retriever Library 26.01 release adds new hardware and software support, and other improvements.
-
-To upgrade the Helm Charts for this version, refer to [NV-Ingest Helm Charts](https://github.com/NVIDIA/nv-ingest/blob/release/26.1.2/helm/README.md).
-
-
-### Highlights 
-
-This release contains the following key changes:
-
-- Added functional support for [H200 NVL](https://www.nvidia.com/en-us/data-center/h200/). For details, refer to [Support Matrix](support-matrix.md).
-- All Helm deployments for Kubernetes now use [NVIDIA NIM Operator](https://docs.nvidia.com/nim-operator/latest/index.html). For details, refer to [NV-Ingest Helm Charts](https://github.com/NVIDIA/nv-ingest/blob/release/26.1.2/helm/README.md). 
-- Updated RIVA NIM to version 1.4.0. For details, refer to [Extract Speech](audio.md).
-- Updated VLM NIM to [nemotron-nano-12b-v2-vl](https://build.nvidia.com/nvidia/nemotron-nano-12b-v2-vl/modelcard). For details, refer to [Extract Captions from Images](nv-ingest-python-api.md#extract-captions-from-images).
-- Added VLM caption prompt customization parameters, including reasoning control. For details, refer to [Caption Images and Control Reasoning](nv-ingest-python-api.md#caption-images-and-control-reasoning).
-- Added support for the [nemotron-parse](https://build.nvidia.com/nvidia/nemotron-parse/modelcard) model which replaces the [nemoretriever-parse](https://build.nvidia.com/nvidia/nemoretriever-parse/modelcard) model. For details, refer to [Advanced Visual Parsing](nemoretriever-parse.md).
-- Support is now deprecated for [paddleocr](https://build.nvidia.com/baidu/paddleocr/modelcard).
-- The `meta-llama/Llama-3.2-1B` tokenizer is now pre-downloaded so that you can run token-based splitting without making a network request. For details, refer to [Split Documents](chunking.md).
-- For scanned PDFs, added specialized extraction strategies. For details, refer to [PDF Extraction Strategies](nv-ingest-python-api.md#pdf-extraction-strategies).
-- [LanceDB](https://lancedb.com/) is now the default vector database backend; Milvus remains fully supported. For details, refer to [Data Upload](data-store.md).
-- The V2 API is now available and is the default processing pipeline. The response format remains backwards-compatible. You can enable the v2 API by using `message_client_kwargs={"api_version": "v2"}`.For details, refer to [API Reference](api-docs).
-- Large PDFs are now automatically split into chunks and processed in parallel, delivering faster ingestion for long documents. For details, refer to [PDF Pre-Splitting](v2-api-guide.md).
-- Issues maintaining extraction quality while processing very large files are now resolved with the V2 API. For details, refer to [V2 API Guide](v2-api-guide.md).
-- Updated the embedding task to support embedding on custom content fields like the results of summarization functions. For details, refer to [Use the Python API](nv-ingest-python-api.md).
-- User-defined function summarization is now using `nemotron-mini-4b-instruct` which provides significant speed improvements. For details, refer to [User-defined Functions](user-defined-functions.md) and [NV-Ingest UDF Examples](https://github.com/NVIDIA/nv-ingest/blob/release/26.1.2/examples/udfs/README.md).
-- In the `Ingestor.extract` method, the defaults for `extract_text` and `extract_images` are now set to `true` for consistency with `extract_tables` and `extract_charts`. For details, refer to [Use the Python API](nv-ingest-python-api.md).
-- The `table-structure` profile is no longer available. The table-structure profile is now part of the default profile. For details, refer to [Profile Information](quickstart-guide.md#profile-information).
-- New documentation [Why Throughput Is Dataset-Dependent](throughput-is-dataset-dependent.md).
-- New documentation [Add User-defined Stages](user-defined-stages.md).
-- New documentation [Add User-defined Functions](user-defined-functions.md).
-- New documentation [Resource Scaling Modes](scaling-modes.md).
-- New documentation [NimClient Usage](nimclient.md).
-- New documentation [Use the API (V2)](v2-api-guide.md).
-
-
-
-### Fixed Known Issues
-
-The following are the known issues that are fixed in this version:
-
-- A10G support is restored. To use A10G hardware, use release 26.1.2 or later. For details, refer to [Support Matrix](support-matrix.md).
-- L40S support is restored. To use L40S hardware, use release 26.1.2 or later. For details, refer to [Support Matrix](support-matrix.md).
-- The page number field in the content metadata now starts at 1 instead of 0 so each page number is no longer off by one from what you would expect. For details, refer to [Content Metadata](content-metadata.md).
-- Support for batches that include individual files greater than approximately 400MB is restored. This includes audio files and pdfs.
-
-
-
-## All Known Issues
-
-The following are the known issues for NeMo Retriever Library:
-
-- Advanced visual parsing is not supported on RTX Pro 6000, B200, or H200 NVL. For details, refer to [Advanced Visual Parsing](advanced-visual-parsing.md) and [Support Matrix](support-matrix.md).
-- The Page Elements NIM (`nemoretriever-page-elements-v3:1.7.0`) may intermittently fail during inference under high-concurrency workloads. This happens when Triton’s dynamic batching combines requests that exceed the model’s maximum batch size, a situation more commonly seen in multi-GPU setups or large ingestion runs. In these cases, extraction fails for the impacted documents. A correction is planned for `nemoretriever-page-elements-v3:1.7.1`.
-
+    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.   
+
+## 26.03 Release Notes (26.3.0)
+
+NVIDIA® NeMo Retriever Library version 26.03 adds broader hardware and software support along with many pipeline, evaluation, and deployment enhancements.
+
+To upgrade the Helm charts for this release, refer to the [NeMo Retriever Library Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0/helm/README.md).
+
+Highlights for the 26.03 release include:
+
+- NV-Ingest GitHub repo renamed to NeMo-Retriever  
+- NeMo Retriever Extraction pipeline renamed to NeMo Retriever Library  
+- NeMo Retriever Library now supports two deployment options:  
+  - A new no-container, pip-installable in-process library for development (available on PyPI)  
+  - Existing production-ready Helm chart with NIMs  
+- Added documentation notes on Air-gapped deployment support  
+- Added documentation notes on OpenShift support  
+- Added support for RTX4500 Pro Blackwell SKU  
+- Added support for llama-nemotron-embed-vl-v2 in text and text+image modes  
+- New extract methods `pdfium_hybrid` and `ocr` target scanned PDFs to improve text and layout extraction from image-based pages  
+- VLM-based image caption enhancements:  
+  - Infographics can be captioned  
+  - Reasoning mode is configurable  
+- Enabled hybrid search with Lancedb  
+- Added retrieval_bench subfolder with generalizable agentic retrieval pipeline  
+- The project now uses UV as the primary environment and package manager instead of Conda, resulting in faster installs and simpler dependency handling  
+- Default Redis TTL increased from 1–2 hours to 48 hours so long-running jobs (e.g., VLM captioning) don’t expire before completion  
+- NeMo Retriever Library currently does not support image captioning via VLM; this feature will be added in the next release
 
 ## Release Notes for Previous Versions
 
+| [26.1.2](https://docs.nvidia.com/nemo/retriever/26.1.2/extraction/releasenotes-nv-ingest/)
 | [26.1.1](https://docs.nvidia.com/nemo/retriever/26.1.1/extraction/releasenotes-nv-ingest/)
 | [25.9.0](https://docs.nvidia.com/nemo/retriever/25.9.0/extraction/releasenotes-nv-ingest/) 
 | [25.6.3](https://docs.nvidia.com/nemo/retriever/25.6.3/extraction/releasenotes-nv-ingest/) 
@@ -74,13 +44,10 @@ The following are the known issues for NeMo Retriever Library:
 | [25.3.0](https://docs.nvidia.com/nemo/retriever/25.3.0/extraction/releasenotes-nv-ingest/) 
 | [24.12.1](https://docs.nvidia.com/nemo/retriever/25.3.0/extraction/releasenotes-nv-ingest/) 
 | [24.12.0](https://docs.nvidia.com/nemo/retriever/25.3.0/extraction/releasenotes-nv-ingest/) 
-|
-
-
 
 ## Related Topics
 
 - [Prerequisites](prerequisites.md)
 - [Deploy Without Containers (Library Mode)](quickstart-library-mode.md)
 - [Deploy With Docker Compose (Self-Hosted)](quickstart-guide.md)
-- [Deploy With Helm](helm.md)
+- [Deploy With Helm](helm.md)
\ No newline at end of file

From 0e0bebc489b109f4c002309f9358ca56e8c0d68d Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Mon, 23 Mar 2026 09:03:15 -0700
Subject: [PATCH 47/55] Kheiss/update quickstart (#1688)

Moved example per Sohail comment: kheiss-uwzoo:kheiss/update-quickstart
---
 docs/docs/extraction/quickstart-guide.md | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/docs/docs/extraction/quickstart-guide.md b/docs/docs/extraction/quickstart-guide.md
index 217fbbed1..460a17355 100644
--- a/docs/docs/extraction/quickstart-guide.md
+++ b/docs/docs/extraction/quickstart-guide.md
@@ -393,6 +393,17 @@ You can specify multiple `--profile` options.
 | `nemotron-parse`      | Advanced | Use [nemotron-parse](https://build.nvidia.com/nvidia/nemotron-parse), which adds state-of-the-art text and table extraction. For more information, refer to [Advanced Visual Parsing](nemoretriever-parse.md). | 
 | `vlm`                 | Advanced | Use [llama 3.1 Nemotron 8B Vision](https://build.nvidia.com/nvidia/llama-3.1-nemotron-nano-vl-8b-v1/modelcard) for image captioning of unstructured images and infographics. This profile enables the `caption` method in the Python API to generate text descriptions of visual content. For more information, refer to [Use Multimodal Embedding](vlm-embed.md) and [Extract Captions from Images](nv-ingest-python-api.md#extract-captions-from-images). | 
 
+### Example: Using the VLM Profile for Infographic Captioning
+
+Infographics often combine text, charts, and diagrams into complex visuals. Vision-language model (VLM) captioning generates natural language descriptions that capture this complexity, making the content searchable and more accessible for downstream applications.
+
+To use VLM captioning for infographics, start NeMo Retriever Library with both the `retrieval` and `vlm` profiles by running the following code.
+```shell
+docker compose \
+  -f docker-compose.yaml \
+  --profile retrieval \
+  --profile vlm up
+```
 
 ## Air-Gapped Deployment (Docker Compose)
 

From 77cb39ac509ffd0efc42c710dff9863209e574cd Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Mon, 23 Mar 2026 09:14:08 -0700
Subject: [PATCH 48/55] update reference diagram for overview (#1689)

update overview diagram per Sohail : https://nvidia.slack.com/archives/D0AB0118N94/p1774281584685679
---
 docs/docs/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/docs/index.md b/docs/docs/index.md
index 32d7b1acd..0349ea38f 100644
--- a/docs/docs/index.md
+++ b/docs/docs/index.md
@@ -14,7 +14,7 @@ NeMo Retriever provides the following:
 - **Retrieval** — Leverage semantic + hybrid search for high accuracy retrieval with the embedding + reranking NIM microservice.
 
 
-![Overview diagram](extraction/images/overview-retriever.png)
+![Overview diagram](extraction/images/overview-extraction.png)
 
 
 ## Enterprise-Ready Features

From 56c2c5149662800c0095f3464519d705f944bb7e Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Mon, 23 Mar 2026 10:07:26 -0700
Subject: [PATCH 49/55] =?UTF-8?q?fixed=20reference=20information=20about?=
 =?UTF-8?q?=20name=20change=20from=20nv-ingest=20to=20NeMo=20=E2=80=A6=20(?=
 =?UTF-8?q?#1690)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 contributing.md                               | 496 +++++++++++++++++-
 docs/docs/extraction/audio.md                 |   2 +-
 docs/docs/extraction/content-metadata.md      |   2 +-
 docs/docs/extraction/data-store.md            |   2 +-
 docs/docs/extraction/environment-config.md    |   2 +-
 docs/docs/extraction/faq.md                   |   2 +-
 docs/docs/extraction/nemoretriever-parse.md   |   2 +-
 docs/docs/extraction/nimclient.md             |   2 +-
 docs/docs/extraction/notebooks.md             |   2 +-
 docs/docs/extraction/nv-ingest-python-api.md  |   2 +-
 docs/docs/extraction/overview.md              |   2 +-
 docs/docs/extraction/prerequisites.md         |   2 +-
 .../extraction/quickstart-library-mode.md     |   2 +-
 docs/docs/extraction/scaling-modes.md         |   2 +-
 docs/docs/extraction/support-matrix.md        |  16 +-
 docs/docs/extraction/telemetry.md             |   2 +-
 docs/docs/extraction/troubleshoot.md          |   2 +-
 .../docs/extraction/user-defined-functions.md |   3 +-
 docs/docs/extraction/user-defined-stages.md   |   2 +-
 docs/docs/extraction/vlm-embed.md             |   2 +-
 20 files changed, 495 insertions(+), 54 deletions(-)

diff --git a/contributing.md b/contributing.md
index f8dc3815d..1f549469e 100644
--- a/contributing.md
+++ b/contributing.md
@@ -1,50 +1,492 @@
-### Contributing
+# Contributing to NV-Ingest
 
-We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original
-work, or you have rights to submit it under the same license, or a compatible license.
+External contributions will be welcome soon, and they are greatly appreciated! Every little bit helps, and credit will always be given.
 
-Any contribution which contains commits that are not signed off are not accepted.
+## Table of Contents
 
-To sign off on a commit, use the --signoff (or -s) option when you commit your changes as shown following.
+1. [Filing Issues](#filing-issues)
+2. [Cloning the Repository](#cloning-the-repository)
+3. [Code Contributions](#code-contributions)
+   - [Your First Issue](#your-first-issue)
+   - [Seasoned Developers](#seasoned-developers)
+   - [Workflow](#workflow)
+   - [Common Processing Patterns](#common-processing-patterns)
+     - [traceable](#traceable---srcnv_ingestutiltracingtaggingpy)
+     - [nv_ingest_node_failure_context_manager](#nv_ingest_node_failure_context_manager---srcnv_ingestutilexception_handlersdecoratorspy)
+     - [filter_by_task](#filter_by_task---srcnv_ingestutilflow_controlfilter_by_taskpy)
+   - [Adding a New Stage or Module](#adding-a-new-stage-or-module)
+   - [Common Practices for Writing Unit Tests](#common-practices-for-writing-unit-tests)
+     - [General Guidelines](#general-guidelines)
+     - [Mocking External Services](#mocking-external-services)
+   - [Submodules, Third Party Libraries, and Models](#submodules-third-party-libraries-and-models)
+     - [Submodules](#submodules)
+     - [Models](#models)
+4. [Architectural Guidelines](#architectural-guidelines)
+   - [Single Responsibility Principle (SRP)](#1-single-responsibility-principle-srp)
+   - [Interface Segregation Principle (ISP)](#2-interface-segregation-principle-isp)
+   - [Dependency Inversion Principle (DIP)](#3-dependency-inversion-principle-dip)
+   - [Physical Design Structure Mirroring Logical Design Structure](#4-physical-design-structure-mirroring-logical-design-structure)
+   - [Levelization](#5-levelization)
+   - [Acyclic Dependencies Principle (ADP)](#6-acyclic-dependencies-principle-adp)
+   - [Package Cohesion Principles](#7-package-cohesion-principles)
+     - [Common Closure Principle (CCP)](#common-closure-principle-ccp)
+     - [Common Reuse Principle (CRP)](#common-reuse-principle-crp)
+   - [Encapsulate What Varies](#8-encapsulate-what-varies)
+   - [Favor Composition Over Inheritance](#9-favor-composition-over-inheritance)
+   - [Clean Separation of Concerns (SoC)](#10-clean-separation-of-concerns-soc)
+   - [Principle of Least Knowledge (Law of Demeter)](#11-principle-of-least-knowledge-law-of-demeter)
+   - [Document Assumptions and Decisions](#12-document-assumptions-and-decisions)
+   - [Continuous Integration and Testing](#13-continuous-integration-and-testing)
+5. [Writing Good and Thorough Documentation](#writing-good-and-thorough-documentation)
+6. [Licensing](#licensing)
+7. [Attribution](#attribution)
 
-```
-$ git commit --signoff --message "Add cool feature."
-```
+## Filing Issues
 
-This appends the following text to your commit message.
+1. **Bug Reports, Feature Requests, and Documentation Issues:** Please file
+   an [issue](https://github.com/NVIDIA/nv-ingest/issues) with a detailed
+   description of
+   the problem, feature request, or documentation issue. The NV-Ingest team will review and triage these issues,
+   and if appropriate, schedule them for a future release.
 
+## Cloning the repository
+
+```bash
+DATASET_ROOT=[path to your dataset root]
+MODULE_NAME=[]
+NV_INGEST_ROOT=[path to your NV-Ingest root]
+git clone https://github.com/NVIDIA/nv-ingest.git $NV_INGEST_ROOT
+cd $NV_INGEST_ROOT
 ```
-Signed-off-by: Your Name <your@email.com>
+
+Ensure all submodules are checked out:
+
+```bash
+git submodule update --init --recursive
 ```
 
-#### Developer Certificate of Origin (DCO)
+## Code Contributions
 
-The following is the full text of the Developer Certificate of Origin (DCO)
+### Your First Issue
 
-```
-  Developer Certificate of Origin
-  Version 1.1
+1. **Finding an Issue:** Start with issues
+   labeled [good first issue](https://github.com/NVIDIA/nv-ingest/labels/bug).
+2. **Claim an Issue:** Comment on the issue you wish to work on.
+3. **Implement Your Solution:** Dive into the code! Update or add unit tests as necessary.
+4. **Submit Your Pull Request:
+   ** [Create a pull request](https://github.com/NVIDIA/nv-ingest/pulls) once your
+   code is ready.
+5. **Code Review:** Wait for the review by other developers and make necessary updates.
+6. **Merge:** After approval, an NVIDIA developer will approve your pull request.
 
-  Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
-  1 Letterman Drive
-  Suite D4700
-  San Francisco, CA, 94129
+### Seasoned Developers
 
-  Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
-```
+For those familiar with the codebase, please check
+the [project boards](https://github.com/orgs/NVIDIA/projects/48/views/1) for
+issues. Look for unassigned issues and follow the steps starting from **Claim an Issue**.
+
+### Workflow
+
+1. **NV-Ingest Foundation**: Built on top
+   of [RAY](https://docs.ray.io/en/latest/serve/architecture.html).
+
+2. **Pipeline Structure**: Designed around a pipeline that processes individual jobs within an asynchronous execution
+   graph. Each job is processed by a series of stages or task handlers.
+
+3. **Job Composition**: Jobs consist of a data payload, metadata, and task specifications that determine the processing
+   steps applied to the data.
+
+4. **Job Submission**:
+
+   - A job is submitted as a JSON specification and converted into
+     a [ControlMessage](https://github.com/nv-morpheus/Morpheus/blob/branch-24.06/docs/source/developer_guide/guides/9_control_messages.md),
+     with the payload consisting of a cuDF dataframe.
+   - For example:
+     ```text
+         document_type source_id   uuid     metadata
+         0             pdf         somefile  1234  { ... }
+     ```
+   - The `metadata` column contents correspond to
+     the [schema-enforced metadata format of returned data](docs/docs/extraction/content-metadata.md).
+
+5. **Pipeline Processing**:
+
+   - The `ControlMessage` is passed through the pipeline, where each stage processes the data and metadata as needed.
+   - Subsequent stages may add, transform, or filter data as needed, with all resulting artifacts stored in
+     the `ControlMessage`'s payload.
+   - For example, after processing, the payload may look like:
+     ```text
+         document_type   source_id   uuid       metadata
+         0               text        somefile   abcd-1234   {'content': "The quick brown fox jumped...", ...}
+         1               image       somefile   efgh-5678   {'content': "base64 encoded image", ...}
+         2               image       somefile   xyza-5618   {'content': "base64 encoded image", ...}
+         3               image       somefile   zxya-5628   {'content': "base64 encoded image", ...}
+         4               status      somefile   kvq9-5600   {'content': "", 'status': "filtered", ...}
+     ```
+   - A single job can result in multiple artifacts, each with its own metadata element definition.
+
+6. **Job Completion**:
+   - Upon reaching the end of the pipeline, the `ControlMessage` is converted into a `JobResult` object and pushed to
+     the ephemeral output queue for client retrieval.
+   - `JobResult` objects consist of a dictionary containing:
+     1. **data**: A list of metadata artifacts produced by the job.
+     2. **status**: The job status as success or failure.
+     3. **description**: A human-readable description of the job status.
+     4. **trace**: A list of timing traces generated during the job's processing.
+     5. **annotations**: A list of task annotations generated during the job's processing.
+
+### Updating Dependencies
+
+- Dependencies are managed with `uv` and project-local `pyproject.toml` files.
+- Dependencies are stored in package definitions:
+    1. **Service Dependencies** `src/pyproject.toml`.
+    2. **Client Dependencies** `client/pyproject.toml`.
+
+- To update dependencies:
+  - Create a clean environment using `uv venv`.
+  - Update dependencies in the relevant `pyproject.toml` and validate the changes.
+  - Recreate the environment and install via `uv pip`.
+    - For example:
+      ```bash
+      uv venv .venv
+      source .venv/bin/activate
+      uv pip install -e ./src -e ./client -e ./api
+      ```
+
+### Common Processing Patterns
+
+In NV-Ingest, decorators are used to enhance the functionality of functions by adding additional processing logic. These
+decorators help ensure consistency, traceability, and robust error handling across the pipeline. Below, we introduce
+some common decorators used in NV-Ingest, explain their usage, and provide examples.
+
+#### `traceable` -> `src/nv_ingest/util/tracing/tagging.py`
+
+The `traceable` decorator adds entry and exit trace timestamps to a `ControlMessage`'s metadata. This helps in
+monitoring and debugging by recording the time taken for function execution.
+
+**Usage:**
+
+- To track function execution time with default trace names:
+  ```python
+  @traceable()
+  def process_message(message):
+      pass
+  ```
+- To use a custom trace name:
+  ```python
+  @traceable(trace_name="CustomTraceName")
+  def process_message(message):
+      pass
+  ```
+
+#### `nv_ingest_node_failure_context_manager` -> `src/nv_ingest/util/exception_handlers/decorators.py`
+
+This decorator wraps a function with failure handling logic to manage potential failures involving `ControlMessages`. It
+ensures that failures are managed consistently, optionally raising exceptions or annotating the `ControlMessage`.
+
+**Usage:**
+
+- To handle failures with default settings:
+  ```python
+  @nv_ingest_node_failure_context_manager(annotation_id="example_task")
+  def process_message(message):
+      pass
+  ```
+- To handle failures and allow empty payloads:
+  ```python
+  @nv_ingest_node_failure_context_manager(annotation_id="example_task", payload_can_be_empty=True)
+  def process_message(message):
+      pass
+  ```
+
+#### `filter_by_task` -> `src/nv_ingest/util/flow_control/filter_by_task.py`
+
+The `filter_by_task` decorator checks if the `ControlMessage` contains any of the specified tasks. Each task can be a
+string of the task name or a tuple of the task name and task properties. If the message does not contain any listed task
+and/or task properties, the message is returned directly without calling the wrapped function, unless a forwarding
+function is provided.
+
+**Usage:**
+
+- To filter messages based on tasks:
+  ```python
+  @filter_by_task(["task1", "task2"])
+  def process_message(message):
+      pass
+  ```
+- To filter messages based on tasks with specific properties:
+  ```python
+  @filter_by_task([("task", {"prop": "value"})])
+  def process_message(message):
+      pass
+  ```
+- To forward messages to another function. This is necessary when the decorated function does not return the message
+  directly, but instead forwards it to another function. In this case, the forwarding function should be provided as an
+  argument to the decorator.
+  ```python
+  @filter_by_task(["task1", "task2"], forward_func=other_function)
+  def process_message(message):
+      pass
+  ```
+
+#### `cm_skip_processing_if_failed` -> `morpheus/utils/control_message_utils.py`
+
+The `cm_skip_processing_if_failed` decorator skips the processing of a `ControlMessage` if it has already failed. This
+ensures that no further processing is attempted on a failed message, maintaining the integrity of the pipeline.
+
+**Usage:**
+
+- To skip processing if the message has failed:
+  ```python
+  @cm_skip_processing_if_failed
+  def process_message(message):
+      pass
+  ```
+
+### Adding a New Stage or Module
+
+#### TODO(Devin): Add details about adding a new stage or module once we have router node functionality in place.
+
+### Common Practices for Writing Unit Tests
+
+Writing unit tests is essential for maintaining code quality and ensuring that changes do not introduce new bugs. In
+this project, we use `pytest` for running tests and adopt blackbox testing principles. Below are some common practices
+for writing unit tests, which are located in the `[repo_root]/tests` directory.
+
+#### General Guidelines
+
+1. **Test Structure**: Each test module should test a specific module or functionality within the codebase. The test
+   module should be named `test_<module_name>.py`, and reside on a mirrored physical path to its corresponding test
+   target to be easily discoverable by `pytest`.
+
+   1. Example: `nv_ingest/some_path/another_path/my_module.py` should have a corresponding test file:
+      `tests/some_path/another_path/test_my_module.py`.
+
+2. **Test Functions**: Each test function should focus on a single aspect of the functionality. Use descriptive names
+   that clearly indicate what is being tested. For example, `test_function_returns_correct_value`
+   or `test_function_handles_invalid_input`.
+
+3. **Setup and Teardown**: Use `pytest` fixtures to manage setup and teardown operations for your tests. Fixtures help
+   in creating a consistent and reusable setup environment.
+
+4. **Assertions**: Use assertions to validate the behavior of the code. Ensure that the tests cover both expected
+   outcomes and edge cases.
+
+#### Mocking External Services
+
+When writing tests that depend on external services (e.g., databases, APIs), it is important to mock these dependencies
+to ensure that tests are reliable, fast, and do not depend on external factors.
+
+1. **Mocking Libraries**: Use libraries like `unittest.mock` to create mocks for external services. The `pytest-mock`
+   plugin can also be used to integrate mocking capabilities directly with `pytest`.
+
+2. **Mock Objects**: Create mock objects to simulate the behavior of external services. Use these mocks to test how your
+   code interacts with these services without making actual network calls or database transactions.
+
+3. **Patching**: Use `patch` to replace real objects in your code with mocks. This can be done at the function, method,
+   or object level. Ensure that patches are applied in the correct scope to avoid side effects.
+
+#### Example Test Structure
+
+Here is an example of how to structure a test module in the `[repo_root]/tests` directory:
+
+```python
+import pytest
+from unittest.mock import patch, Mock
+
+# Assuming the module to test is located at [repo_root]/module.py
+from module import function_to_test
+
+
+@pytest.fixture
+def mock_external_service():
+    with patch('module.ExternalService') as mock_service:
+        yield mock_service
+
+
+def test_function_returns_correct_value(mock_external_service):
+    # Arrange
+    mock_external_service.return_value.some_method.return_value = 'expected_value'
 
+    # Act
+    result = function_to_test()
+
+    # Assert
+    assert result == 'expected_value'
+
+
+def test_function_handles_invalid_input(mock_external_service):
+    # Arrange
+    mock_external_service.return_value.some_method.side_effect = ValueError("Invalid input")
+
+    # Act and Assert
+    with pytest.raises(ValueError, match="Invalid input"):
+        function_to_test(invalid_input)
 ```
-  Developer's Certificate of Origin 1.1
 
-  By making a contribution to this project, I certify that:
+## Submodules, Third Party Libraries, and Models
+
+### Submodules
+
+1. Submodules are used to manage third-party libraries and dependencies.
+2. Submodules should be created in the `third_party` directory.
+3. Ensure that the submodule is updated to the latest commit before making changes.
+
+### Models
+
+1. **Model Integration**: NV-Ingest is designed to be scalable and flexible, so running models directly in the pipeline
+   is discouraged.
+2. **Model Export**: Models should be exported to a format compatible with Triton Inference Server or TensorRT.
+   - Model acquisition and conversion should be documented in `triton_models/README.md`, including the model name,
+     version, pbtxt file, Triton model files, etc., along with an example of how to query the model in Triton.
+   - Models should be externally hosted and downloaded during the pipeline execution, or added via LFS.
+   - Any additional code, configuration files, or scripts required to run the model should be included in
+     the `triton_models/[MODEL_NAME]` directory.
+3. **Self-Contained Dependencies**: No assumptions should be made regarding other models or libraries being available in
+   the pipeline. All dependencies should be self-contained.
+4. **Base Triton Container**: Directions for the creation of the base Triton container are listed in
+   the `triton_models/README.md` file. If a new model requires additional base dependencies, please update
+   the `Dockerfile` in the `triton_models` directory.
+
+## Architectural Guidelines
+
+To ensure the quality and maintainability of the NV-Ingest codebase, the following architectural guidelines should be
+followed:
+
+### 1. Single Responsibility Principle (SRP)
+
+- Ensure that each module, class, or function has only one reason to change.
+
+### 2. Interface Segregation Principle (ISP)
+
+- Avoid forcing clients to depend on interfaces they do not use.
+
+### 3. Dependency Inversion Principle (DIP)
+
+- High-level modules should not depend on low-level modules, both should depend on abstractions.
+
+### 4. Physical Design Structure Mirroring Logical Design Structure
+
+- The physical layout of the codebase should reflect its logical structure.
+
+### 5. Levelization
+
+- Organize code into levels where higher-level components depend on lower-level components but not vice versa.
+
+### 6. Acyclic Dependencies Principle (ADP)
+
+- Ensure the dependency graph of packages/modules has no cycles.
+
+### 7. Package Cohesion Principles
+
+#### Common Closure Principle (CCP)
+
+- Package classes that change together.
+
+#### Common Reuse Principle (CRP)
+
+- Package classes that are used together.
+
+### 8. Encapsulate What Varies
+
+- Identify aspects of the application that vary and separate them from what stays the same.
+
+### 9. Favor Composition Over Inheritance
 
-  (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or
+- Utilize object composition over class inheritance for behavior reuse where possible.
 
-  (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or
+### 10. Clean Separation of Concerns (SoC)
 
-  (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it.
+- Divide the application into distinct features with minimal overlap in functionality.
 
-  (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.
+### 11. Principle of Least Knowledge (Law of Demeter)
+
+- Objects should assume as little as possible about the structure or properties of anything else, including their
+  subcomponents.
+
+### 12. Document Assumptions and Decisions
+
+- Assumptions made and reasons behind architectural and design decisions should be clearly documented.
+
+### 13. Continuous Integration and Testing
+
+- Integrate code frequently into a shared repository and ensure comprehensive testing is an integral part of the
+  development cycle.
+
+Contributors are encouraged to follow these guidelines to ensure contributions are in line with the project's
+architectural consistency and maintainability.
+
+
+## Writing Good and Thorough Documentation
+
+As a contributor to our codebase, writing high-quality documentation is an essential part of ensuring that others can
+understand and work with your code effectively. Good documentation helps to reduce confusion, facilitate collaboration,
+and streamline the development process. In this guide, we will outline the principles and best practices for writing
+thorough and readable documentation that adheres to the Chicago Manual of Style.
+
+### Chicago Manual of Style
+
+Our documentation follows the Chicago Manual of Style, a widely accepted standard for writing and formatting. This style
+guide provides a consistent approach to writing, grammar, and punctuation, making it easier for readers to understand
+and navigate our documentation.
+
+### Key Principles
+
+When writing documentation, keep the following principles in mind:
+
+1. **Clarity**: Use clear and concise language to convey your message. Avoid ambiguity and jargon that may confuse readers.
+2. **Accuracy**: Ensure that your documentation is accurate and up-to-date. Verify facts, details, and code snippets
+    before publishing.
+3. **Completeness**: Provide all necessary information to understand the code, including context, syntax, and examples.
+4. **Consistency**: Use a consistent tone, voice, and style throughout the documentation.
+5. **Accessibility**: Make your documentation easy to read and understand by using headings, bullet points, and short paragraphs.
+
+### Documentation Structure
+
+A well-structured documentation page should include the following elements:
+
+1. **Header**: A brief title that summarizes the content of the page.
+2. **Introduction**: A short overview of the topic, including its purpose and relevance.
+3. **Syntax and Parameters**: A detailed explanation of the code syntax, including parameters, data types, and return values.
+4. **Examples**: Concrete examples that illustrate how to use the code, including input and output.
+5. **Tips and Variations**: Additional information, such as best practices, common pitfalls, and alternative approaches.
+6. **Related Resources**: Links to relevant documentation, tutorials, and external resources.
+
+### Best Practices
+
+To ensure high-quality documentation, follow these best practices:
+
+1. **Use headings and subheadings**: Organize your content with clear headings and subheadings to facilitate scanning and navigation.
+2. **Use bullet points and lists**: Break up complex information into easy-to-read lists and bullet points.
+3. **Provide context**: Give readers a clear understanding of the code's purpose, history, and relationships to other components.
+4. **Review and edit**: Carefully review and edit your documentation to ensure accuracy, completeness, and consistency.
+
+### Resources
+
+For more information on the Chicago Manual of Style, refer to their
+[online published version](https://www.chicagomanualofstyle.org/home.html?_ga=2.188145128.1312333204.1728079521-706076405.1727890116).
+
+By following these guidelines and principles, you will be able to create high-quality documentation that helps others
+understand and work with your code effectively. Remember to always prioritize clarity, accuracy, and completeness, and
+to use the Chicago Style Guide as your reference for writing and formatting.
+
+
+## Licensing
+
+NV-Ingest is licensed under the NVIDIA Proprietary Software License -- ensure that any contributions are compatible.
+
+The following should be included in the header of any new files:
+
+```text
+SPDX-FileCopyrightText: Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.
+All rights reserved.
+SPDX-License-Identifier: Apache-2.0
 ```
 
+## Attribution
+
+Portions adopted from
 
+- [https://github.com/nv-morpheus/Morpheus/blob/branch-24.06/CONTRIBUTING.md](https://github.com/nv-morpheus/Morpheus/blob/branch-24.06/CONTRIBUTING.md)
+- [https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md](https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md)
+- [https://github.com/dask/dask/blob/master/docs/source/develop.rst](https://github.com/dask/dask/blob/master/docs/source/develop.rst)
diff --git a/docs/docs/extraction/audio.md b/docs/docs/extraction/audio.md
index 021ee74b6..4e2c749cf 100644
--- a/docs/docs/extraction/audio.md
+++ b/docs/docs/extraction/audio.md
@@ -9,7 +9,7 @@ to extract speech from audio files.
 
 !!! note
 
-    NeMo Retriever Library is also known as NVIDIA Ingest.
+   NVIDIA Ingest (nv-ingest) has been renamed NeMo Retriever Library. 
 
 Currently, you can extract speech from the following file types:
 
diff --git a/docs/docs/extraction/content-metadata.md b/docs/docs/extraction/content-metadata.md
index e7d8bb050..5a55b9e6a 100644
--- a/docs/docs/extraction/content-metadata.md
+++ b/docs/docs/extraction/content-metadata.md
@@ -10,7 +10,7 @@ Metadata can be extracted from a source or content, or generated by using models
 
 !!! note
 
-    NeMo Retriever Library is also known as NVIDIA Ingest.
+    NVIDIA Ingest (nv-ingest) has been renamed NeMo Retriever Library.
 
 
 
diff --git a/docs/docs/extraction/data-store.md b/docs/docs/extraction/data-store.md
index 580a1dfc7..b012a9580 100644
--- a/docs/docs/extraction/data-store.md
+++ b/docs/docs/extraction/data-store.md
@@ -4,7 +4,7 @@ Use this documentation to learn how [NeMo Retriever Library](overview.md) handle
 
 !!! note
 
-    NeMo Retriever Library is also known as NVIDIA Ingest.
+    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
 
 
 ## Overview
diff --git a/docs/docs/extraction/environment-config.md b/docs/docs/extraction/environment-config.md
index b411ea5b6..2c7be750b 100644
--- a/docs/docs/extraction/environment-config.md
+++ b/docs/docs/extraction/environment-config.md
@@ -5,7 +5,7 @@ You can specify these in your .env file or directly in your environment.
 
 !!! note
 
-    NeMo Retriever Library is also known as NVIDIA Ingest.
+    NVIDIA Ingest (nv-ingest) has been renamed NeMo Retriever Library.
 
 
 ## General Environment Variables
diff --git a/docs/docs/extraction/faq.md b/docs/docs/extraction/faq.md
index 7bbb33d5f..ff383901c 100644
--- a/docs/docs/extraction/faq.md
+++ b/docs/docs/extraction/faq.md
@@ -4,7 +4,7 @@ This documentation contains the Frequently Asked Questions (FAQ) for [NeMo Retri
 
 !!! note
 
-    NeMo Retriever Library is also known as NVIDIA Ingest.
+    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
 
 
 
diff --git a/docs/docs/extraction/nemoretriever-parse.md b/docs/docs/extraction/nemoretriever-parse.md
index 3da84cdce..a53ed094f 100644
--- a/docs/docs/extraction/nemoretriever-parse.md
+++ b/docs/docs/extraction/nemoretriever-parse.md
@@ -12,7 +12,7 @@ to run [NeMo Retriever Library](overview.md) with nemotron-parse.
 
 !!! note
 
-    NeMo Retriever Library is also known as NVIDIA Ingest.
+    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
 
 
 ## Limitations
diff --git a/docs/docs/extraction/nimclient.md b/docs/docs/extraction/nimclient.md
index fae98c068..9d4a5fe42 100644
--- a/docs/docs/extraction/nimclient.md
+++ b/docs/docs/extraction/nimclient.md
@@ -5,7 +5,7 @@ This documentation demonstrates how to create custom NIM integrations for use in
 
 !!! note
 
-    NeMo Retriever Library is also known as NVIDIA Ingest.
+    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
 
 The NimClient architecture consists of two main components:
 
diff --git a/docs/docs/extraction/notebooks.md b/docs/docs/extraction/notebooks.md
index 4bf6fe7b4..0b4f31a7e 100644
--- a/docs/docs/extraction/notebooks.md
+++ b/docs/docs/extraction/notebooks.md
@@ -4,7 +4,7 @@ To get started using [NeMo Retriever Library](overview.md), you can try one of t
 
 !!! note
 
-    NeMo Retriever Library is also known as NVIDIA Ingest.
+    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
 
 
 ## Dataset Downloads for Benchmarking
diff --git a/docs/docs/extraction/nv-ingest-python-api.md b/docs/docs/extraction/nv-ingest-python-api.md
index d4c29f2b5..4aafa9776 100644
--- a/docs/docs/extraction/nv-ingest-python-api.md
+++ b/docs/docs/extraction/nv-ingest-python-api.md
@@ -4,7 +4,7 @@ The [NeMo Retriever Library](overview.md) Python API provides a simple and flexi
 
 !!! note
 
-    NeMo Retriever Library is also known as NVIDIA Ingest.
+    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
 
 !!! tip
 
diff --git a/docs/docs/extraction/overview.md b/docs/docs/extraction/overview.md
index 8891d57e6..0e67a87c4 100644
--- a/docs/docs/extraction/overview.md
+++ b/docs/docs/extraction/overview.md
@@ -6,7 +6,7 @@ to find, contextualize, and extract text, tables, charts and infographics that y
 
 !!! note
 
-    NeMo Retriever Library is also known as NVIDIA Ingest.
+    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
 
 NeMo Retriever Library enables parallelization of splitting documents into pages where artifacts are classified (such as text, tables, charts, and infographics), extracted, and further contextualized through optical character recognition (OCR) into a well defined JSON schema. 
 From there, NeMo Retriever Library can optionally manage computation of embeddings for the extracted content, 
diff --git a/docs/docs/extraction/prerequisites.md b/docs/docs/extraction/prerequisites.md
index a5e0512e1..902c499c8 100644
--- a/docs/docs/extraction/prerequisites.md
+++ b/docs/docs/extraction/prerequisites.md
@@ -4,7 +4,7 @@ Before you begin using [NeMo Retriever Library](overview.md), ensure the followi
 
 !!! note
 
-    NeMo Retriever Library is also known as NVIDIA Ingest.
+    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
 
 
 
diff --git a/docs/docs/extraction/quickstart-library-mode.md b/docs/docs/extraction/quickstart-library-mode.md
index a3ecbf867..24dabfb8f 100644
--- a/docs/docs/extraction/quickstart-library-mode.md
+++ b/docs/docs/extraction/quickstart-library-mode.md
@@ -2,6 +2,6 @@
 
 !!! note
 
-    NeMo Retriever Library is also known as NVIDIA Ingest.
+    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
 
 Use the [Quick Start for NeMo Retriever Library](https://github.com/NVIDIA/NeMo-Retriever/blob/26.03/nemo_retriever/README.md) to set up and run the NeMo Retriever Library locally, so you can build a GPU‑accelerated, multimodal RAG ingestion pipeline that parses PDFs, HTML, text, audio, and video into LanceDB vector embeddings, integrates with Nemotron RAG models (locally or via NIM endpoints), which includes Ray‑based scaling with built‑in recall evaluation.
\ No newline at end of file
diff --git a/docs/docs/extraction/scaling-modes.md b/docs/docs/extraction/scaling-modes.md
index 5c57b33ac..f39c4ad70 100644
--- a/docs/docs/extraction/scaling-modes.md
+++ b/docs/docs/extraction/scaling-modes.md
@@ -7,7 +7,7 @@ This guide covers how resource scaling modes work across stages in [NeMo Retriev
 
 !!! note
 
-    NeMo Retriever Library is also known as NVIDIA Ingest.
+    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
 
 
 
diff --git a/docs/docs/extraction/support-matrix.md b/docs/docs/extraction/support-matrix.md
index 5dbc49508..d62ed7a8b 100644
--- a/docs/docs/extraction/support-matrix.md
+++ b/docs/docs/extraction/support-matrix.md
@@ -4,20 +4,20 @@ Before you begin using [NeMo Retriever Library](overview.md), ensure that you ha
 
 !!! note
 
-    NeMo Retriever Library is also known as NVIDIA Ingest.
+    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
 
 
 ## Core and Advanced Pipeline Features
 
-The Nemo Retriever extraction core pipeline features run on a single A10G or better GPU. 
+The Nemo Retriever Library extraction core pipeline features run on a single A10G or better GPU. 
 The core pipeline features include the following:
 
-- llama3.2-nv-embedqa-1b-v2 — Embedding model for converting text chunks into vectors.
-- nemoretriever-page-elements-v3 — Detects and classifies images on a page as a table, chart or infographic.
-- nemoretriever-table-structure-v1 — Detects rows, columns, and cells within a table to preserve table structure and convert to Markdown format. 
-- nemoretriever-graphic-elements-v1 — Detects graphic elements within chart images such as titles, legends, axes, and numerical values. 
-- nemoretriever-ocr-v1 — Image OCR model to detect and extract text from images.
-- retrieval — Enables embedding and indexing into LanceDB (default) or Milvus.
+- llama-nemotron-embed-1b-v2 — Embedding model for converting text chunks into vectors.
+- nemotron-page-elements-v3 — Detects and classifies images on a page as a table, chart or infographic.
+- nemotron-table-structure-v1 — Detects rows, columns, and cells within a table to preserve table structure and convert to Markdown format. 
+- nemotron-graphic-elements-v1 — Detects graphic elements within chart images such as titles, legends, axes, and numerical values. 
+- nemotron-ocr-v1 — Image OCR model to detect and extract text from images.
+- retrieval — Enables embedding and indexing into Milvus.
 
 Advanced features require additional GPU support and disk space. 
 This includes the following:
diff --git a/docs/docs/extraction/telemetry.md b/docs/docs/extraction/telemetry.md
index 9d34aaa92..5c050452f 100644
--- a/docs/docs/extraction/telemetry.md
+++ b/docs/docs/extraction/telemetry.md
@@ -4,7 +4,7 @@ You can view telemetry data for [NeMo Retriever Library](overview.md).
 
 !!! note
 
-    NeMo Retriever Library is also known as NVIDIA Ingest.
+    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
 
 
 ## OpenTelemetry
diff --git a/docs/docs/extraction/troubleshoot.md b/docs/docs/extraction/troubleshoot.md
index 1b130952b..6299ca09f 100644
--- a/docs/docs/extraction/troubleshoot.md
+++ b/docs/docs/extraction/troubleshoot.md
@@ -4,7 +4,7 @@ Use this documentation to troubleshoot issues that arise when you use [NeMo Retr
 
 !!! note
 
-    NeMo Retriever Library is also known as NVIDIA Ingest.
+    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
 
 
 ## Can't process long, non-language text strings
diff --git a/docs/docs/extraction/user-defined-functions.md b/docs/docs/extraction/user-defined-functions.md
index 74a710698..9eb5576ae 100644
--- a/docs/docs/extraction/user-defined-functions.md
+++ b/docs/docs/extraction/user-defined-functions.md
@@ -5,8 +5,7 @@ This guide covers how to write, validate, and submit UDFs using both the CLI and
 
 !!! note
 
-    NeMo Retriever Library is also known as NVIDIA Ingest.
-
+    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library. 
 
 
 ## Quickstart
diff --git a/docs/docs/extraction/user-defined-stages.md b/docs/docs/extraction/user-defined-stages.md
index 57f68179f..ac50dd568 100644
--- a/docs/docs/extraction/user-defined-stages.md
+++ b/docs/docs/extraction/user-defined-stages.md
@@ -8,7 +8,7 @@ and operate on a well-defined DataFrame payload and metadata structure.
 
 !!! note
 
-    NeMo Retriever Library is also known as NVIDIA Ingest.
+    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
 
 
 To add user-defined stages to your pipeline, you need the following:
diff --git a/docs/docs/extraction/vlm-embed.md b/docs/docs/extraction/vlm-embed.md
index 3ffe2b7c0..8d095f9e7 100644
--- a/docs/docs/extraction/vlm-embed.md
+++ b/docs/docs/extraction/vlm-embed.md
@@ -10,7 +10,7 @@ The model supports images that contain text, tables, charts, and infographics.
 
 !!! note
 
-    NeMo Retriever Library is also known as NVIDIA Ingest.
+    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
 
 
 ## Configure and Run the Multimodal NIM

From 6758c17e900ea179959a3f9b4acabfd8354b7d4d Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Mon, 23 Mar 2026 11:06:49 -0700
Subject: [PATCH 50/55] =?UTF-8?q?changed=20opening=20note=20to=20=20NVIDIA?=
 =?UTF-8?q?=20Ingest=20(nv-ingest)=20has=20been=20renamed=20N=E2=80=A6=20(?=
 =?UTF-8?q?#1691)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 docs/docs/extraction/data-store.md              | 2 +-
 docs/docs/extraction/faq.md                     | 2 +-
 docs/docs/extraction/nemoretriever-parse.md     | 2 +-
 docs/docs/extraction/nimclient.md               | 2 +-
 docs/docs/extraction/notebooks.md               | 2 +-
 docs/docs/extraction/nv-ingest-python-api.md    | 2 +-
 docs/docs/extraction/overview.md                | 2 +-
 docs/docs/extraction/prerequisites.md           | 2 +-
 docs/docs/extraction/python-api-reference.md    | 2 +-
 docs/docs/extraction/quickstart-library-mode.md | 2 +-
 docs/docs/extraction/releasenotes-nv-ingest.md  | 2 +-
 docs/docs/extraction/scaling-modes.md           | 2 +-
 docs/docs/extraction/support-matrix.md          | 2 +-
 docs/docs/extraction/telemetry.md               | 2 +-
 docs/docs/extraction/troubleshoot.md            | 2 +-
 docs/docs/extraction/user-defined-functions.md  | 2 +-
 docs/docs/extraction/user-defined-stages.md     | 2 +-
 docs/docs/extraction/vlm-embed.md               | 2 +-
 18 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/docs/docs/extraction/data-store.md b/docs/docs/extraction/data-store.md
index b012a9580..6b0d039ce 100644
--- a/docs/docs/extraction/data-store.md
+++ b/docs/docs/extraction/data-store.md
@@ -4,7 +4,7 @@ Use this documentation to learn how [NeMo Retriever Library](overview.md) handle
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NVIDIA Ingest (nv-ingest) has been renamed NeMo Retriever Library.
 
 
 ## Overview
diff --git a/docs/docs/extraction/faq.md b/docs/docs/extraction/faq.md
index ff383901c..08b4b96e2 100644
--- a/docs/docs/extraction/faq.md
+++ b/docs/docs/extraction/faq.md
@@ -4,7 +4,7 @@ This documentation contains the Frequently Asked Questions (FAQ) for [NeMo Retri
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NVIDIA Ingest (nv-ingest) has been renamed NeMo Retriever Library.
 
 
 
diff --git a/docs/docs/extraction/nemoretriever-parse.md b/docs/docs/extraction/nemoretriever-parse.md
index a53ed094f..56ebdc6a2 100644
--- a/docs/docs/extraction/nemoretriever-parse.md
+++ b/docs/docs/extraction/nemoretriever-parse.md
@@ -12,7 +12,7 @@ to run [NeMo Retriever Library](overview.md) with nemotron-parse.
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NVIDIA Ingest (nv-ingest) has been renamed NeMo Retriever Library.
 
 
 ## Limitations
diff --git a/docs/docs/extraction/nimclient.md b/docs/docs/extraction/nimclient.md
index 9d4a5fe42..4f1d29c17 100644
--- a/docs/docs/extraction/nimclient.md
+++ b/docs/docs/extraction/nimclient.md
@@ -5,7 +5,7 @@ This documentation demonstrates how to create custom NIM integrations for use in
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NVIDIA Ingest (nv-ingest) has been renamed NeMo Retriever Library.
 
 The NimClient architecture consists of two main components:
 
diff --git a/docs/docs/extraction/notebooks.md b/docs/docs/extraction/notebooks.md
index 0b4f31a7e..b7748a01a 100644
--- a/docs/docs/extraction/notebooks.md
+++ b/docs/docs/extraction/notebooks.md
@@ -4,7 +4,7 @@ To get started using [NeMo Retriever Library](overview.md), you can try one of t
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NVIDIA Ingest (nv-ingest) has been renamed NeMo Retriever Library.
 
 
 ## Dataset Downloads for Benchmarking
diff --git a/docs/docs/extraction/nv-ingest-python-api.md b/docs/docs/extraction/nv-ingest-python-api.md
index 4aafa9776..35f86c176 100644
--- a/docs/docs/extraction/nv-ingest-python-api.md
+++ b/docs/docs/extraction/nv-ingest-python-api.md
@@ -4,7 +4,7 @@ The [NeMo Retriever Library](overview.md) Python API provides a simple and flexi
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NVIDIA Ingest (nv-ingest) has been renamed NeMo Retriever Library.
 
 !!! tip
 
diff --git a/docs/docs/extraction/overview.md b/docs/docs/extraction/overview.md
index 0e67a87c4..2404b1f10 100644
--- a/docs/docs/extraction/overview.md
+++ b/docs/docs/extraction/overview.md
@@ -6,7 +6,7 @@ to find, contextualize, and extract text, tables, charts and infographics that y
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NVIDIA Ingest (nv-ingest) has been renamed NeMo Retriever Library.
 
 NeMo Retriever Library enables parallelization of splitting documents into pages where artifacts are classified (such as text, tables, charts, and infographics), extracted, and further contextualized through optical character recognition (OCR) into a well defined JSON schema. 
 From there, NeMo Retriever Library can optionally manage computation of embeddings for the extracted content, 
diff --git a/docs/docs/extraction/prerequisites.md b/docs/docs/extraction/prerequisites.md
index 902c499c8..3cb7b6d74 100644
--- a/docs/docs/extraction/prerequisites.md
+++ b/docs/docs/extraction/prerequisites.md
@@ -4,7 +4,7 @@ Before you begin using [NeMo Retriever Library](overview.md), ensure the followi
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NVIDIA Ingest (nv-ingest) has been renamed NeMo Retriever Library.
 
 
 
diff --git a/docs/docs/extraction/python-api-reference.md b/docs/docs/extraction/python-api-reference.md
index b9d914649..e8c503fbf 100644
--- a/docs/docs/extraction/python-api-reference.md
+++ b/docs/docs/extraction/python-api-reference.md
@@ -4,7 +4,7 @@ The [NeMo Retriever Library](overview.md) Python API provides a simple and flexi
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NVIDIA Ingest (nv-ingest) has been renamed NeMo Retriever Library.
 
 !!! tip
 
diff --git a/docs/docs/extraction/quickstart-library-mode.md b/docs/docs/extraction/quickstart-library-mode.md
index 24dabfb8f..7afe7c8c5 100644
--- a/docs/docs/extraction/quickstart-library-mode.md
+++ b/docs/docs/extraction/quickstart-library-mode.md
@@ -2,6 +2,6 @@
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NVIDIA Ingest (nv-ingest) has been renamed NeMo Retriever Library.
 
 Use the [Quick Start for NeMo Retriever Library](https://github.com/NVIDIA/NeMo-Retriever/blob/26.03/nemo_retriever/README.md) to set up and run the NeMo Retriever Library locally, so you can build a GPU‑accelerated, multimodal RAG ingestion pipeline that parses PDFs, HTML, text, audio, and video into LanceDB vector embeddings, integrates with Nemotron RAG models (locally or via NIM endpoints), which includes Ray‑based scaling with built‑in recall evaluation.
\ No newline at end of file
diff --git a/docs/docs/extraction/releasenotes-nv-ingest.md b/docs/docs/extraction/releasenotes-nv-ingest.md
index d3b71b4a5..39c4973d8 100644
--- a/docs/docs/extraction/releasenotes-nv-ingest.md
+++ b/docs/docs/extraction/releasenotes-nv-ingest.md
@@ -4,7 +4,7 @@ This documentation contains the release notes for [NeMo Retriever Library](overv
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.   
+    NVIDIA Ingest (nv-ingest) has been renamed NeMo Retriever Library.   
 
 ## 26.03 Release Notes (26.3.0)
 
diff --git a/docs/docs/extraction/scaling-modes.md b/docs/docs/extraction/scaling-modes.md
index f39c4ad70..8fc4684a1 100644
--- a/docs/docs/extraction/scaling-modes.md
+++ b/docs/docs/extraction/scaling-modes.md
@@ -7,7 +7,7 @@ This guide covers how resource scaling modes work across stages in [NeMo Retriev
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NVIDIA Ingest (nv-ingest) has been renamed NeMo Retriever Library.
 
 
 
diff --git a/docs/docs/extraction/support-matrix.md b/docs/docs/extraction/support-matrix.md
index d62ed7a8b..4098ebc33 100644
--- a/docs/docs/extraction/support-matrix.md
+++ b/docs/docs/extraction/support-matrix.md
@@ -4,7 +4,7 @@ Before you begin using [NeMo Retriever Library](overview.md), ensure that you ha
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NVIDIA Ingest (nv-ingest) has been renamed NeMo Retriever Library.
 
 
 ## Core and Advanced Pipeline Features
diff --git a/docs/docs/extraction/telemetry.md b/docs/docs/extraction/telemetry.md
index 5c050452f..fcdf6d876 100644
--- a/docs/docs/extraction/telemetry.md
+++ b/docs/docs/extraction/telemetry.md
@@ -4,7 +4,7 @@ You can view telemetry data for [NeMo Retriever Library](overview.md).
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NVIDIA Ingest (nv-ingest) has been renamed NeMo Retriever Library.
 
 
 ## OpenTelemetry
diff --git a/docs/docs/extraction/troubleshoot.md b/docs/docs/extraction/troubleshoot.md
index 6299ca09f..ca3cc6da4 100644
--- a/docs/docs/extraction/troubleshoot.md
+++ b/docs/docs/extraction/troubleshoot.md
@@ -4,7 +4,7 @@ Use this documentation to troubleshoot issues that arise when you use [NeMo Retr
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NVIDIA Ingest (nv-ingest) has been renamed NeMo Retriever Library.
 
 
 ## Can't process long, non-language text strings
diff --git a/docs/docs/extraction/user-defined-functions.md b/docs/docs/extraction/user-defined-functions.md
index 9eb5576ae..ef08ded14 100644
--- a/docs/docs/extraction/user-defined-functions.md
+++ b/docs/docs/extraction/user-defined-functions.md
@@ -5,7 +5,7 @@ This guide covers how to write, validate, and submit UDFs using both the CLI and
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library. 
+    NVIDIA Ingest (nv-ingest) has been renamed NeMo Retriever Library. 
 
 
 ## Quickstart
diff --git a/docs/docs/extraction/user-defined-stages.md b/docs/docs/extraction/user-defined-stages.md
index ac50dd568..247a27eb0 100644
--- a/docs/docs/extraction/user-defined-stages.md
+++ b/docs/docs/extraction/user-defined-stages.md
@@ -8,7 +8,7 @@ and operate on a well-defined DataFrame payload and metadata structure.
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NVIDIA Ingest (nv-ingest) has been renamed NeMo Retriever Library.
 
 
 To add user-defined stages to your pipeline, you need the following:
diff --git a/docs/docs/extraction/vlm-embed.md b/docs/docs/extraction/vlm-embed.md
index 8d095f9e7..03d089e8c 100644
--- a/docs/docs/extraction/vlm-embed.md
+++ b/docs/docs/extraction/vlm-embed.md
@@ -10,7 +10,7 @@ The model supports images that contain text, tables, charts, and infographics.
 
 !!! note
 
-    NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.
+    NVIDIA Ingest (nv-ingest) has been renamed NeMo Retriever Library.
 
 
 ## Configure and Run the Multimodal NIM

From 3db9a495a2cdd7522a1acf4c495bb6c7d38d882e Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Mon, 23 Mar 2026 12:04:58 -0700
Subject: [PATCH 51/55] remove duplicate caption() section with wrong
 parameters (NVBug 6000620) (#1693)

---
 docs/docs/extraction/nv-ingest-python-api.md | 25 ------------------
 docs/docs/extraction/python-api-reference.md | 27 --------------------
 2 files changed, 52 deletions(-)

diff --git a/docs/docs/extraction/nv-ingest-python-api.md b/docs/docs/extraction/nv-ingest-python-api.md
index 35f86c176..106bdf76d 100644
--- a/docs/docs/extraction/nv-ingest-python-api.md
+++ b/docs/docs/extraction/nv-ingest-python-api.md
@@ -424,31 +424,6 @@ results = ingestor.ingest()
 
     For more information about working with infographics and multimodal content, refer to [Use Multimodal Embedding](vlm-embed.md).
 
-### Caption Images and Control Reasoning
-
-The caption task can call a VLM with optional prompt and system prompt overrides:
-
-- `caption_prompt` (user prompt): defaults to `"Caption the content of this image:"`.
-- `caption_system_prompt` (system prompt): defaults to `"/no_think"` (reasoning off). Set to `"/think"` to enable reasoning per the Nemotron Nano 12B v2 VL model card.
-
-Example:
-```python
-from nv_ingest_client.client.interface import Ingestor
-
-ingestor = (
-    Ingestor()
-    .files("path/to/doc-with-images.pdf")
-    .extract(extract_images=True)
-    .caption(
-        prompt="Caption the content of this image:",
-        system_prompt="/think",  # or "/no_think"
-    )
-    .ingest()
-)
-```
-
-
-
 ## Extract Embeddings
 
 The `embed` method in the library generates text embeddings for document content.
diff --git a/docs/docs/extraction/python-api-reference.md b/docs/docs/extraction/python-api-reference.md
index e8c503fbf..7feb020f1 100644
--- a/docs/docs/extraction/python-api-reference.md
+++ b/docs/docs/extraction/python-api-reference.md
@@ -516,33 +516,6 @@ results = ingestor.ingest()
 
     For more information about working with infographics and multimodal content, refer to [Use Multimodal Embedding](vlm-embed.md).
 
-### Caption Images and Control Reasoning
-
-The caption task can call a VLM with optional prompt and system prompt overrides:
-
-- `caption_prompt` (user prompt): defaults to `"Caption the content of this image:"`.
-- `caption_system_prompt` (system prompt): defaults to `"/no_think"` (reasoning off). Set to `"/think"` to enable reasoning per the Nemotron Nano 12B v2 VL model card.
-- `context_text_max_chars` (int, optional): Maximum characters of page text to include as context for the VLM.
-- `temperature` (float, optional): Sampling temperature for the VLM.
-
-Example:
-```python
-from nemo_retriever.client.interface import Ingestor
-
-ingestor = (
-    Ingestor()
-    .files("path/to/doc-with-images.pdf")
-    .extract(extract_images=True)
-    .caption(
-        prompt="Caption the content of this image:",
-        system_prompt="/think",  # or "/no_think"
-    )
-    .ingest()
-)
-```
-
-
-
 ## Extract Embeddings
 
 The `embed` method in the NeMo Retriever Library generates text embeddings for document content.

From f0f9e97b6aaea533fa39b4fa3826e64bc1e608b4 Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Mon, 23 Mar 2026 13:07:03 -0700
Subject: [PATCH 52/55] Kheiss/6000618 (#1694)

Co-authored-by: sosahi <syousefisahi@nvidia.com>
---
 docs/docs/extraction/user-defined-functions.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/docs/extraction/user-defined-functions.md b/docs/docs/extraction/user-defined-functions.md
index ef08ded14..aef2c09c3 100644
--- a/docs/docs/extraction/user-defined-functions.md
+++ b/docs/docs/extraction/user-defined-functions.md
@@ -702,7 +702,7 @@ def risky_udf(control_message: IngestControlMessage) -> IngestControlMessage:
     logger = logging.getLogger(__name__)
     
     try:
-        df = control_message.get_payload()
+        df = control_message.payload()
         logger.info(f"Processing {len(df)} documents")
         
         # Load model repeatedly (memory intensive)
@@ -731,7 +731,7 @@ def stable_udf(control_message: IngestControlMessage) -> IngestControlMessage:
     logger = logging.getLogger(__name__)
     
     try:
-        df = control_message.get_payload()
+        df = control_message.payload()
         logger.info(f"Processing {len(df)} documents")
         
         # Load model once and reuse (consider caching)
@@ -889,7 +889,7 @@ def test_my_udf():
     result = my_custom_processor(control_message)
     
     # Verify results
-    result_df = result.get_payload()
+    result_df = result.payload()
     print(result_df)
     assert 'custom_field' in result_df.iloc[0]['metadata']
 
@@ -916,7 +916,7 @@ def debug_udf(control_message: IngestControlMessage) -> IngestControlMessage:
     logger = logging.getLogger(__name__)
     
     try:
-        df = control_message.get_payload()
+        df = control_message.payload()
         logger.info(f"Processing {len(df)} documents")
         
         # Log input data structure

From cf22e8c6d51d6dff938cb67a02258864e93270cb Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Mon, 23 Mar 2026 13:38:49 -0700
Subject: [PATCH 53/55] fix syntax (#1696)

fix code syntax in faq
---
 docs/docs/extraction/faq.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/docs/extraction/faq.md b/docs/docs/extraction/faq.md
index 08b4b96e2..cd99cad9a 100644
--- a/docs/docs/extraction/faq.md
+++ b/docs/docs/extraction/faq.md
@@ -102,5 +102,5 @@ Ingestor(client=client)
     .extract()
     .embed()
     .caption()
-)
+
 ```

From cc33bea5e00465f1b5b59b3f68c4153c909191c4 Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Mon, 23 Mar 2026 13:53:49 -0700
Subject: [PATCH 54/55] Kheiss/6000353 - update links to Helm chart (#1697)

Kheiss/6000353 - update links to Helm chart (#1697)
---
 docs/docs/extraction/releasenotes-nv-ingest.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/docs/extraction/releasenotes-nv-ingest.md b/docs/docs/extraction/releasenotes-nv-ingest.md
index 39c4973d8..7872dcd55 100644
--- a/docs/docs/extraction/releasenotes-nv-ingest.md
+++ b/docs/docs/extraction/releasenotes-nv-ingest.md
@@ -10,7 +10,7 @@ This documentation contains the release notes for [NeMo Retriever Library](overv
 
 NVIDIA® NeMo Retriever Library version 26.03 adds broader hardware and software support along with many pipeline, evaluation, and deployment enhancements.
 
-To upgrade the Helm charts for this release, refer to the [NeMo Retriever Library Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/release/26.3.0/helm/README.md).
+To upgrade the Helm charts for this release, refer to the [NeMo Retriever Library Helm Charts](https://github.com/NVIDIA/NeMo-Retriever/blob/26.3.0/helm/README.md).
 
 Highlights for the 26.03 release include:
 

From fa30ff8327539cf469bc401d4f5b08ebe320e49e Mon Sep 17 00:00:00 2001
From: Kurt Heiss <kheiss@nvidia.com>
Date: Mon, 23 Mar 2026 15:02:54 -0700
Subject: [PATCH 55/55] Document RTX PRO 4500 Blackwell (GB203) in hardware
 support matrix 5961722 (#1698)

---
 docs/docs/extraction/support-matrix.md | 29 +++++++++++++-------------
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/docs/docs/extraction/support-matrix.md b/docs/docs/extraction/support-matrix.md
index 4098ebc33..c38e93760 100644
--- a/docs/docs/extraction/support-matrix.md
+++ b/docs/docs/extraction/support-matrix.md
@@ -45,30 +45,29 @@ NeMo Retriever Library supports the following GPU hardware.
 - [A100 Tensor Core GPU](https://www.nvidia.com/en-us/data-center/a100/)
 - [A10G Tensor Core GPU](https://aws.amazon.com/ec2/instance-types/g5/)
 - [L40S](https://www.nvidia.com/en-us/data-center/l40s/)
+- [RTX PRO 4500 Blackwell](https://www.nvidia.com/en-us/products/workstations/professional-desktop-gpus/rtx-pro-4500/)
 
 
 The following are the hardware requirements to run NeMo Retriever Library.
 
-|Feature         | GPU Option                | RTX Pro 6000  | B200          | H200 NVL      | H100        | A100 80GB   | A100 40GB     | A10G          | L40S   |
-|----------------|---------------------------|---------------|---------------|---------------|-------------|-------------|---------------|---------------|--------|
-| GPU            | Memory                    | 96GB          | 180GB         | 141GB         | 80GB        | 80GB        | 40GB          | 24GB          | 48GB   |
-| Core Features  | Total GPUs                | 1             | 1             | 1             | 1           | 1           | 1             | 1             | 1      |
-| Core Features  | Total Disk Space          | ~150GB        | ~150GB        | ~150GB        | ~150GB      | ~150GB      | ~150GB        | ~150GB        | ~150GB |
-| Audio          | Additional Dedicated GPUs | 1             | 1             | 1             | 1           | 1           | 1             | 1             | 1      |
-| Audio          | Additional Disk Space     | ~37GB         | ~37GB         | ~37GB         | ~37GB       | ~37GB       | ~37GB         | ~37GB         | ~37GB  |
-| nemotron-parse | Additional Dedicated GPUs | Not supported | Not supported | Not supported | 1           | 1           | 1             | 1             | 1      |
-| nemotron-parse | Additional Disk Space     | Not supported | Not supported | Not supported | ~16GB       | ~16GB       | ~16GB         | ~16GB         | ~16GB  |
-| VLM            | Additional Dedicated GPUs | 1             | 1             | 1             | 1           | 1           | Not supported | Not supported | 1      |
-| VLM            | Additional Disk Space     | ~16GB         | ~16GB         | ~16GB         | ~16GB       | ~16GB       | Not supported | Not supported | ~16GB  |
-| Reranker       | With Core Pipeline        | Yes           | Yes           | Yes           | Yes         | Yes         | No*           | No*           | No*    |
-| Reranker       | Standalone (recall only)  | Yes           | Yes           | Yes           | Yes         | Yes         | Yes           | Yes           | Yes    |
+|Feature         | GPU Option                | RTX Pro 6000  | B200          | H200 NVL      | H100        | A100 80GB   | A100 40GB     | A10G          | L40S   | RTX PRO 4500 Blackwell |
+|----------------|---------------------------|---------------|---------------|---------------|-------------|-------------|---------------|---------------|--------|------------------------|
+| GPU            | Memory                    | 96GB          | 180GB         | 141GB         | 80GB        | 80GB        | 40GB          | 24GB          | 48GB   | 32GB GDDR7 (GB203)     |
+| Core Features  | Total GPUs                | 1             | 1             | 1             | 1           | 1           | 1             | 1             | 1      | 1                      |
+| Core Features  | Total Disk Space          | ~150GB        | ~150GB        | ~150GB        | ~150GB      | ~150GB      | ~150GB        | ~150GB        | ~150GB | ~150GB                 |
+| Audio          | Additional Dedicated GPUs | 1             | 1             | 1             | 1           | 1           | 1             | 1             | 1      | 1¹                     |
+| Audio          | Additional Disk Space     | ~37GB         | ~37GB         | ~37GB         | ~37GB       | ~37GB       | ~37GB         | ~37GB         | ~37GB  | ~37GB¹                 |
+| nemotron-parse | Additional Dedicated GPUs | Not supported | Not supported | Not supported | 1           | 1           | 1             | 1             | 1      | Not supported²         |
+| nemotron-parse | Additional Disk Space     | Not supported | Not supported | Not supported | ~16GB       | ~16GB       | ~16GB         | ~16GB         | ~16GB  | Not supported²         |
+| VLM            | Additional Dedicated GPUs | 1             | 1             | 1             | 1           | 1           | Not supported | Not supported | 1      | Not supported³         |
+| VLM            | Additional Disk Space     | ~16GB         | ~16GB         | ~16GB         | ~16GB       | ~16GB       | Not supported | Not supported | ~16GB  | Not supported³         |
+| Reranker       | With Core Pipeline        | Yes           | Yes           | Yes           | Yes         | Yes         | No*           | No*           | No*    | No*                    |
+| Reranker       | Standalone (recall only)  | Yes           | Yes           | Yes           | Yes         | Yes         | Yes           | Yes           | Yes    | Yes                    |
 
 \* GPUs with less than 80GB VRAM cannot run the reranker concurrently with the core pipeline. 
 To perform recall testing with the reranker on these GPUs, shut down the core pipeline NIM microservices 
 and run only the embedder, reranker, and your vector database.
 
-
-
 ## Related Topics
 
 - [Prerequisites](prerequisites.md)