From a85f337716b6718330b8e9b7aaa3f9f6ff40762d Mon Sep 17 00:00:00 2001 From: Ignacio Sastre <56086604+nsuruguay05@users.noreply.github.com> Date: Thu, 12 Mar 2026 11:52:43 -0300 Subject: [PATCH] docs(ingest): align file path examples with topic structure Clarify where to place documents for ingestion and update examples to use data/ paths, including nested child topics.\n\nFixes #18\nFixes #19 --- CHANGELOG.md | 1 + README.md | 24 +++++++++++++++++++----- docs/commands.md | 18 +++++++++++++----- docs/getting-started.md | 9 ++++++--- 4 files changed, 39 insertions(+), 13 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 69881c8..b5d9cbc 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -27,6 +27,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Documentation - Environment variable and authentication docs updated to use `COGSOL_API_KEY` and optional Azure AD B2C credentials. - Removed outdated "no external dependencies" statements from README. +- Added nested-topic ingestion examples and corrected ingest file paths to use topic-aligned `data//` locations in docs. --- diff --git a/README.md b/README.md index 4af5082..f980e5a 100644 --- a/README.md +++ b/README.md @@ -161,7 +161,10 @@ python manage.py makemigrations data python manage.py migrate data # Ingest documents into a topic -python manage.py ingest documentation ./docs/*.pdf +python manage.py ingest documentation ./data/documentation/*.pdf + +# Ingest documents into a nested topic +python manage.py ingest documentation/tutorials ./data/documentation/tutorials/*.pdf ``` --- @@ -294,6 +297,14 @@ python manage.py ingest [options] - `topic`: Topic path (e.g., `documentation` or `parent/child/topic`) - `files`: Files, directories, or glob patterns to ingest +Use slash-separated paths for nested topics. For example, if you created `tutorials` under +`documentation` with `starttopic tutorials --path documentation`, ingest into it with +`documentation/tutorials`. + +For a topic-aligned workflow, place files under `data//` and ingest from that +folder (for example, `./data/documentation/*.pdf` or +`./data/documentation/tutorials/*.pdf`). + **Options:** - `--doc-type`: Document type (defaults to `Text Document`) - `--ingestion-config`: Name of an ingestion config from `data/ingestion.py` @@ -310,13 +321,16 @@ python manage.py ingest [options] **Examples:** ```bash # Ingest PDF files -python manage.py ingest documentation ./docs/*.pdf +python manage.py ingest documentation ./data/documentation/*.pdf + +# Ingest into a child topic +python manage.py ingest documentation/tutorials ./data/documentation/tutorials/*.pdf # Ingest with custom config -python manage.py ingest documentation ./docs/ --ingestion-config HighQuality +python manage.py ingest documentation ./data/documentation/ --ingestion-config HighQuality # Dry run to preview -python manage.py ingest documentation ./data/ --dry-run +python manage.py ingest documentation ./data/documentation/ --dry-run ``` ### `topics` @@ -610,7 +624,7 @@ from cogsol.content import BaseIngestionConfig, PDFParsingMode, ChunkingMode Use with the `ingest` command: ```bash -python manage.py ingest documentation ./docs/ --ingestion-config high_quality +python manage.py ingest documentation ./data/documentation/ --ingestion-config high_quality ``` #### Reference Formatters diff --git a/docs/commands.md b/docs/commands.md index 486554c..c17f7e7 100644 --- a/docs/commands.md +++ b/docs/commands.md @@ -587,6 +587,11 @@ python manage.py ingest [options] | `topic` | Yes | - | Topic path (e.g., `docs` or `parent/child`) | | `files` | Yes | - | Files, directories, or glob patterns | +Use slash-separated paths for nested topics during ingestion (for example: +`documentation/tutorials`). For a topic-aligned workflow, place files under +`data//` and ingest from that matching path (for example: +`./data/documentation/*.pdf` and `./data/documentation/tutorials/*.pdf`). + #### Options | Option | Default | Description | @@ -628,27 +633,30 @@ class HighQualityConfig(BaseIngestionConfig): Then use with: ```bash -python manage.py ingest documentation ./docs/ --ingestion-config high_quality +python manage.py ingest documentation ./data/documentation/ --ingestion-config high_quality ``` #### Example Usage ```bash # Ingest all PDFs in a directory -python manage.py ingest documentation ./docs/*.pdf +python manage.py ingest documentation ./data/documentation/*.pdf + +# Ingest into a child topic using parent/child path +python manage.py ingest documentation/tutorials ./data/documentation/tutorials/*.pdf # Ingest an entire directory recursively -python manage.py ingest documentation ./docs/ +python manage.py ingest documentation ./data/documentation/ # Use custom settings -python manage.py ingest documentation ./reports/ \ +python manage.py ingest documentation ./data/documentation/reports/ \ --doc-type "Text Document" \ --pdf-mode ocr \ --chunking ingestor \ --max-size-block 2000 # Preview what would be ingested -python manage.py ingest documentation ./docs/ --dry-run +python manage.py ingest documentation ./data/documentation/ --dry-run ``` #### Output Messages diff --git a/docs/getting-started.md b/docs/getting-started.md index 06bf45b..c9ae77b 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -611,14 +611,17 @@ python manage.py migrate data ### Step 8: Ingest Documents -Upload documents to your topic: +Upload documents to your topic. In this guide, examples place files under `data//` so the file location mirrors the topic path: ```bash # Ingest a directory of documents -python manage.py ingest product_docs ./docs/ +python manage.py ingest product_docs ./data/product_docs/ + +# Ingest into a nested child topic (parent/child path) +python manage.py ingest product_docs/tutorials ./data/product_docs/tutorials/*.pdf # Preview first (dry run) -python manage.py ingest product_docs ./docs/ --dry-run +python manage.py ingest product_docs ./data/product_docs/ --dry-run ``` ### Step 9: List Topics