A sophisticated suite of scripts for AI-powered content generation, processing, and merging with multi-stage editorial pipelines.
This suite provides a complete workflow for content creation and processing using LangChain and large language models. From initial content generation through intelligent merging and editorial refinement, it supports both simple workflows and complex multi-stage pipelines.
- Architecture
- Quick Start
- Makefile Usage
- Scripts Overview
- Configuration
- Usage Examples
- Pipeline Types
- API Reference
- Tool Agent Schema
- Troubleshooting
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β lc_ask.py β β lc_batch.py β β lc_merge_runner β
β β β β β β
β β’ Core LLM β β β’ Batch Jobs β β β’ Multi-stage β
β β’ Single Query β β β’ Parallel Proc β β β’ Critique β
β β’ JSON Output β β β’ Result Storageβ β β’ Merge β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
βββββββββββββββββββββββββΌββββββββββββββββββββββββ
β
βββββββββββββββββββ
β lc_build_index β
β β
β β’ Vector Index β
β β’ RAG Support β
β β’ Embeddings β
βββββββββββββββββββ
#### Install Python dependencies globally
pip install -r requirements.txt -r requirements-faiss-cpu.txt
# or swap requirements-faiss-cpu.txt for requirements-faiss-gpu.txt on CUDA hosts
pip install -r requirements-test.txtpython -m venv venv
source venv/bin/activate
pip install -r requirements.txt -r requirements-faiss-cpu.txt
# or swap requirements-faiss-cpu.txt for requirements-faiss-gpu.txt on CUDA hosts
pip install -r requirements-test.txt
curl -LO https://github.com/getsops/sops/releases/download/v3.10.2/sops-v3.10.2.linux.amd64
sudo mv sops-v3.10.2.linux.amd64 /usr/local/bin/sops
chmod +x /usr/local/bin/sopscp env.json.template env.json
# Edit env.json with your API keys and settings (add _pt for plaintext after non-secret environment variables)
nano env.json
# encrypt values in env.json
sops -e env.json > env.json
# load environment variables of both secret or otherwise, you may need to eval or run inside backticks, as the output of this command is the specific shell commands to export the environment variables.(workaround for having the _pt suffix)
`make sops-env-export`
# Complete setup and workflow
make init # Set up environment
# (optional) make init with GPU FAISS wheel
FAISS_BACKEND=gpu make init
make lc-index KEY=default # Build FAISS index
make cli-ask "What is machine learning?" # Ask questions via Typer CLI
make cli-shell # Interactive shell
# Complete book generation workflow
make book-from-outline OUTLINE="examples/sample_outline_text.txt" TITLE="My Book"# See all available example files
make examples
# Quick RAG query with custom parameters
make quick-ask "What is machine learning?" KEY="science" CONTENT_TYPE="technical_manual_writer"
# Batch processing with parallel execution
make batch-workflow FILE="examples/sample_jobs_1A1.jsonl" PARALLEL=4# 1. Build knowledge index (optional, for RAG)
python src/langchain/lc_build_index.py
# 2. Generate content variations
python src/langchain/lc_batch.py
# 3. Merge and refine content
python src/langchain/lc_merge_runner.pyThe enhanced Makefile provides a comprehensive command center for the entire LangChain RAG Writer pipeline. It includes all command-line options from the scripts, workflow automation, and quality tools.
make help # Show comprehensive help with all targets
make examples # List all available example filesmake init # Initialize environment and install dependencies
FAISS_BACKEND=gpu make init # Same as above but installs faiss-gpu
make lc-index KEY=foo SHARD_SIZE=2000 RESUME=1 # Build sharded FAISS index
make cli-ask "question" # RAG query via Typer CLI
make cli-shell # Interactive shellmake lc-ask INSTR="instruction" [TASK="task"] # RAG query with custom parameters
make lc-batch FILE="jobs.jsonl" [PARALLEL=4] # Batch processing
make lc-merge-runner [SUB=1A1] # Content merging
make lc-outline-converter OUTLINE="file.txt" # Convert outlines to book structure
make lc-book-runner BOOK="book.json" # Complete book generationmake test # Run test suite
make test-coverage # Run tests with coverage reporting
make format # Format code with black
make lint # Lint code with flake8
make quality # Run full quality check
make show-config # Display current configuration
make check-setup # Validate project setup# Generate book from outline (complete pipeline)
make book-from-outline OUTLINE="examples/sample_outline_text.txt" TITLE="My Book"
# Quick RAG query with all options
make quick-ask "What is machine learning?" KEY="science" CONTENT_TYPE="technical_manual_writer" K=20
# Batch processing workflow
make batch-workflow FILE="jobs.jsonl" KEY="biology" PARALLEL=4- Complete Option Coverage: All command-line options available as variables
- Smart Defaults: Sensible defaults for all parameters
- Workflow Automation: Multi-step processes in single commands
- Error Prevention: Parameter validation and help messages
- Quality Tools: Integrated testing, formatting, and linting
- Example Discovery: Easy access to sample files and usage patterns
The project provides multiple CLI interfaces for different use cases:
Note on FAISS index paths:
- The multi-model builder writes FAISS directories like
storage/faiss_<key>__<embed_model>. - The Typer CLI (
python -m src.cli.commands) looks forstorage/faiss_<key>by default. - If you use the multi-model builder and the Typer CLI, copy or symlink your chosen embedding index to the generic path, e.g.:
ln -s storage/faiss_science__BAAI-bge-small-en-v1.5 storage/faiss_sciencelc_ask.pywill automatically use a...__<embed_model>_repackeddirectory if present.
If you upgraded LangChain and your old FAISS index fails to load, repack it without re-embedding:
# Derive paths from KEY and EMBED_MODEL
make repack-faiss KEY=science EMBED_MODEL=BAAI/bge-small-en-v1.5
# Or specify explicit directories
make repack-faiss FAISS_DIR=storage/faiss_science__BAAI-bge-small-en-v1.5 OUT=storage/faiss_science__BAAI-bge-small-en-v1.5_repackedPurpose: Direct interface to language models for single queries.
Key Features:
- Flexible prompt engineering
- Multiple content types
- JSON output support
- Retrieval-augmented generation (RAG)
Options:
--key: collection key used when building the index (requires matching--chunks-dir)--k: number of results to return from vector database--embed-model: the model index to query (default:BAAI/bge-small-en-v1.5)--ce-model: cross encoder model (default:cross-encoder/ms-marco-MiniLM-L-6-v2)--chunks-dir: directory containing the chunk JSONL written bylc_build_index--chunks-file: explicit path to a chunk JSONL file (skips--chunks-dirlookup)--index-dir: root directory containing FAISS index folders (the same parent directory passed tolc_build_index; defaults to<repo>/storage)
Usage:
# Basic query
python src/langchain/lc_ask.py ask "What is machine learning?"
# Query using a specific key with a custom index directory root
python src/langchain/lc_ask.py --key science --index-dir /mnt/vector-storage --question "Summarise the Higgs boson"
# Advanced query with options
python src/langchain/lc_ask.py ask "Explain neural networks" --content-type technical_manual_writer --key science --k 20
# Query from JSON file
python src/langchain/lc_ask.py ask --file query.json --key biologyMakefile Usage:
make lc-ask - Complete RAG query options:
INSTR: Instruction for retrieval (what to search for)TASK: Task prefix for LLM (how to answer)FILE: JSON file containing query parametersKEY: Collection key (default: default)CONTENT_TYPE: Writing style (default: pure_research)K: Number of documents to retrieve (default: 30)
# Simple RAG query
make lc-ask "Explain neural networks"
# Advanced RAG query with all options
make lc-ask INSTR="Explain neural networks" TASK="Write for beginners" KEY="science" CONTENT_TYPE="technical_manual_writer" K=20Purpose: Process multiple content generation jobs in parallel.
Key Features:
- JSONL job file processing
- Parallel execution
- Result aggregation
- Progress tracking
Options:
--key: string specifying the faiss index to query--k: number of results to return from vector database--parallel: number of parallel workers (default: derived fromRAG_PARALLEL_WORKERSor clampedos.cpu_count())--output-dir: specify output directory--jobs: JSON or JSONL file containing job definitions
Usage:
# Process jobs from JSONL file
python src/langchain/lc_batch.py --jobs data_jobs/example.jsonl --key science
# Parallel processing
python src/langchain/lc_batch.py --jobs jobs.jsonl --parallel 4 --k 30
# Custom output directory
python src/langchain/lc_batch.py --jobs jobs.jsonl --output-dir ./custom_outputMakefile Usage:
make lc-batch - Batch processing with full options:
FILE: JSON or JSONL file containing job definitionsKEY: Collection key (default: default)CONTENT_TYPE: Writing style (default: pure_research)K: Retriever top-k (default: 30)PARALLEL: Number of parallel workers (default: 1)OUTPUT_DIR: Custom output directory
# Batch processing with parallel execution
make lc-batch FILE="examples/sample_jobs_1A1.jsonl" KEY="biology" PARALLEL=4Purpose: Create vector indexes for retrieval-augmented generation.
Key Features:
- Document ingestion
- Vector embeddings
- Index optimization
- Multiple data sources
Options:
key: Positional storage key prefix (defaults to$RAG_KEYordefault)--shard-size: Number of chunks per shard (default:1000)--resume [VALUE]: Skip shards already built (accepts optional value for compatibility)--keep-shards: Do not delete shard directories after merge--no-gpu: Force embeddings to run on CPU even if accelerators are available--serve-gpu: After saving, copy the in-memory FAISS index to GPU for serving--faiss-threads: Override FAISS build thread count (default: host CPU count)--input-dir: Directory containing source PDFs to ingest (default:data_raw/)--chunks-dir: Directory for normalized chunk JSONL output (default:data_processed/)--index-dir: Directory that will contain FAISS index folders (default:storage/)
Usage:
# Build index using the default key with GPU embeddings when available
python src/langchain/lc_build_index.py science
# Resume a sharded build, keep intermediate shards, pin to CPU, and serve from GPU memory
python src/langchain/lc_build_index.py science --shard-size 200 --resume --keep-shards --no-gpu --serve-gpu --faiss-threads 8
Purpose: Intelligent content merging with multi-stage editorial pipelines.
Key Features:
- Multi-stage processing (critique β merge β style β images)
- AI-powered content scoring
- Jaccard similarity de-duplication
- YAML-driven configuration
- Command-line and interactive modes
Options:
--sub: Subsection ID (e.g., 1A1) for job file processing"--jobs: Path to JSONL jobs file--key: Collection key for lc_ask--k: Retriever top-k for lc_ask (default:10)--batch-only: Force use of batch results only (skip job file prompts) (default:False)--chapter: Chapter title for context--section: Section title for context--subsection: Subsection title for context
Usage:
# Interactive mode
python src/langchain/lc_merge_runner.py
# Process specific subsection
python src/langchain/lc_merge_runner.py --sub 1A1 --key science
# Custom job file
python src/langchain/lc_merge_runner.py --jobs data_jobs/1A1.jsonl --chapter "Chapter 1"Makefile Usage:
make lc-merge-runner - Intelligent content merging:
SUB: Subsection ID (e.g., 1A1)JOBS: Path to JSONL jobs fileKEY: Collection key for lc_askK: Retriever top-k for lc_askBATCH_ONLY: Force use of batch results onlyCHAPTER/SECTION/SUBSECTION: Hierarchical context titles
Purpose: Generate intelligent book outlines using LangChain's indexed knowledge.
Key Features:
- Interactive book detail collection
- Outline depth selection (3-5 levels)
- AI-powered outline generation using indexed content
- Automatic conversion to book runner format
- Comprehensive outline validation and summary
Options:
--output: Output JSON file path--non-interactive: not yet implemented (user will be able to supply json file containing answers to all prompts)
Usage:
# Interactive outline generation
python src/langchain/lc_outline_generator.py
# Save to specific location
python src/langchain/lc_outline_generator.py --output my_book_outline.jsonPurpose: Convert existing outlines into book structure and job files.
Key Features:
- Multiple input format support (JSON, Markdown, Text)
- Automatic hierarchical context generation
- Job file generation for each subsection
- Dependency relationship detection
- Format validation and conversion
Options:
--outline: Input outline file (JSON, Markdown, or Text)--output: Output book structure JSON file"--title: Override book title--topic: Override book topic--audience: Override target audience--wordcount: Override word count target--num-prompts: Number of prompts to generate per section--content-type: Content type for job generation (default: technical_manual_writer)
Supported Input Formats:
- JSON: Structured outline format (from lc_outline_generator.py)
- Markdown: Header-based outline (# ## ### ####)
- Text: Numbered/lettered outline (1. 2. A. B. etc.)
# Convert text outline
python src/langchain/lc_outline_converter.py --outline examples/sample_outline_text.txt
# Convert JSON outline
python src/langchain/lc_outline_converter.py --outline my_outline.json
# Convert with custom metadata
python src/langchain/lc_outline_converter.py --outline outline.md --title "My Book" --topic "AI" --audience "developers"
# Convert markdown outline
python src/langchain/lc_outline_converter.py --outline examples/sample_outline_markdown.md --output book.jsonMakefile Usage:
make lc-outline-converter - Outline conversion:
OUTLINE: Input outline file (JSON, Markdown, or Text)OUTPUT: Output book structure JSON fileTITLE/TOPIC/AUDIENCE: Override metadataWORDCOUNT: Override word count targetNUM_PROMPTS: Number of prompts to generate per sectionCONTENT_TYPE: Content type for job generation
# Interactive outline generation
make lc-outline-generator
# Convert outline with custom output
make lc-outline-converter examples/sample_outline_markdown.md OUTPUT=my_book.json
# Convert with metadata overrides
make lc-outline-converter examples/sample_outline_text.txt \
TITLE="My Book" \
TOPIC="Machine Learning" \
AUDIENCE="Data Scientists" \
WORDCOUNT=75000
# Run complete book generation
make lc-book-runner examples/book_structure_example.json
# Force regeneration of all content
make lc-book-runner examples/book_structure_example.json FORCE=1
# Skip merge step (batch only)
make lc-book-runner examples/book_structure_example.json SKIP_MERGE=1
# Convert outline with custom metadata
make lc-outline-converter OUTLINE="examples/sample_outline_text.txt" TITLE="My Book" TOPIC="AI" AUDIENCE="developers"
Purpose: High-level orchestration for entire books and chapters.
Key Features:
- Hierarchical book structure processing (4 levels deep)
- Automatic job file generation
- Batch and merge pipeline orchestration
- Final document aggregation
- Progress tracking and error recovery
- Section dependency management
Options:
--book: JSON file defining book structure--output: Output markdown file path--force: Force regeneration of all content (default:False)--skip-merge: Skip merge processing, only run batch (default:False)--use-rag: Use RAG for additional context when generating job prompts (default:False)--rag-key: Collection key for RAG retrieval (required if --use-rag is specified)--num-prompts: Number of prompts to generate per section (default: 4)
Usage:
# Process complete book structure
python src/langchain/lc_book_runner.py --book examples/book_structure_example.json
# Custom output location
python src/langchain/lc_book_runner.py --book book.json --output /path/to/final_book.md
# Force regeneration
python src/langchain/lc_book_runner.py --book book.json --force
# Skip merge step (batch only)
python src/langchain/lc_book_runner.py --book book.json --skip-mergeMakefile Usage:
make lc-book-runner - Complete book orchestration:
BOOK: JSON file defining book structureOUTPUT: Output markdown file pathFORCE: Force regeneration of all contentSKIP_MERGE: Skip merge processing, only run batchUSE_RAG: Use RAG for additional contextRAG_KEY: Collection key for RAG retrievalNUM_PROMPTS: Number of prompts to generate per section
# Complete book generation
make lc-book-runner BOOK="examples/book_structure_example.json" OUTPUT="my_book.md"
# Force regeneration of all content
make lc-book-runner BOOK="book.json" FORCE=1
# Skip merge step (batch only)
make lc-book-runner BOOK="book.json" SKIP_MERGE=1Purpose: To add metadata (doi/isbn/author/title/date) to pdfs prior to indexing
Key Features:
- Scans PDFs filename, content, metadata for DOI/ISBN
- fetches metadata (Crossref/OpenLibrary),
- writes a manifest, and updates PDF metadata (Info + XMP/DC/Prism).
- renames files using a consistent format ([slugified title]-[year].pdf)
Options:
--dir: Root to scan (defaultdata_raw)--glob: File pattern (default**/*.pdf)--write: Write manifest and PDF metadata (default off)--quickscan: Skip files requiring input from user (default off)--manifest: Manifest path (defaultresearch/out/manifest.json)--rename:yes|noto rename files by slugified title/year (defaultyes)--skip-existing: Skip already processed files in manifest (default off)
Usage:
# Preview (no writes)
python research/metadata_scan.py --dir data_raw
# Process pdf files in data_raw, adding metadata and renaming them where possible from information in files
python research/metadata_scan.py --dir data_raw --write --quickscan
# Process pdf files in data_raw, prompting user to provide the isbn or doi where this is not found or ambiguous
python research/metadata_scan.py --dir data_raw --write Makefile Usage:
make scan-metadata DIR=data_raw WRITE=1 RENAME=yes SKIP_EXISTING=1Purpose: to aid in the gathering and metadata population of journal articles and books in pdf format which serve as the basis for the RAG index
Key Features:
- TUI user interface
- Import html source via textarea
- Recalls state between runs via manifest.json
- Export list of PDF URLs for download using any download manager
- Download PDF directly from UI
- Load PDF files and URLs, and scholar details URLs with the click of a button
- Fuzzy search for DOI/ISBN on click
Options:
--file: HTML file with Google Scholar results-xml: XML-like markup file with entries--skip-existing: Skip entries with processed=true in manifest--allow-delete: Enable delete actions
Usage:
# run without filr input or already populated manifest
python research/collector_ui.py
# load entries from html source
python research/collector_ui.py --file ../research/out/research.html
# load entries from simple xml format
python research/collector_ui.py --xml ../research/out/research.xmlThe CLI commands module provides a streamlined interface using Typer:
# Basic RAG query
python -m src.cli.commands ask "What is machine learning?"
# Advanced query with options
python -m src.cli.commands ask "Explain neural networks" --key science --k 20 --task "Write for beginners"
# Query from JSON file
python -m src.cli.commands ask --file query.json --key biologyCLI Command Options:
--question, -q: Your question or instruction (required)--key, -k: Collection key (default: from config)--k: Top-k results for retrieval (default: 15)--task, -t: Optional task prefix (excluded from retrieval)--file, -f: JSON file containing prompt parameters
The interactive shell provides an advanced REPL interface with presets and multi-step workflows:
# Start interactive shell
python -m src.cli.shell
# Start with specific collection
RAG_KEY=science python -m src.cli.shellShell Commands:
ask <question>: General RAG answer with citationscompare <topic>: Contrast positions/methods/results across sourcessummarize <topic>: High-level summary with quotesoutline <topic>: Book/essay outline with evidence bulletspresets: List dynamic presets from playbooks.yamlpreset <name> [topic]: Run guided multi-step presetsources: Show sources from last answerhelp: Show available commandsquit: Exit shell
Example Shell Session:
$ python -m src.cli.shell
RAG Tool Shell
ROOT: /path/to/project
KEY: default
Index: /path/to/storage/faiss_default
rag> ask What is machine learning?
[AI generates response with citations]
rag> sources
- Machine Learning Basics (p.15) :: ml_basics.pdf
- Neural Networks Explained (p.42) :: nn_guide.pdf
rag> quit| Interface | Best For | Features |
|---|---|---|
| Makefile | Complete workflows, automation | All options as variables, error handling, examples |
| Direct Scripts | Direct control, scripting | Full command-line options, programmatic use |
| CLI Commands | Simple queries, automation | Streamlined interface, JSON file support |
| Interactive Shell | Exploration, complex queries | Presets, multi-step workflows, source inspection |
All CLI interfaces use the same configuration system:
- Environment Variables:
OPENAI_API_KEY,RAG_KEY - Configuration Files:
env.json, YAML config files - Command-line Options: Override defaults per command
Example Configuration:
{
"openai_api_key": "your-key-here",
"default_model": "gpt-4",
"rag_key": "science",
"embedding_model": "text-embedding-ada-002"
}- OpenAI via LangChain (preferred): requires
OPENAI_API_KEYandlangchain-openai. - OpenAI (raw client) fallback: requires
openaipackage. - Ollama (local): requires
langchain-ollamaorlangchain-communityChatOllama and a running Ollama daemon; setOLLAMA_MODEL.
The factory auto-selects an available backend in this order: OpenAI (LangChain) β Ollama β OpenAI (raw). See src/core/llm.py.
This project targets LangChain 0.2.x with the split provider packages:
langchain>=0.2.13,<0.3langchain-community>=0.2.12,<0.3langchain-text-splitters>=0.2.2,<0.3langchain-openai>=0.1.7,<0.2- Optional:
langchain-huggingface>=0.0.3,langchain-ollama>=0.1.0
These versions ensure stable imports for retrievers (e.g., EnsembleRetriever) and LLM integrations. If you upgrade beyond these ranges, prefer the Typer CLI (python -m src.cli.commands) which already includes robust fallbacks.
Defines the JSON contract for tool-enabled agents. Each assistant message must be either a tool invocation:
{"tool": "<tool_name>", "args": { /* ... */ }}or a final response:
{"final": "<answer text>"}See docs/tool_agent_schema.md for the full specification and transcript example.
Canonical tool input and output JSON schemas live under schemas/tools/. Each tool,
such as web_search_query or vector_query_search, has corresponding
*.input.schema.json and *.output.schema.json files that define the expected
Model Context Protocol contracts.
Additional tools can be declared by dropping *.tool.yaml files under the
tools/ directory. Each YAML file defines a toolpack with an id, kind
(python, cli, node, php, or http), an entry point (or php script
path), and JSON schemas for input and output. On startup these files are loaded
and exposed over /mcp/tool/<id>. Jinja2 templating is supported for argv, URL,
and headers, and a templating.cacheKey value can override caching behavior.
See docs/MCP.md for full field definitions, env passthrough, and
JSON contracts.
Run a tool-enabled agent that combines local RAG retrieval with tools served over MCP:
python -m src.cli.multi_agent "Find papers on transformers and call the time tool" \
--key default --mcp ./tools/mcp_server.py --index-dir ./storageThe command registers the rag_retrieve tool for vector search and loads any
tools exposed by the MCP server so the agent can invoke them during the
conversation.
Start the tool server to expose registered tools via the Model Context Protocol:
python -m src.tool.mcp_server # or: make tool-shell
uvicorn src.tool.mcp_app:app --host 127.0.0.1 --port 3333Handles writing PDF files with metadata in dublin core and prism formats.
Write standard PDF metadata fields with pypdf
Write XMP (Dublin Core + Prism) metadata in-place using pikepdf. If pikepdf is not installed, this function silently returns
Provides debout log output to a file for cases where it is not possible to access the standard error and standard out streams directly.
log str to file
Pydantic models describing YAML-defined toolpacks.
id: strβ canonical tool identifierkind: Literal['python','cli','node','http','php']entry: str | List[str] | Noneβ module path, CLI/Node argv, or HTTP URLphp: str | List[str] | Noneβ PHP script pathphpBinary: Optional[str]β PHP interpreter overrideschema: ToolSchemaβ input and output JSON schemastimeoutMs: Optional[int]limits: ToolLimitsβ input/output byte capsenv: List[str]β environment variables to pass throughheaders: Dict[str, str]β HTTP headers (templated)templating: Optional[Templating]β Jinja2 settingsdeterministic: boolβ enable idempotent caching
cacheKey: Optional[str]β Jinja2 template for cache key
Executes a loaded ToolPack.
Run the toolpack either in-process or via subprocess/HTTP and return its JSON output.
Documentation: The argparse module makes it easy to write user-friendly command-line interfaces. The program defines what arguments it requires, and argparse will figure out how to parse those out of sys.argv. The argparse module also automatically generates help and usage messages. The module will also issue errors when users give the program invalid arguments.
Example Usage:
parser = argparse.ArgumentParser()
parser.add_argument("--json", dest="json_path", help="JSON job file containing 'question'")
parser.add_argument("--k", type=int, default=10)
args = parser.parse_args()Documentation: python-magic is a Python interface to the libmagic file type identification library. libmagic identifies file types by checking their headers according to a predefined list of file types. This functionality is exposed to the command line by the Unix command file. Example Usage:
# returns a value such as image/png, application/pdf, text/html
mime_type = magic.from_file(filepath, mime=True)Documentation: pytest-asyncio provides asyncio awareness for pytest so asynchronous tests can run alongside synchronous suites.
Example Usage:
from src.tool import ToolRegistry
@pytest.mark.asyncio
async def test_tool_invocation():
registry = ToolRegistry()
await registry.register_mcp_server("dummy://server")Documentation: FastAPI is a modern, high-performance web framework for building APIs with Python.
Example Usage:
from fastapi import FastAPI
app = FastAPI()Documentation: Uvicorn is a lightning-fast ASGI server implementation, perfect for serving FastAPI apps.
Example Usage:
uvicorn src.tool.mcp_app:app --host 127.0.0.1 --port 3333Documentation: Pydantic provides data validation and settings management using Python type annotations.
Example Usage:
from pydantic import BaseModel
class Meta(BaseModel):
traceId: strDocumentation: Jinja2 is a modern templating engine for Python used to render strings with dynamic values.
Example Usage:
from jinja2 import Environment, BaseLoader
env = Environment(loader=BaseLoader())
env.from_string("Hello {{ name }}").render(name="World")Documentation: httpx is a fully featured HTTP client for Python.
Example Usage:
from httpx import ASGITransport, AsyncClient
transport = ASGITransport(app=app)
async with AsyncClient(transport=transport, base_url="http://test") as client:
await client.get("/")Documentation: AnyIO provides a unified asynchronous I/O API across event loops.
Example Usage:
import anyio
process = await anyio.open_process(["python", "-m", "src.tool.mcp_stdio"])| name | description | secret | default |
|---|---|---|---|
OPENAI_API_KEY |
API key for OpenAI backends | π | N/A |
RAG_KEY |
Default collection key | π | None |
OPENAI_MODEL |
Override OpenAI chat model | π | gpt-4o-mini |
OLLAMA_MODEL |
Override local Ollama model | π | llama3.1:8b |
DEBUG |
Set to 1/true to enable debug mode in config | π | 0/False |
RAG_PARALLEL_WORKERS |
Default parallel workers for lc_batch (see src/config/settings.py, src/langchain/lc_batch.py) |
π | Auto (clamped os.cpu_count() to 1-32) |
use sops-edit env.json to add/edit new environment variables... append _pt (for plaintext) to names of non-secret values. Load sops values into environment using eval(make sops-env-export) or ``make sops-env-export`` operation as this handles removing the suffix and loading them properly
The system uses several YAML configuration files located in src/config/content/prompts/ to define content types, merge pipelines, and interactive presets.
tools/**/*.tool.yaml files describe executable tools. Core keys:
id: <unique tool id>
kind: python | cli | node | http
entry: package.module:func # argv for CLI/Node or URL for HTTP tools
headers:
X-Token: "{{input.token}}" # optional HTTP headers
templating:
cacheKey: "{{input.path}}" # override caching
schema:
input: $ref to input schema
output: $ref to output schemaDropping a new file in this tree automatically registers the tool at
/mcp/tool/<id>.
File: src/config/content/prompts/content_types.yaml (main index)
Individual Files: src/config/content/prompts/content_types/*.yaml
Content types define different writing styles and system prompts for content generation. Each content type is stored in its own YAML file.
Available Content Types:
pure_research- Academic research with citationstechnical_manual_writer- Technical documentationscience_journalism_article_writer- Science journalismfolklore_adaptation_and_anthology_editor- Creative writing adaptations
Example Content Type Structure (pure_research.yaml):
description: "Pure research assistant focused on citations and evidence"
system_prompt:
- "You are a careful research assistant.\n"
- "Use ONLY the provided context\n."
- "Every claim MUST include inline citations like ([filename], p.X) or ([filename], pp.XβY)."
- "If the context is insufficient or conflicting, state what is missing and stop."
- "Current date: {{current_date}} (for temporal context if needed)\n"
job_generation_prompt: |
You are a research-focused content strategist. Generate {{num_prompts}} research-oriented writing prompts...
job_generation_rag_context: |
**Additional Research Context from RAG:**
Use the following relevant research information from the knowledge base...Template Variables Available:
{{book_title}}- Full book title{{chapter_title}}- Chapter title{{section_title_hierarchy}}- Hierarchical section path{{subsection_title}}- Subsection title{{subsection_id}}- Hierarchical ID (e.g., "1A1"){{target_audience}}- Target audience{{topic}}- Book topic{{num_prompts}}- Number of prompts to generate{{rag_context}}- Additional RAG context{{current_date}}- Current date
RAG Context Query Templates:
Each content type now includes a rag_context_query template for retrieving relevant context:
rag_context_query: |
Find relevant information about: {{section_title}}
Context: This is for creating educational content for a book titled "{{book_title}}"
for {{target_audience}}.
Please provide any relevant background information, examples, or context that would be
helpful for writing educational content about this topic.Template Fallback System: The template engine now supports a robust fallback system:
- First: Try to load template from the specific content type file (e.g.,
pure_research.yaml) - Fallback: If not found, try to load from
default.yaml - Error: If still not found, raise an informative error message
This ensures that new content types can inherit templates from the default configuration while still allowing for customization when needed.
File: src/config/content/prompts/merge_types.yaml
Defines different merge pipeline configurations for content consolidation and editing.
File: prompts/REGISTRY.yaml
Prompt packs are stored on disk and registered via prompts/REGISTRY.yaml. Each prompt family contains one or more versions. Bodies live in Markdown files named prompts/packs/<domain>/<name>.v<major>.md and share a spec defined in prompts/packs/<domain>/<name>.spec.yaml.
Example Registry Structure:
writing:
sectioned_draft:
- 3Example Spec Structure (sectioned_draft.spec.yaml):
inputs:
type: object
properties:
topic:
type: string
required:
- topic
additionalProperties: false
constraints:
length:
max_tokens: 1000Available Merge Types:
generic_editor- Basic single-stage mergingadvanced_pipeline- Multi-stage critique β merge β style β imageseducator_handbook- Specialized for educational content
Example Merge Type Structure:
generic_editor:
system_prompt:
- "You are a senior editor for a publisher..."
- "It is your job to merge these together so that the final resulting text..."
advanced_pipeline:
description: "Multi-stage pipeline with critique, merge, and style harmonization"
parameters:
top_n_variations: 3
similarity_threshold: 0.85
stages:
critique:
system_prompt:
- "You are a senior editor evaluating content quality..."
output_format: "json"
scoring_instruction: "Return only JSON: {\"score\": <0-10>, ...}"
merge:
system_prompt:
- "You are a consolidating editor..."
output_format: "markdown"
style:
system_prompt:
- "You are a line editor harmonizing tone..."
output_format: "markdown"Stage Configuration Options:
system_prompt- Array of prompt stringsoutput_format- "json" or "markdown"scoring_instruction- For critique stagesparameters- Pipeline-level settings (top_n_variations, similarity_threshold)
File: src/config/content/prompts/playbooks.yaml
Defines interactive presets for complex multi-step workflows used in the CLI shell.
Example Playbook Structure:
literature_review:
label: Literature Review
description: Structured synthesis with methods appraisal and evidence-backed themes
inputs: [] # preset-level interactive inputs
system_prompt: |
You are a meticulous literature-review assistant...
stitch_final: true # Combine step outputs
final_prompt: |
Combine the step outputs into a cohesive literature review...
steps:
- name: scope
prompt: |
Clarify the research question(s) and implicit inclusion/exclusion criteria...
- name: themes
prompt: |
Extract major themes/claims with evidence...
- name: methods
prompt: |
Critically appraise methods for each study...
- name: synthesis
prompt: |
Summarize agreements and disagreements...Playbook Features:
- Interactive Inputs: User prompts for dynamic content
- Multi-step Workflows: Sequential processing stages
- Template Variables: Jinja2 templating support
- Flexible Output: JSON arrays or final synthesis
Input Types:
inputs:
- name: audience
prompt: Primary audience?
default: general
choices: [general, policy, practitioners]
- name: styles
prompt: List 3 styles (comma-separated)
default: modern retelling, mythic high-fantasy
multi: true # Allow multiple values
- name: target_length
prompt: Target length (words)
default: 450
type: int # Type validationFile: src/config/content/prompts/templates.md
Provides structural templates for different output formats:
Literature Review Template:
- **Research Question**: ...
- **Scope/Inclusion**: ...
- **Themes**
- Theme A β evidence (pages)
- Theme B β evidence (pages)
- **Methods Appraisal**
- Source β strengths/limitations
- **Synthesis**
- Agreements / Disagreements
- Gaps & Future Work
Science Journalism Template:
- **Headline**: ...
- **Dek**: ...
- **What's New**
- **Why It Matters**
- **Evidence (plain-language)**
- **Caveats**
- **Quote(s)** (with page cites)
- **How Solid Is The Evidence?** (1β5)
- Create new YAML file in
src/config/content/prompts/content_types/ - Define system prompt and job generation templates
- Use template variables for dynamic content
- Test with
make lc-batch CONTENT_TYPE=your_type
- Add new entry to
src/config/content/prompts/merge_types.yaml - Define stages with system prompts and output formats
- Configure parameters for advanced pipelines
- Test with
python src/langchain/lc_merge_runner.py
- Add new entry to
src/config/content/prompts/playbooks.yaml - Define interactive inputs and step workflows
- Use Jinja2 templating for dynamic content
- Test with
python -m src.cli.shellβpreset your_preset
{
"title": "Book Title",
"metadata": {
"author": "Author Name",
"version": "1.0",
"target_audience": "Target audience",
"word_count_target": 100000,
"created_date": "2025-08-28",
"description": "Book description"
},
"sections": [
{
"subsection_id": "1A1",
"title": "Section Title",
"job_file": "data_jobs/1A1.jsonl",
"batch_params": {
"key": "collection_name",
"k": 5
},
"merge_params": {
"key": "collection_name",
"k": 3
},
"dependencies": ["parent_section_id"]
}
]
}Use hierarchical naming for 4-level deep structure:
- Level 1: Chapter (1, 2, 3...)
- Level 2: Major section (A, B, C...)
- Level 3: Subsection (1, 2, 3...)
- Level 4: Sub-subsection (a, b, c...)
Examples: 1A1, 2B3, 3C2a
Each line is a JSON object with hierarchical context:
{
"task": "system prompt with book context",
"instruction": "specific instruction with hierarchical positioning",
"context": {
"book_title": "Book Title",
"chapter": "Chapter X",
"section": "Section Y",
"subsection": "Subsection Z",
"subsection_id": "XYZ",
"target_audience": "target audience"
}
}Context Fields:
book_title: Full book title for contextchapter: Chapter identifier (e.g., "Chapter 1")section: Section identifier (e.g., "Section A")subsection: Subsection identifier (e.g., "Subsection 1")subsection_id: Hierarchical ID (e.g., "1A1")target_audience: Who the content is for
See examples/sample_jobs_1A1.jsonl, examples/sample_jobs_1B1.jsonl, and examples/sample_jobs_2A1.jsonl for complete examples showing different hierarchical contexts.
Purpose: View and analyze generated content.
Key Features:
- Content browsing
- Quality assessment
- Export functionality
Purpose: Clean and preprocess source documents.
Key Features:
- Document normalization
- Metadata extraction
- Quality filtering
# Generate a single piece of content
python src/langchain/lc_ask.py ask \
--content-type pure_research \
--task "Explain quantum computing to a beginner"Create jobs.jsonl:
{"task": "system prompt", "instruction": "Generate introduction"}
{"task": "system prompt", "instruction": "Generate examples"}
{"task": "system prompt", "instruction": "Generate conclusion"}python src/langchain/lc_batch.py --jobs jobs.jsonl# Use advanced pipeline for educational content
python src/langchain/lc_merge_runner.py --sub 1A1Add to merge_types.yaml:
technical_docs:
description: "Optimized for technical documentation"
stages:
critique:
system_prompt: "You are a technical editor..."
merge:
system_prompt: "You are a technical writer consolidating docs..."Step 1: Generate Intelligent Outline
python src/langchain/lc_outline_generator.pyInteractively collects book details and generates AI-powered outline
Step 2: Generate Complete Book
python src/langchain/lc_book_runner.py --book outlines/book_structures/my_book_outline.jsonOrchestrates the complete content generation pipeline
Step 1: Convert Outline to Book Structure
# Convert markdown outline
python src/langchain/lc_outline_converter.py --outline examples/sample_outline_markdown.md
# Convert text outline
python src/langchain/lc_outline_converter.py --outline examples/sample_outline_text.txt
# Convert with custom metadata
python src/langchain/lc_outline_converter.py --outline my_outline.md \
--title "My Custom Book Title" \
--topic "Data Science" \
--audience "Data Scientists" \
--wordcount 75000Step 2: Generate Complete Book
python src/langchain/lc_book_runner.py --book outlines/converted_structures/converted_book_structure.jsonCreate a book structure file:
{
"title": "Professional Development Handbook",
"metadata": {
"author": "AI Content Generator",
"target_audience": "Primary school teachers",
"word_count_target": 100000
},
"sections": [
{
"subsection_id": "1A1",
"title": "Understanding Modern Learning Theories",
"job_file": "data_jobs/1A1.jsonl",
"batch_params": {"key": "education", "k": 5},
"merge_params": {"key": "education", "k": 3},
"dependencies": []
}
]
}Generate the complete book:
python src/langchain/lc_book_runner.py --book book_structure.jsonThe book runner automatically embeds hierarchical context into every job file, eliminating the need for manual context entry in merge scripts:
Generated Job Structure:
{
"task": "Contextual system prompt with book title and audience",
"instruction": "Instruction with hierarchical positioning",
"context": {
"book_title": "Professional Development Handbook",
"chapter": "Chapter 1",
"section": "Section A",
"subsection": "Subsection 1",
"subsection_id": "1A1",
"target_audience": "primary school teachers"
}
}- No Manual Input: Merge scripts get context automatically
- Consistent Positioning: All content knows its place in the hierarchy
- Audience Awareness: Content tailored to target readers
- Scalable: Works for books of any size and complexity
- Automated: Context generation happens during job file creation
- Book Runner parses subsection ID (e.g., "1A1")
- Automatically generates hierarchical context
- Embeds context in job file during generation
- Batch processing uses contextualized jobs
- Merge processing receives properly positioned content
- Final aggregation maintains hierarchical structure
- Simple single-stage merging
- Basic content consolidation
- Fast processing
- Multi-stage processing
- AI-powered critique and scoring
- Intelligent de-duplication
- Style harmonization
- Optional image suggestions
- Specialized for educational content
- Teacher-focused language
- Classroom utility emphasis
- PD handbook optimization
- YAML-driven configuration
- Domain-specific prompts
- Custom processing stages
- Flexible parameters
Run the full system in a Docker container without needing a local Python setup.
docker build -t rag-writer:latest .Or with Docker Compose:
docker compose buildExamples below assume youβre in the project root. Mounting ./ into /app keeps your data and outputs on the host.
# Show CLI help (default CMD)
docker run --rm -it \
-v "$PWD":/app \
-e OPENAI_API_KEY=sk-... \
rag-writer:latest --help
# Build FAISS index from PDFs in ./data_raw (mount this dir with your PDFs)
docker run --rm -it \
-v "$PWD":/app \
-e RAG_KEY=science \
rag-writer:latest python src/langchain/lc_build_index.py
# Ask a question using the Typer CLI
docker run --rm -it \
-v "$PWD":/app \
-e OPENAI_API_KEY=sk-... \
-e RAG_KEY=science \
rag-writer:latest ask "What is machine learning?"
# Interactive shell (optional)
docker run --rm -it \
-v "$PWD":/app \
-e OPENAI_API_KEY=sk-... \
rag-writer:latest shellTip: To persist model downloads across runs, mount a cache:
docker run --rm -it \
-v "$PWD":/app \
-v hf-cache:/root/.cache/huggingface \
-e OPENAI_API_KEY=sk-... \
rag-writer:latest ask "Explain neural networks"compose.yaml ships with sensible defaults:
# With OPENAI_API_KEY exported in your shell
export OPENAI_API_KEY=sk-...
# Show help
docker compose run --rm rag-writer --help
# Build index
docker compose run --rm rag-writer python src/langchain/lc_build_index.py
# Ask
docker compose run --rm rag-writer ask "What is machine learning?"- Data and outputs are in project subfolders:
data_raw,data_processed,storage,output,exports. - Set
RAG_KEYto switch collections (defaults todefault). - If you prefer the Makefile workflows, run them inside the container shell and call the Python scripts directly (the Makefileβs venv targets are designed for host use).
- Some first-time runs will download models (HuggingFace). Use the provided cache volume to avoid repeated downloads.
- Entrypoint shortcuts:
askruns the Typer CLI;shellstarts the interactive REPL; any other command is executed verbatim (e.g.,python -m src.cli.shell).
data_raw/: Place source PDFs and documents to ingest.data_processed/: Extracted chunks and intermediate artifacts.storage/: Vector stores (e.g., FAISS), per collection key.output/,exports/: Generated content and final artifacts.outlines/,data_jobs/: Outline and job files for the book pipeline.
- The image includes
sopsandjq. If/app/env.jsonexists and is decryptable (via AWS KMS, GCP KMS, or PGP), the entrypoint auto-loads its values into the environment before running your command. - Keep your existing PGP recipient for local; add cloud KMS recipients to
.sops.yamlfor production. Seedocs/sops_kms_examples.md.
# Rewrap env.json with current recipients from .sops.yaml
make sops-updatekeys [FILE=env.json]
# Decrypt to stdout (or redirect)
make sops-decrypt [FILE=env.json] > /tmp/env.json
# Print export lines for env.json (use with eval in your shell)
make sops-env-export [FILE=env.json] | source /dev/stdinSee also: docs/ci_sops_rewrap_example.yml for a GitHub Actions example to automate rewrapping.
# Build image
make docker-build [DOCKER_IMAGE=rag-writer:latest]
# Ask via Docker
make docker-ask "What is machine learning?" KEY=science
# Index via Docker
make docker-index KEY=science
# Compose variants
make compose-build
make compose-ask "What is machine learning?" KEY=science
make compose-index KEY=science
make compose-shell
# Run full book pipeline
make docker-book-runner BOOK=outlines/converted_structures/my_book.json OUTPUT=exports/books/my_book.md
make compose-book-runner BOOK=outlines/converted_structures/my_book.json OUTPUT=exports/books/my_book.md
# Index maintenance
make clean-faiss KEY=your_key # remove FAISS dirs for key
make clean-shards KEY=your_key EMB=BAAI/bge-small-en-v1.5 # remove FAISS shards for model
make reindex KEY=your_key # clean + rebuild FAISS for key
make repack-faiss KEY=your_key EMBED_MODEL=BAAI/bge-small-en-v1.5 # salvage old index
# Metadata scanning (pre-alpha)
make scan-metadata DIR=data_raw WRITE=1 RENAME=yes SKIP_EXISTING=1
# Collector UI (import HTML/XML, export links)
make collector-uiThe Docker image uses a multi-stage build to speed up development rebuilds:
- base-sys: OS deps (build tools, curl, jq, ca-certificates, libgomp1) + sops binary
- py-deps: Python dependencies from
requirements.txt - runner: App source + entrypoint
Only the runner layer changes when you edit source files, so rebuilds are much faster.
Common workflows:
# Seed base layers (do this after changing system or Python deps)
make docker-build-base
# Regular dev cycle (code changes only)
make docker-build # or: docker compose build
# Compose variant for base layers
make compose-build-baseTo use prebuilt base layers across machines/CI, you can tag and push the base stages to a registry and update Dockerfile FROM references if you want to pin them.
- Use Docker BuildKit + Buildx to push/pull cache to a registry for fast CI builds.
- Example GitHub Actions workflow is provided at
docs/ci_build_cache_example.yml. - Typical flow:
- Cache base-sys and py-deps stages to registry
- Build the runner stage with
--cache-frompointing at those cache refs - Push final image to your registry (e.g., GHCR, ECR, GCR)
If you use Podman in CI, layer caching is local by default; pushing prebuilt base stage images to your registry can still improve cold-start builds.
# Run batch processing first
python src/langchain/lc_batch.py --jobs your_jobs.jsonl# Check file path and permissions
ls -la data_jobs/
python src/langchain/lc_merge_runner.py --jobs data_jobs/your_file.jsonl# Validate YAML syntax
python -c "import yaml; yaml.safe_load(open('src/config/content/prompts/merge_types.yaml'))"# Check environment configuration
cat env.json
# Ensure API keys are set and validIf you see "attempted relative import with no known parent package", run modules in package mode:
python -m src.cli.commands ask "What is machine learning?"
python -m src.cli.shellOr use the Docker/Compose entrypoint shortcuts:
docker compose run --rm rag-writer ask "What is machine learning?"
docker compose run --rm rag-writer shellEnable verbose logging:
export PYTHONPATH=src
python -c "import logging; logging.basicConfig(level=logging.DEBUG)"
python src/langchain/lc_merge_runner.py --sub 1A1- Use
--parallelin batch processing for faster execution (defaults driven byRAG_PARALLEL_WORKERSor CPU count) - Adjust
similarity_thresholdin pipeline config for different de-duplication levels - Configure appropriate
top_n_variationsbased on content complexity
- Add configuration to
src/config/content/prompts/merge_types.yaml - Test with sample content
- Update documentation
- Add configuration to
src/config/content/prompts/content_types.yaml - Test with lc_ask.py
- Document usage examples
- Follow PEP 8 guidelines
- Add docstrings to new functions
- Include error handling
- Update tests for new functionality
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with LangChain for LLM integration
- Uses Rich for beautiful terminal interfaces
- Inspired by advanced content processing workflows
- Designed for educational and professional content creation