This repository contains a Python extraction pipeline for source documents used by the essay writer system. It currently supports:
- text-native PDFs
- OCR extraction for PDFs
- modern Word
.docxfiles
pypdf is distributed under a permissive BSD-style license, which is commonly
compatible with both open-source and closed-source projects.
pip install -e .For development and tests:
pip install -e ".[dev]"Install optional OCR extras as needed:
pip install -e ".[ocr-small]" # Tesseract tier
pip install -e ".[ocr-medium]" # EasyOCR tier
pip install -e ".[ocr-high]" # PaddleOCR tier
pip install -e ".[ocr-small,ocr-scheduler]" # Tesseract + parallel schedulerInstall the web API dependencies when running the local app:
pip install -e ".[web]"Install the Vite frontend dependencies:
cd frontend
npm installRun the API from the repository root:
uvicorn backend.app:app --host 127.0.0.1 --port 8629 --reloadRun the frontend in another terminal:
cd frontend
npm run devThe frontend runs at http://127.0.0.1:3527 by default and proxies /api
requests to http://127.0.0.1:8629. Vite preview uses
http://127.0.0.1:4627.
Agent Tool Mode exposes the essay workflow as local MCP tools for harnesses such as Claude Code and Codex. In this mode, the app does not make hidden LLM API calls for reasoning stages. The harness reads prepared work packets, produces JSON with its own model, and commits validated artifacts back to the app.
Install optional dependencies:
pip install -e ".[agent-tools]"Run the MCP server:
ESSAY_DATA_DIR=./data python -m essay_writer.agent_tools.serverSee docs/agent-tool-mode-mcp.md and .mcp.example.json.
The app supports source uploads for .pdf, .docx, .txt, .md,
.markdown, and .notes files. Assignment text can be pasted or extracted
from the same document types.
Source access budgets default high enough for broad research passes while still preventing one model request from reading an entire book:
ESSAY_MAX_RESEARCH_ROUNDS=3
ESSAY_MAX_SOURCE_PACKETS=40
ESSAY_MAX_TOTAL_SOURCE_CHARS=200000
ESSAY_MAX_PDF_PAGES_PER_REQUEST=80
ESSAY_MAX_PDF_PAGES_TOTAL=240
ESSAY_MAX_CHARS_PER_PACKET=50000
ESSAY_OVERSIZED_SOURCE_REQUEST_POLICY=reject
ESSAY_LAZY_PDF_OCR_ENABLED=true
ESSAY_LAZY_OCR_TIER=small
ESSAY_LAZY_OCR_DPI=300
ESSAY_LAZY_OCR_LANGUAGES=enPer-stage model and token budgets can be configured through settings or env
vars. The backend resolves models in this order: per-stage Settings override,
per-stage env var, Settings default model, LLM_MODEL, then the adapter
default.
ESSAY_MODEL_TASK_SPEC=
ESSAY_MODEL_SOURCE_CARD=
ESSAY_MODEL_TOPIC_IDEATION=
ESSAY_MODEL_RESEARCH=
ESSAY_MODEL_OUTLINING=
ESSAY_MODEL_DRAFTING=
ESSAY_MODEL_DRAFTING_REVISION=
ESSAY_MODEL_DRAFTING_STYLE=
ESSAY_MODEL_VALIDATION=
ESSAY_MAX_TOKENS_TASK_SPEC=4096
ESSAY_MAX_TOKENS_SOURCE_CARD=2500
ESSAY_MAX_TOKENS_TOPIC_IDEATION=5000
ESSAY_MAX_TOKENS_RESEARCH=8000
ESSAY_MAX_TOKENS_OUTLINING=6000
ESSAY_MAX_TOKENS_DRAFTING=8000
ESSAY_MAX_TOKENS_DRAFTING_REVISION=8000
ESSAY_MAX_TOKENS_DRAFTING_STYLE=8000
ESSAY_MAX_TOKENS_VALIDATION=4000PDF retrieval uses physical 1-based PDF page numbers for source access. Printed page labels are stored separately when PDF metadata exposes them, and are used for traceability rather than as the primary retrieval coordinate.
The web app is an essay-writing workflow built on top of the document extraction pipeline. The orchestrator owns workflow state. Each stage persists its inputs, outputs, prompt version, and validation results. Human approval gates and manual revision loops are first-class steps, not escape hatches.
assignment text + uploaded sources
|
v
+---------------------------------------+
| 1. Source Ingestion | parses PDF/DOCX/TXT/MD/Notes,
| - per-source artifacts | page/section units, lazy OCR,
| - source cards (LLM) | source cards.
+---------------------------------------+
|
v
+---------------------------------------+
| 2. Assignment Parsing (LLM) | -> blocking_questions or
| -> TaskSpecification | adversarial flags can
+---------------------------------------+ mark job blocked here
|
v
+-----------------+
| 3. Job Created |
+-----------------+
|
v
+---------------------------------------+
| 4. Topic Ideation (LLM) | <----+
| candidate topics with leads | |
+---------------------------------------+ |
| |
v |
*** HUMAN GATE: Topic Selection *** |
select | reject + reasons | revise -----+ (next round
| uses rejected
v topics + user
+---------------------------------------+ instruction)
| 5. Research Planning (deterministic) |
| validates source_requests, |
| bounds, page ranges, sections |
+---------------------------------------+
|
v
+---------------------------------------+
| 6. Source Resolution | request -> SourceTextPacket
| pdf_pages | section | search | | (lazy per-page OCR if
| chunk locators | pages have low/empty text)
+---------------------------------------+
|
v
+---------------------------------------+
| 7. Final Topic Research (LLM) | evidence map: notes,
| grounded notes, quotes verified | groups, gaps, conflicts
| against source text |
+---------------------------------------+
|
v
+---------------------------------------+
| 8. Outline (LLM) | thesis, sections,
| section -> note_id mapping | word budgets, claims
+---------------------------------------+
|
v
+---------------------------------------+
| 9. Drafting (LLM) | draft v1 with
| anti-AI skill in system prompt | section_source_map +
| evidence + outline + packets | bibliography candidates
+---------------------------------------+
|
v
+---------------------------------------+
| 10. Final Style Pass (LLM, optional) | prose-only rewrite,
| facts/citations frozen | emits next draft version
+---------------------------------------+
|
v
+---------------------------------------+
| 11. Validation | deterministic style checks
| deterministic checks + | + LLM judgment
| LLM judgment + tone alignment | -> structured diagnostics
+---------------------------------------+
|
passes? | requires_revision?
+-----------------+-------------------+
| |
| yes | no
v v
+------------------------+ +----------------------------------+
| 12a. System Revision | | *** HUMAN GATE: Review Draft *** |
| (LLM) | | review_only | revise | edit text |
| prior draft + report | +----------------------------------+
| + evidence + packets | |
| -> next draft version | +----------------+----------------+
+------------------------+ | | |
| v v v
| (no change) (LLM revision) (user edit
| | saved as
| v new draft
+-----------------------------+ | version)
| | |
v v v
+------------------------------------+
| 13. Final Export (Markdown) |
| content + bibliography + |
| section source map + summary |
+------------------------------------+
Linear shorthand of the happy path:
assignment + uploaded sources
-> source ingestion + source cards
-> task specification (LLM)
-> topic ideation (LLM) <-> [HUMAN] topic selection / rejection rounds
-> research planning (deterministic validation)
-> source access resolution (with lazy OCR)
-> final topic research (LLM)
-> outline (LLM)
-> draft v1 (LLM)
-> optional final style pass (LLM) -> draft v2
-> validation (deterministic + LLM + tone alignment)
-> [HUMAN] review draft (review_only | LLM revise | direct text edit)
-> optional system revision loop (LLM)
-> final Markdown export
The workflow exposes four explicit human-in-the-loop touchpoints. Each one produces or consumes durable artifacts so the orchestrator can resume after a restart.
After topic ideation, the user must select one candidate before the workflow
proceeds. The user can also reject directions with a written reason. Rejected
topics are persisted (TopicRoundStore.save_rejected_topic) and replayed into
later ideation rounds along with an optional user instruction so the model
avoids repeating discarded directions. Round numbers and topic IDs are stable
across rounds.
ideation round N -> [HUMAN] select | reject(reason) | request another round
|
v
round N+1 prompt includes previous candidates
+ rejected topics + user instruction
The workflow can mark a job blocked (with a user-facing EssayJobErrorState)
when:
- the task spec contains
blocking_questionsor unresolved prompt-option ambiguity - evidence sufficiency falls below the configured threshold during research
- adversarial AI-directed instructions are detected in the assignment
The job remains in blocked until the user resolves the issue (e.g. supplies a
clarification, picks a prompt option, or uploads more sources). Resolution
creates a new artifact version and the workflow resumes from the affected stage.
Any draft version can be edited directly by the user. The endpoint
POST /jobs/{job_id}/drafts/save-user-edit writes the edited text as a new
EssayDraft with:
origin = "user_edit"created_by = "user"parent_draft_idlinking to the prior versionparent_export_idset when the edit was applied to an exported draft
User edits become the new latest draft and are eligible inputs for the next manual or system revision pass. Each edit is its own immutable version, so edit history is fully traceable.
Manual revision is a structured re-pass over a chosen draft. Triggered via
POST /jobs/{job_id}/manual-revision-runs, it accepts:
source_draft_id— which version to revise frommode—review_only(no rewrite, just reports) orrevise(LLM rewrite)instruction— optional free-text user directionselected_lenses— any subset of:evidence,citations,assignment_fit,length,tone,anti_ai
Each run produces a ManualRevisionRun with:
pre_revision_validation post_revision_validation
pre_revision_tone_alignment post_revision_tone_alignment
pre_revision_anti_ai post_revision_anti_ai
change_summary warnings
result_draft_id status (completed | failed)
The pre/post reports let the UI show a side-by-side delta so the user can see
exactly what the lens-based pass changed. In revise mode the result is saved
as a new draft version with origin = "manual_llm_revision" and
manual_request_id/user_instruction/selected_lenses recorded on the draft
itself.
Drafts form a directed acyclic version graph through parent_draft_id. The
origin field records how each version came into existence:
| Origin | Created by | Triggered by |
|---|---|---|
generated |
system | drafting stage (draft v1) |
style_revision |
system | optional final style pass |
system_revision |
system | post-validation revision loop |
manual_llm_revision |
system | manual revision run in revise mode |
user_edit |
user | direct text edit via save_user_edit |
GET /jobs/{job_id}/drafts returns the full version list with previews and
lineage so the UI can render a revision timeline. The export step always reads
the latest validated draft.
Most workflow steps are deterministic application code. LLM calls happen only
where a configured LLMClient is passed into the service.
| Step | Uses LLM? | Prompt/version | What the model receives |
|---|---|---|---|
| Source ingestion | Yes for source cards | SOURCE_CARD_SYSTEM_PROMPT |
Source metadata plus selected uploaded-source excerpts. Missing LLM configuration raises an error. PDF OCR itself is not an LLM call. |
| Source maps and source access | No | None | The app builds maps, resolves locators, runs SQLite search, and may run OCR locally. |
| Assignment parsing | Yes | task-spec-v1 / TASK_SPEC_SYSTEM_PROMPT |
Raw assignment text. Missing LLM configuration raises an error. |
| Job creation | No | None | The app links task spec, source IDs, and workflow state. |
| Topic ideation | Yes | topic-ideation-v1 / TOPIC_IDEATION_SYSTEM_PROMPT |
Task spec, source cards, source maps, index manifests, previous candidates, rejected topics, and optional user instruction. |
| Topic selection | No | None | User action stored by the app. |
| Research planning | No today | research-planning-v1 metadata only |
The app deterministically validates selected-topic source_requests, source IDs, page ranges, sections, chunks, and budgets. No model call is made. |
| Source resolution | No | None | The app resolves validated requests into SourceTextPacket objects and may run lazy per-page OCR locally. |
| Final topic research | Yes | final-topic-research-v1 / FINAL_TOPIC_RESEARCH_SYSTEM_PROMPT |
Task spec, selected topic, resolved source packets, and legacy chunks when present. |
| Outlining | Yes | thesis-outline-v1 / OUTLINE_SYSTEM_PROMPT |
Task spec, selected topic, research plan, evidence map, and full source-packet text/metadata. Missing LLM configuration raises an error. |
| Drafting | Yes | drafting-v1 / DRAFTING_SYSTEM_PROMPT |
Task spec, selected topic, evidence map, outline, and resolved source packets/excerpts. |
| Validation | Yes | validation-v1 / VALIDATION_SYSTEM_PROMPT |
Draft, task spec, evidence notes, bibliography candidates, source-card metadata, metadata warnings, and deterministic style findings. Returns structured diagnostics rather than prose rewrite advice. |
| Revision | Yes | drafting-revision-v1 / DRAFTING_SYSTEM_PROMPT |
Prior draft, structured validation diagnostics, task spec, selected topic, outline, evidence map, and resolved source packets/excerpts. |
| Final style pass | Yes | drafting-style-revision-v1 / STYLE_REVISION_SYSTEM_PROMPT |
Latest draft, task spec, outline, evidence map, deterministic style findings, anti-AI skill document, and source packets. Preserves facts/citations while revising prose shape. |
| Export | No | None | The app writes final Markdown from stored draft and validation data. |
The app uses structured JSON prompts. Each LLM call sends:
- a system prompt constant
- a JSON user payload built by the service
- a JSON schema that constrains the response shape
- an optional per-stage model override from settings/env
- File:
essay_writer/task_spec/prompts.py - System prompt:
TASK_SPEC_SYSTEM_PROMPT - User payload:
build_task_spec_user_message(raw_text) - Output schema:
TASK_SPEC_SCHEMA - Stored version:
task-spec-v1
Purpose:
- Extract assignment requirements from untrusted assignment documents.
- Preserve details, classify real student-facing requirements, detect adversarial instructions, and avoid treating AI-directed instructions as checklist requirements.
- File:
essay_writer/sources/summary.py - System prompt:
SOURCE_CARD_SYSTEM_PROMPT - User payload:
_build_source_card_user_message(source, excerpts, summary_char_limit) - Output schema:
SOURCE_CARD_SCHEMA - Stored version: none currently on
SourceCard
Purpose:
- Create a compact card from uploaded-source excerpts only.
- Summarize key topics, topic-ideation usefulness, notable sections, limitations, citation metadata, and warnings.
- Files:
essay_writer/topic_ideation/prompts.py,essay_writer/topic_ideation/service.py - System prompt:
TOPIC_IDEATION_SYSTEM_PROMPT - User payload:
_build_user_message(context, max_candidates) - Output schema:
TOPIC_IDEATION_SCHEMA - Stored version:
topic-ideation-v1
Purpose:
- Generate source-grounded candidate essay topics.
- Use source cards, source maps, and index manifests.
- Prefer
source_requestsusing physical PDF pages or section IDs, with chunk/search leads as backward-compatible fallbacks.
- Files:
essay_writer/research/prompts.py,essay_writer/research/service.py - System prompt:
FINAL_TOPIC_RESEARCH_SYSTEM_PROMPT - User payload:
_build_user_message(...)inresearch/service.py - Output schema:
FINAL_TOPIC_RESEARCH_SCHEMA - Stored version:
final-topic-research-v1
Purpose:
- Extract source-grounded research notes from resolved source packets/chunks.
- Prevent invented sources, page numbers, facts, and quotes.
- Build notes, evidence groups, gaps, conflicts, and warnings.
- File:
essay_writer/outlining/service.py - System prompt:
OUTLINE_SYSTEM_PROMPT - User payload:
_build_outline_user_message(...) - Output schema:
OUTLINE_SCHEMA - Stored version:
thesis-outline-v1
Purpose:
- Create a detailed, source-grounded essay outline.
- Carry the core argument through thesis, section purposes, claims, evidence placement, counterarguments, and word-budget priorities.
- Preserve traceability through note IDs and source packet IDs.
- Receive source-packet locator metadata, PDF page ranges, printed page labels, heading paths, extraction methods, text quality, warnings, and text so the outline can plan from concrete source information.
- Apply structural humanization guidance: avoid uniform section weights, three parallel body sections, and perfectly balanced source treatment unless the assignment requires them.
- Files:
essay_writer/drafting/prompts.py,essay_writer/drafting/service.py,anti-ai-detection-SKILL.md - System prompt:
DRAFTING_SYSTEM_PROMPT - User payload:
_build_user_message(...)indrafting/service.py - Output schema:
DRAFTING_SCHEMA - Stored version:
drafting-v1
Purpose:
- Write an academic essay draft from task spec, selected topic, evidence map, outline, and resolved source packets.
- Use only evidence-map notes and supplied source packets, record section-to-note/source mappings, and report weak spots instead of fabricating support.
- Use source packet text for accurate source detail, quotes, and citations.
- Include the full local
anti-ai-detection-SKILL.mddocument directly in the system prompt and apply it during drafting, not as a cleanup pass.
- Files:
essay_writer/drafting/prompts.py,essay_writer/drafting/revision.py,anti-ai-detection-SKILL.md - System prompt:
DRAFTING_SYSTEM_PROMPT - User payload:
_build_revision_message(...) - Output schema:
DRAFTING_SCHEMA - Stored version:
drafting-revision-v1
Purpose:
- Revise the prior draft using validation feedback while keeping every claim grounded in the supplied evidence.
- Reuses the same drafting system prompt and schema, but the user payload adds previous draft content, structured validation diagnostics, legacy revision suggestions when present, and weak spots.
- Receives the resolved source packets again, so revision can correct grounding issues against the actual excerpts instead of only the distilled evidence map.
- Because it reuses
DRAFTING_SYSTEM_PROMPT, it also includes the fullanti-ai-detection-SKILL.mddocument directly.
- Files:
essay_writer/validation/prompts.py,essay_writer/validation/service.py - System prompt:
VALIDATION_SYSTEM_PROMPT - User payload:
_build_user_message(...)invalidation/service.py - Output schema:
VALIDATION_SCHEMA - Stored version:
validation-v1
Purpose:
- Judge grounding, citations, assignment fit, length, rubric alignment, and higher-level style.
- Deterministic style checks run before this call; the prompt tells the model not to re-check those findings and instead use them as supplied data.
- Return structured diagnostics with location, issue type, evidence, severity, and action category. The validator diagnoses; drafting/revision services perform the rewrite.
- Files:
essay_writer/drafting/style_revision.py,anti-ai-detection-SKILL.md - System prompt:
STYLE_REVISION_SYSTEM_PROMPT - User payload:
_build_user_message(...)indrafting/style_revision.py - Output schema:
STYLE_REVISION_SCHEMA - Stored version:
drafting-style-revision-v1
Purpose:
- Run a constrained prose-only style pass before validation.
- Preserve facts, citations, thesis meaning, source map, bibliography candidates, and required source-backed claims.
- Use deterministic style findings and the full anti-AI skill document to reduce generic prose patterns without adding unsupported content.
Users upload source documents through the web UI. Supported source types are:
.pdf.docx.txt.md.markdown.notes
Ingestion uses DocumentReader for document text extraction:
- PDFs use
PyPdfExtractorfor text-native extraction. - DOCX files use
WordDocExtractor. - TXT/Markdown/Notes files are read as UTF-8 plain text.
- Low-quality PDF text can fall back to OCR during ingestion.
Each ingested source produces persisted artifacts under the configured data directory:
source.json
original.<ext>
pages.jsonl
chunks.jsonl
full_text.txt
source_card.json
source_map.json
source_units.jsonl
index.sqlite
index_manifest.json
Not every source will have every artifact. For example, index.sqlite and
index_manifest.json are only present when indexing succeeds. Uploaded
originals are copied into the source artifact directory so later source access
can re-read specific PDF pages when OCR is needed.
Every successfully ingested source gets a source card. A source card is a compact summary used by topic ideation and later validation metadata checks.
The source-card builder sends selected source excerpts to the model. If no LLM client is configured, ingestion raises a configuration error instead of creating a lower-quality deterministic card.
The source card includes:
- title
- brief summary
- key topics
- topic-ideation hints
- notable sections
- limitations
- citation metadata
- warnings
The source access layer is the preferred interface between LLM stages and source text.
For PDFs, the source map is page-based:
source_id
unit_id
unit_type = pdf_page
pdf_page_start / pdf_page_end
printed_page_start / printed_page_end
text preview
text quality
Important: source access uses physical 1-based PDF page numbers. Printed page labels are stored separately for citation and traceability.
For DOCX, Markdown, TXT, and Notes, the source map is section-based:
source_id
unit_id
unit_type = section
heading_path
text preview
text quality
Markdown sections are built from headings when available. DOCX/TXT/Notes use heading-like lines and paragraph grouping as fallback structure.
The source resolver accepts SourceLocator requests:
pdf_pages: source_id + physical pdf_page_start/pdf_page_end
section: source_id + section_id
search: source_id + query
chunk: source_id + chunk_id
It returns SourceTextPacket objects with exact text and provenance.
The user can paste assignment text or extract it from a supported document type.
TaskSpecParser turns that assignment into a TaskSpecification.
The task specification includes:
- assignment title and raw text
- essay type and academic level when available
- target length and citation style
- prompt options and selected prompt
- required sources/materials/structure
- rubric and grading criteria
- extracted checklist items
- adversarial text flags
- blocking questions and warnings
If multiple prompt options are detected and no selected prompt is provided, the job can be marked blocked until the user resolves the ambiguity.
An EssayJob links:
- task spec ID
- uploaded source IDs
- topic rounds
- selected topic
- research plan
- evidence map
- outline
- draft
- validation report
- final export
The job state machine records progress through stages such as
topic_selection, research_planning, drafting, validation, revision,
and complete.
Topic ideation is an LLM stage. It receives:
- task specification
- source cards
- source maps
- source index manifests
- previous topic candidates
- rejected topic directions
- optional user instruction for another round
It returns candidate topics with:
- title
- research question
- tentative thesis direction
- rationale
- fit/evidence/originality scores
- risk flags and missing evidence
- legacy source leads using
chunk_idsand source-index search queries - preferred
source_requestsusing PDF pages, section IDs, searches, or chunks
The user selects one topic or rejects directions with reasons. Rejected topics are stored and passed into later topic rounds so the model can avoid repeating them.
Research planning is deterministic in the current implementation.
It receives the selected topic, source maps, index manifests, task spec, and
source access config. It validates the selected topic's source_requests:
- source IDs must belong to the job
- source maps must exist
- physical PDF page ranges must be valid
- section IDs must exist
- search requests must include a query
- chunk requests must include a chunk ID
- PDF requests must fit configured per-request bounds
The output ResearchPlan contains:
- research question
- uploaded source priorities
- validated source requests
- source requirements from the assignment
- expected evidence categories
- optional external search queries if external search is allowed
- warnings
Before final topic research, the workflow resolves validated source requests into source text packets.
Resolution order is:
- Preferred
source_requestsfrom the selected topic and research plan. - Legacy explicit
chunk_ids. - SQLite full-text search using suggested source search queries.
Resolved packets are bounded by:
- max research rounds
- max source packets
- max total source characters
- max PDF pages per request
- max PDF pages total
- max characters per packet
- oversized request policy
For PDFs, the resolver first uses stored page text from ingestion. If requested
physical pages are missing readable text, or only have low/partial text, it can
run lazy per-page OCR against the stored original PDF and refresh
pages.jsonl, full_text.txt, source_map.json, and source_units.jsonl
before returning the packet. Lazy OCR uses physical 1-based PDF page numbers,
not printed page labels.
Relevant source access environment variables:
ESSAY_MAX_RESEARCH_ROUNDS
ESSAY_MAX_SOURCE_PACKETS
ESSAY_MAX_TOTAL_SOURCE_CHARS
ESSAY_MAX_PDF_PAGES_PER_REQUEST
ESSAY_MAX_PDF_PAGES_TOTAL
ESSAY_MAX_CHARS_PER_PACKET
ESSAY_OVERSIZED_SOURCE_REQUEST_POLICY
ESSAY_LAZY_PDF_OCR_ENABLED
ESSAY_LAZY_OCR_TIER
ESSAY_LAZY_OCR_DPI
ESSAY_LAZY_OCR_LANGUAGES
Final topic research is an LLM stage. It receives:
- task specification
- selected topic
- resolved source text packets
- legacy retrieved chunks when present
The model turns source text into an evidence map:
- research notes
- grounded claims
- quotes when directly found in source text
- paraphrases
- relevance explanations
- evidence groups
- gaps
- conflicts
- warnings
The service validates model output against supplied source text. For example, quotes that are not found in the packet/chunk are dropped with warnings.
Outlining is a major LLM-backed content-planning step. The outline service receives:
- task specification
- selected topic
- research plan
- evidence map
- resolved source packets, including packet IDs, source IDs, locator type, PDF page ranges, printed page labels when known, heading paths, extraction method, text quality, warnings, and text
It returns:
- working thesis
- section headings
- section purposes
- key points
- note IDs to use in each section
- target word counts when applicable
If no LLM client is configured, outlining raises a configuration error.
Drafting is an LLM stage. It receives:
- task specification
- selected topic
- evidence notes
- evidence groups
- gaps and conflicts
- outline
- resolved source text packets
The draft response includes:
- essay content
- section-to-source map
- bibliography candidates
- known weak spots
The draft model receives the resolved source packets selected during research planning/source resolution. These packets include source IDs, packet IDs, page ranges, printed page labels when known, headings, extraction metadata, text quality, warnings, and the excerpt text.
Validation combines deterministic checks and an LLM judgment.
Deterministic checks look for style and structure issues such as:
- em dash count
- en dash count
- decorative hyphen pause count
- colon explanation pattern count
- overused high-level vocabulary
- conclusion opener problems
- participial phrase rate
- repetitive signposting
- sentence similarity runs
- triplet plus contrastive-negation combos
- clustered triplets
- paragraph length variance
- mechanical burstiness, including clipped fragment chains
- concrete source engagement
The LLM validation stage receives:
- draft text
- task specification
- evidence map
- bibliography candidates
- known source metadata from source cards
- deterministic findings
It returns:
- unsupported claims
- citation issues
- rubric scores
- assignment-fit judgment
- length check
- style issues
- structured diagnostics
- legacy revision suggestions when present
- overall quality score
If validation fails, the workflow can run a revision pass.
The revision service receives:
- prior draft
- validation report
- task spec
- selected topic
- evidence map
- outline
- resolved source text packets
It creates the next draft version, then validation runs again. If the revised draft passes, the workflow can export.
When configured, the workflow runs a final constrained style pass before validation. This pass receives the latest draft, task spec, outline, evidence map, deterministic style findings, anti-AI skill document, and source packets.
The style pass may revise prose rhythm, paragraph movement, transitions, and generic phrasing. It must not add facts, citations, source names, quotes, page numbers, or statistics. The styled draft is stored as the next draft version, and validation runs against that styled draft.
When validation passes, FinalExportService creates a Markdown export with:
- final essay content
- bibliography candidates
- section source map
- validation summary
The web UI can display the final essay and download it as Markdown.
- Lazy PDF OCR depends on the stored original PDF and installed OCR dependencies. Existing source artifacts created before original-file persistence may need to be re-ingested before lazy OCR can run.
- Embedding search is not yet implemented.
- Follow-up research rounds are configurable but not yet wired into a multi-round source-request loop.
- DOCX page numbers are not stable without rendering, so DOCX access is section-based rather than page-based.
- The extraction CLI and web app share lower-level document extraction code, but the CLI does not run the full essay workflow.
pdf-extract extract path/to/file.pdf --mode text_only
pdf-extract extract path/to/file.pdf --mode ocr_only --ocr-tier small
pdf-extract extract path/to/file.pdf --mode ocr_only --ocr-tier medium --ocr-lang en --ocr-lang fr
pdf-extract extract path/to/file.pdf --mode ocr_only --ocr-tier high --ocr-gpuFor Tesseract-backed small OCR, the pipeline maps --ocr-lang en to
Tesseract's eng language code automatically.
For page-level parallel OCR with the Tesseract-backed small tier:
pdf-extract ocr-parallel path/to/file.pdf --ocr-tier small --workers auto --max-pages 10
pdf-extract -v ocr-parallel path/to/file.pdf --ocr-tier small --workers 4 --store ./ocr_store
pdf-extract -v ocr-parallel path/to/file.pdf --ocr-tier small --workers auto --calibrate --max-pages 20
pdf-extract -v ocr-parallel path/to/file.pdf --ocr-tier small --document-id my-book --resumeThe parallel command writes page artifacts and a merged result under ocr_store
by default. Use --calibrate with --workers auto to benchmark a few sample
pages and select a measured worker count. Use --resume with a stable
--document-id to reuse already-completed page artifacts after an interrupted
run. Medium and high OCR tiers remain sequential for now; they are kept
compatible but are not yet parallelized because EasyOCR/PaddleOCR need
backend-specific worker handling, especially for GPU mode.
The CLI prints JSON with:
- source path
- page count
- page-wise text payloads
For generic document reading:
from pdf_pipeline import DocumentReader
reader = DocumentReader()
result = reader.extract("path/to/assignment-or-source.docx")
print(result.pages[0].text)For PDF-specific extraction modes:
from pdf_pipeline.modes import ExtractionMode
from pdf_pipeline.ocr import OcrConfig, OcrTier
from pdf_pipeline.pipeline import ExtractionPipeline
pipeline = ExtractionPipeline(
mode=ExtractionMode.OCR_ONLY,
ocr_tier=OcrTier.MEDIUM,
ocr_config=OcrConfig(languages=("en",), dpi=300, use_gpu=False),
)
result = pipeline.extract("path/to/file.pdf")
for page in result.pages:
print(page.page_number, page.char_count, page.text[:80])ExtractionMode.AUTOis intentionally not implemented yet..docxfiles are returned as one logical page because Word documents do not store stable page boundaries without rendering.- Legacy
.docfiles are not supported. Convert them to.docxfirst. - OCR tiers:
small: Tesseractmedium: EasyOCRhigh: PaddleOCR (PP-OCRv4)
- Encrypted PDFs raise
EncryptedPdfError. - Corrupt/unreadable PDFs raise
InvalidPdfError. - Missing optional OCR packages raise
MissingDependencyError.
ocr-smallrequires the Tesseract binary installed on your system and available in PATH.ocr-mediumandocr-highmay download model weights on first run.- GPU behavior depends on backend/runtime installation (
torch/paddle).
See docs/THIRD_PARTY_LICENSES.md.