Document Extraction Pipeline

This repository contains a Python extraction pipeline for source documents used by the essay writer system. It currently supports:

text-native PDFs
OCR extraction for PDFs
modern Word .docx files

Why `pypdf`

pypdf is distributed under a permissive BSD-style license, which is commonly compatible with both open-source and closed-source projects.

Install

pip install -e .

For development and tests:

pip install -e ".[dev]"

Install optional OCR extras as needed:

pip install -e ".[ocr-small]"   # Tesseract tier
pip install -e ".[ocr-medium]"  # EasyOCR tier
pip install -e ".[ocr-high]"    # PaddleOCR tier
pip install -e ".[ocr-small,ocr-scheduler]"  # Tesseract + parallel scheduler

Install the web API dependencies when running the local app:

pip install -e ".[web]"

Install the Vite frontend dependencies:

cd frontend
npm install

Web App Usage

Run the API from the repository root:

uvicorn backend.app:app --host 127.0.0.1 --port 8629 --reload

Run the frontend in another terminal:

cd frontend
npm run dev

The frontend runs at http://127.0.0.1:3527 by default and proxies /api requests to http://127.0.0.1:8629. Vite preview uses http://127.0.0.1:4627.

Agent Tool Mode MCP Usage

Agent Tool Mode exposes the essay workflow as local MCP tools for harnesses such as Claude Code and Codex. In this mode, the app does not make hidden LLM API calls for reasoning stages. The harness reads prepared work packets, produces JSON with its own model, and commits validated artifacts back to the app.

Install optional dependencies:

pip install -e ".[agent-tools]"

Run the MCP server:

ESSAY_DATA_DIR=./data python -m essay_writer.agent_tools.server

See docs/agent-tool-mode-mcp.md and .mcp.example.json.

The app supports source uploads for .pdf, .docx, .txt, .md, .markdown, and .notes files. Assignment text can be pasted or extracted from the same document types.

Source access budgets default high enough for broad research passes while still preventing one model request from reading an entire book:

ESSAY_MAX_RESEARCH_ROUNDS=3
ESSAY_MAX_SOURCE_PACKETS=40
ESSAY_MAX_TOTAL_SOURCE_CHARS=200000
ESSAY_MAX_PDF_PAGES_PER_REQUEST=80
ESSAY_MAX_PDF_PAGES_TOTAL=240
ESSAY_MAX_CHARS_PER_PACKET=50000
ESSAY_OVERSIZED_SOURCE_REQUEST_POLICY=reject
ESSAY_LAZY_PDF_OCR_ENABLED=true
ESSAY_LAZY_OCR_TIER=small
ESSAY_LAZY_OCR_DPI=300
ESSAY_LAZY_OCR_LANGUAGES=en

Per-stage model and token budgets can be configured through settings or env vars. The backend resolves models in this order: per-stage Settings override, per-stage env var, Settings default model, LLM_MODEL, then the adapter default.

ESSAY_MODEL_TASK_SPEC=
ESSAY_MODEL_SOURCE_CARD=
ESSAY_MODEL_TOPIC_IDEATION=
ESSAY_MODEL_RESEARCH=
ESSAY_MODEL_OUTLINING=
ESSAY_MODEL_DRAFTING=
ESSAY_MODEL_DRAFTING_REVISION=
ESSAY_MODEL_DRAFTING_STYLE=
ESSAY_MODEL_VALIDATION=

ESSAY_MAX_TOKENS_TASK_SPEC=4096
ESSAY_MAX_TOKENS_SOURCE_CARD=2500
ESSAY_MAX_TOKENS_TOPIC_IDEATION=5000
ESSAY_MAX_TOKENS_RESEARCH=8000
ESSAY_MAX_TOKENS_OUTLINING=6000
ESSAY_MAX_TOKENS_DRAFTING=8000
ESSAY_MAX_TOKENS_DRAFTING_REVISION=8000
ESSAY_MAX_TOKENS_DRAFTING_STYLE=8000
ESSAY_MAX_TOKENS_VALIDATION=4000

PDF retrieval uses physical 1-based PDF page numbers for source access. Printed page labels are stored separately when PDF metadata exposes them, and are used for traceability rather than as the primary retrieval coordinate.

End-to-End Application Logic

The web app is an essay-writing workflow built on top of the document extraction pipeline. The orchestrator owns workflow state. Each stage persists its inputs, outputs, prompt version, and validation results. Human approval gates and manual revision loops are first-class steps, not escape hatches.

                  assignment text + uploaded sources
                                  |
                                  v
             +---------------------------------------+
             |  1. Source Ingestion                  |  parses PDF/DOCX/TXT/MD/Notes,
             |     - per-source artifacts            |  page/section units, lazy OCR,
             |     - source cards (LLM)              |  source cards.
             +---------------------------------------+
                                  |
                                  v
             +---------------------------------------+
             |  2. Assignment Parsing (LLM)          |  -> blocking_questions or
             |     -> TaskSpecification              |     adversarial flags can
             +---------------------------------------+     mark job blocked here
                                  |
                                  v
                         +-----------------+
                         | 3. Job Created  |
                         +-----------------+
                                  |
                                  v
             +---------------------------------------+
             |  4. Topic Ideation (LLM)              | <----+
             |     candidate topics with leads       |      |
             +---------------------------------------+      |
                                  |                         |
                                  v                         |
                    *** HUMAN GATE: Topic Selection ***     |
                    select | reject + reasons | revise -----+ (next round
                                  |                            uses rejected
                                  v                            topics + user
             +---------------------------------------+        instruction)
             |  5. Research Planning (deterministic) |
             |     validates source_requests,        |
             |     bounds, page ranges, sections     |
             +---------------------------------------+
                                  |
                                  v
             +---------------------------------------+
             |  6. Source Resolution                 |  request -> SourceTextPacket
             |     pdf_pages | section | search |    |  (lazy per-page OCR if
             |     chunk locators                    |  pages have low/empty text)
             +---------------------------------------+
                                  |
                                  v
             +---------------------------------------+
             |  7. Final Topic Research (LLM)        |  evidence map: notes,
             |     grounded notes, quotes verified   |  groups, gaps, conflicts
             |     against source text               |
             +---------------------------------------+
                                  |
                                  v
             +---------------------------------------+
             |  8. Outline (LLM)                     |  thesis, sections,
             |     section -> note_id mapping        |  word budgets, claims
             +---------------------------------------+
                                  |
                                  v
             +---------------------------------------+
             |  9. Drafting (LLM)                    |  draft v1 with
             |     anti-AI skill in system prompt    |  section_source_map +
             |     evidence + outline + packets      |  bibliography candidates
             +---------------------------------------+
                                  |
                                  v
             +---------------------------------------+
             | 10. Final Style Pass (LLM, optional)  |  prose-only rewrite,
             |     facts/citations frozen            |  emits next draft version
             +---------------------------------------+
                                  |
                                  v
             +---------------------------------------+
             | 11. Validation                        |  deterministic style checks
             |     deterministic checks +            |  + LLM judgment
             |     LLM judgment + tone alignment     |  -> structured diagnostics
             +---------------------------------------+
                                  |
                  passes?         |          requires_revision?
                +-----------------+-------------------+
                |                                     |
                | yes                                 | no
                v                                     v
   +------------------------+      +----------------------------------+
   | 12a. System Revision   |      | *** HUMAN GATE: Review Draft *** |
   |      (LLM)             |      | review_only | revise | edit text |
   |  prior draft + report  |      +----------------------------------+
   |  + evidence + packets  |                        |
   |  -> next draft version |       +----------------+----------------+
   +------------------------+       |                |                |
                |                   v                v                v
                |          (no change)        (LLM revision)   (user edit
                |                                    |          saved as
                |                                    v          new draft
                +-----------------------------+      |          version)
                                              |     |               |
                                              v     v               v
                                +------------------------------------+
                                | 13. Final Export (Markdown)        |
                                |     content + bibliography +       |
                                |     section source map + summary   |
                                +------------------------------------+

Linear shorthand of the happy path:

assignment + uploaded sources
-> source ingestion + source cards
-> task specification (LLM)
-> topic ideation (LLM) <-> [HUMAN] topic selection / rejection rounds
-> research planning (deterministic validation)
-> source access resolution (with lazy OCR)
-> final topic research (LLM)
-> outline (LLM)
-> draft v1 (LLM)
-> optional final style pass (LLM) -> draft v2
-> validation (deterministic + LLM + tone alignment)
-> [HUMAN] review draft (review_only | LLM revise | direct text edit)
-> optional system revision loop (LLM)
-> final Markdown export

Manual Oversight & Iteration

The workflow exposes four explicit human-in-the-loop touchpoints. Each one produces or consumes durable artifacts so the orchestrator can resume after a restart.

1. Topic selection and rejection rounds

After topic ideation, the user must select one candidate before the workflow proceeds. The user can also reject directions with a written reason. Rejected topics are persisted (TopicRoundStore.save_rejected_topic) and replayed into later ideation rounds along with an optional user instruction so the model avoids repeating discarded directions. Round numbers and topic IDs are stable across rounds.

ideation round N -> [HUMAN] select | reject(reason) | request another round
                                              |
                                              v
                       round N+1 prompt includes previous candidates
                       + rejected topics + user instruction

2. Blocking gates

The workflow can mark a job blocked (with a user-facing EssayJobErrorState) when:

the task spec contains blocking_questions or unresolved prompt-option ambiguity
evidence sufficiency falls below the configured threshold during research
adversarial AI-directed instructions are detected in the assignment

The job remains in blocked until the user resolves the issue (e.g. supplies a clarification, picks a prompt option, or uploads more sources). Resolution creates a new artifact version and the workflow resumes from the affected stage.

3. Manual draft editing (`save_user_edit`)

Any draft version can be edited directly by the user. The endpoint POST /jobs/{job_id}/drafts/save-user-edit writes the edited text as a new EssayDraft with:

origin = "user_edit"
created_by = "user"
parent_draft_id linking to the prior version
parent_export_id set when the edit was applied to an exported draft

User edits become the new latest draft and are eligible inputs for the next manual or system revision pass. Each edit is its own immutable version, so edit history is fully traceable.

4. Manual revision runs

Manual revision is a structured re-pass over a chosen draft. Triggered via POST /jobs/{job_id}/manual-revision-runs, it accepts:

source_draft_id — which version to revise from
mode — review_only (no rewrite, just reports) or revise (LLM rewrite)
instruction — optional free-text user direction
selected_lenses — any subset of: evidence, citations, assignment_fit, length, tone, anti_ai

Each run produces a ManualRevisionRun with:

pre_revision_validation       post_revision_validation
pre_revision_tone_alignment   post_revision_tone_alignment
pre_revision_anti_ai          post_revision_anti_ai
change_summary                warnings
result_draft_id               status (completed | failed)

The pre/post reports let the UI show a side-by-side delta so the user can see exactly what the lens-based pass changed. In revise mode the result is saved as a new draft version with origin = "manual_llm_revision" and manual_request_id/user_instruction/selected_lenses recorded on the draft itself.

Draft version lineage

Drafts form a directed acyclic version graph through parent_draft_id. The origin field records how each version came into existence:

Origin	Created by	Triggered by
`generated`	system	drafting stage (draft v1)
`style_revision`	system	optional final style pass
`system_revision`	system	post-validation revision loop
`manual_llm_revision`	system	manual revision run in `revise` mode
`user_edit`	user	direct text edit via `save_user_edit`

GET /jobs/{job_id}/drafts returns the full version list with previews and lineage so the UI can render a revision timeline. The export step always reads the latest validated draft.

LLM Usage By Step

Most workflow steps are deterministic application code. LLM calls happen only where a configured LLMClient is passed into the service.

Step	Uses LLM?	Prompt/version	What the model receives
Source ingestion	Yes for source cards	`SOURCE_CARD_SYSTEM_PROMPT`	Source metadata plus selected uploaded-source excerpts. Missing LLM configuration raises an error. PDF OCR itself is not an LLM call.
Source maps and source access	No	None	The app builds maps, resolves locators, runs SQLite search, and may run OCR locally.
Assignment parsing	Yes	`task-spec-v1` / `TASK_SPEC_SYSTEM_PROMPT`	Raw assignment text. Missing LLM configuration raises an error.
Job creation	No	None	The app links task spec, source IDs, and workflow state.
Topic ideation	Yes	`topic-ideation-v1` / `TOPIC_IDEATION_SYSTEM_PROMPT`	Task spec, source cards, source maps, index manifests, previous candidates, rejected topics, and optional user instruction.
Topic selection	No	None	User action stored by the app.
Research planning	No today	`research-planning-v1` metadata only	The app deterministically validates selected-topic `source_requests`, source IDs, page ranges, sections, chunks, and budgets. No model call is made.
Source resolution	No	None	The app resolves validated requests into `SourceTextPacket` objects and may run lazy per-page OCR locally.
Final topic research	Yes	`final-topic-research-v1` / `FINAL_TOPIC_RESEARCH_SYSTEM_PROMPT`	Task spec, selected topic, resolved source packets, and legacy chunks when present.
Outlining	Yes	`thesis-outline-v1` / `OUTLINE_SYSTEM_PROMPT`	Task spec, selected topic, research plan, evidence map, and full source-packet text/metadata. Missing LLM configuration raises an error.
Drafting	Yes	`drafting-v1` / `DRAFTING_SYSTEM_PROMPT`	Task spec, selected topic, evidence map, outline, and resolved source packets/excerpts.
Validation	Yes	`validation-v1` / `VALIDATION_SYSTEM_PROMPT`	Draft, task spec, evidence notes, bibliography candidates, source-card metadata, metadata warnings, and deterministic style findings. Returns structured diagnostics rather than prose rewrite advice.
Revision	Yes	`drafting-revision-v1` / `DRAFTING_SYSTEM_PROMPT`	Prior draft, structured validation diagnostics, task spec, selected topic, outline, evidence map, and resolved source packets/excerpts.
Final style pass	Yes	`drafting-style-revision-v1` / `STYLE_REVISION_SYSTEM_PROMPT`	Latest draft, task spec, outline, evidence map, deterministic style findings, anti-AI skill document, and source packets. Preserves facts/citations while revising prose shape.
Export	No	None	The app writes final Markdown from stored draft and validation data.

Prompt Inventory

The app uses structured JSON prompts. Each LLM call sends:

a system prompt constant
a JSON user payload built by the service
a JSON schema that constrains the response shape
an optional per-stage model override from settings/env

Assignment Parsing Prompt

File: essay_writer/task_spec/prompts.py
System prompt: TASK_SPEC_SYSTEM_PROMPT
User payload: build_task_spec_user_message(raw_text)
Output schema: TASK_SPEC_SCHEMA
Stored version: task-spec-v1

Purpose:

Extract assignment requirements from untrusted assignment documents.
Preserve details, classify real student-facing requirements, detect adversarial instructions, and avoid treating AI-directed instructions as checklist requirements.

Source Card Prompt

File: essay_writer/sources/summary.py
System prompt: SOURCE_CARD_SYSTEM_PROMPT
User payload: _build_source_card_user_message(source, excerpts, summary_char_limit)
Output schema: SOURCE_CARD_SCHEMA
Stored version: none currently on SourceCard

Purpose:

Create a compact card from uploaded-source excerpts only.
Summarize key topics, topic-ideation usefulness, notable sections, limitations, citation metadata, and warnings.

Topic Ideation Prompt

Files: essay_writer/topic_ideation/prompts.py, essay_writer/topic_ideation/service.py
System prompt: TOPIC_IDEATION_SYSTEM_PROMPT
User payload: _build_user_message(context, max_candidates)
Output schema: TOPIC_IDEATION_SCHEMA
Stored version: topic-ideation-v1

Purpose:

Generate source-grounded candidate essay topics.
Use source cards, source maps, and index manifests.
Prefer source_requests using physical PDF pages or section IDs, with chunk/search leads as backward-compatible fallbacks.

Final Topic Research Prompt

Files: essay_writer/research/prompts.py, essay_writer/research/service.py
System prompt: FINAL_TOPIC_RESEARCH_SYSTEM_PROMPT
User payload: _build_user_message(...) in research/service.py
Output schema: FINAL_TOPIC_RESEARCH_SCHEMA
Stored version: final-topic-research-v1

Purpose:

Extract source-grounded research notes from resolved source packets/chunks.
Prevent invented sources, page numbers, facts, and quotes.
Build notes, evidence groups, gaps, conflicts, and warnings.

Outline Prompt

File: essay_writer/outlining/service.py
System prompt: OUTLINE_SYSTEM_PROMPT
User payload: _build_outline_user_message(...)
Output schema: OUTLINE_SCHEMA
Stored version: thesis-outline-v1

Purpose:

Create a detailed, source-grounded essay outline.
Carry the core argument through thesis, section purposes, claims, evidence placement, counterarguments, and word-budget priorities.
Preserve traceability through note IDs and source packet IDs.
Receive source-packet locator metadata, PDF page ranges, printed page labels, heading paths, extraction methods, text quality, warnings, and text so the outline can plan from concrete source information.
Apply structural humanization guidance: avoid uniform section weights, three parallel body sections, and perfectly balanced source treatment unless the assignment requires them.

Drafting Prompt

Files: essay_writer/drafting/prompts.py, essay_writer/drafting/service.py, anti-ai-detection-SKILL.md
System prompt: DRAFTING_SYSTEM_PROMPT
User payload: _build_user_message(...) in drafting/service.py
Output schema: DRAFTING_SCHEMA
Stored version: drafting-v1

Purpose:

Write an academic essay draft from task spec, selected topic, evidence map, outline, and resolved source packets.
Use only evidence-map notes and supplied source packets, record section-to-note/source mappings, and report weak spots instead of fabricating support.
Use source packet text for accurate source detail, quotes, and citations.
Include the full local anti-ai-detection-SKILL.md document directly in the system prompt and apply it during drafting, not as a cleanup pass.

Revision Prompt

Files: essay_writer/drafting/prompts.py, essay_writer/drafting/revision.py, anti-ai-detection-SKILL.md
System prompt: DRAFTING_SYSTEM_PROMPT
User payload: _build_revision_message(...)
Output schema: DRAFTING_SCHEMA
Stored version: drafting-revision-v1

Purpose:

Revise the prior draft using validation feedback while keeping every claim grounded in the supplied evidence.
Reuses the same drafting system prompt and schema, but the user payload adds previous draft content, structured validation diagnostics, legacy revision suggestions when present, and weak spots.
Receives the resolved source packets again, so revision can correct grounding issues against the actual excerpts instead of only the distilled evidence map.
Because it reuses DRAFTING_SYSTEM_PROMPT, it also includes the full anti-ai-detection-SKILL.md document directly.

Validation Prompt

Files: essay_writer/validation/prompts.py, essay_writer/validation/service.py
System prompt: VALIDATION_SYSTEM_PROMPT
User payload: _build_user_message(...) in validation/service.py
Output schema: VALIDATION_SCHEMA
Stored version: validation-v1

Purpose:

Judge grounding, citations, assignment fit, length, rubric alignment, and higher-level style.
Deterministic style checks run before this call; the prompt tells the model not to re-check those findings and instead use them as supplied data.
Return structured diagnostics with location, issue type, evidence, severity, and action category. The validator diagnoses; drafting/revision services perform the rewrite.

Final Style Pass Prompt

Files: essay_writer/drafting/style_revision.py, anti-ai-detection-SKILL.md
System prompt: STYLE_REVISION_SYSTEM_PROMPT
User payload: _build_user_message(...) in drafting/style_revision.py
Output schema: STYLE_REVISION_SCHEMA
Stored version: drafting-style-revision-v1

Purpose:

Run a constrained prose-only style pass before validation.
Preserve facts, citations, thesis meaning, source map, bibliography candidates, and required source-backed claims.
Use deterministic style findings and the full anti-AI skill document to reduce generic prose patterns without adding unsupported content.

1. Source Ingestion

Users upload source documents through the web UI. Supported source types are:

.pdf
.docx
.txt
.md
.markdown
.notes

Ingestion uses DocumentReader for document text extraction:

PDFs use PyPdfExtractor for text-native extraction.
DOCX files use WordDocExtractor.
TXT/Markdown/Notes files are read as UTF-8 plain text.
Low-quality PDF text can fall back to OCR during ingestion.

Each ingested source produces persisted artifacts under the configured data directory:

source.json
original.<ext>
pages.jsonl
chunks.jsonl
full_text.txt
source_card.json
source_map.json
source_units.jsonl
index.sqlite
index_manifest.json

Not every source will have every artifact. For example, index.sqlite and index_manifest.json are only present when indexing succeeds. Uploaded originals are copied into the source artifact directory so later source access can re-read specific PDF pages when OCR is needed.

2. Source Cards

Every successfully ingested source gets a source card. A source card is a compact summary used by topic ideation and later validation metadata checks.

The source-card builder sends selected source excerpts to the model. If no LLM client is configured, ingestion raises a configuration error instead of creating a lower-quality deterministic card.

The source card includes:

title
brief summary
key topics
topic-ideation hints
notable sections
limitations
citation metadata
warnings

3. Source Maps And Source Access

The source access layer is the preferred interface between LLM stages and source text.

For PDFs, the source map is page-based:

source_id
unit_id
unit_type = pdf_page
pdf_page_start / pdf_page_end
printed_page_start / printed_page_end
text preview
text quality

Important: source access uses physical 1-based PDF page numbers. Printed page labels are stored separately for citation and traceability.

For DOCX, Markdown, TXT, and Notes, the source map is section-based:

source_id
unit_id
unit_type = section
heading_path
text preview
text quality

Markdown sections are built from headings when available. DOCX/TXT/Notes use heading-like lines and paragraph grouping as fallback structure.

The source resolver accepts SourceLocator requests:

pdf_pages: source_id + physical pdf_page_start/pdf_page_end
section: source_id + section_id
search: source_id + query
chunk: source_id + chunk_id

It returns SourceTextPacket objects with exact text and provenance.

4. Assignment Parsing

The user can paste assignment text or extract it from a supported document type. TaskSpecParser turns that assignment into a TaskSpecification.

The task specification includes:

assignment title and raw text
essay type and academic level when available
target length and citation style
prompt options and selected prompt
required sources/materials/structure
rubric and grading criteria
extracted checklist items
adversarial text flags
blocking questions and warnings

If multiple prompt options are detected and no selected prompt is provided, the job can be marked blocked until the user resolves the ambiguity.

5. Job Creation

An EssayJob links:

task spec ID
uploaded source IDs
topic rounds
selected topic
research plan
evidence map
outline
draft
validation report
final export

The job state machine records progress through stages such as topic_selection, research_planning, drafting, validation, revision, and complete.

6. Topic Ideation

Topic ideation is an LLM stage. It receives:

task specification
source cards
source maps
source index manifests
previous topic candidates
rejected topic directions
optional user instruction for another round

It returns candidate topics with:

title
research question
tentative thesis direction
rationale
fit/evidence/originality scores
risk flags and missing evidence
legacy source leads using chunk_ids and source-index search queries
preferred source_requests using PDF pages, section IDs, searches, or chunks

The user selects one topic or rejects directions with reasons. Rejected topics are stored and passed into later topic rounds so the model can avoid repeating them.

7. Research Planning

Research planning is deterministic in the current implementation.

It receives the selected topic, source maps, index manifests, task spec, and source access config. It validates the selected topic's source_requests:

source IDs must belong to the job
source maps must exist
physical PDF page ranges must be valid
section IDs must exist
search requests must include a query
chunk requests must include a chunk ID
PDF requests must fit configured per-request bounds

The output ResearchPlan contains:

research question
uploaded source priorities
validated source requests
source requirements from the assignment
expected evidence categories
optional external search queries if external search is allowed
warnings

8. Source Resolution

Before final topic research, the workflow resolves validated source requests into source text packets.

Resolution order is:

Preferred source_requests from the selected topic and research plan.
Legacy explicit chunk_ids.
SQLite full-text search using suggested source search queries.

Resolved packets are bounded by:

max research rounds
max source packets
max total source characters
max PDF pages per request
max PDF pages total
max characters per packet
oversized request policy

For PDFs, the resolver first uses stored page text from ingestion. If requested physical pages are missing readable text, or only have low/partial text, it can run lazy per-page OCR against the stored original PDF and refresh pages.jsonl, full_text.txt, source_map.json, and source_units.jsonl before returning the packet. Lazy OCR uses physical 1-based PDF page numbers, not printed page labels.

Relevant source access environment variables:

ESSAY_MAX_RESEARCH_ROUNDS
ESSAY_MAX_SOURCE_PACKETS
ESSAY_MAX_TOTAL_SOURCE_CHARS
ESSAY_MAX_PDF_PAGES_PER_REQUEST
ESSAY_MAX_PDF_PAGES_TOTAL
ESSAY_MAX_CHARS_PER_PACKET
ESSAY_OVERSIZED_SOURCE_REQUEST_POLICY
ESSAY_LAZY_PDF_OCR_ENABLED
ESSAY_LAZY_OCR_TIER
ESSAY_LAZY_OCR_DPI
ESSAY_LAZY_OCR_LANGUAGES

9. Final Topic Research

Final topic research is an LLM stage. It receives:

task specification
selected topic
resolved source text packets
legacy retrieved chunks when present

The model turns source text into an evidence map:

research notes
grounded claims
quotes when directly found in source text
paraphrases
relevance explanations
evidence groups
gaps
conflicts
warnings

The service validates model output against supplied source text. For example, quotes that are not found in the packet/chunk are dropped with warnings.

10. Outlining

Outlining is a major LLM-backed content-planning step. The outline service receives:

task specification
selected topic
research plan
evidence map
resolved source packets, including packet IDs, source IDs, locator type, PDF page ranges, printed page labels when known, heading paths, extraction method, text quality, warnings, and text

It returns:

working thesis
section headings
section purposes
key points
note IDs to use in each section
target word counts when applicable

If no LLM client is configured, outlining raises a configuration error.

11. Drafting

Drafting is an LLM stage. It receives:

task specification
selected topic
evidence notes
evidence groups
gaps and conflicts
outline
resolved source text packets

The draft response includes:

essay content
section-to-source map
bibliography candidates
known weak spots

The draft model receives the resolved source packets selected during research planning/source resolution. These packets include source IDs, packet IDs, page ranges, printed page labels when known, headings, extraction metadata, text quality, warnings, and the excerpt text.

12. Validation

Validation combines deterministic checks and an LLM judgment.

Deterministic checks look for style and structure issues such as:

em dash count
en dash count
decorative hyphen pause count
colon explanation pattern count
overused high-level vocabulary
conclusion opener problems
participial phrase rate
repetitive signposting
sentence similarity runs
triplet plus contrastive-negation combos
clustered triplets
paragraph length variance
mechanical burstiness, including clipped fragment chains
concrete source engagement

The LLM validation stage receives:

draft text
task specification
evidence map
bibliography candidates
known source metadata from source cards
deterministic findings

It returns:

unsupported claims
citation issues
rubric scores
assignment-fit judgment
length check
style issues
structured diagnostics
legacy revision suggestions when present
overall quality score

13. Revision Loop

If validation fails, the workflow can run a revision pass.

The revision service receives:

prior draft
validation report
task spec
selected topic
evidence map
outline
resolved source text packets

It creates the next draft version, then validation runs again. If the revised draft passes, the workflow can export.

14. Final Style Pass

When configured, the workflow runs a final constrained style pass before validation. This pass receives the latest draft, task spec, outline, evidence map, deterministic style findings, anti-AI skill document, and source packets.

The style pass may revise prose rhythm, paragraph movement, transitions, and generic phrasing. It must not add facts, citations, source names, quotes, page numbers, or statistics. The styled draft is stored as the next draft version, and validation runs against that styled draft.

15. Export

When validation passes, FinalExportService creates a Markdown export with:

final essay content
bibliography candidates
section source map
validation summary

The web UI can display the final essay and download it as Markdown.

Current Limitations

Lazy PDF OCR depends on the stored original PDF and installed OCR dependencies. Existing source artifacts created before original-file persistence may need to be re-ingested before lazy OCR can run.
Embedding search is not yet implemented.
Follow-up research rounds are configurable but not yet wired into a multi-round source-request loop.
DOCX page numbers are not stable without rendering, so DOCX access is section-based rather than page-based.
The extraction CLI and web app share lower-level document extraction code, but the CLI does not run the full essay workflow.

CLI Usage

pdf-extract extract path/to/file.pdf --mode text_only
pdf-extract extract path/to/file.pdf --mode ocr_only --ocr-tier small
pdf-extract extract path/to/file.pdf --mode ocr_only --ocr-tier medium --ocr-lang en --ocr-lang fr
pdf-extract extract path/to/file.pdf --mode ocr_only --ocr-tier high --ocr-gpu

For Tesseract-backed small OCR, the pipeline maps --ocr-lang en to Tesseract's eng language code automatically.

For page-level parallel OCR with the Tesseract-backed small tier:

pdf-extract ocr-parallel path/to/file.pdf --ocr-tier small --workers auto --max-pages 10
pdf-extract -v ocr-parallel path/to/file.pdf --ocr-tier small --workers 4 --store ./ocr_store
pdf-extract -v ocr-parallel path/to/file.pdf --ocr-tier small --workers auto --calibrate --max-pages 20
pdf-extract -v ocr-parallel path/to/file.pdf --ocr-tier small --document-id my-book --resume

The parallel command writes page artifacts and a merged result under ocr_store by default. Use --calibrate with --workers auto to benchmark a few sample pages and select a measured worker count. Use --resume with a stable --document-id to reuse already-completed page artifacts after an interrupted run. Medium and high OCR tiers remain sequential for now; they are kept compatible but are not yet parallelized because EasyOCR/PaddleOCR need backend-specific worker handling, especially for GPU mode.

The CLI prints JSON with:

source path
page count
page-wise text payloads

Python Usage

For generic document reading:

from pdf_pipeline import DocumentReader

reader = DocumentReader()
result = reader.extract("path/to/assignment-or-source.docx")
print(result.pages[0].text)

For PDF-specific extraction modes:

from pdf_pipeline.modes import ExtractionMode
from pdf_pipeline.ocr import OcrConfig, OcrTier
from pdf_pipeline.pipeline import ExtractionPipeline

pipeline = ExtractionPipeline(
    mode=ExtractionMode.OCR_ONLY,
    ocr_tier=OcrTier.MEDIUM,
    ocr_config=OcrConfig(languages=("en",), dpi=300, use_gpu=False),
)
result = pipeline.extract("path/to/file.pdf")
for page in result.pages:
    print(page.page_number, page.char_count, page.text[:80])

Notes

ExtractionMode.AUTO is intentionally not implemented yet.
.docx files are returned as one logical page because Word documents do not store stable page boundaries without rendering.
Legacy .doc files are not supported. Convert them to .docx first.
OCR tiers:
- small: Tesseract
- medium: EasyOCR
- high: PaddleOCR (PP-OCRv4)
Encrypted PDFs raise EncryptedPdfError.
Corrupt/unreadable PDFs raise InvalidPdfError.
Missing optional OCR packages raise MissingDependencyError.

OCR Prerequisites

ocr-small requires the Tesseract binary installed on your system and available in PATH.
ocr-medium and ocr-high may download model weights on first run.
GPU behavior depends on backend/runtime installation (torch/paddle).

Third-Party Licenses

See docs/THIRD_PARTY_LICENSES.md.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.claude		.claude
backend		backend
docs		docs
essay_writer		essay_writer
frontend		frontend
llm		llm
pdf_pipeline		pdf_pipeline
testpdfs		testpdfs
tests		tests
.gitignore		.gitignore
.mcp.example.json		.mcp.example.json
AGENTS.md		AGENTS.md
README.md		README.md
TODO.md		TODO.md
anti-ai-detection-SKILL.md		anti-ai-detection-SKILL.md
bug-fixes.md		bug-fixes.md
pyproject.toml		pyproject.toml
session-log.md		session-log.md

Folders and files

Latest commit

History

Repository files navigation

Document Extraction Pipeline

Why pypdf

Install

Web App Usage

Agent Tool Mode MCP Usage

End-to-End Application Logic

Manual Oversight & Iteration

1. Topic selection and rejection rounds

2. Blocking gates

3. Manual draft editing (save_user_edit)

4. Manual revision runs

Draft version lineage

LLM Usage By Step

Prompt Inventory

Assignment Parsing Prompt

Source Card Prompt

Topic Ideation Prompt

Final Topic Research Prompt

Outline Prompt

Drafting Prompt

Revision Prompt

Validation Prompt

Final Style Pass Prompt

1. Source Ingestion

2. Source Cards

3. Source Maps And Source Access

4. Assignment Parsing

5. Job Creation

6. Topic Ideation

7. Research Planning

8. Source Resolution

9. Final Topic Research

10. Outlining

11. Drafting

12. Validation

13. Revision Loop

14. Final Style Pass

15. Export

Current Limitations

CLI Usage

Python Usage

Notes

OCR Prerequisites

Third-Party Licenses

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Why `pypdf`

3. Manual draft editing (`save_user_edit`)

Packages