docutext

Zero-dependency PDF text extraction built for RAG and AI pipelines. Parses PDFs from scratch -- no PDF.js, no WASM, no native addons. Works in Node.js and the browser.

Performance

Library	Small PDF (31 KB)	Large PDF (1.3 MB)	Bundle (gzip)	Dependencies
docutext	3 ms	40 ms	~24 KB	0
pdfjs-dist	5 ms	244 ms	~1.3 MB	0 (but large)
pdf-parse	5 ms	279 ms	~780 KB	1 (pdfjs)
unpdf	6 ms	232 ms	~320 KB	2 (pdfjs)

Median of 3 runs, Node.js, Apple Silicon.

docutext is purpose-built for text extraction in RAG/AI workflows -- not a general PDF toolkit. It does one thing and does it fast.

Install

# Node.js (zero dependencies)
pnpm add docutext

# Browser / bundler (add fflate for decompression, ~3 KB gzip)
pnpm add docutext fflate

Requires Node.js 18+. Browser builds target ES2020+.

Quick Start

import { DocuText } from 'docutext';

const doc = await DocuText.load('document.pdf');
console.log(doc.text);

Structured Markdown

import { DocuText } from 'docutext';
import { docToMarkdown, pageToMarkdown } from 'docutext/markdown';

const doc = await DocuText.load('document.pdf');
console.log(docToMarkdown(doc));       // headings, bold, links
console.log(pageToMarkdown(doc.pages[0]));

Browser

import { DocuText } from 'docutext';

const response = await fetch('/document.pdf');
const bytes = new Uint8Array(await response.arrayBuffer());
const doc = DocuText.fromBuffer(bytes);
console.log(doc.text);

Page-by-page

for (const page of doc) {
  console.log(`Page ${page.number}: ${page.text}`);
}

Layout-Fidelity Opt-In

import { DocuText } from 'docutext';

const doc = await DocuText.load('document.pdf', { textMode: 'layout' });
console.log(doc.text);

textMode: 'layout' keeps more literal spacing and text-object behavior. The default textMode: 'clean' favors semantic text reconstruction and fixes fragmented form PDFs.

Key Features

Zero dependencies in Node.js. Single optional peer dep (fflate) for browser.
~24 KB gzipped browser bundle -- 50x smaller than pdfjs-dist.
6x faster than alternatives on real-world documents.
Clean semantic text by default -- fragmented form PDFs are reconstructed without spurious intra-word splits.
Plain text + structured markdown output (headings inferred from font size, bold/italic, links).
Opt-in layout fidelity mode via textMode: 'layout' when you want more literal spacing/object ordering.
Column-aware text flow -- side-by-side columns (e.g. signature blocks) are read column-first to keep related data together.
Lazy extraction -- accessing page.text only processes that page.
Full PDF parsing -- xref tables, stream filters, font encodings, ToUnicode CMaps, form XObjects.
Encrypted PDF support -- handles permission-encrypted PDFs (empty password, RC4/AES-128).
Runs everywhere -- Node.js and browser via conditional exports.
ESM only -- modern module system, tree-shakeable.
TypeScript first -- full type definitions included.

Documentation

For the full API reference, architecture details, and a live playground, see the documentation site.

Security

Please see SECURITY.md for vulnerability reporting guidance.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github		.github
examples		examples
site		site
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts
vite.config.site.ts		vite.config.site.ts
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

docutext

Performance

Install

Quick Start

Structured Markdown

Browser

Page-by-page

Layout-Fidelity Opt-In

Key Features

Documentation

Security

License

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

docutext

Performance

Install

Quick Start

Structured Markdown

Browser

Page-by-page

Layout-Fidelity Opt-In

Key Features

Documentation

Security

License

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages