A robust, strictly-typed Node.js and Browser library for parsing office files into a rich Abstract Syntax Tree (AST) and generating high-fidelity output in multiple formats.
Parses: docx · pptx · xlsx · odt · odp · ods · pdf · rtf · csv · md · html
Generates: Markdown · HTML · CSV · RTF · PDF · Plain Text · RAG Chunks
Upload any office file in your browser — inspect the AST, tweak config, and preview generated output in real-time.
- AST Visualizer: Inspect the hierarchical node tree, metadata, and raw content
- Config Configurator: Tweak options (
ignoreNotes,ocr,newlineDelimiter) and see results instantly - Debugging: Identify exactly how nodes are interpreted
- Format Specs: Read detailed specs for the AST structure and all config options
- Install
- Command Line Usage
- Quick Decision Guide
- Library Usage: Parsing
- OfficeGenerator
- OfficeConverter — One-Step API
- Native RAG Chunking
- The AST Structure
- Deep Dive: Document Components
- Performance Highlights
- Advanced AST Usage
- Configuration Reference
- OCR Scheduler & Resource Management
- Browser Usage
- Troubleshooting & Common Issues
- Known Limitations
- Contributing
npm i officeparser# Full AST as JSON (default)
npx officeparser /path/to/file.docx
# Plain text output
npx officeparser /path/to/file.docx --format=text
# Convert DOCX to Markdown and save
npx officeparser report.docx --format=md --output=report.md
# Convert PPTX to HTML
npx officeparser presentation.pptx --format=html --output=preview.html
# Convert XLSX to CSV
npx officeparser data.xlsx --format=csv
# Generate RAG chunks
npx officeparser document.pdf --format=chunks| Flag | Values | Default | Description |
|---|---|---|---|
--format |
json|text|md|html|csv|rtf|pdf|chunks |
json |
Output format |
--output |
path | — | Write output to a file |
--toText |
true|false |
false |
Deprecated. Use --format=text |
--ignoreNotes |
true|false |
false |
Ignore speaker notes (PPTX/ODP) |
--putNotesAtLast |
true|false |
false |
Collect notes at end of output |
--newlineDelimiter |
string | \n |
Delimiter between lines |
--extractAttachments |
true|false |
false |
Extract images/charts as Base64 |
--ocr |
true|false |
false |
Enable OCR for images |
--includeRawContent |
true|false |
false |
Include raw XML/RTF in nodes |
--includeBreakNodes |
true|false |
false |
Include break nodes (DOCX only) |
--outputErrorToConsole |
true|false |
false |
Deprecated. Use onWarning callback |
--verbose |
true|false |
false |
Show full error stack traces |
| Goal | API to use |
|---|---|
| Extract text / AST from a file | OfficeParser.parseOffice(file) |
| Convert directly to another format | OfficeConverter.convert(file, 'md') |
| Parse first, then generate | parseOffice() → OfficeGenerator.generate(ast, 'html') |
| Convert on the AST itself (shorthand) | ast.to('md') |
| RAG pipeline chunking | OfficeConverter.convert(file, 'chunks', {...}) |
const officeParser = require('officeparser');
const ast = await officeParser.parseOffice('/path/to/file.docx');
console.log(ast.type); // 'docx'
console.log(ast.metadata); // { author, title, created, ... }
console.log(ast.content); // Array of hierarchical nodes
console.log(ast.attachments);// Images/charts (if extractAttachments: true)
console.log(ast.warnings); // Non-fatal issues from parsing phaseTypeScript (named import):
import { OfficeParser } from 'officeparser';
const ast = await OfficeParser.parseOffice('report.docx', {
extractAttachments: true,
ocr: true,
});officeParser.parseOffice('/path/to/file.docx', function(ast, err) {
if (err) { console.error(err); return; }
console.log(ast.toText());
});Pass a Buffer, ArrayBuffer, or Uint8Array instead of a file path:
const fs = require('fs');
const buffer = fs.readFileSync('/path/to/file.pdf');
const ast = await officeParser.parseOffice(buffer);Important
Text-based formats from buffers need a fileType hint.
Formats like md, html, and csv have no magic bytes, so the parser cannot
auto-detect them from a buffer. You must provide fileType in that case:
const ast = await officeParser.parseOffice(markdownBuffer, { fileType: 'md' });The preferred way to convert a parsed AST to another format. Returns a ConversionResult.
// ConversionResult shape:
// { value: string | Uint8Array | OfficeChunk[], messages: OfficeIssue[] }
const { value: markdown, messages } = await ast.to('md');
const { value: html } = await ast.to('html', { includeFormatting: false });
const { value: chunks } = await ast.to('chunks', { strategy: 'fixed-size', chunkSize: 800 });
const { value: pdfBytes } = await ast.to('pdf'); // Uint8ArrayNote
toText() is synchronous and deprecated in favour of the async ast.to('text').
It remains available for backward compatibility.
const text = ast.toText(); // synchronous, returns plain stringUse OfficeGenerator.generate(ast, format, config?) when you need to produce output from an already-parsed AST:
import { OfficeParser, OfficeGenerator } from 'officeparser';
const ast = await OfficeParser.parseOffice('report.docx');
// Convert to Markdown
const { value: md } = await OfficeGenerator.generate(ast, 'md');
// Convert to HTML with style mapping
const { value: html } = await OfficeGenerator.generate(ast, 'html', {
includeFormatting: true,
styleMap: [
{
selector: { nodeType: 'paragraph', attributes: { style: 'Heading 1' } },
output: { tag: 'h1', classes: ['main-title'] }
}
]
});
// Convert to CSV (spreadsheets)
const { value: csv } = await OfficeGenerator.generate(ast, 'csv');Supported destinations: 'text' · 'md' · 'html' · 'csv' · 'rtf' · 'pdf' · 'chunks'
Note
PDF generation requires the optional puppeteer peer dependency:
npm install puppeteerOfficeConverter.convert() combines parsing and generation in a single call. It automatically syncs parser options from generator config (e.g., enables extractAttachments when images are requested).
import { OfficeConverter } from 'officeparser';
// Minimal usage
const { value: markdown } = await OfficeConverter.convert('report.docx', 'md');
// With config
const { value: html, messages } = await OfficeConverter.convert('data.xlsx', 'html', {
parseConfig: {
ignoreNotes: true,
newlineDelimiter: '\n\n',
},
generatorConfig: {
includeFormatting: true,
styleMap: [
{
selector: { attributes: { style: { value: 'Header', operator: '~=' } } },
output: { tag: 'h2', classes: ['data-header'] }
}
]
},
onWarning: (issue) => console.warn(`[${issue.code}] ${issue.message}`)
});Important
The OfficeConverterConfig shape uses nested parseConfig and generatorConfig sub-objects.
Do not put parser or generator options at the top level — only onWarning lives there.
officeParser provides native document chunking for Retrieval-Augmented Generation (RAG) pipelines with three strategies:
Splits at natural AST boundaries (paragraphs, headings, pages, slides, sheets). Preserves logical flow.
const { value: chunks } = await OfficeConverter.convert('report.docx', 'chunks', {
generatorConfig: {
chunksConfig: {
strategy: 'document-structure',
splitBy: 'heading', // 'paragraph' | 'heading' | 'page' | 'slide' | 'sheet'
maxChunkSize: 1500,
tableSplitStrategy: 'row', // repeats header row in every chunk — ideal for RAG
}
}
});Splits by character count with overlap. Equivalent to LangChain's RecursiveCharacterTextSplitter.
const { value: chunks } = await OfficeConverter.convert('report.docx', 'chunks', {
generatorConfig: {
chunksConfig: {
strategy: 'fixed-size',
chunkSize: 1000,
chunkOverlap: 200,
}
}
});
console.log(`Generated ${chunks.length} chunks`);Uses cosine similarity between sentence embeddings to find topic boundaries. Requires you to provide an embeddingFunction.
import OpenAI from 'openai';
const openai = new OpenAI();
const { value: chunks } = await OfficeConverter.convert('report.docx', 'chunks', {
generatorConfig: {
chunksConfig: {
strategy: 'semantic',
embeddingFunction: async (text) => {
const res = await openai.embeddings.create({
input: text, model: 'text-embedding-3-small'
});
return res.data[0].embedding;
},
similarityThreshold: 0.8,
maxChunkSize: 2000,
}
}
});Every chunk contains text and rich metadata for citations and filtered retrieval:
interface OfficeChunk {
text: string;
/** Rich metadata for filtered retrieval */
metadata: {
sourceType: string; // e.g., 'docx', 'pdf'
pageNumber?: number; // (PDF only)
slideNumber?: number; // (PPTX only)
sheetName?: string; // (XLSX only)
closestHeading?: string; // Nearest heading above this chunk
isTableChunk?: boolean; // True if part of a split table
};
startIndex?: number; // Character offset (if addStartIndex: true)
endIndex?: number; // End character offset (if addStartIndex: true)
}OfficeParserAST is a format-agnostic document representation:
OfficeParserAST
├── type: 'docx' | 'pdf' | 'xlsx' | 'csv' | 'md' | ... (11 formats)
├── metadata: { author, title, created, modified, customProperties, styleMap, ... }
├── content: [ OfficeContentNode ]
│ ├── type: 'paragraph' | 'heading' | 'table' | 'list' | 'image' | 'chart' | ...
│ ├── text: string (concatenated text of node + all descendants)
│ ├── children: [ OfficeContentNode ] (recursive)
│ ├── formatting: { bold, italic, underline, color, size, font, alignment, ... }
│ └── metadata: { level, listId, row, col, rowSpan, colSpan, style, ... }
├── attachments: [ OfficeAttachment ] (populated when extractAttachments: true)
│ ├── type: 'image' | 'chart'
│ ├── name: string
│ ├── mimeType: string
│ ├── data: string (Base64)
│ ├── ocrText?: string (if ocr: true)
│ └── chartData?: { title, dataSets, labels }
├── warnings: OfficeIssue[] (non-fatal issues from the parsing phase)
├── to(format, config?) (format: 'html'|'md'|'text'|'csv'|'rtf'|'pdf'|'chunks', returns { value, messages })
└── toText() (Deprecated: use .to('text') instead)
All warnings and errors (from both parsing and generation) use this shape:
interface OfficeIssue {
type: 'warning' | 'info' | 'error';
code: OfficeWarningType | OfficeErrorType; // typed enum, e.g. 'OCR_FAILED'
message: string;
node?: OfficeContentNode; // the node that triggered the issue, if any
details?: any; // original error or extra context
}List Node
├── type: 'list'
├── metadata: {
│ listId: '1', // items with the same listId belong to one logical list
│ listType: 'ordered' | 'unordered',
│ indentation: 0, // nesting level (0-based)
│ itemIndex: 0, // sequential position within the list level
│ paragraphIndentation: { left, hanging, right, firstLine }
│ }
└── children: [ Text content ]
Tip
Even if a list is interrupted by a regular paragraph, itemIndex keeps incrementing for the same listId, so numbering stays correct.
Tables follow a strict table → row → cell hierarchy:
Table Node (type: 'table')
└── children: Row Nodes (type: 'row')
└── children: Cell Nodes (type: 'cell')
├── metadata: { row, col, rowSpan?, colSpan? }
└── children: [ Paragraph | List | Table | ... ]
row/col: zero-based grid positionrowSpan/colSpan: merged cells (primarily ODF formats)- Cells can contain nested tables
Image Node (type: 'image')
├── metadata: { attachmentName: 'img1.png', altText: '...' }
└── → Attachment: { data: 'base64...', ocrText: '...' }
- Set
extractAttachments: trueto populateattachment.data - Set
ocr: true(requiresextractAttachments: true) to populateocrText
Chart Node (type: 'chart')
├── metadata: { attachmentName: 'chart1.xml' }
└── → Attachment: { chartData: { title, dataSets, labels } }
formatting: {
bold?: boolean
italic?: boolean
underline?: boolean
strikethrough?: boolean
color?: string // '#RRGGBB'
backgroundColor?: string
size?: string // e.g. '12pt'
font?: string
subscript?: boolean
superscript?: boolean
alignment?: 'left' | 'center' | 'right' | 'justify'
}When includeBreakNodes: true, break elements appear as nodes:
Break Node (type: 'break')
└── metadata: {
breakType: 'textWrapping' | 'page' | 'column' | 'lastRenderedPage' | 'carriageReturn',
clear?: 'all' | 'left' | 'none' | 'right'
}
Note
Break nodes have no text property, but ast.toText() and ast.to('text') automatically convert them to the configured newline delimiter.
ast.metadata = {
author?: string
title?: string
created?: Date
modified?: Date
description?: string
customProperties?: Record<string, any> // user-defined metadata from the document
styleMap?: Record<string, TextFormatting> // named styles → formatting definitions
formatting?: TextFormatting // document-wide defaults
}Accessing custom properties:
const ast = await officeParser.parseOffice('contract.docx');
console.log(ast.metadata.customProperties);
// { "ProjectID": "ABC-123", "InternalReview": true }Key internal optimizations shipped in recent versions:
- OpenOffice (ODP): Up to 23× faster parsing via optimized XML pre-parsing and style caching
- Excel Memory: Resolved O(n) memory overhead on large sparse spreadsheets using iterative stream-based parsing
- RTF Parser: Rewrote string accumulation loop to eliminate O(n²) bottleneck in large files
- Table Fidelity (DOCX): Native support for vertical cell merging (
vMerge) and horizontal spanning (gridSpan)
const headings = ast.content.filter(n => n.type === 'heading' && n.metadata?.level === 1);
console.log(headings.map(h => h.text));const ast = await officeParser.parseOffice('report.docx', { extractAttachments: true, ocr: true });
ast.attachments.filter(a => a.mimeType?.startsWith('image/')).forEach(img => {
console.log(`${img.name}: ${img.ocrText ?? 'no OCR'}`);
});ast.content.filter(n => n.type === 'table').forEach((table, i) => {
const csv = table.children
.filter(r => r.type === 'row')
.map(r => r.children.filter(c => c.type === 'cell')
.map(c => `"${c.text.replace(/"/g, '""')}"`)
.join(','))
.join('\n');
console.log(`Table ${i + 1}:\n${csv}`);
});function findBold(nodes) {
return nodes.flatMap(n => [
...(n.type === 'text' && n.formatting?.bold ? [n.text] : []),
...(n.children ? findBold(n.children) : [])
]);
}
console.log(findBold(ast.content));function extractNotes(nodes) {
return nodes.flatMap(n => [
...(n.type === 'note' ? [{ id: n.metadata.noteId, text: n.text, type: n.metadata.noteType }] : []),
...(n.children ? extractNotes(n.children) : [])
]);
}
console.log(extractNotes(ast.content));import { OfficeParser } from 'officeparser';
async function contains(filePath: string, term: string): Promise<boolean> {
const ast = await OfficeParser.parseOffice(filePath);
return (await ast.to('text')).value.includes(term);
}Pass as the second argument to parseOffice(file, config).
| Option | Type | Default | Description |
|---|---|---|---|
newlineDelimiter |
string |
'\n' |
Delimiter inserted between lines in text output |
ignoreNotes |
boolean |
false |
Ignore speaker notes (PPTX/ODP) |
putNotesAtLast |
boolean |
false |
Collect all notes at the end instead of inline |
extractAttachments |
boolean |
false |
Populate ast.attachments with Base64 images/charts |
ocr |
boolean |
false |
Run Tesseract OCR on images (requires extractAttachments: true) |
ocrConfig |
OcrConfig |
{} |
OCR worker pool settings — see OCR section |
includeRawContent |
boolean |
false |
Attach raw XML/RTF source to each node |
serializeRawContent |
boolean |
true |
Re-serialize XML to clean strings (only if includeRawContent: true) |
preserveXmlWhitespace |
boolean |
false |
Preserve original XML whitespace during serialization |
includeBreakNodes |
boolean |
false |
Include w:br / w:cr as typed break nodes (DOCX only) |
ignoreInternalLinks |
boolean |
false |
Strip bookmarks and internal cross-references from AST |
fileType |
SupportedFileType | null |
null |
Required for text-based binary data ('md', 'html', 'csv') as these lack magic bytes. |
csvDelimiter |
string |
',' |
Input delimiter when parsing CSV files |
pdfWorkerSrc |
string |
CDN (jsDelivr) | Path/URL to pdf.worker.min.mjs (required in browser) |
onWarning |
(issue: OfficeIssue) => void |
— | Callback for non-fatal parsing issues |
outputErrorToConsole |
boolean |
false |
Deprecated. Use onWarning instead |
Options shared by all generator formats. Pass to OfficeGenerator.generate(ast, format, config) or ast.to(format, config).
| Option | Type | Default | Description |
|---|---|---|---|
includeFormatting |
boolean |
true |
Include bold/italic/colors/sizes in output |
generateIds |
boolean |
true |
Add slug-based id attributes to headings |
renderMetadata |
boolean |
false |
Render title/author as visible header block |
includeImages |
boolean |
true |
Include image nodes in output |
includeCharts |
boolean |
true |
Include interactive charts (HTML only) |
ignoreInternalLinks |
boolean |
false |
Strip bookmarks and internal anchors from output |
ignoreDefaultStyleMap |
boolean |
false |
Disable built-in style mappings (e.g., "Heading 1" → h1) |
styleMap |
string[] | StructuredStyleMapping[] |
[] |
Custom semantic style mappings |
onNode |
(node) => string | false | void |
— | Per-node callback for filtering, overriding, or mutating |
onWarning |
(issue: OfficeIssue) => void |
— | Callback for non-fatal generation issues |
Called for every node in the AST during generation. Can be async.
| Return value | Effect |
|---|---|
false |
Skip this node and all its children |
string |
Use this string as the output for this node, skip default logic |
void |
Proceed with default rendering (mutations to node are applied) |
const { value: md } = await ast.to('md', {
onNode: async (node) => {
// Skip all images
if (node.type === 'image') return false;
// Redact secrets (mutate then proceed)
if (node.text?.includes('SECRET_KEY')) {
node.text = node.text.replace(/SECRET_KEY: \w+/, 'SECRET_KEY: [REDACTED]');
}
// Custom rendering for a specific style
if (node.metadata?.style === 'Callout') {
return `> [!INFO]\n> ${node.text}`;
}
}
});Maps document style names to semantic output elements. Two formats supported:
styleMap: [
{
selector: { nodeType: 'paragraph', attributes: { style: 'Heading 1' } },
output: { tag: 'h1', classes: ['main-title'], attributes: { id: 'top' } }
},
{
// '~=' operator matches if the word 'Quote' appears anywhere in the style name
selector: { attributes: { style: { value: 'Quote', operator: '~=' } } },
output: { tag: 'blockquote', fresh: true }
}
]fresh: true prevents the generator from merging adjacent nodes of the same tag into one block.
Compatible with mammoth.js style maps:
styleMap: [
"p[style-name='Heading 1'] => h1",
"p[style~='Title'] => h2",
"p[style-name='Quote'][lang='en'] => blockquote"
]Pass as htmlConfig inside GeneratorConfig.
| Option | Type | Default | Description |
|---|---|---|---|
standalone |
boolean |
true |
Wrap output in a full <html> document with CSS |
chartJsSrc |
string |
jsDelivr CDN | URL for the Chart.js library |
Pass as mdConfig inside GeneratorConfig.
| Option | Type | Default | Description |
|---|---|---|---|
fallbackToHtml |
boolean |
true |
Use HTML tags for features Markdown cannot represent (underlines, merged table cells, etc.) |
Pass as pdfConfig inside GeneratorConfig. Requires the optional puppeteer peer dependency.
| Option | Type | Default | Description |
|---|---|---|---|
format |
string |
'A4' |
Paper format ('A4', 'Letter', 'Legal', etc.) |
landscape |
boolean |
false |
Landscape page orientation |
printBackground |
boolean |
true |
Print background graphics |
margin |
object |
{0,0,0,0} |
Page margins (top, right, bottom, left) |
displayHeaderFooter |
boolean |
false |
Show print header/footer |
headerTemplate |
string |
'' |
HTML template for the print header |
footerTemplate |
string |
'' |
HTML template for the print footer |
scale |
number |
1 |
Rendering scale factor |
launchOptions |
object |
headless defaults | Puppeteer launch options (e.g., executablePath) |
Pass as csvConfig inside GeneratorConfig.
| Option | Type | Default | Description |
|---|---|---|---|
sheets |
string |
'' |
Sheet range to export: '1', '1-3', '1,3' (1-based). Empty = all sheets |
mergeSheets |
boolean |
true |
Merge all sheets into one CSV. If false, returns a ZIP archive |
columnDelimiter |
string |
',' |
Output column delimiter |
Pass as textConfig inside GeneratorConfig.
| Option | Type | Default | Description |
|---|---|---|---|
newlineDelimiter |
string |
'\n' |
String inserted between structural blocks |
preserveLayout |
boolean |
false |
Render tables with aligned columns using whitespace |
Configuration for OfficeConverter.convert(file, format, config).
| Option | Type | Description |
|---|---|---|
parseConfig |
OfficeParserConfig |
Settings for the parsing phase |
generatorConfig |
GeneratorConfig |
Settings for the generation phase |
onWarning |
(issue: OfficeIssue) => void |
Global warning callback (overrides phase-specific ones) |
ChunkingConfig is a discriminated union — the available options depend on the strategy field.
| Option | Type | Default | Description |
|---|---|---|---|
strategy |
string |
'document-structure' |
Chunking strategy |
stripWhitespace |
boolean |
true |
Trim leading/trailing whitespace from each chunk |
includeMetadata |
boolean |
true |
Include page/slide/heading metadata in each chunk |
addStartIndex |
boolean |
false |
Add startIndex character offset to chunk metadata |
lengthFunction |
(text) => number |
text.length |
Custom size measurer (e.g., token counter) |
sentenceBoundaryRegex |
string | RegExp |
/[.!?。!?]/ |
Custom regex for sentence boundary detection |
abbreviations |
string[] |
common list | Abbreviations to skip when splitting on . |
| Option | Type | Default | Description |
|---|---|---|---|
chunkSize |
number |
1000 |
Maximum characters per chunk |
chunkOverlap |
number |
200 |
Character overlap between consecutive chunks |
separators |
string[] |
['\n\n','\n',' ',''] |
Ordered list of separators to try |
| Option | Type | Default | Description |
|---|---|---|---|
splitBy |
string |
'paragraph' |
'paragraph' · 'heading' · 'page' · 'slide' · 'sheet' |
maxChunkSize |
number |
1000 |
Max characters per chunk (oversized units are split recursively) |
tableSplitStrategy |
string |
'row' |
'row' (repeats header in each chunk) or 'flatten' |
| Option | Type | Default | Description |
|---|---|---|---|
embeddingFunction |
(text) => Promise<number[]> |
required | Async embedding function |
similarityThreshold |
number |
0.8 |
Cosine similarity threshold; lower = fewer boundaries |
maxChunkSize |
number |
2000 |
Max characters even if similarity stays high |
bufferSize |
number |
1 |
Surrounding sentences used when computing similarity |
embeddingBatchSize |
number |
50 |
Sentences per embedding API batch |
When ocr: true is set, officeParser maintains an intelligent Smart Worker Pool backed by Tesseract.js:
- Dynamic Affinity: Workers persist with their last-used language, avoiding re-initialization overhead.
- LRU Re-allocation: When a new language is requested and the pool is full, the Least Recently Used idle worker is re-initialized.
- Auto-Termination: Workers shut down after 10 seconds of inactivity (configurable via
ocrConfig.autoTerminateTimeout).
| Option | Type | Default | Description |
|---|---|---|---|
language |
string |
'eng' |
Tesseract language code(s), e.g. 'eng+fra' |
workerPath |
string |
'' |
Custom path to Tesseract worker script |
corePath |
string |
'' |
Custom path to Tesseract core script |
langPath |
string |
'' |
Custom path for language data files |
autoTerminateTimeout |
number |
10000 |
Inactivity timeout in ms before auto-teardown (0 = disabled) |
See all language codes at tesseract-ocr.github.io.
In short-lived scripts (CLI tools, one-off automation), call terminateOcr() after processing to bypass the idle timer and exit immediately:
const officeParser = require('officeparser');
const ast = await officeParser.parseOffice('file.pdf', { ocr: true });
// ... process results ...
await officeParser.terminateOcr(); // immediate exitTip
The built-in CLI (npx officeparser ...) handles this automatically.
Only call it manually in your own scripts.
Two bundles are available in the dist/ directory:
| Bundle | Usage |
|---|---|
officeparser.browser.mjs |
ESM — use with import statements or modern bundlers (Vite, Webpack, Next.js) |
officeparser.browser.iife.js |
IIFE — use with a <script> tag; exposes the global officeParser object |
import { OfficeParser } from 'officeparser';
const handleFile = async (event) => {
const file = event.target.files[0];
const buffer = await file.arrayBuffer();
const ast = await OfficeParser.parseOffice(new Uint8Array(buffer));
console.log(ast.toText());
};<script src="dist/officeparser.browser.iife.js"></script>
<script>
async function handleFile(event) {
const file = event.target.files[0];
const buffer = await file.arrayBuffer();
const ast = await officeParser.parseOffice(new Uint8Array(buffer));
console.log(ast.toText());
}
</script>Note
File paths don't work in the browser. Always pass a Buffer, ArrayBuffer, or Uint8Array.
Passing a path string will throw a descriptive FEATURE_NOT_SUPPORTED_IN_BROWSER error.
When parsing PDFs in the browser, a Web Worker is required. If pdfWorkerSrc is omitted, a jsDelivr CDN link is used automatically:
// Uses default CDN worker:
const ast = await officeParser.parseOffice(pdfArrayBuffer);
// Or specify your own:
const ast = await officeParser.parseOffice(pdfArrayBuffer, {
pdfWorkerSrc: 'https://cdn.jsdelivr.net/npm/pdfjs-dist@5.6.205/build/pdf.worker.min.mjs'
});Note
The pdfjs-dist worker version must match the version bundled with officeparser (currently 5.6.205).
| Symptom | Fix |
|---|---|
| Node.js process stays alive after finishing | Call await officeParser.terminateOcr() at end of script when OCR was used |
"Worker not found" in browser for PDF |
Verify pdfWorkerSrc points to pdf.worker.min.mjs matching version 5.6.205 |
| Low OCR accuracy | Verify ocrConfig.language matches the document language; quality depends on image resolution |
| Out of memory on large Excel files | Call ast.toText() early and discard the AST object to allow garbage collection |
md/html/csv buffer not detected |
Add fileType: 'md' (or 'html', 'csv') to config — these formats have no magic bytes |
IMPROPER_BUFFERS error |
Usually means no file extension and no fileType hint was provided for a buffer input |
| PDF generation fails | Install the optional peer dependency: npm install puppeteer |
For a full debugging guide, visit the Live Documentation.
- ODT/ODS Charts: May show inaccurate data when the chart references external cell ranges or uses complex layout-based data.
- PDF Images (Browser): Extracted as BMP files for cross-platform compatibility. Conversion is automatic.
- RTF Notes:
putNotesAtLasthas no effect for RTF files; footnotes and endnotes are always appended at the end.
npm: https://npmjs.com/package/officeparser
github: https://github.com/harshankur/officeParser
If officeParser has helped you save time, consider supporting its continued development. Your sponsorship helps maintain the project, add new features, and keep it robust for everyone.
Contributions are welcome! Please see CONTRIBUTING.md for details.
This project is licensed under the MIT License — see the LICENSE file for details.