DotNet version of Hocr
C# Library for converting PDF files to Searchable PDF Files
- Need to batch convert 100 of scan PDF's to Searchable PFS's?
- Don't want to pay thousands of dollars for a component?
I have personally tested this library with over 110 thousand PDFs. Beyond a few fringe cases the code has performed as it was designed.. I was able to process 110k pdfs (Some hundreds of pages) over a 3 day period using 5 servers.
Internally, Hocr uses Tesseract, GhostScript, iTextSharp and the HtmlAgilityPack. Please check the licensing for each nuget to make sure you are in compliance.
This library IS THREADSAFE so you can process multiple PDF's at the same time in different threads, you do not need to process them one at a time.
PdfCompressor.CreateSearchablePdfAsync— Added async overload ofCreateSearchablePdfwithCancellationTokensupport. Uses native async file I/O (WriteAsync,FlushAsync,ReadAllBytesAsync) and offloads CPU-bound OCR and GhostScript compression to the thread pool viaTask.Run. Cancellation is checked at each major stage andOperationCanceledExceptionpropagates unwrapped.
TempData._caches— ReplacedDictionary<string, string>withConcurrentDictionary<string, string>to prevent data corruption under concurrent access fromParallel.ForEachand background timer threads.TempData.DestroySession— Replaced three separate check-then-act operations (TOCTOU race) with a single atomicTryRemovecall.TempData._cleanUpTimerRunning— Replaced non-volatileboolwithInterlocked.CompareExchangeto guarantee visibility across threads.TempData.Dispose— Added double-dispose guard (Interlocked.CompareExchange), spin-waits for any in-flight timer callback to complete before running cleanup, and unsubscribes the timer event.TempDatapublic methods — AddedObjectDisposedExceptionguards toCreateNewSession,CreateTempFile, andCreateDirectory.PdfCompressor.Dispose— Now stops and disposes its ownCleanUpTimer(previously leaked).
PdfCreator.SetupDocumentWriter— Removed premature_writer.Close()/_writer.Dispose()calls that occurred before the document was opened. This causedSetFullCompression()to have no effect, meaning PDF object streams (images, fonts, metadata) were not Flate-compressed.PdfCreator.SetupDocumentWriter— FixedCompressionLevel = 100(invalid for zlib, which accepts 0–9) toPdfStream.BEST_COMPRESSION(9).GhostScript.CompressPdf— Added-dDetectDuplicateImages=true(deduplicates identical images across pages) and-dCompressFonts=true(compresses embedded font data) to the GhostScript command line. Both are lossless optimizations that reduce output size without affecting image quality.
OcrController.CreateHocr— Fixed hardcoded"eng"language; now correctly passes thelanguageparameter to the Tesseract engine.PdfCompressor.CreateSearchablePdf— Moved null/empty validation offileDatabefore session creation andGetPagescall (previously validated after use).PdfCompressor.PdfSigned— Addedusingdeclaration foriTextSharp.text.pdf.PdfReader(file handle was leaked on every call).
PdfCreator.WritePageDrawBlocks— TwoGraphicsobjects and fourPenobjects were never disposed. Consolidated to a singleGraphicsand wrapped all GDI+ objects inusingstatements.GhostScript.RunCommand— Redirected stdout/stderr were never drained, which can deadlock the process when GhostScript output fills the OS pipe buffer. AddedBeginOutputReadLine()/BeginErrorReadLine()to drain streams asynchronously.
PdfCreator.WriteDirectContent—BaseFont.CreateFont()was called inside the per-line loop (identical result each time). Hoisted to a single call before the loop.PdfCreator.WriteUnderlayContent—BaseFont.CreateFont()was called inside the per-word loop. Hoisted to a single call before the loop.TempData.CleanUpFiles— Previously processed only one directory per 5-second timer tick and halted the entire batch on the first locked directory. Now processes all queued items per tick, skipping locked directories and re-enqueuing them individually.TempData.CreateNewSession— Eliminated per-callRegexallocation and redundantPath.Combinecomputations. UsesGuid.ToString("N")for filesystem-safe names directly.TempData.CreateTempFile— ReplacedDateTime.Now.Second + Millisecond(collision-prone) withGuid.NewGuid()for unique filenames.TempData.Dispose— Retry loop now sleeps only after failures instead of before every attempt.TempDatasingleton — ReplacedActivator.CreateInstancereflection with direct constructor call.
- Redundant
Disposecalls — Removedchk.Dispose(),reader.Dispose(), andwriter.Dispose()calls insideusingblocks inPdfCompressor.CompressAndOcr. - Redundant process cleanup — Removed
proc.Close()/proc.Dispose()insideusingblock and simplifiedwhile (!HasExited) { WaitForExit(10000) }loop to a singleWaitForExit()call inGhostScript.RunCommand. - Simplified LINQ — Replaced verbose query syntax with
FirstOrDefaultinImageProcessor.GetCodecInfoForName. - XML documentation — Added XML doc comments to all public APIs and key internal methods across the solution.
Example Usage:
// See https://aka.ms/new-console-template for more information
using Utility.Hocr.Enums;
using Utility.Hocr.Pdf;
const string ghostScriptPathToExecutable = @"C:\gs10.03.1\bin\gswin64c.exe";
static void Comp_OnCompressorEvent(string msg)
{
Console.WriteLine(msg);
}
Console.WriteLine("Hello, World!");
PdfCompressor comp;
List<string> DistillerOptions = new()
{
"-dSubsetFonts=true",
"-dCompressFonts=true",
"-sProcessColorModel=DeviceRGB",
"-sColorConversionStrategy=sRGB",
"-sColorConversionStrategyForImages=sRGB",
"-dConvertCMYKImagesToRGB=true",
"-dDetectDuplicateImages=true",
"-dDownsampleColorImages=false",
"-dDownsampleGrayImages=false",
"-dDownsampleMonoImages=false",
"-dColorImageResolution=265",
"-dGrayImageResolution=265",
"-dMonoImageResolution=265",
"-dDoThumbnails=false",
"-dCreateJobTicket=false",
"-dPreserveEPSInfo=false",
"-dPreserveOPIComments=false",
"-dPreserveOverprintSettings=false",
"-dUCRandBGInfo=/Remove"
};
using (comp = new PdfCompressor(ghostScriptPathToExecutable, new PdfCompressorSettings
{
PdfCompatibilityLevel = PdfCompatibilityLevel.Acrobat_7_1_6,
WriteTextMode = WriteTextMode.Word,
Dpi = 400,
ImageType = PdfImageType.Jpg,
ImageQuality = 100,
CompressFinalPdf = true,
DistillerMode = dPdfSettings.prepress,
DistillerOptions = string.Join(" ", DistillerOptions.ToArray())
}))
{
comp.OnCompressorEvent += Comp_OnCompressorEvent;
Parallel.ForEach(Directory.GetFiles("C:\\pdfin"), file =>
{
byte[] data = File.ReadAllBytes(file);
Tuple<byte[], string> result = comp.CreateSearchablePdf(data, new PdfMeta());
File.WriteAllBytes("c:\\PDFOUT\\" + Path.GetFileName(file), result.Item1);
}
);
}Use CreateSearchablePdfAsync for non-blocking processing with cancellation support:
using Utility.Hocr.Enums;
using Utility.Hocr.Pdf;
const string ghostScriptPathToExecutable = @"C:\gs10.03.1\bin\gswin64c.exe";
using var cts = new CancellationTokenSource(TimeSpan.FromMinutes(30));
using var comp = new PdfCompressor(ghostScriptPathToExecutable, new PdfCompressorSettings
{
PdfCompatibilityLevel = PdfCompatibilityLevel.Acrobat_7_1_6,
WriteTextMode = WriteTextMode.Word,
Dpi = 400,
ImageType = PdfImageType.Jpg,
ImageQuality = 100,
CompressFinalPdf = true,
DistillerMode = dPdfSettings.prepress
});
comp.OnCompressorEvent += msg => Console.WriteLine(msg);
string[] files = Directory.GetFiles("C:\\pdfin");
IEnumerable<Task> tasks = files.Select(async file =>
{
byte[] data = await File.ReadAllBytesAsync(file, cts.Token);
Tuple<byte[], string> result = await comp.CreateSearchablePdfAsync(data, new PdfMeta(), cancellationToken: cts.Token);
await File.WriteAllBytesAsync("c:\\PDFOUT\\" + Path.GetFileName(file), result.Item1, cts.Token);
});
await Task.WhenAll(tasks);