Skip to content

FairfieldTekLLC/Hocr.Net

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hocr.Net

DotNet version of Hocr

C# Library for converting PDF files to Searchable PDF Files

  • Need to batch convert 100 of scan PDF's to Searchable PFS's?
  • Don't want to pay thousands of dollars for a component?

I have personally tested this library with over 110 thousand PDFs. Beyond a few fringe cases the code has performed as it was designed.. I was able to process 110k pdfs (Some hundreds of pages) over a 3 day period using 5 servers.

Internally, Hocr uses Tesseract, GhostScript, iTextSharp and the HtmlAgilityPack. Please check the licensing for each nuget to make sure you are in compliance.

This library IS THREADSAFE so you can process multiple PDF's at the same time in different threads, you do not need to process them one at a time.


Recent Changes

New Features

  • PdfCompressor.CreateSearchablePdfAsync — Added async overload of CreateSearchablePdf with CancellationToken support. Uses native async file I/O (WriteAsync, FlushAsync, ReadAllBytesAsync) and offloads CPU-bound OCR and GhostScript compression to the thread pool via Task.Run. Cancellation is checked at each major stage and OperationCanceledException propagates unwrapped.

Thread Safety & Concurrency Fixes

  • TempData._caches — Replaced Dictionary<string, string> with ConcurrentDictionary<string, string> to prevent data corruption under concurrent access from Parallel.ForEach and background timer threads.
  • TempData.DestroySession — Replaced three separate check-then-act operations (TOCTOU race) with a single atomic TryRemove call.
  • TempData._cleanUpTimerRunning — Replaced non-volatile bool with Interlocked.CompareExchange to guarantee visibility across threads.
  • TempData.Dispose — Added double-dispose guard (Interlocked.CompareExchange), spin-waits for any in-flight timer callback to complete before running cleanup, and unsubscribes the timer event.
  • TempData public methods — Added ObjectDisposedException guards to CreateNewSession, CreateTempFile, and CreateDirectory.
  • PdfCompressor.Dispose — Now stops and disposes its own CleanUpTimer (previously leaked).

Compression Improvements (Lossless)

  • PdfCreator.SetupDocumentWriter — Removed premature _writer.Close() / _writer.Dispose() calls that occurred before the document was opened. This caused SetFullCompression() to have no effect, meaning PDF object streams (images, fonts, metadata) were not Flate-compressed.
  • PdfCreator.SetupDocumentWriter — Fixed CompressionLevel = 100 (invalid for zlib, which accepts 0–9) to PdfStream.BEST_COMPRESSION (9).
  • GhostScript.CompressPdf — Added -dDetectDuplicateImages=true (deduplicates identical images across pages) and -dCompressFonts=true (compresses embedded font data) to the GhostScript command line. Both are lossless optimizations that reduce output size without affecting image quality.

Bug Fixes

  • OcrController.CreateHocr — Fixed hardcoded "eng" language; now correctly passes the language parameter to the Tesseract engine.
  • PdfCompressor.CreateSearchablePdf — Moved null/empty validation of fileData before session creation and GetPages call (previously validated after use).
  • PdfCompressor.PdfSigned — Added using declaration for iTextSharp.text.pdf.PdfReader (file handle was leaked on every call).

Resource Leak Fixes

  • PdfCreator.WritePageDrawBlocks — Two Graphics objects and four Pen objects were never disposed. Consolidated to a single Graphics and wrapped all GDI+ objects in using statements.
  • GhostScript.RunCommand — Redirected stdout/stderr were never drained, which can deadlock the process when GhostScript output fills the OS pipe buffer. Added BeginOutputReadLine()/BeginErrorReadLine() to drain streams asynchronously.

Performance Improvements

  • PdfCreator.WriteDirectContentBaseFont.CreateFont() was called inside the per-line loop (identical result each time). Hoisted to a single call before the loop.
  • PdfCreator.WriteUnderlayContentBaseFont.CreateFont() was called inside the per-word loop. Hoisted to a single call before the loop.
  • TempData.CleanUpFiles — Previously processed only one directory per 5-second timer tick and halted the entire batch on the first locked directory. Now processes all queued items per tick, skipping locked directories and re-enqueuing them individually.
  • TempData.CreateNewSession — Eliminated per-call Regex allocation and redundant Path.Combine computations. Uses Guid.ToString("N") for filesystem-safe names directly.
  • TempData.CreateTempFile — Replaced DateTime.Now.Second + Millisecond (collision-prone) with Guid.NewGuid() for unique filenames.
  • TempData.Dispose — Retry loop now sleeps only after failures instead of before every attempt.
  • TempData singleton — Replaced Activator.CreateInstance reflection with direct constructor call.

Code Quality

  • Redundant Dispose calls — Removed chk.Dispose(), reader.Dispose(), and writer.Dispose() calls inside using blocks in PdfCompressor.CompressAndOcr.
  • Redundant process cleanup — Removed proc.Close()/proc.Dispose() inside using block and simplified while (!HasExited) { WaitForExit(10000) } loop to a single WaitForExit() call in GhostScript.RunCommand.
  • Simplified LINQ — Replaced verbose query syntax with FirstOrDefault in ImageProcessor.GetCodecInfoForName.
  • XML documentation — Added XML doc comments to all public APIs and key internal methods across the solution.

Use Hocr!

Example Usage:

// See https://aka.ms/new-console-template for more information

using Utility.Hocr.Enums;
using Utility.Hocr.Pdf;

const string ghostScriptPathToExecutable = @"C:\gs10.03.1\bin\gswin64c.exe";

static void Comp_OnCompressorEvent(string msg)
{
    Console.WriteLine(msg);
}

Console.WriteLine("Hello, World!");
PdfCompressor comp;
List<string> DistillerOptions = new()
{
    "-dSubsetFonts=true",
    "-dCompressFonts=true",
    "-sProcessColorModel=DeviceRGB",
    "-sColorConversionStrategy=sRGB",
    "-sColorConversionStrategyForImages=sRGB",
    "-dConvertCMYKImagesToRGB=true",
    "-dDetectDuplicateImages=true",
    "-dDownsampleColorImages=false",
    "-dDownsampleGrayImages=false",
    "-dDownsampleMonoImages=false",
    "-dColorImageResolution=265",
    "-dGrayImageResolution=265",
    "-dMonoImageResolution=265",
    "-dDoThumbnails=false",
    "-dCreateJobTicket=false",
    "-dPreserveEPSInfo=false",
    "-dPreserveOPIComments=false",
    "-dPreserveOverprintSettings=false",
    "-dUCRandBGInfo=/Remove"
};

using (comp = new PdfCompressor(ghostScriptPathToExecutable, new PdfCompressorSettings
    {
        PdfCompatibilityLevel = PdfCompatibilityLevel.Acrobat_7_1_6,
        WriteTextMode = WriteTextMode.Word,
        Dpi = 400,
        ImageType = PdfImageType.Jpg,
        ImageQuality = 100,
        CompressFinalPdf = true,
        DistillerMode = dPdfSettings.prepress,
        DistillerOptions = string.Join(" ", DistillerOptions.ToArray())
    }))
{
    comp.OnCompressorEvent += Comp_OnCompressorEvent;
    Parallel.ForEach(Directory.GetFiles("C:\\pdfin"), file =>
        {
            byte[] data = File.ReadAllBytes(file);
            Tuple<byte[], string> result = comp.CreateSearchablePdf(data, new PdfMeta());
            File.WriteAllBytes("c:\\PDFOUT\\" + Path.GetFileName(file), result.Item1);
        }
    );
}

Async Usage

Use CreateSearchablePdfAsync for non-blocking processing with cancellation support:

using Utility.Hocr.Enums;
using Utility.Hocr.Pdf;

const string ghostScriptPathToExecutable = @"C:\gs10.03.1\bin\gswin64c.exe";

using var cts = new CancellationTokenSource(TimeSpan.FromMinutes(30));

using var comp = new PdfCompressor(ghostScriptPathToExecutable, new PdfCompressorSettings
{
    PdfCompatibilityLevel = PdfCompatibilityLevel.Acrobat_7_1_6,
    WriteTextMode = WriteTextMode.Word,
    Dpi = 400,
    ImageType = PdfImageType.Jpg,
    ImageQuality = 100,
    CompressFinalPdf = true,
    DistillerMode = dPdfSettings.prepress
});

comp.OnCompressorEvent += msg => Console.WriteLine(msg);

string[] files = Directory.GetFiles("C:\\pdfin");
IEnumerable<Task> tasks = files.Select(async file =>
{
    byte[] data = await File.ReadAllBytesAsync(file, cts.Token);
    Tuple<byte[], string> result = await comp.CreateSearchablePdfAsync(data, new PdfMeta(), cancellationToken: cts.Token);
    await File.WriteAllBytesAsync("c:\\PDFOUT\\" + Path.GetFileName(file), result.Item1, cts.Token);
});

await Task.WhenAll(tasks);

About

DotNet version of Hocr

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages