Skip to content

guangcode/norwegian-invoice-ocr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

norwegian-invoice-ocr

A zero-dependency Java library for extracting fields from Norwegian invoices.

Provides production-tested keyword rules and parsers for matching OCR output to invoice fields. Works with any OCR backend — AWS Textract, Google Vision, Azure Document Intelligence, or plain text. Bring your own OCR; this library handles the Norwegian-specific matching logic.

What it extracts

Field Examples
Invoice number Fakturanummer, Invoice No, FACTURA N°
Invoice date Fakturadato, Date of issue, Fecha factura
Due date Forfallsdato, Betalingsfrist, Due Date
Amount Sum inkl.mva, Total, Beløp å betale (NOK)
Bank account Bankkonto, Kontonummer, IBAN
KID reference KID, OCR, Betalingsreferanse/KID
Org number Norwegian 9-digit org numbers from any context

Quick start

InvoiceExtractor extractor = new InvoiceExtractor();

// Convert your OCR output to label-value pairs
List<OcrLabelValue> pairs = List.of(
    new OcrLabelValue("Fakturanummer", "INV-2024-001"),
    new OcrLabelValue("Fakturadato",   "15.03.2024"),
    new OcrLabelValue("Forfallsdato",  "15.04.2024"),
    new OcrLabelValue("Sum inkl.mva",  "12 500,00"),
    new OcrLabelValue("Bankkonto",     "1234.56.78901"),
    new OcrLabelValue("KID",           "12345678")
);

InvoiceResult result = extractor.extract(pairs, List.of());

result.invoiceNumber().ifPresent(n -> System.out.println("Invoice: " + n));
result.amount().ifPresent(a -> System.out.println("Amount: " + a + " NOK"));
result.dueDate().ifPresent(d -> System.out.println("Due: " + d));
System.out.printf("Found %d/7 fields%n", result.foundFieldCount());

AWS Textract integration

// Convert Textract SummaryFields to OcrLabelValue pairs
List<OcrLabelValue> pairs = expenseDocument.summaryFields().stream()
    .filter(f -> f.labelDetection() != null && f.valueDetection() != null)
    .map(f -> new OcrLabelValue(
        f.labelDetection().text(),
        f.valueDetection().text(),
        f.labelDetection().confidence(),
        pageIndex + 1))
    .toList();

// LINE blocks for invoice type detection (Faktura vs Kreditnota)
List<String> lines = expenseDocument.blocks().stream()
    .filter(b -> b.blockType() == BlockType.LINE)
    .map(Block::text)
    .toList();

InvoiceResult result = extractor.extract(pairs, lines);

What's inside

Keyword weight table (NorwegianInvoiceKeywords)

150 keywords covering invoice labels from Norwegian, Swedish, Danish, English, Spanish, French, Italian and German suppliers. Each keyword has a priority weight calibrated on 10k+ real Norwegian invoices.

[
  {"t": 1, "c": "Fakturanummer",  "w": 102},
  {"t": 1, "c": "Invoice Number", "w": 100},
  {"t": 2, "c": "Fakturadato",    "w": 101},
  {"t": 3, "c": "Forfallsdato",   "w": 101},
  {"t": 4, "c": "Sum inkl.mva",   "w": 101},
  ...
]

Load your own rules from JSON:

List<KeywordRule> rules = NorwegianInvoiceKeywords.fromJson(myJsonString);
InvoiceExtractor extractor = new InvoiceExtractor(rules);

Or add rules to the defaults:

List<KeywordRule> rules = NorwegianInvoiceKeywords.defaultRules();
rules.add(new KeywordRule(1, "Vår referanse", 98));

Amount parser (InvoiceAmountParser)

Handles every number format seen on real invoices:

Input Output Format
8.431,50 8431.50 Norwegian/German
8,431.50 8431.50 English/US
8 431,50 8431.50 Norwegian with space
8.888.431,50 8888431.50 Multi-separator
22750,- 22750 Norwegian shorthand
(NOK) 178 750,00 178750.00 Currency prefix
-269100kr 269100 Currency suffix

Date parser (InvoiceDateParser)

14 date formats + Norwegian, English, Spanish and German month names:

16.08.2024  →  2024-08-16
16/08/24    →  2024-08-16
4. januar 2024  →  2024-01-04
20th January 2021  →  2021-01-20
29. november 2023  →  2023-11-29
17.02.2024\nNet 15  →  2024-02-17  (trailing line stripped)
30.01.2024(Netto 30 dager)  →  2024-01-30  (parenthetical stripped)

KID validator (KidValidator)

Norwegian payment reference number validation with Mod10 (Luhn) and Mod11 check digits. Handles space-separated OCR output by splitting and picking the longest valid segment.

Org number validator (NorwegianOrgNumberValidator)

Norwegian organisation number (9-digit, Mod11 check digit) validation and extraction from arbitrary text.

Invoice type classifier (InvoiceTypeClassifier)

Detects Kreditnota (credit notes) from document text lines — matches "kreditnota" and "Tilgode" as standalone labels.

Dependencies

None at runtime. Jackson is optional (only needed for NorwegianInvoiceKeywords.fromJson()).

<dependency>
    <groupId>no.skatt</groupId>
    <artifactId>norwegian-invoice-ocr</artifactId>
    <version>1.0.0</version>
</dependency>

Running tests

mvn test

License

Apache 2.0

About

A zero-dependency Java library for extracting fields from Norwegian invoices.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages