OCR for Financial and Legal Documents

Introduction

The optical recognition of content of (usually – paper) documents has been made broadly available since early 1990s, inline with availability of a range of document scanners. While both scanner hardware and the recognition algorithms have progressed imminently since then, the basic assumptions for the content recognition remain the same: the contents of the paper is scanned (typically in 300 DPI or more), the image is transformed into black-and-white using a set of filters that attempt to remove image noise, dirt, etc. and emphasize these features of the scanned characters, which allow the recognition algorithms identification of the characters as well as possible.

The text recognition algorithms typically perform three steps:

Depending on the use case, resources (required recognition speed, hardware resources) and the given text recognition algorithm principles, the stress on the accuracy of the individual character identification versus the strength of the dictionary-based correction may be altered. Generally however, in all of the publicly available text recognition engines nowadays the dictionary-based correction plays the prominent role in the overall text recognition accuracy.

There are however numerous cases where the dictionary-based correction is impossible due to a nature of the documents to be digitalized.

Our Solution

Our OCR technology targets the cases of digitization of paper forms of various kinds (e.g. financial, manufacturing and trade documents), which share a set of common features including:

In such cases the quality of tokenization and identification of characters plays a critical role in usefulness of the digitized documents.

Our OCR is dedicated to recognize content of financial and legal documents, especially those, that have form- or table-based layouts and structured content.

It is optimized to maximize the recognition ratio of individual characters, tokenization of sentences and words, recognition of table layouts as well as utilizing intelligent dictionaries and adaptive validation algorithms, that understand the data that may be encountered on the invoices, reports, receipts and other types of financial and legal documents and are able to detect and correct typical recognition errors (as control sums in identifiers and account numbers, methods of writing the addresses, tax rates, names and last names, numbers that should add up to totals etc.).

The system is also taught to recognize fonts that can be encountered on:

The Active Text OCR is most often used as a part of automatic document processing workflow.