An Overview of Optical Character Recognition Processes at Canon Research America
Abstract
Optical character recognition (OCR) is the process of taking a bitmap picture and
converting it into a text document. Character recognition has many useful applications,
such as eliminating the need for storing paper copies of records and allowing for fast text
searches on a large archive of scanned and stored documents. At Canon Research America,
the Personal Imaging Computer System (PICS) project is trying to develop a document
storage system consisting of an integrated laser printer, greyscale scanner, and Pentium
computer system running Windows NT. The system allows for a document to be scanned
and stored in a database either as a compressed image or as an OCR processed text
document. The document can then be retrieved from the PICS unit across a network using
a special client program, or stored in the database until it is needed, at which time it can be
printed out again. The goal of the PICS project is to develop a system that allows for a total
reduction in the amount of paper documents that need to be stored -- a "paperless office
solution" -- and also to offer a method of storing large amounts of text in a format that is
easy to search and index.
One of the central systems that makes PICS possible is character recognition. The
issues that are involved in developing a high accuracy OCR system are very broad, as the
process can include areas of image processing and manipulation, machine learning, and
natural language processing. This paper presents a general description of the optical
character recognition process, followed by an overview of specific algorithms that have
been researched at Canon Research America.