An Overview of Optical Character Recognition Processes at Canon Research America
Optical character recognition (OCR) is the process of taking a bitmap picture and converting it into a text document. Character recognition has many useful applications, such as eliminating the need for storing paper copies of records and allowing for fast text searches on a large archive of scanned and stored documents. At Canon Research America, the Personal Imaging Computer System (PICS) project is trying to develop a document storage system consisting of an integrated laser printer, greyscale scanner, and Pentium computer system running Windows NT. The system allows for a document to be scanned and stored in a database either as a compressed image or as an OCR processed text document. The document can then be retrieved from the PICS unit across a network using a special client program, or stored in the database until it is needed, at which time it can be printed out again. The goal of the PICS project is to develop a system that allows for a total reduction in the amount of paper documents that need to be stored -- a "paperless office solution" -- and also to offer a method of storing large amounts of text in a format that is easy to search and index. One of the central systems that makes PICS possible is character recognition. The issues that are involved in developing a high accuracy OCR system are very broad, as the process can include areas of image processing and manipulation, machine learning, and natural language processing. This paper presents a general description of the optical character recognition process, followed by an overview of specific algorithms that have been researched at Canon Research America.