An Overview of Optical Character Recognition Processes at Canon Research America
Optical character recognition (OCR) is the process of taking a bitmap picture and converting it into a text document. Character recognition has many useful applications, such as eliminating the need for storing paper copies of records and allowing for fast text searches on a large archive of scanned and stored documents. At Canon Research America, the Personal Imaging Computer System (PICS) project is trying to develop a document storage system consisting of an integrated laser printer, greyscale scanner, and Pentium computer system running Windows NT. The system allows for a document to be scanned and stored in a database either as a compressed image or as an OCR processed text document. The document can then be retrieved from the PICS unit across a network using a special client program, or stored in the database until it is needed, at which time it can be printed out again. The goal of the PICS project is to develop a system that allows for a total reduction in the amount of paper documents that need to be stored -- a "paperless office solution" -- and also to offer a method of storing large amounts of text in a format that is easy to search and index. One of the central systems that makes PICS possible is character recognition. The issues that are involved in developing a high accuracy OCR system are very broad, as the process can include areas of image processing and manipulation, machine learning, and natural language processing. This paper presents a general description of the optical character recognition process, followed by an overview of specific algorithms that have been researched at Canon Research America.
iii, 25 p.
U.S. copyright laws protect this material. Commercial use or distribution of this material is not permitted without prior written