An Overview of Optical Character Recognition Processes at Canon Research America
Loading...
Authors
Evans, Colin
Issue Date
1996
Type
Thesis
Language
en_US
Keywords
Alternative Title
Abstract
Optical character recognition (OCR) is the process of taking a bitmap picture and
converting it into a text document. Character recognition has many useful applications,
such as eliminating the need for storing paper copies of records and allowing for fast text
searches on a large archive of scanned and stored documents. At Canon Research America,
the Personal Imaging Computer System (PICS) project is trying to develop a document
storage system consisting of an integrated laser printer, greyscale scanner, and Pentium
computer system running Windows NT. The system allows for a document to be scanned
and stored in a database either as a compressed image or as an OCR processed text
document. The document can then be retrieved from the PICS unit across a network using
a special client program, or stored in the database until it is needed, at which time it can be
printed out again. The goal of the PICS project is to develop a system that allows for a total
reduction in the amount of paper documents that need to be stored -- a "paperless office
solution" -- and also to offer a method of storing large amounts of text in a format that is
easy to search and index.
One of the central systems that makes PICS possible is character recognition. The
issues that are involved in developing a high accuracy OCR system are very broad, as the
process can include areas of image processing and manipulation, machine learning, and
natural language processing. This paper presents a general description of the optical
character recognition process, followed by an overview of specific algorithms that have
been researched at Canon Research America.
Description
iii, 25 p.
Citation
Publisher
License
U.S. copyright laws protect this material. Commercial use or distribution of this material is not permitted without prior written