FIG. 3 illustrates separation of algorithmic infrastructure from storage infrastructure. FIG. 3 shows that images 302 of documents may be stored in a data store 304. In some aspects, the data store 304 may utilize a “Mongo” database. A document processing component 306 reads the images from the datastore 304 to detect handwriting on the documents 302. The document processing component 306 may utilize a variety of technologies to perform this task. For example, the processing may be implemented in a variety of programming languages, including java, C++, python, Perl, or other languages known in the art. In some aspects, the document processing component 306 may rely on MatLab? functions.
By separating the processing infrastructure from document storage infrastructure as illustrated by FIG. 3, the infrastructure is able to scale to available resources required to handles millions of documents in an automatic workflow. This provides for users to direct and annotate processing results. Users can point the system to collections of scanned images and route the processed result to the appropriate language specialist. Users can annotate the machine learning results as being incorrect or missing. The annotations may be used for further analysis such as algorithm refinement.
One goal of binarization is to convert the input document so that the foreground which includes the handwriting, is logical true. This simple procedure proves to be a difficult task due to variations in illumination, condition of the paper, and other factors such as variations in the ink. The success, however, of the later stages of handwriting recognition and language classification depend on a good binarization.