FIG. 2 illustrates an example process flow for language classification. The flow 200 of FIG. 2 includes binarization 202, recognition 204, and language classification 206. The simple process flow shown in FIG. 2 does not reflect the combined algorithmic complexity of integrating and evaluating the various segmentation, recognition, and language classification approaches. For example, considering the techniques listed in FIG. 1, the total number of system configurations is the overall product of all possible combinations of algorithms for each stage. In this case there are four binarization techniques, two recognition techniques and 3 three language classification techniques or 24 total (=4×2×3) which, when coupled with the all the parameter settings of each individual algorithm, can easily stress the test/evaluation capacities.
Dividing the processing of a document into three stages helps confine this complexity by allowing algorithms at each stage to be optimized as an independent problem. Complexity is further managed through separation of the document handling and user interfaces from the algorithm development within the evaluation architecture as shown in FIG. 3.