Once each feature was detected and encoded into a number, the language classification process could start using the encoded numbers. In an embodiment, one approach was based on the successful Kavnar and Trenkel technique used on characters, not handwriting, where histograms of n-grams are formed to create a language profile. An n-gram is an occurrence of two-features together. The letters ‘th’ are the most common character bi-gram in English. The language profile vector of n-gram normalized counts is developed during training and stored for each language. During testing, n-gram profile test vectors of the test document are compared to the stored profile vectors. The “closest match” is the reported language. There have been multiple proposals for measuring the distance between the profile vector and the test vector.
In various embodiments, n-grams were formed using the feature numbers. A profile n-gram histogram vector was created for each language during training. N-gram test vectors were compared to the profile n-gram histogram vector during testing to estimate the language by choosing the profile vector that is the best match to the test vector.
Various experiments showed this was a viable technique which could learn a language profile and match the language profile against features extracted from never-before-seen data. This technique may involve coding the individual feature detectors which may be complex.