At 220, a vocabulary difficulty metric analyzes the text to determine a level of difficulty of its words. “Difficult words,” for example, may be foreign words or uncommon names that are inappropriate for a particular recitation task. Since difficult words tend to appear less frequently than easy words, the difficulty of a word can be estimated based on the frequency of the word appearing in a reference corpus (i.e., a word that rarely appears in the reference corpus may be deemed difficult). Based on this assumption, a vocabulary difficulty value can be estimated based on, for example, the proportion of the text's low frequency words and average word frequency. At 225, the vocabulary difficulty value is compared to a pre-determined vocabulary difficulty range, which in one embodiment is determined based on a set of training texts that have been deemed suitable for the recitation task. Then at 250, the text filter module determines whether to filter out the text from the text set based on the result of the comparison and, optionally, the results of other metrics.
At 230, a syntactic complexity metric is employed to identify texts with overly complicated syntactic structure, which may not be appropriate for a recitation task. A syntactic complexity value may be calculated using any conventional means for estimating syntax complexity (such as those used in automated essay scoring systems). At 235, the text filter module compares the syntactic complexity value to a pre-determined syntactic complexity range, which in one embodiment is determined based on a set of training texts that have been deemed suitable for the recitation task. Then at 250, the text filter module determines whether to filter out the text from the text set based on the result of the comparison and, optionally, the results of other metrics.