In some embodiments, one or more accuracy tests may be used determine the predictive accuracy of a prediction model, or otherwise compare the accuracy of the prediction models against one another. For example, an F-score may be computed for determining the accuracy of different ensemble prediction models. The F-score may be determined based on the number of true positive results returned from the ensemble model and the number of false positives and false negatives returned from the ensemble model. An example of a true positive result can be, for example, the correct classification of an image showing a “texting” driving behavior. A false positive can include, for example, the incorrect classification of “texting,” e.g., for an image that in fact depicts “safe driving.” A false negative can include, for example, failing to identify an image as “texting” when the image in fact shows “texting.” The positive and negative results may be based on comparing the model's predictions and classifications for certain images against the actual classification for those images. Thus, a model that provides more true positive results than false negative or false positives would be determined more accurate than a model that has fewer true positive results than false negative and false positive results.