In various embodiments, the ensemble model's ensemble function (at block 820) can analyze the predict actions in the predict data structure in “chunks” based on a common timeframe (e.g., 5 second video chunks). The timeframe may be specified by the computing device or operator of the computing device before execution of the ensemble model. In the chunk-based embodiment, the ensemble model can predict a 2D3D image pair classification, as described above, for each 2D3D image pair in the chunked timeframe. In certain embodiments, the ensemble model can generate a chunk classification based on all (or some) of the 2D3D image pair classifications in the chunk. For example, in one embodiment, a chunk of 5 seconds of 2D and 3D video images, with 20 frames (images) per second for each of the 2D and 3D images, would have 100 2D images and 100 3D images. The ensemble model can obtain, standardize and determine 2D and 3D classifications for the chunk of images as described above (blocks 802-818), yielding 100 2D3D image pairs. Using the enhanced prediction method described above, if 50 of the a 2D3D image pairs were classified as “texting,” 30 as “calling,” and 20 as “safe driving,” then ensemble model could generate a prediction such that the chunk's overall classification is determined from the 2D3D image pair classification having the maximum count. In the above example, the chunk's classification would be “texting” since the “texting” class was predicted in a majority of the frames (i.e., 50 frames) of the 5 second video chunk. Thus, a chunk of one or more 2D or 3D images, as a whole, may be predicted as associated with a particular classification, even where, for example, one or more of the 2D or 3D images are not, individually, predicted to relate to that classification.