The predict actions can include classifications and related probability values for those classifications for each of the 2D and 3D images. For example, the 2D prediction model may generate a 20% value for a “texting” class for a given 2D image and the 3D prediction model may generate a 50% value for the same “texting” class for a given 3D image, such as a 3D image paired with the 2D image in the 2D3D image pair. The ensemble model may then generate an enhanced prediction for the 2D3D image pair, where the enhanced prediction can determine an overall 2D3D image pair classification for the 2D3D image based upon the 2D and 3D predict actions. Thus, for example, the 2D3D image pair may indicate that the driver was “texting.” In some embodiments, the enhanced prediction determines the 2D3D image pair classification by summing one or more probability values associated with the 2D predict actions and the 3D predict actions to determine a maximum summed probability value, wherein the maximum summed probability value is determined from the sums of one or more classification probability values associated with each of the 2D predict actions and the 3D predict actions. Thus, for the example above, the 20% probability value and the probably 50% value from the 2D and 3D models, respectively, could be summed to compute an overall 70% value. If the 70% summed value was the maximum value, when compared to other classifications, e.g., “eating,” then the classification (e.g., “texting”) associated with the maximum summed probability can be identified as the 2D3D image pair classification for the 2D3D image pair.