For example, in certain embodiments as described herein, the 3D model could generate output probabilities, e.g., 0.4, 0.2, and 0.4 for respective classes “safe driving,” “texting,” and “calling.” The 3D output probabilities could be associated with a certain 3D predict action of the predict data structure. Similarly, the 2D model could generate output probabilities e.g., 0.1, 0.5, and 0.4 for respective classes “safe driving,” “texting,” and “calling.” The 2D output probabilities could be associated with a certain 2D predict action of the predict data structure. The 2D and 3D output probabilities could correspond based on, e.g., a same or similar timestamp shared by the 2D and 3D images and related predict actions, thereby, creating a 2D3D image pair, as described above. In certain embodiments, the ensemble model may generate the enhanced prediction by summing the probabilities of each respective class of a 2D3D image pair and determining a 2D3D image pair classification from the class having the maximum summed probability. For example, the 3D output probabilities and 2D output probabilities of the 2D3D image pair described above may be summed to create a 2D3D image pair classification structure having summed classification values of 0.5, 0.7, and 0.8 for respective classes “safe driving,” “texting,” and “calling.” Because the “calling” class has the maximum probability value (0.8), then the ensemble model generates an enhanced prediction of “calling,” thereby classifying the 2D3D image pair, and the driver's behavior at the time the 2D3D image was captured, as a “calling” gesture.