In another embodiment, instead of summing the output probabilities of the classes for the 2D and 3D models of the respective 2D3D image pair, the classification having the largest probability across both the 2D and 3D output probabilities is determined as the classification for the 2D3D image pair. For example, the 3D output probabilities and 2D output probabilities of the 2D3D image pair described above may be analyzed to determine that that the 2D output probability of class “texting” has the maximum value (0.05). Because “texting” class has the maximum probability value (0.5) than any other class in either the 2D and 3D output probabilities, then the ensemble model generates an enhanced prediction of “texting,” thereby classifying the 2D3D image pair, and the driver's behavior at the time the 2D3D image was captured, as a “texting” gesture.
Although summing and determining the maximum probability values are disclosed, other methods for generating the enhanced ensemble prediction are contemplated herein, such as, for example, by using logarithmic, multiplicative, or other functions to combine the predict action of the 2D and 3D models. In other embodiments, the 2D and 3D model predict actions may be input into a further prediction model used by the ensemble model, such as a further neural network model that receives the 2D and 3D model predict actions as input and outputs an enhanced prediction and classifications based on the 2D and 3D model predict actions.