For the foregoing reasons, systems and methods are disclosed herein for generating an enhanced prediction from a 2D and 3D image-based ensemble model. As described herein, a computing device may be configured to obtain one or more sets of 2D and 3D images. Each of the 2D and 3D images may be standardized to allow for comparison and interoperability between the images. In one embodiment, the 3D images are standardized using Distification. In addition, corresponding 2D and 3D image pairs (i.e., a “2D3D image pair”) may be determined from the standardized 2D and 3D pairs where, for example, the 2D and 3D images correspond based on a common attribute, such as a similar timestamp or time value. The enhanced prediction may utilize separate underlying 2D and 3D prediction models, where, for example, the corresponding 2D and 3D images of a 2D3D pair are each input to the respective 2D and 3D prediction models to generate respective 2D and 3D predict actions.