A machine learning model may then be applied to pairs of video feature sets and audio feature sets to determine a confidence score between a frame and an audio bin. For example, the object vector and object attribute vector for frame 102b and the audio vector for audio bin 104b are provided as inputs to a machine learning model that outputs a confidence score that the frame 102b and audio bin 104b are synchronized. Confidence scores may also be determined between frame 102b and audio bin 104a and audio bin 104c, as well as between frame 102a and each of audio bins 104a-c and frame 102c and each of audio bins 104a-c.
The confidence scores are used to determine whether frames 102a-c are desynchronized with audio bins 104a-c. For example, the confidence score between frame 102b and audio bin 104a may be higher than the confidence score between frame 102b and audio bin 104b. Similarly, the confidence score between frame 102c and audio bin 104b may be higher than the confidence score between frame 102c and audio bin 104c. Based on this, the audio component and the video component of media presentation 100 are determined to be desynchronized.