To determine whether the video frames and audio bins are synchronized, correlations are determined between pairs of video frames 102a-c and audio bins 104a-c. Correlations may be determined using feature sets generated for each audio bin and video frame. Video feature sets are generated for each of frames 102a-c. For example, the video feature sets might include an object vector that represents objects in the frame such as horse 107 or train 106a-b. The video feature sets might also include an object attribute vector that represents attributes of the objects in the frame, e.g., size, position, location, type, etc.
Audio feature sets are also generated for each of audio bins 104a-c. For example, the audio feature sets might include an audio vector which include features representing a Fourier transform of the audio that indicate the amplitude or energy of discrete frequencies in the audio bin. The audio vector may indicate sounds associated with the audio bin, such as a horse galloping or a train moving along a railroad track.