What is claimed is:1. A method, comprising:receiving a media presentation, wherein the media presentation comprises a video component that includes a sequence of video frames, and an audio component that includes a sequence of audio bins, each audio bin corresponding to one of the video frames relative to a media timeline;generating a first audio vector based on a first audio bin of the sequence of audio bins, the first audio vector representing a plurality of features of the first audio bin;generating a first object vector based on a first video frame that corresponds to the first audio bin, wherein the first object vector represents one or more objects in the first video frame;generating a first object attribute vector that represents one or more features of the one or more objects represented by the first object vector;generating a first confidence score using a first machine learning model, the first audio vector, the first object vector, and the first object attribute vector, the first confidence score representing a measure of correlation between the first audio bin and the first video frame;generating a predicted object vector and a predicted object attribute vector using a second machine learning model and the first audio vector, the predicted object vector and the predicted object attribute vector representing a hypothetical video frame having a high degree of correlation with the first audio bin;generating a second confidence score that represents a measure of correlation between the predicted object vector and the predicted object attribute vector, and the first video frame; anddetermining, based on the first confidence score and the second confidence score, that the audio component and the video component are desynchronized; andmodifying one or both of the audio component and the video component to improve synchronization of the audio and video components of the media presentation.2. The method of claim 1, further comprising:generating a second object vector and a second object attribute vector based on a second video frame, the second video frame being adjacent to the first video frame relative to the media timeline;generating a second confidence score using the first machine learning model, the first audio vector, the second object vector, and the second object attribute vector, the second confidence score representing a measure of correlation between the first audio bin and the second video frame; anddetermining the second confidence score is higher than the first confidence score; anddetermining that the audio component and the video component are desynchronized based on the second confidence score being higher than the first confidence score.3. The method of claim 1, further comprising determining a predicted audio vector using a third machine learning model, the first object vector, and the first object attribute vector, the predicted audio vector representing a hypothetical audio bin having a high degree of correlation with the first video frame;generating a third confidence score that represents a measure of correlation between the predicted audio vector and the first audio bin; anddetermining the audio component and the video component are desynchronized additionally based on the third confidence score.4. A system, comprising one or more processors and memory configured to:select a media presentation, the media presentation including a video component and an audio component that reference a media timeline;generate a plurality of audio feature sets based on corresponding portions of the audio component;generate a plurality of predicted video feature sets using a machine learning model and the plurality of audio feature sets, each predicted video feature set corresponding to a portion of the audio component;generate a plurality of video feature sets based on corresponding portions of the video component, wherein each portion of the audio component corresponds to one of the portions of the video component relative to the media timeline;determine, using the predicted video feature sets and the video feature sets, one or more correlations between one or more pairs of the portions of the audio component and the portions of the video components;determine, using the one or more correlations, that the audio component and the video component are desynchronized, andmodify the media presentation to improve synchronization of the audio component and video component of the media presentation.5. The system of claim 4, wherein each video feature set corresponds to a single video frame of the video component.6. The system of claim 4, wherein the one or more processors and memory are further configured to generate a predicted audio feature set using a machine learning model and a first video feature set of the video feature sets, the predicted audio feature set representing hypothetical audio content having a high degree of correlation with a first portion of the video component corresponding to the first video feature set.7. The system of claim 4, wherein the one or more processors and memory are further configured to determine that the audio component and the video component are desynchronized by providing the one or more correlations to a classifier trained on correlations between audio content and video content.8. The system of claim 4, wherein the one or more correlations include a first set of correlations between a first portion of the video component and a subset of the portions of the audio component, wherein the one or more processors and memory are further configured to determine that the audio component and video component are desynchronized based on the first set of correlations.9. The system of claim 4, wherein the one or more correlations include a second set of correlations between a first portion of the audio component and a subset of the portions of the video component, wherein the one or more processors and memory are further configured to determine that the audio component and video component are desynchronized based on the second set of correlations.10. The system of claim 4, wherein the one or more processors and memory are further configured to modify the media presentation by advancing or delaying one or both of selected portions of the audio component and selected portions of the video component relative to the media timeline, inserting first audio content or first video content to the media presentation, removing second audio content or second video content from the media presentation, or any combination thereof, based on determining that the audio component and the video component are desynchronized.11. A method, comprising:selecting a media presentation, the media presentation including a video component and an audio component that reference a media timeline;generating a plurality of audio feature sets based on corresponding portions of the audio component;generate a plurality of predicted video feature sets using a machine learning model and the plurality of audio feature sets, each predicted video feature set corresponding to a portion of the audio component;generating a plurality of video feature sets based on corresponding portions of the video component, wherein each portion of the audio component corresponds to one of the portions of the video component relative to the media timeline;determining, using the predicted video feature sets and the video feature sets, one or more correlations between one or more pairs of the portions of the audio component and the portions of the video component;determining, using the one or more correlations, that the audio component and the video component are desynchronized andmodifying the media presentation to improve synchronization of the audio component and video component of the media presentation.12. The method of claim 11, wherein each video feature set corresponds to a single video frame of the video component.13. The method of claim 11, further comprising generating a predicted audio feature set using a machine learning model and a first video feature set of the video feature sets, the predicted audio feature set representing hypothetical audio content having a high degree of correlation with a first portion of the video component corresponding to the first video feature set.14. The method of claim 11, further comprising determining that the audio component and the video component are desynchronized by providing the one or more correlations to a classifier trained on correlations between audio content and video content.15. The method of claim 11, wherein the one or more correlations include a first set of correlations between a first portion of the video component and a subset of the portions of the audio component, wherein the method further comprises determining that the audio component and video component are desynchronized based on the first set of correlations.16. The method of claim 11, wherein the one or more correlations include a second set of correlations between a first portion of the audio component and a subset of the portions of the video component, wherein the method further comprises determining that the audio component and video component are desynchronized based on the second set of correlations.17. The method of claim 11, further comprising modifying the media presentation by advancing or delaying one or both of selected portions of the audio component and selected portions of the video component relative to the media timeline, inserting first audio content or first video content to the media presentation, removing second audio content or second video content from the media presentation, or any combination thereof, based on determining that the audio component and the video component are desynchronized.