Event based audio-video sync detection

專利號

US11659217B1

公開日期

2023-05-23

申請人

Amazon Technologies, Inc.（US WA Seattle）

發(fā)明人

Hooman Mahyar; Avijit Vajpayee; Abhinav Jain; Arjun Cholkar; Vimal Bhat

IPC分類

H04N21/242; H04N21/234; H04N21/233

技術(shù)領(lǐng)域

audio,video,frames,bins,feature,bin,frame,sets,may,component

地域： WA WA Seattle

摘要

Techniques are described for detecting desynchronization between an audio component and a video component of a media presentation. Feature sets may be determined for portions of the audio component and portions of the video component, which may then be used to generate correlations between portions of the audio component and portions of the video component. Synchronization may then be assessed based on the correlations.

說明書

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

BACKGROUND

Consumers have an ever-increasing array of options for consuming media content, in terms of the types of media content (e.g., video, audio, etc.), providers of the media content, and devices for consuming the media content. Media content providers are becoming increasingly sophisticated and effective at providing media content quickly and reliably to consumers.

A recurring challenge in providing media content to consumers is maintaining synchronization between an audio stream and a video stream, as consumers may be dissatisfied if an audio stream excessively leads or lags its associated video stream. Thus, it would be desirable to provide techniques for identifying audio-visual desynchronization and for synchronizing audio content and video content.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 illustrates an example of identifying correlations between audio and video content.

FIG. 2 presents a simplified diagram of a computing environment in which various implementations may be practiced.

FIG. 3 presents a flowchart illustrating operations of an example implementation.

FIGS. 4, 5, and 6 present flowcharts illustrating operations for determining correlations between portions of an audio component and portions of a video component as described herein.

DETAILED DESCRIPTION

權(quán)利要求

What is claimed is:

1. A method, comprising:receiving a media presentation, wherein the media presentation comprises a video component that includes a sequence of video frames, and an audio component that includes a sequence of audio bins, each audio bin corresponding to one of the video frames relative to a media timeline;

generating a first audio vector based on a first audio bin of the sequence of audio bins, the first audio vector representing a plurality of features of the first audio bin;

generating a first object vector based on a first video frame that corresponds to the first audio bin, wherein the first object vector represents one or more objects in the first video frame;

generating a first object attribute vector that represents one or more features of the one or more objects represented by the first object vector;

generating a first confidence score using a first machine learning model, the first audio vector, the first object vector, and the first object attribute vector, the first confidence score representing a measure of correlation between the first audio bin and the first video frame;

generating a predicted object vector and a predicted object attribute vector using a second machine learning model and the first audio vector, the predicted object vector and the predicted object attribute vector representing a hypothetical video frame having a high degree of correlation with the first audio bin;

generating a second confidence score that represents a measure of correlation between the predicted object vector and the predicted object attribute vector, and the first video frame; and

determining, based on the first confidence score and the second confidence score, that the audio component and the video component are desynchronized; and

modifying one or both of the audio component and the video component to improve synchronization of the audio and video components of the media presentation.

2. The method of claim 1, further comprising:generating a second object vector and a second object attribute vector based on a second video frame, the second video frame being adjacent to the first video frame relative to the media timeline;

generating a second confidence score using the first machine learning model, the first audio vector, the second object vector, and the second object attribute vector, the second confidence score representing a measure of correlation between the first audio bin and the second video frame; and

determining the second confidence score is higher than the first confidence score; and

determining that the audio component and the video component are desynchronized based on the second confidence score being higher than the first confidence score.

3. The method of claim 1, further comprising determining a predicted audio vector using a third machine learning model, the first object vector, and the first object attribute vector, the predicted audio vector representing a hypothetical audio bin having a high degree of correlation with the first video frame;generating a third confidence score that represents a measure of correlation between the predicted audio vector and the first audio bin; and

determining the audio component and the video component are desynchronized additionally based on the third confidence score.

4. A system, comprising one or more processors and memory configured to:select a media presentation, the media presentation including a video component and an audio component that reference a media timeline;

generate a plurality of audio feature sets based on corresponding portions of the audio component;

generate a plurality of predicted video feature sets using a machine learning model and the plurality of audio feature sets, each predicted video feature set corresponding to a portion of the audio component;

generate a plurality of video feature sets based on corresponding portions of the video component, wherein each portion of the audio component corresponds to one of the portions of the video component relative to the media timeline;

determine, using the predicted video feature sets and the video feature sets, one or more correlations between one or more pairs of the portions of the audio component and the portions of the video components;

determine, using the one or more correlations, that the audio component and the video component are desynchronized, and

modify the media presentation to improve synchronization of the audio component and video component of the media presentation.

5. The system of claim 4, wherein each video feature set corresponds to a single video frame of the video component.

6. The system of claim 4, wherein the one or more processors and memory are further configured to generate a predicted audio feature set using a machine learning model and a first video feature set of the video feature sets, the predicted audio feature set representing hypothetical audio content having a high degree of correlation with a first portion of the video component corresponding to the first video feature set.

7. The system of claim 4, wherein the one or more processors and memory are further configured to determine that the audio component and the video component are desynchronized by providing the one or more correlations to a classifier trained on correlations between audio content and video content.

8. The system of claim 4, wherein the one or more correlations include a first set of correlations between a first portion of the video component and a subset of the portions of the audio component, wherein the one or more processors and memory are further configured to determine that the audio component and video component are desynchronized based on the first set of correlations.

9. The system of claim 4, wherein the one or more correlations include a second set of correlations between a first portion of the audio component and a subset of the portions of the video component, wherein the one or more processors and memory are further configured to determine that the audio component and video component are desynchronized based on the second set of correlations.

10. The system of claim 4, wherein the one or more processors and memory are further configured to modify the media presentation by advancing or delaying one or both of selected portions of the audio component and selected portions of the video component relative to the media timeline, inserting first audio content or first video content to the media presentation, removing second audio content or second video content from the media presentation, or any combination thereof, based on determining that the audio component and the video component are desynchronized.

11. A method, comprising:selecting a media presentation, the media presentation including a video component and an audio component that reference a media timeline;

generating a plurality of audio feature sets based on corresponding portions of the audio component;

generating a plurality of video feature sets based on corresponding portions of the video component, wherein each portion of the audio component corresponds to one of the portions of the video component relative to the media timeline;

determining, using the predicted video feature sets and the video feature sets, one or more correlations between one or more pairs of the portions of the audio component and the portions of the video component;

determining, using the one or more correlations, that the audio component and the video component are desynchronized and

modifying the media presentation to improve synchronization of the audio component and video component of the media presentation.

12. The method of claim 11, wherein each video feature set corresponds to a single video frame of the video component.

13. The method of claim 11, further comprising generating a predicted audio feature set using a machine learning model and a first video feature set of the video feature sets, the predicted audio feature set representing hypothetical audio content having a high degree of correlation with a first portion of the video component corresponding to the first video feature set.

14. The method of claim 11, further comprising determining that the audio component and the video component are desynchronized by providing the one or more correlations to a classifier trained on correlations between audio content and video content.

15. The method of claim 11, wherein the one or more correlations include a first set of correlations between a first portion of the video component and a subset of the portions of the audio component, wherein the method further comprises determining that the audio component and video component are desynchronized based on the first set of correlations.

16. The method of claim 11, wherein the one or more correlations include a second set of correlations between a first portion of the audio component and a subset of the portions of the video component, wherein the method further comprises determining that the audio component and video component are desynchronized based on the second set of correlations.

17. The method of claim 11, further comprising modifying the media presentation by advancing or delaying one or both of selected portions of the audio component and selected portions of the video component relative to the media timeline, inserting first audio content or first video content to the media presentation, removing second audio content or second video content from the media presentation, or any combination thereof, based on determining that the audio component and the video component are desynchronized.

微信群二維碼

意見反饋

白丝美女被狂躁免费视频网站,500av导航大全精品,yw.193.cnc爆乳尤物未满,97se亚洲综合色区,аⅴ天堂中文在线网官网

Event based audio-video sync detection

摘要

說明書

權(quán)利要求

該功能需要專業(yè)版企業(yè)版VIP權(quán)限，您可以：

該功能需要專業(yè)版企業(yè)版VIP權(quán)限，您可以：