As another example of pre-processing, systems described herein may downsample audio content to a specified sampling frequency (e.g., 16000 Hz, 12000 Hz, 8000 Hz, etc.). This may have the benefit of reducing computational load and improving efficiency while preserving human-salient differences. These systems may extract features from the audio content useful for comparing the similarity of the content. For example, these systems may convert the content to spectrograms. In some examples, these systems may convert the content into log-mel spectrograms. For example, these systems may extract 128 mel frequencies, thereby producing 128-dimensional log-mel features.
Similarly, systems described herein may downsample video content to a specified resolution (e.g., 320×180). Furthermore, these systems may crop video content to achieve a consistent size and/or aspect ratio. In some examples, these systems may also apply cropping to each frame to remove potentially irrelevant content. For example, these systems may crop approximately 2% of the horizontal portion of the frame and approximately 15% of the vertical portion of the frame to remove potentially irrelevant textual content. In addition, these systems may reformat the content as a vector (e.g., converting the downsampled 320×180 frame to a 57600×1 vector).