Taking the case of audio content as an example, the systems described herein may compute the standardized Euclidean distance between each pair of the audio segments (across the first and second media objects), as shown in Equation (1): √{square root over (Σ(ui?vi)2/V[xi])}.??(1)
where u and v are the respective vectors representing a segment from the first and second media objects, respectively (e.g., 128-dimensional log-mel features, as described earlier); V is a variance vector, V[i] being the variance computed over all of the ith components of the log-mel vector.
Taking the case of video content as an example, the systems described herein may compute the structural similarity index measure for each pair of video frames (across the first and second media objects. Generally, the systems described herein may use any suitable similarity metric, including, e.g., the mean squared error.
As mentioned earlier, the systems described herein may determine which segments pairs are substantially the same and which are not, and so classify each pair. Thus, for example, the systems described herein may apply each computed pairwise distance to a predetermined distance threshold to determine whether each given pair is the same or different. After having performed a pairwise comparison of each of the segments from the first media data object with each of the segments from the second media data object and classified each pair as the same or different, in some examples the systems described herein may determine the longest common subsequence between the first and second set of segments.