Additionally or alternatively, the campaign identification platform 110 may train the one or more neural networks to learn and apply a vocabulary of sub words and/or pieces of words that are adapted to threat identification and clustering. For example, the campaign identification platform 110 may use sub-word embedding to train the one or more neural networks (e.g., when applying the one or more neural networks to text data) rather than applying the one or more neural networks to every letter individually and/or word separately. As an example, in using sub-word embedding to train the one or more neural networks, the campaign identification platform 110 may feed strings of characters that often fall together in training data (e.g., “.com”) into the one or more neural networks as sub words.
By training the one or more neural networks to use both metric learning and sub-word embeddings to produce numerical representations, the campaign identification platform 110 may continuously adapt to—and thereby address—technical challenges presented by the constantly changing nature of the threat landscape. For example, features of different threats may be constantly changing (e.g., the filenames of files that threats may create, the URLs that threats may attempt to communicate with, and/or objects may change). Similarly, the labels that may be applied to different threats may be constantly changing (e.g., due to the temporal nature of campaigns). As a result of these challenges, conventional supervised machine learning (which may e.g., attempt to find an association between features and labels) might not work to address the technical problems addressed herein. One or more aspects of the disclosure, however, may provide various advantages over these conventional approaches, for instance, by providing the capability to continuously adapt to changing threats, as noted above.