Scanning module 106 may scan the data input to the application in a variety of ways. Notably, scanning module 106, the data loss prevention scanner, and/or the data loss prevention filters may all correspond to the same module or instead refer to one or more separate modules that simply communicate with each other. Moreover, prior to scanning the data input to the application, scanning module 106 and/or monitoring module 104 may extract underlying content from the data (e.g., through a text extractor). For example, scanning module 106 may distinguish between metadata, formatting data, and data specifying how content is presented, on the one hand, and underlying content or text, on the other hand. In some examples, scanning module 106 may discard some or all of the metadata and formatting data while leaving the underlying data for scanning for sensitive data. In other examples, scanning module 106 may scan the metadata as well.
Notably, as used herein, use of the text extractor to extract text from document files and other files does not indicate that those document files “obfuscate” the presence of underlying textual content, as discussed further below for step 308. In other words, the text extractor simply extracts text that is already present within the file (e.g., the document file) whereas text that is displayed within a multimedia file, such as an image file, is not present within the file when displayed in ASCII or text formatting. Instead, the underlying text displayed within the media file is simply encoded in terms of pixels, sound waves, etc., as discussed further below.