The remote computing devices 106a associated with the VAS 160 may process the voice input by converting the voice input to text (for example, via a speech-to-text component, discussed above with reference to FIG. 6) and analyzing the text to determine the intent of the request. In some embodiments, the remote computing devices 106a may employ NLU systems that maintain and utilize a lexicon of language, parsers, grammar and semantic rules, and associated processing algorithms to derive information related to the requested media content. For example, the VAS 160 may (i) identify derived payload 783a and/or field types 870 within the voice input that correspond to the intent of the voice input, and (ii) associate the derived payload 783a with one or more of the fields. The derived payload 783a and/or field types 870 identified by the VAS 160 and contained within the packet 783 may be derived by the VAS 160 based on a search and/or metadata provided by the MPS 100 (described in greater detail below) and/or may be stated explicitly by the user. For example, the voice input “Play the ‘In the Zone’ album” explicitly names derived payload 783a (i.e., “In the Zone”) and a field type (i.e., “album”); as such, the resulting response 783 would include {album: “In the Zone”}. In some embodiments, the response 783 contains only the fields populated with derived payload 783a. In particular embodiments, the response 783 contains all of the predefined fields, whether null or populated. In certain cases, the response 783 from the VAS does not include any metadata derived from the voice input.