Referring to block 784 of FIG. 7D, the MPS 100 may capture a user's voice input in response to the MPS's 100 request for the user to select one of the available MCS(es). The MPS 100 may then send the voice input 795 to the VAS 160 for processing to determine the intent (block 796) of the voice input. The VAS 160 may send a response or packet 797 to the MPS 100 that contains information identifying the MCS selection made by the user. The MPS 100 may then process the response 797 (block 798) and generate a desired message for the user. The MPS 100 may send a request 799 to the VAS to convert the MPS's 100 message into voice data that can be played back as a voice output by the MPS 100 to the user. In some embodiments, the message may be a confirmation to the user that the MPS 100 will play or is already playing the user's requested media content on a certain one of the MCS(es). For example, the MPS 100 may play back a voice output such as “You are listening to ‘Jagged Little Pill’ on SPOTIFY.” At block 831, the VAS converts the message into the requested audio data and transmits a packet 832 containing the voice data to the MPS 100. Before, concurrently with, and/or after playing back the voice output (at block 833) to the user, the MPS 100 may exchange data (block 834) with the selected MCS to play back the requested and found media content (for example, via one or more of the playback devices 102). In some instances it may be beneficial to play the voice output confirming the media content and/or MCS selection prior to playing back the media content, as retrieving the media content from the MCS for playback may create a latency and the voice output can fill that latency for the user.