As noted above, the detected sound (from the first playback device 702a) is passed via the second network interface (represented by arrow I(d)) to the second voice processor 760b which processes and transmits the detected sound to the second wake word engine 770b (represented by arrow I(e)). The second wake word engine 770b then processes the detected sound for detection of the second wake word, which may occur before, after, or while the first wake word engine 770a processes the detected sound for the first wake word. As such, the first and second playback devices 702a, 702b are configured to monitor sound detected by the microphones 722 of the first playback device 702a for different wake words associated with different VASes which allows a user to realize the benefits of multiple VASes, each of which may excel in different aspects, rather than requiring a user to limit her interactions to a single VAS to the exclusion of any others. Moreover, the distribution of wake word detection across multiple playback devices of the system frees up computational resources (e.g., processing time and power) (as compared to a single playback device with two wake word engines). As such, the playback devices of the present technology may be configured to efficiently process detected sound, thereby enhancing the responsiveness and accuracy of the media playback system to a user's command.