A network microphone device further includes components for detecting and facilitating capture of voice input. For example, the network microphone device 503 shown in FIG. 5A includes beam former components 551, acoustic echo cancellation (AEC) components 552, voice activity detector components 553, and/or wake word detector components 554. In various embodiments, one or more of the components 551-556 may be a subcomponent of the processor 512. The beamforming and AEC components 551 and 552 are configured to detect an audio signal and determine aspects of voice input within the detect audio, such as the direction, amplitude, frequency spectrum, etc. For example, the beamforming and AEC components 551 and 552 may be used in a process to determine an approximate distance between a network microphone device and a user speaking to the network microphone device. In another example, a network microphone device may detective a relative proximity of a user to another network microphone device in a media playback system.
The voice activity detector activity components 553 are configured to work closely with the beamforming and AEC components 551 and 552 to capture sound from directions where voice activity is detected. Potential speech directions can be identified by monitoring metrics which distinguish speech from other sounds. Such metrics can include, for example, energy within the speech band relative to background noise and entropy within the speech band, which is measure of spectral structure. Speech typically has a lower entropy than most common background noise.