In some implementations, the analyzed speech data 116 may be the (e.g., raw) audio data of the user's recorded voice. In such implementations, the model 128 of a user's speech may model the particular grammar, syntax, vocabulary, and/or other textual characteristics, and/or the audio characteristics of the user's speech such as pitch, timbre, volume, pace and/or rhythm of speaking, pause patterns, and/or other audio characteristics. The authentication engine 124 may provide the audio data as input to the model 128 for the particular user 102, and the model may compare the audio data to the modeled characteristics of the user's speech to determine a probability that the speaker corresponds to the modeled user. If the probability of a match that is output by the model exceeds a predetermined threshold, the authentication engine 124 may determine that the user 102 has been successfully authenticated.
In some implementations, the authentication of the user 102 is based on video data 118 in addition to speech data 116. For example, the camera(s) 110 in the PA device 104 may capture video and/or still image(s) of the user's face and/or other body parts, and the video data 118 may be provided as input to the model(s) 128. The model(s) 128 may model feature(s) of the user and/or movements of the user in addition to modeling speech characteristics, and the use of the video data 118 may provide a higher-confidence verification of the user's identity compared to verification using audio data without using video data 118. For example, the model 128 may analyze image(s) of the user's face and/or body, and/or video of the user's facial expressions, gestures, gait, and/or other aspects of the user's behavior, to authenticate the user 102.