In some examples, speech recognition machine 130 may be configured to segment speech audio into words (e.g., using LSTM trained to recognize word boundaries, and/or separating words based on silences or amplitude differences between adjacent words). In some examples, speech recognition machine 130 may classify individual words to assess lexical data for each individual word (e.g., character sequences, word sequences, n-grams). In some examples, speech recognition machine 130 may employ dependency and/or constituency parsing to derive a parse tree for lexical data. In some examples, speech recognition machine 130 may operate AI and/or ML models (e.g., LSTM) to translate speech audio and/or vectors representing speech audio in the learned representation space, into lexical data, wherein translating a word in the sequence is based on the speech audio at a current time and further based on an internal state of the AI and/or ML models representing previous words from previous times in the sequence. Translating a word from speech audio to lexical data in this fashion may capture relationships between words that are potentially informative for speech recognition, e.g., recognizing a potentially ambiguous word based on a context of previous words, and/or recognizing a mispronounced word based on a context of previous words. Accordingly, speech recognition machine 130 may be able to robustly recognize speech, even when such speech may include ambiguities, mispronunciations, etc.