Further, in tracking mode 206, the tracking algorithm returns information for a bounding box containing a respective representation of a face and features of a user's face are extracted from within the bounding box for each image to determine an output for the location of a user's eyes, mouth, or one or more points relative to these features, for example, that is smoother and reduces jitteriness relative to simply providing the current location of these features. The change in position of these features between subsequent images can be used to determine what output or to adjust an output for a current location of these features. For example, the change in position of optical flow of a user's eyes can be calculated for a current and previous image. If this change in position is less than a first threshold, then the position of the user has only slightly changed relative to their position in the previous frame. Since this change is small, the user's current eye position can be reasonable estimated as their location in the previous frame, as if the user hasn't moved. In another example, if this change is between the first threshold and a second threshold, a single point tracking algorithm can be used to track the user's eyes between these two frames. If, however, this change in optical flow is greater than the second threshold, the current position of the user's eyes can be used. In this instance, the tracking output will appear quite jittery, however, since the change in eye position is so great (i.e., greater than the second threshold) the user has moved quickly or abruptly and, thus, an abrupt change, in this instance, would not only be acceptable, it would likely be expected. Once the current location of the eyes, in this example, is determined for each image captured by each camera, stereo disparity between this current location between these images is determined. The stereo disparity is then used to determine a z-depth for the eyes, by calculating a distance between the eyes and the computing device, in order to determine a three-dimensional position (x, y, z) of the eyes relative to the computing device.