A method for facilitating a remote conference includes receiving a digital video and a computer-readable audio signal. A face recognition machine is operated to recognize a face of a first conference participant in the digital video, and a speech recognition machine is operated to translate the computer-readable audio signal into a first text. An attribution machine attributes the text to the first conference participant. A second computer-readable audio signal is processed similarly, to obtain a second text attributed to a second conference participant. A transcription machine automatically creates a transcript including the first text attributed to the first conference participant and the second text attributed to the second conference participant. Transcription can be extended to a variety of scenarios to coordinate the conference, facilitate communication among conference participants, record events of interest during the conference, track whiteboard drawings and digital files shared during the conference, and more generally create a robust record of multi-modal interactions among conference participants. The conference transcript can be used by participants for reviewing various multi-modal interactions and other events of interest that happened in the conference. The conference transcript can be analyzed to provide conference participants with feedback regarding their own participation in the conference, other participants, and team/organizational trends.