For example, FIG. 13 gives the overall workflow 1300 of a preferred embodiment of the training process. For training, the actual Video Compression & Transmission module 135 by a Video Noise Modeling module 235. This is because the actual video compression includes non-differentiable process such as quantization. The Video Noise Modeling module 235 adds random noises to the down-sampled sequence X′=x′1, x′2, . . . to generate the decoded down-sampled sequence X=x′1, x′2, . . . in the training process, mimicking the true data distribution of the decoded down-sampled sequence in the final test stage. Therefore, the noise model used by the Video Noise Modeling module 235 usually depends on the actual video compression method used in practice. Similarly, we replace the EFA Feature Compression & Transmission module 127 by an EFA Feature Noise Modeling module 227, which adds noises to Fb,1, Fb,2, . . . to generate the decoded EFA features Fb,1, Fb,2, . . . in the training stage, mimicking the data distribution of the actual decoded EFA features in practice. Also, there is replaced the Landmark Feature Compression & Transmission module 126 by a Landmark Feature Noise Modeling module 226, which adds noises to Fl,1, Fl,2, . . . to generate the decoded landmark features Fl,1, Fl,2, . . . in the training stage, mimicking the true distribution of the decoded landmark features in practice. Exemplary embodiments compute the following loss functions for training.