On the decoder side, such as described for example with respect to the flowchart 1100 of FIG. 11 and various modules of FIG. 12, received encoded bitstreams, at S111, are first decompressed, at S112, to obtain the decoded down-sampled sequence X=x′1, x′2, . . . , data 136, the decoded EFA features Fb,1, Fb,2, . . . , data 129, and the decoded facial landmark features Fl,1, Fl,2, . . . , data 128. Each decoded frame x′i corresponds to the down-sampled x′i. Each decoded EFA feature Fb,i corresponds to the EFA feature Fb,i. Each decoded landmark feature Fl,i corresponds to the landmark feature Fl,i. At S113, the decoded down-sampled sequence X=x′1, x′2, . . . is passed through an ST Up Sample module 137 to generate an up-sampled sequence X=x1, x2, . . . , data 138. Corresponding to the encoder size, this ST Up Sample module performs spatial, temporal, or both spatial and temporal up-sampling as an inverse operation of the down-sampling process in the ST Down Sample module 123. When the spatial down-sampling is used on the encoder side, the spatial up-sampling is used here where each x′i is up-sampled into xi at the same time stamp, e.g., by traditional interpolation or DNN-based super-resolution methods, and xi will have the same resolution with xi. When temporal down-sampling is used on the encoder side, the temporal up-sampling is used here where each xki is x′i and the additional (k?1) frames between xki and x(k+1)i are computed, e.g., by using traditional motion interpolation or DNN-based frame synthesis methods based on xki and x(k+1)i. When both spatial and temporal down-sampling is used on the encoder side, the spatial and temporal up-sampling is used here where each xki is computed from x′i by spatially up-sampling x′i using traditional interpolation or DNN-based super-resolution methods, and the additional frames between xki and x(k+1)i are further generated by using traditional motion interpolation or DNN-based frame synthesis methods based on xki and x(k+1)i.