Said syntax structures may reside in-band in the video bitstream and/or may be delivered as such and/or converted to another representation format (e.g. base-64 representation of the syntax structure or a list of ASCII-coded key-value pairs) out-of-band, for example using a signaling protocol such as the Session Description Protocol (SDP). Alternatively or in addition, said syntax structures or alike may be used in announcing the properties of a bitstream, for example using the Real-time Streaming Protocol (RTSP) or the Media Presentation Description (MPD) or a manifest file for adaptive streaming for example over HTTP. Alternatively or in addition, said syntax structures or alike may be used in session or mode negotiation, for example according to the SDP Offer/Answer model.
For the case of multiple spatial/quality layers, sample prediction could be used between those layers, and consequently multiple motion compensation loops would be needed to reconstruct the samples for each layer, which is very complex. According to an embodiment, to limit the complexity, syntax prediction could be used between layers, but reconstructed samples of a single layer can be used for predicting other layers. It may be, for example, specified that any operation point according to a particular coding profile must not require more than three motion compensation loops but the number of syntax prediction references is not limited. In other words, the requirement may be formulated as a constraint that the number of output layers summed up with the number of reference layers for sample prediction for those output layers must be less than or equal to 3, where the reference layers in the summation exclude those that are also output layers and include in a recursive manner all the reference layers (for sample prediction) of the reference layers.