Can A Concise Architecture Be Efficient And Accurate? Tsinghua & Huawei Propose A New Residual Recurrent Super-Resolution Model: RRN!

Sharing a paper on video super-resolution titled Revisiting Temporal Modeling for Video Super-resolution, which is a BMVC 2020 paper. The results of this paper currently rank first on several datasets for video super-resolution, and the code has been open-sourced.

Affiliations: Tsinghua University, New York University, Huawei Noah’s Ark Lab

Highlights

This paper proposes a concise yet efficient super-resolution architecture, achieving a PSNR of 27.69 with only 45ms per frame on the test set, with significant practical value. The highlights are as follows:

Many video super-resolution (VSR) methods based on deep learning have been proposed in the past, but direct comparisons are difficult due to different loss functions or training sets. This paper unifies the research and compares three temporal modeling methods: early fusion 2D CNN, slow fusion 3D CNN, and RNN.
A new Residual Recurrent Network (RRN) is proposed, which stabilizes the training of RNNs using residuals while improving super-resolution performance, achieving SOTA on three benchmark datasets.

Temporal Fusion Models

2D CNN: Utilizes several improved 2D residual blocks, each consisting of a 3×3 convolution layer and ReLU. The model takes 2T+1 consecutive frames as input, concatenating them along the channel dimension, then passing through a series of residual blocks, outputting a residual feature map of shape H×W×Cr^2^. The residual image R~t~^↑^ is obtained by upsampling the residual feature map four times using depth-to-space, which is then added to the center frame upsampled using bicubic interpolation to produce the HR image.

Can A Concise Architecture Be Efficient And Accurate? Tsinghua & Huawei Propose A New Residual Recurrent Super-Resolution Model: RRN!

3D CNN: In contrast to 2D CNN, 3D CNN uses 3×3×3 convolution layers to extract spatiotemporal information. Additionally, to prevent frame reduction, we add two zero-pixel value frames along the time axis.

Can A Concise Architecture Be Efficient And Accurate? Tsinghua & Huawei Propose A New Residual Recurrent Super-Resolution Model: RRN!

RNN: The input at time step t consists of three parts: (1) previous output o~t−1~, (2) previous hidden state h~t−1~, and (3) the sum of two adjacent frames. RNN can utilize complementary information from the previous layer to further refine the high-frequency texture details at time step t.

However, RNNs suffer from the vanishing gradient problem. To address this, we propose a new recurrent network (RRN), which internally uses residual blocks (consisting of a convolution layer, a ReLU layer, and another convolution layer).

This design ensures smooth information flow and has the ability to retain textual information over long periods, making RNNs easier to handle longer sequences while reducing the risk of vanishing gradients.

Can A Concise Architecture Be Efficient And Accurate? Tsinghua & Huawei Propose A New Residual Recurrent Super-Resolution Model: RRN!

Where σ(·) is the ReLU function, and the residual image to be learned.

Experiments

Implementation Details RRN is initialized to zero for previous estimates in time. All three models use the L1 loss function. Vimeo-90k is used as the training set, with BD degradation and cropping to 64×64 as preprocessing.

Quantitative Evaluation and Ablation Study The authors consider two models with different network depths for modeling. S represents 5 stacked modules, while L represents 10. The following figure shows that RRN has significant advantages over other temporal modeling methods in terms of runtime, computational complexity, and PSNR value.

Ablation studies on whether to use residual blocks and the number of residual blocks show that residual blocks effectively suppress the vanishing gradient.

Comparisons with other models show that RRN achieves SOTA.

Paper: https://arxiv.org/pdf/2008.05765.pdf

Code: https://github.com/junpan19/RRN

END

Note: Super-resolution

Super-resolution Group Chat

Image and video super-resolution, visible light, infrared, remote sensing super-resolution technologies,

If you are already friends with other accounts of CV Jun, please send a direct message.

I Love Computer Vision

WeChat ID: aicvml

QQ Group: 805388940

Weibo Zhihu: @I Love Computer Vision

Submission: [email protected]

Website: www.52cv.net

This paper proposes a concise yet efficient super-resolution architecture, achieving a PSNR of 27.69 with only 45ms per frame on the test set, with significant practical value. The highlights are as follows:

Leave a Comment Cancel reply