Top 10 Must-Read Computer Vision Papers of 2021

Source: DeepHub IMBA



This article has 2000 words, recommended reading time is 10 minutes.
This article will include clear video explanations and code, and a full reference for each paper is listed at the end.

The top 10 computer vision papers of 2021, including video demonstrations, articles, code, and paper references.The global economy has fallen into an unprecedented stagnation due to the impact of the virus, but research has not slowed its frantic pace, especially in the field of artificial intelligence. This year’s papers not only highlight general research results but also emphasize many important aspects such as ethics, significant biases, governance, transparency, and more. The understanding of artificial intelligence and its connection to the human brain continues to evolve, showing promising applications that could improve our quality of life in the near future. However, we should be cautious in choosing which technologies to apply.

“Science cannot tell us what we should do; it can only tell us what we can do.” — Jean-Paul Sartre, Being and Nothingness

Here are the 10 most interesting research papers I summarized in the field of computer vision this year. In short, it is essentially a curated list of the latest breakthroughs in AI and CV. This article will include clear video explanations and code (if available). A complete reference for each paper is listed at the end of this article. If you have any recommendations, please feel free to contact me.

DALL·E: Zero-Shot Text-to-Image Generation from OpenAI [1]

OpenAI successfully trained a network capable of generating images from text prompts. It is very similar to GPT-3 and Image GPT, producing astonishing results.Code: https://github.com/openai/DALL-E

Taming Transformers for High-Resolution Image Synthesis [2]

Combining the efficiency of GANs and convolutional methods with the expressive power of Transformers provides a powerful and time-saving approach for semantically guided high-quality image synthesis.Code: https://github.com/CompVis/taming-transformers

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [3]

Will Transformers replace CNNs in computer vision? Learn how to apply the Transformer architecture to computer vision in under 5 minutes through a new paper titled Swin Transformer.Code: https://github.com/microsoft/Swin-Transformer

Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image [4]

The next step in view synthesis: the goal is to take a single image and then explore the scenery within it!DEMO:https://colab.research.google.com/github/google-research/google-research/blob/master/infinite_nature/infinite_nature_demo.ipynb#scrollTo=sCuRX1liUEVM

Total Relighting: Learning to Relight Portraits for Background Replacement [5]

Relighting portraits based on the brightness of the added new background. Have you ever thought about changing the background of a picture but making it look realistic? If you’ve tried, you know it isn’t simple. If you take a photo of yourself at home and then change it to a beach background, anyone would instantly say, “That has been Photoshopped.” For movies and professional videos, perfect lighting and artists are needed to recreate high-quality images, which is very expensive. You cannot do this with your own photos. But this paper has achieved it.

Animating Pictures with Eulerian Motion Fields [6]

This model can understand which particles should be moving by taking just one photo and can set realistic animations for them in a limited loop while fully preserving the rest of the image, allowing us to turn pictures into animations…Code: https://eulerian.cs.washington.edu/

CVPR 2021 Best Paper Award: GIRAFFE — Controllable Image Generation [7]

Using a modified GAN architecture, they can move objects in the image without affecting the background or other objects!Code: https://github.com/autonomousvision/giraffe

TimeLens: Event-based Video Frame Interpolation [8]

TimeLens can understand the motion of particles between video frames, reconstructing what really happened at speeds invisible to the naked eye. It achieves effects that smartphones and other models cannot reach!Code: https://github.com/uzh-rpg/rpg_timelens

CLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis [9]

Have you ever thought about applying the style of a photo, like this cool painting style on the left, to a new photo of your choice? This model can do that, even achieving it solely from text, and also provides a Google Colab for everyone to try this new method. Simply take a picture of the style you want to replicate, input the text you want to generate, and this algorithm will generate a new image! The results are impressive, especially since they can be created from a line of text!DEMO: https://colab.research.google.com/github/kvfrans/clipdraw/blob/main/clipdraw.ipynbhttps://colab.research.google.com/github/pschaldenbrand/StyleCLIPDraw/blob/master/Style_ClipDraw.ipynb

CityNeRF: Building NeRF at City Scale [10]

This model called CityNeRF is developed from NeRF, which is one of the first models to use radiance fields and machine learning to construct 3D models from images. However, NeRF is not efficient and only applicable to a single scale. Here, CityNeRF is applied simultaneously to satellite and ground images, generating various 3D models. In short, they have brought NeRF to city scale.Code: https://city-super.github.io/citynerf/

References:

[1] A. Ramesh et al., Zero-shot text-to-image generation, 2021. arXiv:2102.12092

[2] Taming Transformers for High-Resolution Image Synthesis, Esser et al., 2020.

[3] Liu, Z. et al., 2021, “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, arXiv preprint https://arxiv.org/abs/2103.14030v1

[bonus] Yuille, A.L., and Liu, C., 2021. Deep nets: What have they ever done for vision?. International Journal of Computer Vision, 129(3), pp.781–802, https://arxiv.org/abs/1805.04025.

[4] Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N. and Kanazawa, A., 2020. Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image, https://arxiv.org/pdf/2012.09855.pdf

[5] Pandey et al., 2021, Total Relighting: Learning to Relight Portraits for Background Replacement, doi: 10.1145/3450626.3459872, https://augmentedperception.github.io/total_relighting/total_relighting_paper.pdf.

[6] Holynski, Aleksander, et al. “Animating Pictures with Eulerian Motion Fields.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

[7] Michael Niemeyer and Andreas Geiger, (2021), “GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields”, Published in CVPR 2021.

[8] Stepan Tulyakov, Daniel Gehrig, Stamatios Georgoulis, Julius Erbach, Mathias Gehrig, Yuanyou Li, Davide Scaramuzza, TimeLens: Event-based Video Frame Interpolation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 2021, http://rpg.ifi.uzh.ch/docs/CVPR21_Gehrig.pdf

[9] a) CLIPDraw: exploring text-to-drawing synthesis through language-image encoders b) StyleCLIPDraw: Schaldenbrand, P., Liu, Z. and Oh, J., 2021. StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis.

[10] Xiangli, Y., Xu, L., Pan, X., Zhao, N., Rao, A., Theobalt, C., Dai, B. and Lin, D., 2021. CityNeRF: Building NeRF at City Scale.

Author: Louis Bouchard