Im2Mesh GAN: Recovering 3D Hand Mesh from a Single RGB Image

Click on the above “Beginner Learning Vision“, select to add “Starred” or “Pinned“

Heavyweight content delivered in real time

Beginner’s Guide

The paper is the essence of academic research and a beacon for future development. The author is determined to bring classic or the latest paper interpretations and shares daily to help readers quickly understand the content of the papers. Personal abilities are limited, and misunderstandings are inevitable; readers interested in the content of the article are advised to download the original text to understand the specifics.

Abstract

This work addresses the recovery of hand meshes from a single RGB image. Compared to most existing methods that utilize parameterized hand models, we demonstrate that hand meshes can be learned directly from the input image. We propose a novel neural network, the Im2Mesh neural network, to learn the mesh via end-to-end adversarial training. By interpreting the mesh as a graph, our model can capture the topological relationships between mesh vertices. We also introduce a 3D surface descriptor into the GAN architecture to further capture relevant 3D features. We experimented with two approaches, one that benefits from the availability of ground truth data for the image and the corresponding mesh, and another that addresses the more challenging problem of mesh estimation without corresponding ground truth. Through extensive evaluation, we prove that the proposed method outperforms state-of-the-art methods.

Innovations of the Paper

Importantly, by interpreting the mesh as a graph, we can leverage recent advancements in Graph Neural Networks (GNNs) to support mesh processing in the generator and discriminator networks. GNNs have demonstrated the capability to handle non-Euclidean structured data such as graphs and manifolds. Unlike existing graph-based mesh estimation methods in the literature that only consider CNN-generated features, we introduce a 3D descriptor that encodes surface-level information into GNNs, allowing them to better utilize the topological relationships between mesh vertices in graph-structured hand data. This improves the accuracy of mesh recovery since the recovery algorithm considers not only the three-dimensional coordinates of the vertices but also the three-dimensional features associated with the vertices.

The main contributions of this paper are as follows:

We propose a new GAN architecture named Im2Mesh that can directly learn hand meshes from a single RGB input image in an end-to-end manner, without requiring any heatmap processing, 3D keypoint annotations, or external parameterized hand models.
We model the generator of the GAN as a graphical architecture, allowing it to model the topological relationships between mesh vertices while introducing a 3D descriptor that encodes surface-level information into GNNs, further capturing 3D features associated with the mesh vertices.
This method not only addresses the mesh reconstruction problem for coupled datasets where a one-to-one mapping exists between images and ground truth meshes but also tackles the mesh reconstruction problem for datasets that do not contain corresponding ground truth annotations.
We do not use depth images; thus, we increase the potential for using our model on datasets without corresponding depth images.

Network Architecture

An overview of the proposed conditional GaN architecture. The positional values and lens descriptor values are generated by the generator network and passed to the discriminator network, which classifies them as generated or ground truth.

An overview of the proposed cyclic GaN architecture. G M is the generator that estimates images from the input image I, while G I is the generator that estimates images from the input mesh M.

The graph enhancement process used in our work. It is important to note that this figure describes the process of upsampling a graph with N nodes and feature dimensions of d to a graph with R nodes. The described network consists of two cascaded graph upsamplings followed by a coordinate reconstructor that computes the position vectors of the upsampled graph. k and q are the feature dimensions of the generated features at cascades 1 and 2 respectively. Since the goal of our work is to upsample the graph while maintaining the number of features, we set k = q = d.

Experimental Results

Qualitative results obtained by varying parameters related to surface smoothness

Conclusion

While 3D mesh reconstruction of the human hand using a single image has been studied, the problem remains challenging due to the high degrees of freedom of the human hand. In this paper, we propose a method for creating 3D hand meshes using a single image that effectively leverages existing databases to better utilize a single image for 3D mesh reconstruction. We designed a loss function to generate more realistic hand meshes and demonstrated the effectiveness of the loss function in two generative adversarial network settings. The first setting aims to effectively use coupled datasets where ground truth meshes are available, while the second setting targets non-coupled datasets. Additionally, we utilized a 3D surface descriptor and graph convolutional networks in this work to preserve surface details of the generated meshes. We confirm that our framework outperforms state-of-the-art techniques and represents the first effort to integrate explicit 3D features in single-image-based 3D mesh reconstruction. An interesting property of the proposed mesh recovery method is that it does not require a parameterized hand model as a prior. The geometric shape of the hand is learned and encoded directly in the generator through an end-to-end adversarial training process. This fact makes the proposed algorithm easily adaptable to other mesh problems, such as other body parts or 3D objects.

Paper link: https://arxiv.org/pdf/2101.11239.pdf

Daily sharing of papers is not easy; if you like our content, we hope you can recommend or forward it to your classmates.

– END –

Download 1: OpenCV-Contrib Extension Module Chinese Version Tutorial

Reply in the backend of the “Beginner Learning Vision” public account: Chinese Tutorial for Extension Module, to download the first Chinese version of the OpenCV extension module tutorial available online, covering more than twenty chapters including installation of extension modules, SFM algorithms, stereo vision, object tracking, biological vision, super-resolution processing, etc.

Download 2: 52 Lectures on Python Vision Practical Projects

Reply in the backend of the “Beginner Learning Vision” public account: Python Vision Practical Projects, to download 31 visual practical projects including image segmentation, mask detection, lane detection, vehicle counting, eyeliner addition, license plate recognition, character recognition, emotion detection, text content extraction, face recognition, etc., to help quickly learn computer vision.

Download 3: 20 Lectures on OpenCV Practical Projects

Reply in the backend of the “Beginner Learning Vision” public account: 20 Lectures on OpenCV Practical Projects, to download 20 practical projects based on OpenCV for advanced learning of OpenCV.

Communication Group

Welcome to join the reader group of the public account to exchange with peers. Currently, there are WeChat groups for SLAM, 3D Vision, Sensors, Autonomous Driving, Computational Photography, Detection, Segmentation, Recognition, Medical Imaging, GAN, Algorithm Competitions (which will gradually be subdivided in the future). Please scan the WeChat ID below to join the group, and note: “nickname + school/company + research direction”, for example: “Zhang San + Shanghai Jiao Tong University + Vision SLAM”. Please follow the format for notes; otherwise, you will not be approved. After successful addition, you will be invited to relevant WeChat groups based on your research direction. Please do not send advertisements in the group; otherwise, you will be removed from the group. Thank you for your understanding~

Leave a Comment Cancel reply