Language-Guided Open Set Computer Vision

Source: ZHUAN ZHI

This article is approximately 1000 words, recommended reading time is 5 minutes.
We explore three paths to introduce language into computer vision systems for open set recognition.

The visual world is vast and constantly evolving. Additionally, due to the long-tail nature of data collection, computer vision systems cannot observe all visual concepts during training. Humans do not learn the entire world in their childhood either. We continuously adapt and learn visual concepts throughout our lives. We have developed a compositional representation of the world, where complex entities are further broken down into simpler primitives shared across different visual concepts. Humans can share their compositional visual models through language, enabling zero-shot generation of new categories. For instance, if someone has never observed an animal called a zebra, they can understand it through the description ‘A zebra is an animal like a horse, with black and white stripes.’ Humans can leverage this description to zero-shot generalize to zebras without explicit visual supervision. In this doctoral thesis, we utilize this compositional characteristic of human language to develop computer vision systems capable of generalizing new categories through language, without retraining on labeled data. We particularly focus on zero-shot learning (ZSL) and the field of open set computer vision.

Part One of the Thesis We propose that online text documents (such as Wikipedia) contain rich visual descriptions of object categories. We believe these texts can serve as powerful unsupervised auxiliary information for zero-shot learning. To this end, we introduce I2DFormer, an innovative ZSL framework based on transformers that jointly learns to encode images and documents by aligning them in a shared embedding space. Quantitatively, we demonstrate that I2DFormer significantly outperforms previous unsupervised semantic embedding methods on three public datasets under zero-shot and generalized zero-shot learning settings. Qualitatively, we show that our method can yield highly interpretable results, where vocabulary from the documents can correspond to image regions.

Part Two of the Thesis We address the memory complexity issue of I2DFormer by proposing an innovative Document Summary Transformer (DSTransformer). Our DSTransformer can encode documents into a fixed set of summary tokens, allowing us to scale the model to more challenging zero-shot learning at the ImageNet scale. We demonstrate that I2DFormer+ significantly surpasses baseline models on large-scale zero-shot learning benchmarks.

Part Three of the Thesis Large language models (LLMs) trained on web-scale text exhibit impressive capabilities in applying their learned knowledge to various tasks. We propose a new perspective of utilizing LLMs to describe categories and perform classification as zero-shot generalization emerges. We introduce I2MVFormer, an innovative model capable of leveraging multiple descriptions from LLMs to understand each category. I2MVFormer further elevates the state-of-the-art in zero-shot image classification using unsupervised semantic embedding.

Final Part of the Thesis We construct an open set computer vision model aimed at all image-based tasks. Image-text pre-training based on web-scale image annotation datasets has become the default method for open set classification and retrieval models, thanks to the success of CLIP and its variants. We introduce SILC, an innovative visual-language pre-training framework. SILC simply enhances image-text contrastive learning through self-distillation of local-to-global correspondence learning. The SILC model sets new technical standards for zero-shot classification, few-shot classification, image and text retrieval, zero-shot segmentation, and open vocabulary segmentation. We further demonstrate the significant advantages of SILC in open vocabulary detection, image caption generation, and visual question answering. Overall, this thesis proposes language guidance as a powerful signal for open set computer vision across all image-based tasks.

“What does a tiger look like?” It is a fierce animal that looks like a scary big cat with stripes. Tigers are not native animals of Japan, but when travelers from China described them based on local animals, it inspired a series of tiger paintings in Japanese history. Humans possess an impressive ability to imagine and recognize unseen objects through pure language descriptions. We can easily navigate complex environments and generalize to new visual challenges, such as new categories, in zero-shot tasks. Human perception is compositional [13, 56]. We develop our understanding of the surrounding world through basic concepts that can be used for complex reasoning.

Developing machine learning systems capable of mimicking and ultimately surpassing human abilities has long been a goal of academia. Early computer vision systems used a combination of hand-designed feature extractors [86, 7] with learnable classifiers. Several works improved upon this foundation by learning feature extraction directly from data in an unsupervised manner [69, 105], followed by a second stage of learning classification boundaries. In 2012, modern deep learning experienced its first practical breakthrough with AlexNet [70], paving the way for today’s systems. AlexNet demonstrated that it was possible to fully learn feature extractors and classifiers in an end-to-end manner through large datasets and computation. AlexNet jointly trained convolutional neural network (CNN) feature extractors and multi-layer perceptron (MLP) classifiers.

Building on the success of AlexNet, academia developed more powerful models in the subsequent years. Key advances in this direction include the use of deeper networks to construct larger models [136, 128, 55, 189], developing better optimizers and normalization techniques to facilitate stable learning [68, 134, 59, 6], and most importantly, the invention of skip connections [55]. Neural networks face the problem of vanishing gradients, where the learning signal gradually disappears due to multiplication with small activation values as model depth increases. ResNet [55] addressed this issue by introducing skip connections, allowing the model to bypass certain modules and provide an alternative path for learning to tackle the gradient vanishing problem. Recently, convolutional neural networks (CNNs) have been disrupted by transformer models [141]. Transformers, by leveraging attention mechanisms, learn all inductive biases and features directly from data. Originally proposed in the natural language processing (NLP) domain, transformers have had a significant impact on computer vision since the introduction of the Vision Transformer [38]. Due to these major advances, computer vision systems can now outperform human-level performance on multiple tasks, including classification, detection, segmentation, and more [35, 9, 25, 120]. However, current systems still follow closed-set methods, where they are trained only on a predefined set of categories and can only handle new instances of the same categories during testing.

We believe that these closed-set systems have significant limitations for practical applications. The real world is vast and constantly evolving. Computer vision systems in real-world environments will inevitably face new visual concepts not covered in the training data. These closed-set systems cannot generalize to new concepts without retraining [150]. Achieving open set capabilities, that is, being able to generalize when new visual concepts arise, is crucial for computer vision systems in the real world. For example, autonomous robots may encounter unknown objects in the environment, or medical imaging systems may identify previously unknown diseases under new discoveries.

Zero-shot learning (ZSL) aims to promote a model trained on seen categories to a set of unseen categories through shared auxiliary information [150]. This shared information is often realized through human-annotated class attributes [151, 112, 142, 44, 101]. Unlike simply labeling images as their noun entities (e.g., dog), attribute-based methods also identify and label common basic semantic features among classes (e.g., legs, fur, habitat, etc.). Once humans annotate these additional class attributes, ZSL methods can learn classification in relation to these attribute vectors. During testing, these models can generalize to unseen categories by predicting shared attributes and selecting the category that is most compatible with the predictions. Despite the power of these methods, they exchange one costly annotation method (i.e., fine-grained attribute annotation) for another (i.e., image annotation for new categories). Moreover, attribute-based methods rely on manually identifying a set of non-overlapping attributes among classes to learn class boundaries. As we scale up to ImageNet-scale zero-shot learning, the cost of this approach becomes prohibitive.

Another class of methods relies on semantic embeddings extracted solely from language encoders to represent categories [123, 104, 23, 91, 1, 173, 150, 92]. These methods encode class names into alternative annotation vectors using models like GloVe [113]. Because similar classes are close together in the semantic space of language encoders, a certain degree of zero-shot generalization becomes possible. However, since these language encoders are not jointly trained with the visual modality, they often encounter common failure cases due to the ambiguity of class names [102]. Additionally, these methods are inherently limited by the quality of the semantic embeddings from pre-trained encoders [102, 103, 100].

Another subfield of zero-shot learning specifically focuses on studying the compositionality of machine learning models. Early attempts in this literature can be traced back to Hoffman [56], who proposed modeling objects by decomposing them into lines and edges. Biederman [13] improved upon this by proposing to achieve compositionality by decomposing visual concepts into their parts and achieving compositionality through these primitives. Fundamentally, modern machine learning systems (such as CNNs and transformers) possess a certain degree of compositionality. Multiple studies have shown that these models learn raw features such as edges in the early layers of the model, similar to Hoffman’s conception [56]. Subsequent layers then use these basic features to construct larger parts of objects, akin to Biederman’s conception [13]. However, these models share only low-level features and primitives and cannot achieve compositionality for new semantic object categories. Compositional zero-shot learning (CZSL) requires models to generalize to new combinations of these primitives by learning combinations of specific states and object pairs. Methods in this direction generalize to new combinations by learning a combination function between states and objects. However, despite observing all primitives during training, many studies have shown that they still struggle to combine these primitives into new composite classes [104, 91, 65, 64, 101]. This is due to the dilemma of the state-object compositionality problem in its current form [101, 104]. We take a different compositionality approach. We do not rely on rigid states and object classes but instead leverage natural language descriptions to achieve compositionality for new categories through primitives learned directly from language.

This thesis aims to advance towards “open set” computer vision, in which systems can not only recognize previously seen visual concepts but also handle entirely new concepts that have never been encountered. We do not adopt methods that use human-annotated attributes or encode category names through pre-trained encoders; instead, we take an orthogonal direction. Natural language descriptions provide a rich source of semantic information that can bridge the gap between the low-level features learned from images and high-level semantic understanding. We propose that if a machine learning system can learn about the world through the linguistic descriptions of categories, it can sufficiently understand simple compositional concepts that form more complex categories. Therefore, new categories can be introduced through language descriptions, and the model can effectively perform zero-shot generalization. For example, we can introduce a new “zebra” category into the animal classifier by describing it as “A zebra is an animal like a horse, with black and white stripes.”

We explore three paths to introduce language into computer vision systems for open set recognition.

I2DFormer and I2DFormer+: We propose using unstructured human text from Wikipedia to learn classification models in Chapters 3 and 4. Since our I2DFormer framework learns classification in the context of language descriptions, it allows the model to zero-shot generalize new categories through their respective descriptions. Wikipedia text is readily available on the internet, and querying it incurs low costs when new categories arise. Additionally, I2DFormer inherently possesses interpretability, explaining which words in the document and which regions in the image contributed to the decision. We further demonstrate that the I2DFormer framework can be scaled to ImageNet-scale zero-shot learning and achieve state-of-the-art performance.
I2MVFormer: We expand the I2DFormer framework in Chapter 5, combining multiple complementary text sources. We propose a new perspective of utilizing large language models (LLMs) as annotators, generating conditions based on different k-shot examples. We show that when prompted with different k-shot examples, LLMs can simulate multiple annotators, revealing complementary information about each category. We introduce a new model, I2MVFormer, which can leverage multiple descriptions of classes to learn a robust zero-shot generalization model. We demonstrate that the text generated by LLMs provides rich semantic information, enabling the model to achieve a more comprehensive understanding of classes than I2DFormer. I2MVFormer significantly improves performance in zero-shot classification benchmarks.
SILC: In Chapter 6, we establish a parallel thread based on the open set model originally proposed by CLIP [116]. CLIP learns a powerful open set model through contrastive learning on web-scale image-text datasets. Due to the vast dataset covering almost all visual primitives and categories that the model may encounter downstream, CLIP provides robust zero-shot transfer for various computer vision datasets. Moreover, CLIP serves as a stronger model initialization than the previous ImageNet gold standard for training specialized models. However, CLIP’s features perform poorly in dense prediction tasks that require the model to understand the local context of images, such as segmentation and detection [110]. We address this limitation of CLIP by introducing local-to-global consistency learning as an additional objective for contrastive learning. Our model SILC has significantly improved over CLIP and other similar models across all open set computer vision tasks. These improvements are particularly evident in tasks requiring the model to better capture the local semantics of images, such as segmentation, detection, and caption generation.

These three contributions collectively explore the power of human language as a strong generalization signal for open set computer vision in textual form. They demonstrate how to leverage text descriptions and large language models to enhance open set generalization capabilities, enabling models to generalize to new categories without supervised training on image data. By bridging the gap between vision and language, we show that this research has the potential to significantly advance the field of open set computer vision. We hope it paves the way for more powerful and intelligent systems capable of navigating an ever-changing visual world.

About Us

Data Pie THU, as a public account for data science, is backed by the Tsinghua University Big Data Research Center, sharing cutting-edge data science and big data technology innovation research dynamics, continuously disseminating data science knowledge, striving to build a platform for data talent aggregation, and creating the strongest group in China’s big data.

Sina Weibo: @数据派THU

WeChat Video Account: 数据派THU

Today’s Headlines: 数据派THU

Leave a Comment Cancel reply