DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

DeepSeek released the multimodal model Janus-Pro this morning. The Janus series includes three model components: Janus, JanusFlow, and Janus-Pro, significantly enhancing multimodal understanding and visual generation performance. Currently, the code and models are fully open. I tried them today and felt that the results were average (details below), and there is still a certain gap compared to other closed-source or open-source solutions available now.

Introduction to the Three Models

Janus: Decoupled Visual Multimodal Understanding and Generation

Janus is an innovative autoregressive framework aimed at unifying multimodal understanding and generation. Its unique advantages include:

  • Decoupled Visual Encoding: By separating the visual encoding paths, it effectively alleviates the conflicts between generation and understanding while still adopting a unified Transformer architecture.
  • Flexibility and Generality: The decoupled design enhances the framework’s flexibility, allowing it to surpass traditional unified models in multimodal tasks and rival task-specific models.

Technical Features:

  • Simplified design: Reduces architectural complexity.
  • Efficiency: Performs excellently across multiple tasks, making it a strong candidate for the next generation of multimodal models.

Model Download Links and Paper Addresses:

  • Janus-1.3B: https://huggingface.co/deepseek-ai/Janus-1.3B
  • Paper: https://arxiv.org/abs/2410.13848

DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

JanusFlow: Harmonious Unity of Autoregression and Rectified Flow

JanusFlow is another important component of the Janus framework, combining autoregressive language models with the latest generative modeling method Rectified Flow. Its technical highlights include:

  • Simplified Architecture: Allows training of rectified flow directly within the large language model framework without complex architectural modifications.
  • Efficient Generation: Achieves performance comparable to or even better than specialized models in visual and language tasks.
  • Cross-task Superiority: Significantly surpasses the benchmark performance of existing unified methods.

Model Download Links and Paper Addresses:

  • JanusFlow-1.3B: https://huggingface.co/deepseek-ai/JanusFlow-1.3B
  • Paper: https://arxiv.org/abs/2411.07975

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

Janus-Pro: A Unified Framework for Data and Model Expansion

  • Janus-Pro is an advanced version of Janus, integrating the following core improvements:

  • Optimized Training Strategy: Significantly enhances generalization ability and stability.
  • Expanded Training Data: Enriches the scenes and diversity of multimodal inputs.
  • Model Scale Expansion: Enhances the model’s understanding and generation stability.
  • These improvements enable Janus-Pro to excel in the following areas:

  • Multimodal Understanding: Achieves industry-leading performance in complex tasks.
  • Text-to-Image Generation: Executes complex instructions more efficiently and generates stable high-quality images.
  • Model Download Links and Paper Addresses:

  • Janus-Pro-1B: https://huggingface.co/deepseek-ai/Janus-Pro-1B
  • Janus-Pro-7B: https://huggingface.co/deepseek-ai/Janus-Pro-7B
  • Paper: https://github.com/deepseek-ai/Janus/blob/main/janus_pro_tech_report.pdf

DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

Try Image Understanding

Task 1: Understand this Meme

DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

Janus

This meme is a humorous comparison between two different AI assistants. The left side shows a person using an OpenAI assistant, which is depicted as having a camera mounted on its head, suggesting it is a tool for surveillance or monitoring. The right side shows a person using a DeepSee assistant, which is depicted as a person pointing a camera at a screen, suggesting it is used for visual analysis or data collection. The meme is a play on the idea that OpenAI is more focused on surveillance and monitoring, while DeepSee is more focused on visual analysis and data collection. This meme humorously contrasts two types of AI assistants: the left side shows a person using an OpenAI assistant with a camera on their head, implying a focus on monitoring; the right side depicts a user with a DeepSee assistant, pointing a camera at a screen, indicating an emphasis on visual analysis and data collection. The meme pokes fun at the notion that OpenAI prioritizes surveillance, whereas DeepSee is more dedicated to visual analysis.

JanusFlow

OpenAI and DeepSeek are two AI models that have been developed to assist humans in various tasks. OpenAI is a large language model that can be used for a wide range of applications, including chatbots, language translation, and text generation. DeepSeek, on the other hand, is a more specialized AI model that is designed to search for specific information within large datasets. Both models have their own strengths and weaknesses, and they are often used in conjunction with each other to achieve more complex tasks. OpenAI and DeepSeek are two AI models designed to assist humans in various tasks. OpenAI is a general large language model used for chatbots, language translation, and text generation, while DeepSeek is more focused on searching for specific information within large datasets. Both models have strengths and weaknesses and are often used together for more complex tasks.

Janus-Pro

This image is a humorous meme comparing two AI systems, “OpenAI” and “DeepSeek,” using a visual metaphor. This image is a humorous meme comparing two AI systems, “OpenAI” and “DeepSeek,” through visual metaphor.

  1. Left Side (OpenAI):

  • The person is wearing a headset with multiple cameras or lenses, aiming a gun. This represents the idea of OpenAI being “focused” or “targeted” on a specific task or goal. The person is wearing a headset with multiple cameras or lenses, aiming a gun. This symbolizes OpenAI being more “focused” or “precise” on a specific task or goal.
  • Right Side (DeepSeek) Overall Meaning:

    • The person is holding a gun, aiming at a target, but the target is not visible. This suggests that DeepSeek is “aiming” or “targeting” at something, but the specific goal or outcome is not clear. The person is holding a gun, aiming at a target, but the target is not visible. This implies that while DeepSeek is “aiming” in a certain direction, the specific goal or outcome remains unclear.
  • Overall Meaning:

    • The meme humorously implies that OpenAI is more focused and directed towards a specific goal, while DeepSeek is more ambiguous or uncertain in its approach.

    This is a playful comparison, likely intended to highlight the differences in how these AI systems are perceived or used. The meme humorously suggests that OpenAI is more clearly directed towards a specific goal, while DeepSeek’s approach appears more vague or uncertain.

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    Task 2: Spot the Differences

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    Correct Answer:

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    Janus

    No difference (Incorrect ❌)

    JanusFlow

    The image shows a girl sitting at a table with a bowl of ice cream and an ice cream cone. There are two differences between the two images / This image shows a girl sitting at a table with a bowl of ice cream and an ice cream cone. The differences between the two images are as follows:

    1. The girl’s hair is slightly different in the second image. In the first image, her hair is straight, while in the second image, her hair is slightly wavy. The girl’s hair is slightly different in the second image. In the first image, her hair is straight, while in the second image, her hair is slightly wavy. (Incorrect ❌)

    2. The girl is wearing a pink bow in her hair in the second image, which is not present in the first image. The girl is wearing a pink bow in her hair in the second image, which is not present in the first image. (Correct ✅)

    Janus-Pro

    The difference between the two images is the color of the ice cream scoops. In the left image, the ice cream scoops are green, while in the right image, they are white. The difference between the two images is the color of the ice cream scoops: in the left image, the ice cream scoops are green, while in the right image, they are white. (Incorrect ❌)

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    Task 3: Solve the Math Problem

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    Correct Answer: chatGPT o1 and o1 Pro can solve it.

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    Janus

    To solve this math problem, we need to determine the value of ( h_1(-2) ).

    First, we need to find the value of ( h_1(-2) ).

    Given the equation:

    [ A^2 = 24^2 – 48t + 2, h_1(-2) = -9 ]

    We can substitute ( A^2 = 24^2 – 48t ) into the given equation:

    [ 24^2 – 48t + 2 = -9 ]

    Next, we need to solve the equation for ( t ):

    [ 24^2 – 48t + 2 = -9 ]

    [ 24^2 – 48t + 2 = -9 ]

    [ 24^2 – 48t + 2 = -9 ]

    [ 24^2 – 48t + 2 = -9 ]

    [ 24^2 – 48t + 2 = -9 ]

    [ 24^2 – 48t + 2 = -9 ]

    [ 24^2 – 48t + 2 = -9 ]

    [ 24^2 – 48t + 2 = -9 ]

    [ 24^2 – 48t + 2 = -9 ]

    [ 24^2 – 48t + 2 = -9 ]

    [ 24^2 – 48t + 2 = -9 ]

    [ 24^2 – 48t + 2 = -9 ]

    [ 24^2 – 48t + 2 = -9 ]

    [ 24^2 – 48t + 2 = -9 ]

    [ 24^2 – 48t + 2 = -9 ]

    [ 24^2 – 48t + 2 = -9 ]

    [ 24^2 – 48t + 2 = -9 ]

    [ 24^2 – 48t + 2 = -9 ]

    [ 24^2 – 48t + 2 = -9 ]

    (Incorrect ❌)

    JanusFlow

    To solve this math problem, we need to find the value of ( h_1(1) ) given that ( h_1′(1) = 24t^2 – 48t + 2 ) and ( h_1(-2) = -9 ).
    First, we need to find ( h_1′(1) ):
    [ h_1′(1) = 24t^2 – 48t + 2 ]
    Next, we need to find ( h_1(-2) ):
    [ h_1(-2) = -9 ]
    Now, we can substitute ( h_1′(1) ) and ( h_1(-2) ) into the equation:
    [ 24t^2 – 48t + 2 – 9 = -9 ]
    Simplifying the equation:
    [ 24t^2 – 47t + 1 = -9 ]
    Subtract 1 from both sides:
    [ 24t^2 – 47t + 1 – 1 = -9 ]
    [ 24t^2 – 46t + 2 = -9 ]
    Now, we need to find ( h_1(1) ):
    [ h_1(1) = 24t^2 – 46t + 2 ]
    Substitute ( h_1(1) ) back into the equation we derived:
    [ 24t^2 – 46t + 2 = 24t^2 – 46t + 2 ]
    Simplify the equation:
    [ 24t^2 – 46t + 2 = 24t^2 – 46t + 2 ]
    This equation is already correct, so there is no need to solve it further. The solution is:
    [ h_1(1) = 24t^2 – 46t + 2 ]

    (Incorrect ❌)

    Janus-Pro

    We have the equation (24^2 – 48t + 2h = -9) and (h(-2) = -4).
    First, let’s simplify the equation (24^2 – 48t + 2h = -9).
    [
    \begin{align*}
    24^2 – 48t + 2h &= -9 \
    (24)^2 – 48t + 2h &= -9 \
    (24)^2 – 48t + 2h + 9 &= 0 \
    (24)^2 – 48t + 2h + 9 &= 0 \
    (24)^2 – 48t + 2h + 9 &= 0 \
    (24)^2 – 48t + 2h + 9 &= 0 \
    (24)^2 – 48t + 2h + 9 &= 0 \
    (24)^2 – 48t + 2h + 9 &= 0 \
    (24)^2 – 48t + 2h + 9 &= 0 \
    (24)^2 – 48t + 2h + 9 &= 0 \
    (24)^2 – 48t + 2h + 9 &= 0 \
    (24)^2 – 48t + 2h + 9 &= 0 \
    (24)^2 – 48t + 2h + 9 &= 0 \
    (24)^2 – 48t + 2h + 9 &= 0 \
    (24)^2 – 48t + 2h + 9 &= 0 \
    (24)^2 – 48t + 2h + 9 &= 0 \
    (24)^2 – 48t + 2h + 9 &= 0 \
    (24)^2 – 48t + 2h + 9 &= 0 \
    (24)^2 – 48t + 2h + 9 &= 0 \
    (24)^2 – 48t + 2h + 9 &= 0 \
    (24)^2 – 48t + 2h + 9 &= 0 \
    (24)^2 – 48t + 2h + 9 &= 0 \
    (24)^2 – 48t + 2h + 9 &= 0 \
    (24)^2 – 48t + 2h + 9 &= 0 \
    (24)^2 – 48t + 2h + 9 &= 0 \

    (Incorrect ❌)

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    Task 4: Count the Cars in the Parking Lot

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    Correct Answer: 18 cars (This was previously given as an incorrect answer by chatGPT o1 Pro)

    Janus

    There are 20 cars in this parking lot.

    (Incorrect ❌)

    JanusFlow

    There are 21 cars in the parking lot.

    (Incorrect ❌)

    Janus-Pro

    There are 14 cars in the parking lot.

    (Incorrect ❌)

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    Try Image Generation

    Prompt:

    The image features an intricately designed eye set against a circular backdrop adorned with ornate swirl patterns that evoke both realism and surrealism. At the center of attention is a strikingly vivid blue iris surrounded by delicate veins radiating outward from the pupil to create depth and intensity. The eyelashes are long and dark, casting subtle shadows on the skin around them which appears smooth yet slightly textured as if aged or weathered over time.

    Above the eye, there’s a stone-like structure resembling part of classical architecture, adding layers of mystery and timeless elegance to the composition. This architectural element contrasts sharply but harmoniously with the organic curves surrounding it. Below the eye lies another decorative motif reminiscent of baroque artistry, further enhancing the overall sense of eternity encapsulated within each meticulously crafted detail.

    Overall, the atmosphere exudes a mysterious aura intertwined seamlessly with elements suggesting timelessness, achieved through the juxtaposition of realistic textures and surreal artistic flourishes. Each component—from the intricate designs framing the eye to the ancient-looking stone piece above—contributes uniquely towards creating a visually captivating tableau imbued with enigmatic allure.

    Janus

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    JanusFlow

    The demo has a bug and did not run successfully.

    Janus-Pro

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    Prompt:

    A charming, cute girl with big sparkling eyes, soft pastel-colored hair (e.g., pink, lavender, or mint green), wearing a stylish outfit with subtle frills and bows, standing in a dreamy background filled with soft lighting, cherry blossoms, and gentle gradients. The atmosphere is cheerful and heartwarming, with warm, glowing highlights and delicate details in the surroundings, anime-inspired style.

    Janus

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    JanusFlow

    The demo has a bug and did not run successfully.

    Janus-Pro

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    More Image Examples

    Janus

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    JanusFlow

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    Janus-Pro

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    Evaluation

    Janus

    Benchmark Performance

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    Visual Generation Results

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    JanusFlow

    Benchmark Performance

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    Visual Generation Results

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    Janus-Pro

    Average performance in four multimodal understanding benchmarks

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    Performance in text-to-image generation instruction following benchmarks

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    Technical Framework

    Janus

    Unlike previous methods that typically assume visual understanding and generation share the same visual encoder, Janus decouples visual encoding into independent modules for visual understanding and visual generation. “Und. Encoder” and “Gen. Encoder” are abbreviations for “understanding encoder” and “generation encoder,” respectively.

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    Janus’s three-stage training process:

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    • Stage 1: Train Adapters and Image Heads The goal is to establish conceptual connections between vision and language in the embedding space, enabling the model to have preliminary visual generation capabilities. The visual encoder and LLM are frozen in this stage, updating only the understanding adapter, generation adapter, and image head.

    • Stage 2: Unified Pre-training Train using a multimodal corpus to allow the model to learn multimodal understanding and generation capabilities. Unfreeze the LLM and train using pure text data, multimodal understanding data, and visual generation data, starting visual generation training from ImageNet-1k, then expanding to open-domain text-to-image data.

    • Stage 3: Supervised Fine-tuning (SFT) Fine-tune the model using instruction tuning data to enhance its instruction-following and conversational capabilities. All parameters except for the generation encoder are unfrozen. The data mixture includes pure text dialogue, multimodal understanding, and visual generation to ensure the model’s versatility in various scenarios.

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    JanusFlow

    JanusFlow architecture: In visual understanding, LLM generates responses through autoregressive predictions; in image generation, starting from Gaussian noise (𝑡=0), LLM iteratively updates 𝑧𝑡 by predicting velocity vectors until 𝑡=1. For simplicity, the VAE encoder, skip connections in generation, and linear layers after 𝑓𝑒𝑛𝑐 are omitted from the display.

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    JanusFlow’s three-stage training process:

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    • Stage 1: Adapt Randomly Initialized Components Train the randomly initialized linear layers, generation encoder, and generation decoder, allowing these new modules to work collaboratively with the pre-trained LLM and SigLIP encoder to complete initialization.

    • Stage 2: Unified Pre-training Train the entire model (freezing the visual encoder), using three types of data: multimodal understanding data, image generation data, and pure text data. Initially focus on multimodal understanding, later increasing the proportion of image generation data to meet the convergence needs of diffusion models.

    • Stage 3: Supervised Fine-tuning (SFT) Fine-tune the model using instruction tuning data (including dialogue, task-specific dialogue, high-quality text-image generation examples). Unfreeze the SigLIP encoder to enhance the instruction response capabilities of multimodal understanding and image generation tasks.

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    Janus-Pro

    Janus-Pro architecture: Decouples visual encoding into multimodal understanding (Understanding Encoder, abbreviated as “Und. Encoder”) and visual generation (Generation Encoder, abbreviated as “Gen. Encoder”).

    DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

    Compared to Janus, Pro’s three-stage optimization is as follows:

    • Stage I: Extend Training Time Increase the number of training steps in Stage I, fully training on the ImageNet dataset. Even with LLM parameters fixed, the model can still effectively model pixel dependencies and generate reasonable images based on category names.

    • Stage II: Focused Training Remove ImageNet data, directly use ordinary text-to-image data to generate images based on dense descriptions, thus improving training efficiency and overall performance.

    • Stage III: Adjust Data Ratios In supervised fine-tuning, adjust the ratio of multimodal data, pure text data, and text-to-image data from 7:3:10 to 5:1:4. By slightly lowering the proportion of text-to-image data, the model improves multimodal understanding performance while maintaining strong visual generation capabilities.

    Leave a Comment