This information is sourced from company announcements, related news, public research reports, and social media networks, and does not constitute investment advice regarding the industries and stocks mentioned in the text. If there are any copyright infringements or violations of information disclosure, please leave a message in the background to contact for deletion.
New Year’s Eve Meeting Highlights
1. Development of DeepSeek and its multimodal large model Janus Pro
Rapid growth of DeepSeek: DeepSeek has surpassed ChatGPT, becoming the fastest-growing AI application globally, with daily active users (DAU) exceeding 2 to 2.5 million.
Foundation of Janus Pro: Janus Pro is not a brand new model; it is based on Janus, released in October last year, with Janus-Pro and Janus-Flow launched on January 28.
Core of Janus model: Decouples the encoding tasks for understanding and generating images, allowing them to be executed by different encoders.
Training process of Janus model: Divided into three stages; the first stage trains the model’s adapter and image head; the second stage conducts unified pre-training; the third stage performs supervised fine-tuning (SFT).
Versions and performance of Janus Pro: DeepSeek Pro is available in 1B and 7B versions, boasting the best capabilities among models with similar parameter counts, with performance advantages stemming from the use of more high-quality synthetic data.
Model architecture of Janus Pro: Two different tasks are handled by two different encoders, processed through a unified autoregressive Transformer.
Processing of image understanding and generation in Janus Pro: Image understanding utilizes a contrastive loss function based on Contrastive Language–Image Pretraining (CLIP); image generation employs another encoder.
2. Optimization Strategies for Janus Pro
Extended training: Fully train the model on ImageNet to leverage existing information.
Improved pre-training: In the first stage, train the adapter and head; in the second stage, perform unified pre-training, using long text-to-image data for training.
Adjusting data ratios: Continuously optimize the ratios of different datasets during supervised fine-tuning (SFT).
Expanding the training dataset: On top of the 160 million token samples used in Janus, an additional 72 million synthetic data has been added, along with new sample data in both image understanding and generation tasks.
Increasing model parameters: Janus Pro has 1B and 7B versions, corresponding to 1 billion and 7 billion parameters, differing in vocabulary size and embedding size, with the 7B version having larger attention heads, layers, and context window. The increased model parameters result in better performance for the 7B version.
3. Training Costs of Janus Pro and Comparison with Other Models
Training costs: Training of Janus Pro is conducted on a cluster of 16 to 32 nodes, each equipped with 8 NVIDIA A100 (40GB). A 15 billion parameter model was trained for seven days using 128 cards, while a 70 billion parameter model was trained for 14 days using 256 cards.
Comparison with other models: GPT-4 was trained for 90 to 100 days using 25,000 A100 cards, while Llama 3.1 used 15,000 H100 cards; the publicly available data indicates that general models can utilize around 15 to 20T tokens.
4. Comparison of Janus Pro with Other Visual Large Models
Comparison with Imagen and Stable Diffusion: The parameter count of Janus Pro is acceptable compared to visual large models like Imagen and Stable Diffusion, but its capabilities are yet to be tested, as each model has its own advantages and differences.
5. Reflections on Model Training and the AI Industry
The importance of model slimming: Once large models reach a certain parameter count, their practical value diminishes, and costs become too high. Model slimming is something that major companies need to consider going forward, requiring updates and upgrades to all aspects of model architecture.
Understanding post-training: Post-training includes learning and search; learning extracts a modality from data, while search extracts reasoning through computation. The entire training process of a large model is akin to a human life, transitioning from pre-training to post-training, followed by learning and then search. RL holds significant potential in enhancing model capabilities, representing a part of the AI industry’s advancement rhythm worth learning from.
Impact on the AI industry and the re-evaluation of Chinese AI assets: The influence of DeepSeek on various aspects of the AI industry chain (computing power, models, application terminals) will be reported in the next 1 to 2 weeks. Many AI assets in China are worth re-evaluating at this point, but re-evaluation is a mid to long-term industry logic that needs to be combined with short-term rhythms. The progress of the AI industry in the United States is also worth tracking and learning from.
———————-
The WeChat public account has been revamped, and everyone may not receive article push notifications in time. Please be sure to click “Like” and “View” at the bottom of the article, or click the three dots in the upper right corner of the public account homepage to set the public account as “Starred” for easier access (Your likes and views are my motivation for updating).