Introduction
Competition Link: Code Qwen AI Challenge – Algorithm Track_Algorithm Competition_Questions and Data_Tianchi Competition – Data and Questions from Alibaba Cloud Tianchi
Code is one of the high-quality languages created by humans, replacing the diverse natural language through high abstraction, ultimately converting to specific programs to complete tasks for humans. It possesses advantages such as precision, logic, and executability. Therefore, coding ability has become a core capability of large language models (LLMs). We anticipate that LLMs can assist humans in tasks such as programming assistance, bug fixing, and even fully automated code generation.
How to enhance the coding ability of foundational language models through high-quality data fine-tuning remains an open and challenging issue. The Qwen AI Challenge, hosted by Alibaba Cloud and NVIDIA, and jointly organized by the Tianchi platform and Modao, focuses on the fine-tuning training of the Tongyi Qwen large model. Its main goal is to explore and expand the coding capability limits of the open-source models Qwen 1.8B and Qwen 72B through high-quality data.
The structure of the article is as follows:
Preliminary model open-source address: https://modelscope.cn/models/JMXGODLZ/CodeQWen_1.8B/summary
Preliminary Plan
Problem Analysis
The preliminary evaluation data is divided into two types:
-
Code Completion (humanevalsynthesize/MBPP)
-
Code Correction (humanevalfixtests)
The data format is as follows:
humanevalsynthesize:
{
"input":"Write a Python function `has_close_elements(numbers: List[float], threshold: float) -> bool` to solve the following problem:
Check if in given list of numbers, are any two numbers closer to each other than
given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
True
from typing import List
def has_close_elements(numbers: List[float], threshold: float) -> bool:
""" Check if in given list of numbers, are any two numbers closer to each other than
given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
True
"""
",
"output":" for idx, elem in enumerate(numbers):
for idx2, elem2 in enumerate(numbers):
if idx != idx2:
distance = abs(elem - elem2)
if distance < threshold:
return True
return False
"
}
MBPP:
{
"input":"""\nWrite a python function to remove first and last occurrence of a given character from the string.\nassert remove_Occ("hello","l") == "heo"\n""",
"output":"def remove_Occ(s,ch): \n for i in range(len(s)): \n if (s[i] == ch): \n s = s[0 : i] + s[i + 1:] \n break\n for i in range(len(s) - 1,-1,-1): \n if (s[i] == ch): \n s = s[0 : i] + s[i + 1:] \n break\n return s "
}
humanevalfixtests:
{
"input":"from typing import List\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n for idx, elem in enumerate(numbers):\n for idx2, elem2 in enumerate(numbers):\n if idx != idx2:\n distance = elem - elem2;\n if distance < threshold:\n return True\n\n return False\n\n\ndef check(has_close_elements):\n assert has_close_elements([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True\n assert has_close_elements([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False\n assert has_close_elements([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True\n assert has_close_elements([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False\n assert has_close_elements([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True\n assert has_close_elements([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True\n assert has_close_elements([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False\n\ncheck(has_close_elements)\nFix bugs in has_close_elements.\n\nfrom typing import List\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:",
"output":" for idx, elem in enumerate(numbers):\n for idx2, elem2 in enumerate(numbers):\n if idx != idx2:\n distance = abs(elem - elem2);\n if distance < threshold:\n return True\n\n return False\n"
}
Key Points of the Problem:
-
The preliminary competition code languages are multilingual: Python/JAVA/JS/CPP/GO/RUST
-
Training data leakage detection: Avoid training datasets containing evaluation data or similar data
Reference Solutions
Related open-source data:
https://huggingface.co/datasets/nampdn-ai/tiny-codes
https://huggingface.co/datasets/jinaai/code_exercises?row=20
https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1
https://huggingface.co/datasets/codefuse-ai/Evol-instruction-66k
https://huggingface.co/datasets/TigerResearch/tigerbot-kaggle-leetcodesolutions-en-2k/tree/main
https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k/viewer/default/train?p=200
https://huggingface.co/datasets/BAAI/COIG/tree/main
Reference Paper 1:
https://arxiv.org/pdf/2306.11644v2.pdf

PHI-1 can achieve 51% accuracy on HumanEval, with the core method introduced as follows:
Datasets:
-
Filtered Code-Language Dataset: – Data Type: This dataset is extracted from The Stack and StackOverflow, containing Python code and related natural language comments. – Scale: This dataset contains approximately 600 million (6B) tokens.
-
Synthetic Textbook Dataset: – Data Type: This dataset is synthesized by GPT-3.5, containing natural language text and related code snippets from Python textbooks. – Scale: This dataset contains fewer than 1 billion (1B) tokens.
-
CodeExercises Dataset:
– Data Type: This dataset contains Python programming exercises and corresponding solution docstrings.
– Scale: This dataset contains approximately 180 million (180M) tokens.
Training Method:
-
Pre-training: phi-1-base was trained for 8 rounds on the CodeTextbook dataset (including Filtered Code-Language Dataset and Synthetic Textbook Dataset) with about 50B tokens
-
Fine-tuning: Fine-tuning was performed on phi-1-base using the CodeExercises dataset.
The authors used a combination of these three main datasets during training to provide phi-1 model with different types of training signals and tasks. First, they used the filtered code-language dataset (Filtered Code-Language Dataset) for pre-training the model, then used the synthetic textbook dataset (Synthetic Textbook Dataset) to further enhance the model’s performance. Finally, by fine-tuning with the programming exercises dataset (CodeExercises Dataset), they enabled the model to better generate code related to programming tasks.
The use of these different types of datasets during the training process helps the model learn different levels and types of knowledge, thus improving the model’s performance and generality. Additionally, this training method emphasizes the importance of data quality, especially the role of high-quality textbook data in enhancing model performance.
High-quality textbook data has some significant differences from traditional code data:
1. Clarity and Self-Containment: High-quality textbook data has clear expression and self-containment, allowing the content to be understood independently without additional context. In contrast, traditional code data may contain dependencies on other modules or files, making it difficult to understand.
2. Instructiveness: High-quality textbook data should have a teaching nature, effectively teaching basic coding concepts. However, traditional code data may contain a lot of non-instructive content, such as overly simple or boilerplate code that does not involve any meaningful computation.
3. Algorithmic Logic: High-quality textbook data should cover algorithmic logic, not just provide simple code snippets like constant definitions or parameter settings. In contrast, traditional code data may hide algorithmic logic within complex or poorly documented functions.
4. Balance: High-quality textbook data should cover a wide range of coding concepts and skills, without bias towards specific topics or use cases. Traditional code data may exhibit bias towards certain themes or use cases, leading to uneven distribution of coding concepts and skills in the dataset.
The authors believe the importance of high-quality textbook data for model training lies in:
1. Improving Signal Quality: High-quality textbook data can provide clearer and more effective signals, helping the model map natural language to code, thereby enhancing model performance.
2. Reducing Noise and Incompleteness: Traditional code data may contain a lot of noise, ambiguity, and incomplete information that can interfere with the model’s learning process. High-quality textbook data can reduce these issues, allowing the model to receive more accurate and complete training.
3. Enhancing Model Generality: By using high-quality textbook data, the model can learn broader and more diverse coding concepts and skills, thereby improving the model’s generality and robustness across different tasks.
In summary, high-quality textbook data can provide the model with more effective, clear, and instructive training signals, enabling it to achieve better performance in code generation tasks.
Reference Paper 2:
https://arxiv.org/pdf/2312.02120.pdf

MagicoderS-CL-7B significantly improves the model’s performance on code generation tasks through SFT training on the high-quality dataset OSS-Instruct.
Prompt generated from OSS-Instruct dataset:

Core Idea of OSS-Instruct: Use open-source code data as seed code snippets to generate high-quality code generation tasks and solutions.
80K initial seed snippets were collected from 80K code documents, including 40K Python, and 5K each of C++, Java, TypeScript, Shell, C#, Rust, PHP, and Swift. Each collected seed code snippet was applied to the prompt template shown in Figure 2, with a teacher model taking it as input and outputting a coding problem and its solution.
Data Generation Example:

Model Training:
-
Stage 1: Train 75K OSS-Instruct dataset for 2 rounds, learning rate 5e-5
-
Stage 2: 110K evol-codealpaca-v1 dataset, same parameter configuration
Overall, OSS-INSTRUCT is a novel data generation method that uses large language models to generate low-bias and high-quality coding challenges from open-source code snippets. This method has significantly enhanced the foundational large language model (LLM) performance of Magicoder. Despite having fewer than 7 billion parameters, it outperforms all evaluation LLMs with parameters less than or equal to 16 billion, including the 15 billion parameter WizardCoder. The enhanced MagicoderS model was built by combining OSS-INSTRUCT with Evol-Instruct. They achieved significant results by comparing with leading models like ChatGPT in the HumanEval benchmark test.
Solution
Overall idea comes from "Textbook is All you need". Training is divided into two stages:
1. stage1: Use open-source 1.6W textbook data for one round of training, learning rate 1e-6
2. stage2: Use other open-source code data along with CodeExercise data and rewritten language data for second stage fine-tuning, learning rate 5e-7
-
OSS-Instruct dataset performance has not been compared due to competition time constraints~
-
The training method uses full parameter fine-tuning, and under sufficient computational resources, direct SFT performance will exceed LoRA training.
Reason can be seen in the paper: https://arxiv.org/pdf/2304.08109.pdf
-
Training data leakage detection: Offline leakage detection was performed on training data: Similarity deduplication of input-output; deduplication of functions with the same name
-
The preliminary model has been open-sourced: https://modelscope.cn/models/JMXGODLZ/CodeQWen_1.8B/summary
Final Plan
Problem Analysis
The format of the final competition data:
{
"task_id":"Python/88",
"title":"maximum-product-subarray",
"question_id":"152",
"type":"medium",
"prompt":"Implement in python3:
Given an integer array nums, please find the contiguous subarray (containing at least one number) which has the largest product, and return the product corresponding to that subarray.
The answer for the test case is a 32-bit integer.
A subarray is a contiguous subsequence of an array.
Example 1:
Input: nums = [2,3,-2,4]
Output: 6
Explanation: Subarray [2,3] has maximum product 6.
Example 2:
Input: nums = [-2,0,-1]
Output: 0
Explanation: The result cannot be 2, because [-2,-1] is not a subarray.
Constraints:
1 <= nums.length <= 2 * 104
-10 <= nums[i] <= 10
The product of any prefix or suffix of nums is guaranteed to be a 32-bit integer
class Solution:
def maxProduct(self, nums: List[int]) -> int:",
"Solution":"class Solution:
def maxProduct(self, nums: List[int]) -> int:
prefix, suffix, res = 0, 0, -float('inf')
for i in range(len(nums)):
prefix = nums[i] if not prefix else nums[i] * prefix
suffix = nums[-i - 1] if not suffix else nums[-i - 1] * suffix
res = max(res, prefix, suffix)
return res",
"test":"class TestSolution(unittest.TestCase):
def test_example(self):
a = Solution()
nums = [2, 3, -2, 4]
self.assertEqual(a.maxProduct(nums), 6)
"}
Based on the problem statement and Q&A in the organizer’s group, the key points of the problem can be summarized as follows:
-
The final competition code language includes only Python.
-
The final competition questions are based on LeetCode data.
-
The final program must run without errors, and the package imports, spacing alignment, runtime results, and time complexity must meet the problem requirements.
-
According to estimates, the 72B Qwen model requires approximately 4.5 hours for Lora training with about 10,000 samples based on the competition resources provided. Therefore, we prioritize data relevance and quality over sheer data volume to ensure overall training efficiency and effectiveness.
Solution
-
Collect open-source data related to LeetCode in Python.
-
Collect problem-solving ideas related to LeetCode for code explanations.
LeetCode related open-source data links:
https://github.com/itcharge/LeetCode-Py/tree/main/Solutions
https://huggingface.co/datasets/RayBernard/leetcode1000
https://modelscope.cn/datasets/AI-ModelScope/leetcode-solutions-python/summary
Training data format:
{
"messages":[
{
"role":"system",
"content":""
},
{
"role":"user",
"content":"Implement in Python3 the following requirement: Given an integer array `nums` and a target value `target`, return the indices of the two numbers such that they add up to `target`.
You may assume that each input would have exactly one solution, and you may not use the same element twice. You can return the answer in any order.
For example, given nums = [2,7,11,15], target = 9, the answer is [0,1]. Because nums[0] + nums[1] == 9, we return [0, 1].
For nums = [3,2,4], target = 6, the answer is [1,2].
For nums = [3,3], target = 6, the answer is [0,1].
Constraints:
* 2 <= nums.length <= 104
* -109 <= nums[i] <= 109
* -109 <= target <= 109
* There is exactly one valid answer.
Bonus: Can you come up with an algorithm that runs in less than O(n^2) time? The solution code must contain detailed comments explaining the algorithm."
},
{
"role":"assistant",
"content":"class Solution:
def twoSum(self, nums: List[int], target: int) -> List[int]:
map = {} # Use a dictionary as a hash table to store visited elements
for i, num in enumerate(nums): # Iterate through the array while getting both the value and index of elements
complement = target - num # Calculate the difference between target value and current element
if complement in map: # If the difference is in the hash table, a solution has been found
return [map[complement], i] # Return the index of the complement element and the current element
map[num] = i # If the difference is not in the hash table, add the current element to the hash table
return [] # If no solution is found, return an empty list
# This algorithm has a time complexity of O(n) and a space complexity of O(n) since in the worst case we need to store n elements in the hash table."
}
]
}
Hyperparameter Settings:
-
batch_size: 4
-
sequence_length: 4096
-
learning_rate: 3e-5
-
epochs: 3
Challenges:
While executing the validation code for the final competition locally, the following issues occurred:
-
Missing package imports in the execution code, such as math, heapq, and other external libraries.
-
Generated content includes incorrect comments and test cases, leading to failure in passing tests.
-
Functions with the same name caused the model-generated solutions to not meet the problem requirements and time complexity requirements.
To resolve the first two issues, I cleaned the training data, organized commonly used external libraries, removed unused libraries, and generated test data.
As for the last and most challenging issue, due to time constraints, no attempts have been made yet. The expected solution is to refactor function names.
References
https://arxiv.org/pdf/2306.11644v2.pdf
https://arxiv.org/pdf/2312.02120.pdf
https://arxiv.org/pdf/2304.08109.pdf
Acknowledgments
Heartfelt thanks to the organizers for their enthusiastic responses in the group, for hosting such a practical application-oriented code generation competition. This competition has given me a deeper understanding of the field of code generation. I wish the Qwen model continued success in the future, and I hope to have more similar opportunities to learn and progress together. Once again, thank you to the organizers for their hard work and full support.~
To join the technical exchange group, please add the AINLP assistant WeChat (id: ainlp2)
Please specify the specific direction + related technical points used
About AINLP
AINLP is an interesting AI natural language processing community, focusing on sharing related technologies such as AI, NLP, machine learning, deep learning, recommendation algorithms, etc. Topics include LLM, pre-trained models, automatic generation, text summarization, intelligent Q&A, chatbots, machine translation, knowledge graphs, recommendation systems, computational advertising, recruitment information, and job experience sharing. Welcome to follow! To join the technical exchange group, please add the AINLP assistant WeChat (id: ainlp2), specifying your work/research direction + purpose of joining the group.