Exploring Open Source Copilot: Do It Yourself

Exploring Open Source Copilot: Do It Yourself

MLNLP ( Machine Learning Algorithms and Natural Language Processing ) community is a well-known natural language processing community both domestically and internationally, covering NLP master’s and doctoral students, university professors, and corporate researchers.
The Vision of the Community is to promote communication between the machine learning and natural language processing academia, industry, and enthusiasts, especially the progress of beginners.

Author | Jin Shu Cheng Se

Link | https://lowin.li/2022/06/27/pan-dian-kai-yuan-copilot/

1

Background

Github Copilot is about to start charging:
The official Copilot recently announced the end of the technical preview and will start charging on August 22, 2022, with a fee of $10 per month or $100 per year. Students and maintainers of popular open-source projects can use it for free.
Programmers can hardly do without Copilot:
Github claims that currently, one-third of the code on the site is completed under the Copilot tool. After using Copilot for half a year, I find it hard to work without its help; it has assisted me with a lot of repetitive programming tasks.
Open-source code generation models:
The Huggingface Model Hub community has many open-source models available for direct download, including some open-source code generation models. So why not do it yourself?
Private deployment of a Copilot:
If we deploy a code generation service using open-source code generation models, supplemented by an editor/IDE plugin, we can simulate Copilot to provide code generation services for ourselves and our colleagues. Additionally, there are the following advantages:
Avoid occasional network instability issues with Copilot
Avoid security issues of uploading code to Copilot
Customize a model that understands your coding habits and existing code by retraining the open-source model

2

Overview

In this blog, we will first evaluate the current performance of open-source code generation models from the user’s perspective; then we will set up a code generation service and build a Vscode plugin to provide ourselves with a private “Copilot”.

3

Review of Open Source Code Generation Models

3.1. Model List

Exploring Open Source Copilot: Do It Yourself

Exploring Open Source Copilot: Do It Yourself

Here, we list models that have the keyword ‘code’ searched in the HuggingFace Model Hub, filtering out open-source models with less than 100 monthly downloads and no introduction.
It can be seen that most models are focused on generating code in Python.
3.2. Model Testing
Next, we will try inputting code to test what the code generation models can output and see which pre-trained model understands me better.
Generation configuration is unified as follows:

Exploring Open Source Copilot: Do It Yourself

3.2.1. Python Code Generation Test 1
Input:

Exploring Open Source Copilot: Do It Yourself

Results:
1. code-autocomplete-distilgpt2-python

Exploring Open Source Copilot: Do It Yourself

2. code-autocomplete-gpt2-base

Exploring Open Source Copilot: Do It Yourself

3. CodeGPT-small-py-adaptedGPT2

Exploring Open Source Copilot: Do It Yourself

5. incoder-6B

Exploring Open Source Copilot: Do It Yourself

6. incoder-1B

Exploring Open Source Copilot: Do It Yourself

7. codegen-350M-mono

Exploring Open Source Copilot: Do It Yourself

8. codegen-2B-mono

Exploring Open Source Copilot: Do It Yourself

9. codegen-6B-mono

Exploring Open Source Copilot: Do It Yourself

11. codegen-350M-multi

Exploring Open Source Copilot: Do It Yourself

12. codegen-2B-multi

Exploring Open Source Copilot: Do It Yourself

13. codegen-6B-multi

Exploring Open Source Copilot: Do It Yourself

15. gpt-neo-125M-code-search-py

Exploring Open Source Copilot: Do It Yourself

16. gpt-neo-125M-code-clippy

Exploring Open Source Copilot: Do It Yourself

17. GPT2-python-code-generator

Exploring Open Source Copilot: Do It Yourself

18. codeparrot

Exploring Open Source Copilot: Do It Yourself

19. codeparrot-small

Exploring Open Source Copilot: Do It Yourself

3.2.2. Python Code Generation Test 2
Input:

Exploring Open Source Copilot: Do It Yourself

Output:
1. code-autocomplete-distilgpt2-python

Exploring Open Source Copilot: Do It Yourself

2. code-autocomplete-gpt2-base

Exploring Open Source Copilot: Do It Yourself

3. CodeGPT-small-py-adaptedGPT2

Exploring Open Source Copilot: Do It Yourself

5. incoder-6B

Exploring Open Source Copilot: Do It Yourself

6. incoder-1B

Exploring Open Source Copilot: Do It Yourself

7. codegen-350M-mono

Exploring Open Source Copilot: Do It Yourself

8. codegen-2B-mono

Exploring Open Source Copilot: Do It Yourself

9. codegen-6B-mono

Exploring Open Source Copilot: Do It Yourself

11. codegen-350M-multi

Exploring Open Source Copilot: Do It Yourself

12. codegen-2B-multi

Exploring Open Source Copilot: Do It Yourself

13. codegen-6B-multi

Exploring Open Source Copilot: Do It Yourself

15. gpt-neo-125M-code-search-py

Exploring Open Source Copilot: Do It Yourself

16. gpt-neo-125M-code-clippy

Exploring Open Source Copilot: Do It Yourself

17. GPT2-python-code-generator

Exploring Open Source Copilot: Do It Yourself

18. codeparrot

Exploring Open Source Copilot: Do It Yourself

19. codeparrot-small

Exploring Open Source Copilot: Do It Yourself

3.2.3. Python Code Generation Test 3
Input:

Exploring Open Source Copilot: Do It Yourself

Output:
1. code-autocomplete-distilgpt2-python

Exploring Open Source Copilot: Do It Yourself

2. code-autocomplete-gpt2-base

Exploring Open Source Copilot: Do It Yourself

3. CodeGPT-small-py-adaptedGPT2

Exploring Open Source Copilot: Do It Yourself

5. incoder-6B

Exploring Open Source Copilot: Do It Yourself

6. incoder-1B

Exploring Open Source Copilot: Do It Yourself

7. codegen-350M-mono

Exploring Open Source Copilot: Do It Yourself

8. codegen-2B-mono

Exploring Open Source Copilot: Do It Yourself

9. codegen-6B-mono

Exploring Open Source Copilot: Do It Yourself

11. codegen-350M-multi

Exploring Open Source Copilot: Do It Yourself

12. codegen-2B-multi

Exploring Open Source Copilot: Do It Yourself

13. codegen-6B-multi

Exploring Open Source Copilot: Do It Yourself

15. gpt-neo-125M-code-search-py

Exploring Open Source Copilot: Do It Yourself

16. gpt-neo-125M-code-clippy

Exploring Open Source Copilot: Do It Yourself

17. GPT2-python-code-generator

Exploring Open Source Copilot: Do It Yourself

18. codeparrot

Exploring Open Source Copilot: Do It Yourself

19. codeparrot-small

Exploring Open Source Copilot: Do It Yourself

3.2.4. Vue.js Code Generation Test 4
Input:

Exploring Open Source Copilot: Do It Yourself

Output:
5. incoder-6B

Exploring Open Source Copilot: Do It Yourself

6. incoder-1B

Exploring Open Source Copilot: Do It Yourself

11. codegen-350M-multi

Exploring Open Source Copilot: Do It Yourself

12. codegen-2B-multi

Exploring Open Source Copilot: Do It Yourself

13. codegen-6B-multi

Exploring Open Source Copilot: Do It Yourself

16. gpt-neo-125M-code-clippy

Exploring Open Source Copilot: Do It Yourself

3.2.5. JavaScript Code Generation Test 5
Input:

Exploring Open Source Copilot: Do It Yourself

Results:
5. incoder-6B

Exploring Open Source Copilot: Do It Yourself

6. incoder-1B

Exploring Open Source Copilot: Do It Yourself

11. codegen-350M-multi

Exploring Open Source Copilot: Do It Yourself

12. codegen-2B-multi

Exploring Open Source Copilot: Do It Yourself

13. codegen-6B-multi

Exploring Open Source Copilot: Do It Yourself

16. gpt-neo-125M-code-clippy

Exploring Open Source Copilot: Do It Yourself

3.3. Highlights
codegen-6B-mono perfectly wrote the structure of TextCNN in test 1.
3.4. Conclusion
The larger the model, the stronger its capability, as seen in the performance of codegen-6B-mono.
The download volume on Model Hub is often inflated, as seen in the performance of code-autocomplete-distilgpt2-python.
Domain-specific models are very useful, as seen in the performance of codegen-6B-mono compared to codegen-6B-multi on Python tasks.
Salesforce’s codegen series is a level above other open-source code generation models.
Most open-source code generation models focus on Python, with occasional full-stack languages.

4

Building a Private Code Generation Service

4.1. ONNX Quantization and Compression
Models are generally deployed to run on CPUs, and using ONNX Runtime quantization technology can significantly speed up model operation.
It is recommended to use the fastgpt library for ONNX quantization and loading of transformers’ GPT models.
For the codegen series that does not support transformers, fastgpt also has a codegen example for ONNX quantization and code generation.
4.1.1. FastGPT Installation Method

Exploring Open Source Copilot: Do It Yourself

4.1.2. FastGPT Quick Usage

Exploring Open Source Copilot: Do It Yourself

4.2. Private Web Service
In the fastgpt repository’s codegen example, codegen-350M-mono and codegen-350M-multi have been ONNX quantized and packaged into images, and uploaded to Docker Hub.
4.2.1. Docker-Compose Startup

Exploring Open Source Copilot: Do It Yourself

4.2.2. Testing
codegen-350M-multi

Exploring Open Source Copilot: Do It Yourself

codegen-350M-multi

Exploring Open Source Copilot: Do It Yourself

5. Creating a Private Vscode Plugin
See the Vscode plugin adapted for the latest Vscode version (1.68.1).
6. Enjoy Coding

Exploring Open Source Copilot: Do It Yourself

Appendix:
1. Inference Computing Resources
CPU: Intel(R) Core(TM) i9-9900X CPU @ 3.50GHz
Technical Group Invitation

Exploring Open Source Copilot: Do It Yourself

△ Long press to add assistant

Scan the QR code to add the assistant on WeChat

Please note: Name-School/Company-Research Direction
(e.g., Xiao Zhang-Harbin Institute of Technology-Dialogue System)
to apply to join the Natural Language Processing/Pytorch technical group.

About Us

MLNLP Community ( Machine Learning Algorithms and Natural Language Processing ) is a grassroots academic community jointly built by natural language processing scholars from home and abroad. It has now developed into a well-known natural language processing community, including well-known brands such as 10,000-person top conference group chat, AI selection meeting, AI talent meeting and AI academic meeting, aimed at promoting progress among academia, industry, and enthusiasts in machine learning and natural language processing.
The community can provide an open communication platform for related practitioners’ further studies, employment, and research. Everyone is welcome to follow and join us.

Exploring Open Source Copilot: Do It Yourself

Leave a Comment