MLNLP ( Machine Learning Algorithms and Natural Language Processing ) community is a well-known natural language processing community both domestically and internationally, covering NLP master’s and doctoral students, university professors, and corporate researchers.

The Vision of the Community is to promote communication between the machine learning and natural language processing academia, industry, and enthusiasts, especially the progress of beginners.

Author | Jin Shu Cheng Se

Link | https://lowin.li/2022/06/27/pan-dian-kai-yuan-copilot/

1

Background

Github Copilot is about to start charging:

The official Copilot recently announced the end of the technical preview and will start charging on August 22, 2022, with a fee of $10 per month or $100 per year. Students and maintainers of popular open-source projects can use it for free.

Programmers can hardly do without Copilot:

Github claims that currently, one-third of the code on the site is completed under the Copilot tool. After using Copilot for half a year, I find it hard to work without its help; it has assisted me with a lot of repetitive programming tasks.

Open-source code generation models:

The Huggingface Model Hub community has many open-source models available for direct download, including some open-source code generation models. So why not do it yourself?

Private deployment of a Copilot:

If we deploy a code generation service using open-source code generation models, supplemented by an editor/IDE plugin, we can simulate Copilot to provide code generation services for ourselves and our colleagues. Additionally, there are the following advantages:

Avoid occasional network instability issues with Copilot

Avoid security issues of uploading code to Copilot

Customize a model that understands your coding habits and existing code by retraining the open-source model

2

Overview

In this blog, we will first evaluate the current performance of open-source code generation models from the user’s perspective; then we will set up a code generation service and build a Vscode plugin to provide ourselves with a private “Copilot”.

3

Review of Open Source Code Generation Models

3.1. Model List

Here, we list models that have the keyword ‘code’ searched in the HuggingFace Model Hub, filtering out open-source models with less than 100 monthly downloads and no introduction.

It can be seen that most models are focused on generating code in Python.

3.2. Model Testing

Next, we will try inputting code to test what the code generation models can output and see which pre-trained model understands me better.

Generation configuration is unified as follows:

3.2.1. Python Code Generation Test 1

Input:

Results:

1. code-autocomplete-distilgpt2-python

2. code-autocomplete-gpt2-base

3. CodeGPT-small-py-adaptedGPT2

5. incoder-6B

6. incoder-1B

7. codegen-350M-mono

8. codegen-2B-mono

9. codegen-6B-mono

11. codegen-350M-multi

12. codegen-2B-multi

13. codegen-6B-multi

15. gpt-neo-125M-code-search-py

16. gpt-neo-125M-code-clippy

17. GPT2-python-code-generator

18. codeparrot

19. codeparrot-small

3.2.2. Python Code Generation Test 2

Input:

Output:

1. code-autocomplete-distilgpt2-python

2. code-autocomplete-gpt2-base

3. CodeGPT-small-py-adaptedGPT2

5. incoder-6B

6. incoder-1B

7. codegen-350M-mono

8. codegen-2B-mono

9. codegen-6B-mono

11. codegen-350M-multi

12. codegen-2B-multi

13. codegen-6B-multi

15. gpt-neo-125M-code-search-py

16. gpt-neo-125M-code-clippy

17. GPT2-python-code-generator

18. codeparrot

19. codeparrot-small

3.2.3. Python Code Generation Test 3

Input:

Output:

1. code-autocomplete-distilgpt2-python

2. code-autocomplete-gpt2-base

3. CodeGPT-small-py-adaptedGPT2

5. incoder-6B

6. incoder-1B

7. codegen-350M-mono

8. codegen-2B-mono

9. codegen-6B-mono

11. codegen-350M-multi

12. codegen-2B-multi

13. codegen-6B-multi

15. gpt-neo-125M-code-search-py

16. gpt-neo-125M-code-clippy

17. GPT2-python-code-generator

18. codeparrot

19. codeparrot-small

3.2.4. Vue.js Code Generation Test 4

Input:

Output:

5. incoder-6B

6. incoder-1B

11. codegen-350M-multi

12. codegen-2B-multi

13. codegen-6B-multi

16. gpt-neo-125M-code-clippy

3.2.5. JavaScript Code Generation Test 5

Input:

Results:

5. incoder-6B

6. incoder-1B

11. codegen-350M-multi

12. codegen-2B-multi

13. codegen-6B-multi

16. gpt-neo-125M-code-clippy

3.3. Highlights

codegen-6B-mono perfectly wrote the structure of TextCNN in test 1.

3.4. Conclusion

The larger the model, the stronger its capability, as seen in the performance of codegen-6B-mono.

The download volume on Model Hub is often inflated, as seen in the performance of code-autocomplete-distilgpt2-python.

Domain-specific models are very useful, as seen in the performance of codegen-6B-mono compared to codegen-6B-multi on Python tasks.

Salesforce’s codegen series is a level above other open-source code generation models.

Most open-source code generation models focus on Python, with occasional full-stack languages.

4

Building a Private Code Generation Service

4.1. ONNX Quantization and Compression

Models are generally deployed to run on CPUs, and using ONNX Runtime quantization technology can significantly speed up model operation.

It is recommended to use the fastgpt library for ONNX quantization and loading of transformers’ GPT models.

For the codegen series that does not support transformers, fastgpt also has a codegen example for ONNX quantization and code generation.

4.1.1. FastGPT Installation Method

4.1.2. FastGPT Quick Usage

4.2. Private Web Service

In the fastgpt repository’s codegen example, codegen-350M-mono and codegen-350M-multi have been ONNX quantized and packaged into images, and uploaded to Docker Hub.

4.2.1. Docker-Compose Startup

4.2.2. Testing

codegen-350M-multi

5. Creating a Private Vscode Plugin

See the Vscode plugin adapted for the latest Vscode version (1.68.1).

6. Enjoy Coding

Appendix:

1. Inference Computing Resources

CPU: Intel(R) Core(TM) i9-9900X CPU @ 3.50GHz

Technical Group Invitation

△ Long press to add assistant

Scan the QR code to add the assistant on WeChat

Please note: Name-School/Company-Research Direction

(e.g., Xiao Zhang-Harbin Institute of Technology-Dialogue System)

to apply to join the Natural Language Processing/Pytorch technical group.

About Us

MLNLP Community ( Machine Learning Algorithms and Natural Language Processing ) is a grassroots academic community jointly built by natural language processing scholars from home and abroad. It has now developed into a well-known natural language processing community, including well-known brands such as 10,000-person top conference group chat, AI selection meeting, AI talent meeting and AI academic meeting, aimed at promoting progress among academia, industry, and enthusiasts in machine learning and natural language processing.

The community can provide an open communication platform for related practitioners’ further studies, employment, and research. Everyone is welcome to follow and join us.

Exploring Open Source Copilot: Do It Yourself

1

Background

2

Overview

3

Review of Open Source Code Generation Models

4

Building a Private Code Generation Service

About Us

Leave a Comment Cancel reply

1 Background

2 Overview

3 Review of Open Source Code Generation Models

4 Building a Private Code Generation Service

About Us

Leave a Comment Cancel reply

1

Background

2

Overview

3

Review of Open Source Code Generation Models

4

Building a Private Code Generation Service