AI Insights: How GitHub Copilot Better Understands Your Code

With the new filling-in-the-middle paradigm, GitHub engineers have improved the way GitHub Copilot contextualizes your code. By continuing to develop and test advanced retrieval algorithms, they are working to make our AI tools more sophisticated.

To make the collaboration with GitHub Copilot feel like a mental exchange between developers and pair programmers, GitHub ’s machine learning experts have been busy researching, developing, and testing new features— many of which focus on enhancing the AI ’s contextual understanding as a pair programmer. This is because good communication is key to pair programming, and inferring context is crucial for achieving effective communication.

To uncover the details, we asked GitHub ’s researchers and engineers what they are doing to help GitHub Copilot improve its contextual understanding. Here are our findings.

From OpenAI ’s Codex Model to GitHub Copilot

When OpenAI released GPT-3 in June 2020, GitHub knew that developers would benefit from a product that specifically leveraged that model for coding. So, we provided feedback to OpenAI as it built Codex, which is a descendant of GPT-3 and LLM that would support GitHub Copilot . Pair programming was launched in June 2021 as a technical preview and fully released in June 2022 as the world’s first large-scale generative AI coding tool.

To ensure the model has the best information to make optimal predictions quickly, GitHub ’s machine learning (ML) researchers have done a lot of work called prompt engineering (which we will explain in detail below) to provide low-latency, contextually relevant responses from the model.

Although GitHub always experiments with new models, Codex was the first truly powerful generative AI model, GitHub ’s ML engineer David Slater said. “The real-world experience we gained through iterating on the model and timely improvements has been invaluable.”

All these experiments contribute to pair programming, ultimately allowing developers to spend more time focusing on more fulfilling work. GitHub ’s machine learning researcher Alice Li stated that the tool is often very helpful for starting new projects or files from scratch, as it provides developers with a starting point that they can adjust and refine as needed.

Why Context Matters

Developers use details from pull requests, folders in projects, unresolved issues, etc., to contextualize their code. When it comes to generative AI coding tools, we need to teach the tool what information to use to do the same.

Transformer LLM excels at connecting points and global thinking. Generative AI coding tools are powered by large language models (LLM) that are a set of algorithms trained on vast amounts of code and human language. Today’s most advanced LLM are transformers, which enable them to excel at establishing connections between the text input by the user and the output already generated by the model. This is why today’s generative AI tools provide responses that are more contextually relevant than previous AI models.

But they need to be told what information is relevant to your code. Currently, the transformer capable of supporting GitHub Copilot can handle about 6,000 characters at once. While this is sufficient for tasks like advancing and accelerating code completion and code change summaries, the limited character count means not all developers’ code can be used as context.

Therefore, our challenge is not only to figure out what data to provide to the model but also to determine how best to sort and input that data to provide the best recommendations for developers.

How GitHub Copilot Understands Your Code

This all comes down to prompts, which are compilations of IDE code and related context fed into the model. Prompts are generated by algorithms in the background and can be generated at any stage of your coding. This is why GitHub Copilot generates coding suggestions whether you are writing or just finished commenting, or writing some complex code.

The way prompts are created is as follows: a series of algorithms first select relevant code snippets or comments from your current file and other sources (which we will delve into below). Then these snippets and comments are prioritized, filtered, and combined into the final prompt.

GitHub Copilot ’s contextual understanding has matured over time. The first version could only regard the files you were handling in your IDE as contextually relevant. But we know that context is far more than that. Now, just a year later, we are experimenting with algorithms that consider your entire codebase to generate tailored suggestions.

Let’s take a look at how we achieve this:

Prompt engineering is the art of crafting prompts so that the model makes the most useful predictions for users. Prompts tell LLM (including GitHub Copilot) what data to process and in what order to process it to contextualize the code. Most of the work is done in what’s called a prompt library, where our internal ML experts use algorithms to extract and prioritize various information sources about developer context to create prompts that will be processed by the GitHub Copilot model.
Adjacent tabs are what we call the technology that allows GitHub Copilot to process all files open in a developer’s IDE (not just the single file the developer is working on). By opening all files relevant to their project, developers automatically invoke GitHub Copilot to sift through all the data and find matching code snippets between the code in the files they have open and around the cursor, and add those matches to the prompt.

When developing adjacent tabs, GitHub Next team and internal ML researchers conducted A/B tests to find the best parameters for identifying matches between the code in the IDE and the code in the opened tabs. They found that setting a very low matching standard actually yielded the best coding suggestions.

By including every bit of context, adjacent tabs help increase user acceptance of GitHub Copilot suggestions by 5%**.

Filling-in-the-middle (FIM) paradigm further expands the context range. Before FIM , only the code before the cursor would be included in the prompt — ignoring the code after the cursor. (At GitHub, we refer to the code before the cursor as the prefix and the code after the cursor as the suffix.) With FIM, we can tell the model which part of the prompt is the prefix and which part is the suffix.

Even if you are starting something from scratch and have the skeleton of a file, we know that coding is not linear or sequential. Therefore, as you switch back and forth in the file, FIM helps GitHub Copilot provide better coding suggestions for the part of the file where the cursor is or code that should be between the prefix and suffix.

According to A/B testing, FIM ’s performance improved relatively by 10%, meaning the proportion of developers accepting the completion content shown to them increased by 10%. And with optimized use of caching, adjacent tabs and FIM can work in the background without adding any latency.

AI Insights: How GitHub Copilot Better Understands Your Code

Improving Semantic Understanding

Today, we are experimenting with vector databases that can create customized coding experiences for developers working in private repositories or proprietary code. Generative AI coding tools use something called embeddings to retrieve information from the vector database.

What is a vector database? It is a database that indexes high-dimensional vectors.
What are high-dimensional vectors? They are mathematical representations of objects that can capture the complexity of that object due to the ability to model it across multiple dimensions. When used correctly to represent code snippets, they can capture both the semantics of the code and even the intent behind it, rather than just the syntax.
What are embeddings? In the context of coding and LLM , embeddings are representations of a piece of code as high-dimensional vectors. Because LLM has “knowledge” of both programming and natural language, it can capture the syntax and semantics of the code in the vectors.

Here’s how they work together:

Algorithms will create embeddings for all code snippets in the repository (potentially billions of them) and store them in the vector database.
Then, as you write code, algorithms will embed code snippets into your IDE .
Then, the algorithms will perform approximate matching (also real-time matching) between the embeddings created for your IDE code snippets and the embeddings stored in the vector database. The vector database allows algorithms to quickly search for approximate matches on the vectors it stores (rather than just exact matches), even if it stores billions of code snippets.

GitHub ’s senior ML researcher Alireza Goudarzi explains that developers are familiar with retrieving data using hash codes, which typically look for exact matches character by character. “However, because embeddings come from LLMs trained on vast amounts of data, they create a semantic closeness between code snippets and natural language prompts.”

Read the three sentences below and identify which two sentences are semantically most similar.

Sentence A: The king moves and captures a piece.
Sentence B: The king is crowned at Westminster Abbey.
Sentence C: The two white rooks are still in play.

The answer is sentences A and C, as they are both about chess. While sentences A and B are similar in syntax or structure because they both have the king as the subject, they are semantically different as “king” is used in different contexts.

As mentioned above, we are still experimenting with retrieval algorithms. We designed this feature with enterprise customers in mind, especially those who want customized coding experiences through private repositories and explicitly choose to use this feature.

Conclusion

Last year, we conducted quantitative research on GitHub Copilot and found that developers’ coding speed increased by 55% when using pair programming. This means developers feel more efficient, complete repetitive tasks faster, and can focus more on satisfying work. But our work doesn’t stop here.

GitHub ’s product and R&D teams (including GitHub Next) have been collaborating with Microsoft Azure AI-Platform to continue improving GitHub Copilot ’s contextual understanding. Much of the work that helps GitHub Copilot contextualize your code happens behind the scenes. As you write and edit code, GitHub Copilot responds in real-time to your writing and editing by generating prompts— in other words, prioritizing and sending relevant information to the model based on your actions in the IDE — to continue providing you with the best coding suggestions.

(This article is a translation)

If you liked the article, feel free to click “like,” “read,” or share it with your friends.

Leave a Comment Cancel reply