Code Generation

Github Copilot

GitHub Copilot introduces several risks common to AI systems, including risks of model bias, privacy issues, compliance issues, and environmental impact. Some mitigation measures to address certain risks. These include a content filter to block offensive language and personally identifiable information, purchasing of carbon offsets to achieve carbon neutrality, and internal testing to evaluate accessibility. However, the tool lacks explainability into how it generates suggestions, visibility into how it is used, and configurability of its controls. Profile last updated: July 13, 2023

Product Description

GitHub Copilot is an AI-powered pair programmer designed to help developers write code faster and with less effort. It uses context from comments and code to suggest individual lines and whole functions instantly. GitHub Copilot is powered by OpenAI Codex, a generative pretrained language model created by OpenAI. It is available as an extension for several popular integrated development environments (IDEs) [1].

The Codex model underlying GitHub Copilot was trained on natural language text and source code from publicly available sources, including code in public repositories on GitHub. It is designed to generate the best code possible given the context it has access to, but it doesn’t test the code it suggests, so the code may not always work or even make sense. Developers are advised to carefully test, review, and vet the code before pushing it to production [1]. As of February 2023, GitHub started relying on an updated "Codex" model [12]. The differences between the original Codex model and the one currently in use are unclear.

GitHub, the company providing GitHub Copilot, is a well-established platform for software development and version control using Git. Founded in 2008 and acquired by Microsoft in 2018, GitHub has grown to become one of the largest code repositories in the world, with millions of users and repositories. The company offers various support options, including documentation, community forums, and direct support for enterprise customers [1].

Profile last updated: July 13, 2023

Intended Use Case

GitHub Copilot is designed to be an AI pair programmer that assists developers in writing code more efficiently and with less effort. It is intended to be used within a customer's organization to improve developer productivity, satisfaction, and overall code quality [1].

GitHub Copilot provides code suggestions based on the context of the developer's current work, including comments and code. It can suggest individual lines of code or entire functions, helping developers navigate unfamiliar libraries or frameworks. Developers can save time, reduce mental effort, and focus on more meaningful tasks. GitHub Copilot is not intended to replace developers but rather to augment their capabilities and enable them to be more productive [1].

‍

Risk and Mitigation Summary

The following table provides a brief summary of which common genAI-driven risks are present in the GitHub Copilot product and which risks have been addressed (but not necessarily eliminated) by deliberate mitigative measures provided with the tool.

For a definition of each risk, see below.

Risk	Present	Built-in Mitigation
Abuse & Misuse	⚠️	❌
Compliance	⚠️	✅
Environmental & Societal Impact	⚠️	✅
Explainability & Transparency	⚠️	❌
Fairness & Bias	⚠️	✅
Long-term & Existential Risk	-	N/A
Performance & Robustness	⚠️	❌
Privacy	⚠️	✅
Security	⚠️	✅

Abuse & Misuse

Pertains to the potential for AI systems to be used maliciously or irresponsibly, including for creating deepfakes, automated cyber attacks, or invasive surveillance systems. Abuse specifically denotes the intentional use of AI for harmful purposes.

Arbitrary code generation

Because GitHub Copilot is capable of generating arbitrary code, it could be used to generate code used in cyber attacks. For instance, a malicious user could use Copilot to generate code for orchestrating a bot network. A successful cyber attack would likely require additional hacking expertise; GitHub Copilot alone is unlikely to enable a malicious actor to carry out a cyber attack but it could lower the barrier for a less sophisticated hacker.

Compliance

Involves the risk of AI systems violating laws, regulations, and ethical guidelines (including copyright risks). Non-compliance can lead to legal penalties, reputational damage, and loss of user trust.

GitHub Copilot is trained on publicly available code [1]. GitHub's documentation does not disclose the composition of this training data. It is possible that Copilot will provide (i.e. reproduce) this code in response to user prompts and the reproduced code may not be approved for all uses. For instance, code published with a "copyleft" license could appear in the training data for Copilot and GitHub does not guarantee that, if it is reproduced by the Copilot tool, it will include the appropriate attribution. The legality of the use of the reproduced code is subject to ongoing litigation [2].

Regulatory compliance

Because GitHub Copilot is capable of generating arbitrary code, it could be used to build systems that violate laws and regulations. For instance, Copilot could be used to generate software applications which track usage and user engagement in ways that violate regulations in the jurisdiction in which the application is deployed. This risk is generally contingent on the GitHub Copilot user prompting the tool for code that delivers a specific functionality. Nevertheless, it is not impossible for code with illicit functionality to be generated inadvertantly.
GitHub does not store prompts nor model responses sent to and returned from the Codex model. Nevertheless, some uses may violate data security laws. For instance, use of Copilot may not be compliant for projects requiring government security clearance, even if the content exchanged with the product is innocuous.

GitHub Copilot employs a content filter targeted at addressing the risk of copyright infringment. See the Mitigations section below for more details.

Environmental & Societal Impact

Concerns the broader changes AI might induce in society, such as labor displacement, mental health impacts, or the implications of manipulative technologies like deepfakes. It also includes the environmental implications of AI, particularly the strain on natural resources and carbon emissions caused by training complex AI models, balanced against the potential for AI to help mitigate environmental issues.

Carbon footprint

According to OpenAI, the developer of the Codex model that underlies Github Copilot, Codex is a fine-tuned version of the 12 billion parameter variant of the GPT-3 model. It was trained in either late 2020 or early 2021. A direct estimate of the training compute and carbon footprint is not possible [3]. For reference, a similar model, Meta's LLaMa-13B, was trained in late 2022 or early 2023 using approximately 59 MWh of energy [4], roughly equivalent to the annual consumption of 5 U.S. households [8]. Since the Meta model was trained with the benefit of more efficient hardware (2 years of hardware advancement), more efficient training regimes (less experimentation required due to 2 years of expertise gained in the field), and did not undergo fine-tuning, Credo AI believes that the metrics for Meta's model should be viewed as a lower bound on the emission and energy draw of the Codex model's training.
Using Meta's LLaMA-13B model as a reference, models of a size similar to the Codex model can be run on the NVidia V100 GPU [4] at inference time, drawing a max of 300W [6] per instance. It is impossible to derive a reliable third-party estimate of GitHub Copilot's ongoing carbon footprint. Doing so would require data on the average number of users at any given time and information on Microsoft's Azure configuration (i.e. number of users that can be served by a single model instance).

Software developer disruption

The impact on developers is uncertain. GitHub's research [7], and anecdotal evidence shared on social media, suggest that Copilot is effective in increasing developer productivity. This could have the effect of driving down market demand for skilled programmers -- if existing programmers are able to fill 100% of market demand for programming tasks, any efficiency gains would reduce the number of programmers needed by the market. Alternatively, it could increase demand for programming -- as efficiency increases, more tasks become feasible leading to a net increase in need for skilled programmers.
The tool (and tools like it) has the potential to create a two-tiered environment. If the efficiency gains touted by GitHub [7] are real, market demand for programmers who do not use and are not skilled in the use of Copilot may decrease.
Use of GitHub Copilot may lead to reliance on the tool. It is possible that users of GitHub Copilot will develop skill in "prompt engineering" to elicit more useful model outputs, while placing less emphasis on practicing certain programming skills that are tangential to a Copilot-assisted workflow.

Microsoft, GitHub's parent organization, claims net-neutral emissions through purchase of carbon credits. See the Mitigations section below for more details.

Explainability & Transparency

Refers to the ability to understand and interpret an AI system's decisions and actions, and the openness about the data used, algorithms employed, and decisions made. Lack of these elements can create risks of misuse, misinterpretation, and lack of accountability.

Model outputs

GitHub Copilot provides no explanations of how it arrived at its outputs given the user's prompts.

Training and evaluation data

Information about the data used to train the Codex model is limited. The data are drawn from "54 million public software repositories hosted on GitHub" and underwent filtering on the basis of file size [3]. Information on the licenses or quality of the code in the training set is limited.

Design decisions

The Codex model is fine-tuned based on the original GPT-3 model [3]. The academic paper describing the Codex model [3] describes the training methods, with justification, in detail.

Fairness & Bias

Arises from the potential for AI systems to make decisions that systematically disadvantage certain groups or individuals. Bias can stem from training data, algorithmic design, or deployment practices, leading to unfair outcomes and possible legal ramifications.

Language

"Given public sources are predominantly in English, GitHub Copilot will likely work less well in scenarios where natural language prompts provided by the developer are not in English and/or are grammatically incorrect. Therefore, non-English speakers might experience a lower quality of service." [1]

Accessibility

At this writing, GitHub Copilot does not have any advertised features targeted at improving accessibility for a broad class of users.

Offensive outputs or biased language

GitHub Copilot's outputs are a function of the data it was trained on. Since it was trained on millions of public GitHub repositories, without (publicly known) filtering for offensive language, it is possible that Copilot could output offensive language in the form of generated code comments or variable names.

GitHub advertises in the Copilot product documentation that they are working with developers with disabilities to ensure accessibility. See the Mitigations section below for more details.

GitHub Copilot employs a content filter targeted at addressing the risk of outputting offensive language. See the Mitigations section below for more details.

Long-term & Existential Risk

Considers the speculative risks posed by future advanced AI systems to human civilization, either through misuse or due to challenges in aligning their objectives with human values. N/A

Performance & Robustness

Pertains to the AI's ability to fulfill its intended purpose accurately and its resilience to perturbations, unusual inputs, or adverse situations. Failures of performance are fundamental to the AI system performing its function. Failures of robustness can lead to severe consequences, especially in critical applications.

GitHub Copilot may output code containing bugs or security vulnerabilities [1]. GitHub's documentation provides no evaluation of the model's performance on common programming tasks nor its robustness to a variety of prompting strategies. GitHub has performed research on user experiences, finding that approximately 46% of model-proposed completions are accepted by developers [12]. See the [Formal Evaluations](#formal-evaluations) section for more details.
OpenAI evaluated the Codex model independently from GitHub. They report 28.81% pass@k performance for k=1 on a realistic programming problem benchmark, HumanEval. pass@k performance tasks the model with generating k independent outputs in response to the user prompt and assessing whether any of the k outputs passes a pre-defined unit test associated with the problem. If users only prompt the model once (choosing to either accept or reject the suggestion and move on regardless of the result), this 28.81% success rate may be sufficiently representative of real world performance [3]. These assessments likely do not reflect the updated Codex model as of February 2023 [12].

Privacy

Refers to the risk of AI infringing upon individuals' rights to privacy, through the data they collect, how they process that data, or the conclusions they draw.

GitHub Copilot may output names, contact information, or other personally identifiable information from its training data [1]. For instance, identifying information encoded in code comments in the training data could be reproduced by the model. The frequency of this phenomenon is unknown -- GitHub describes it as "very rare".
GitHub Copilot for Business does not log user prompts or model outputs. There is limited risk of sensitive data "leaking" to GitHub or its model provider OpenAI via direct interaction with the product.

GitHub Copilot employs a content filter targeted at addressing the risk of outputting PII. See the Mitigations section below for more details.

Security

Encompasses potential vulnerabilities in AI systems that could compromise their integrity, availability, or confidentiality. Security breaches could result in significant harm, from incorrect decision-making to privacy violations.

Vulnerable code generation

GitHub Copilot may output code containing vulnerabilities. GitHub does not provide estimates of how frequent this occurs and places responsibility on users to verify code safety [1].

Other security

As GitHub Copilot for Business does not store user prompts nor model outputs, the risks of using the service are on-par with using a non-AI internet-connected application. Copilot's source code (including extensions meant to interface with third party IDEs) may contain vulnerabilities, which could lead to security exposure for users.

**As of the February 2023 update to Copilot [12], the product uses an LLM-based vulnerability scanner to identify some security vulnerabilities in generated code. We discuss this mitigation measure further in the [Mitigations](#mitigation-measures) section.**

Mitigation Measures

In this section, we discuss mitigation measures that are built-in to the product (regardless of whether they are enabled by default). We also comment on the feasibility of a procuring organization governing the use of the tool by its employees.

Content Filter GitHub Copilot has a content filter to address several common risks of genAI systems. The filter is described in the GitHub Copilot FAQs page [1]. Its functionality is as follows:

It "blocks offensive language in the prompts and to avoid synthesizing suggestions in sensitive contexts". No details are provided regarding the effectiveness, performance, or robustness of this feature. This feature appears to be enabled by default and does not appear, from available documentation, to be configurable [1].
It "checks code suggestions with their surrounding code of about 150 characters against public code on GitHub. If there is a match or near match, the suggestion will not be shown to [the user]." GitHub does not provide details about the effectiveness, performance, or robustness of this feature. This feature is configurable for Organization customers. It is not documented whether the feature is enabled by default in Organization accounts [9].
It "blocks emails when shown in standard formats". According to GitHub, "it’s still possible to get the model to suggest this sort of content if you try hard enough." No details are provided regarding the effectiveness, performance, or robustness of this feature. This feature appears to be enabled by default and does not appear, from available documentation, to be configurable [1].

Carbon Neutrality Microsoft, GitHub's parent, claims to be carbon neutral [10]. They achieve this through the purchase of carbon credits and offsets. They have publicly committed to reaching net-zero emissions by 2030. Because of GitHub's status as a Microsoft subsidiary, it is likely that all systems relevant to GitHub Copilot (including the Codex model) are deployed on Microsoft's Azure cloud platform and thus are included in Microsoft's broader carbon accounting.

Accessibility Testing GitHub is "conducting internal testing of GitHub Copilot’s ease of use by developers with disabilities" [1]. The company encourages users who identify usability issues to reach out to a dedicated email address. No details are provided about the status of these tests.

Vulnerability Filter As of the February 2023 update to Copilot, the service includes a "vulnerability prevention system" which uses large language models to analyze generated code with the goal of identifying and blocking common security vulnerabilities, such as SQL injection, path injection, and hardcoded credentials [12]. Credo AI was unable to find details on the performance or effectiveness of this mitigation measure. The vulnerability filter is unlikely to identify and block all possible security vulnerabilities.

Governability

For an organization to govern its development or use of an AI system, two functionalities are key: the ability of the organization to observe usage patterns among its employees and the ability of the organization to implement and configure controls to mitigate risk. Credo AI assesses systems on these two dimensions.

GitHub Copilot does not provide usage visibility for its Copilot for Business customers. Enterprises have no mechanism to observe inputs or outputs into the underlying Codex model nor can enterprises view statistical summaries of usage.

GitHub Copilot provides limited configurability of controls (see above). GitHub Copilot does not permit enterprises to implement and configure their own (or third party) technical controls to manage risks.

‍

Formal Evaluations & Certifications

Evaluations

The developers of GitHub Copilot and the Codex model each performed research studies to assess the performance and efficacy of using the tool [7, 3].

GitHub's study focused on developer experiences using the tool. The primary results come from a survey sent to developers enrolled in GitHub's technical preview. The survey had approximately 2,000 respondents (11.7% response rate) and may be subject to significant response bias. The survey found that "between 60-75%" of users felt more fulfilled in their job and, in a controlled study of 95 developers, found that Copilot increased the speed of carrying out a specific task by 55% [7].

More recently, the updated Codex model yielded a 19 percentage point increase in the proportion of developers' code files being generated by Copilot: from 27% to 46%. It is unclear whether this statistic refers to files entirely generated by Copilot or simply files containing Copilot-generated code [12]. The February 2023 update also instituted the use of a new model, separate from Codex, to predict whether a user will accept a particular suggestion. GitHub claims this model yielded a 4.5% improvement in the rate of unwanted suggestions [12].

OpenAI's study proposed a new benchmark HumanEval, which consists of 164 novel programming problems, specifically crafted to avoid testing the model on problems that may appear in its training data because they exist on the internet. The programming problems are accompanied by unit tests to enable automatic, objective assessment, in lieu of human judgement. The model achieves a 28.8% pass rate when generating a single output for each problem. When permitted to generate 100 outputs per problem and deemed "correct" if any one output is correct, the model achieves a 72.3% pass rate. OpenAI argues that this is representative of the write-test-debug process, however they advocate for using output ranking to choose a single best output -- by this metric, the model achieves a 44.5% pass rate [3].

Several independent studies have been performed to assess GitHub Copilot's performance and efficacy. Among these, Credo AI assessed relatively few to constitute high quality academic research. One paper [11] attempted to rigorously assess the correctness of Copilot on fundamental programming problems (sorting, search, etc.) and compared Copilot solutions to a benchmark dataset of human solutions to programming tasks. The authors found that, based on human judgement of correctness, Copilot produced correct solutions ranging from 0% of the time (across prompt samples) for some sorting and graph algorithms to 100% of the time for some other search algorithms. They find that the model is inconsistent over time, repeating their trials across a 30 day window, however this may be due to temperature settings of the underlying Codex model. The authors do not report an overall correctness rate for the fundamental algorithms task. The benchmark solutions dataset contains both correct and buggy solutions for several programming problems. The authors report a correctness ratio as the proportion of correct solutions for a given task in the benchmark and compare this to Copilot's performance. Copilot outperforms the benchmark on 3/5 tasks, based on pass@1 accuracy. The authors suggest that Copilot is best leveraged by "expert developers" and may not be suitable for novices, who may be unable to detect buggy or non-optimal solutions.

Certifications

Credo AI has identified the following regulations and standards as relevant to the privacy, security, and compliance requirements of our customers. GitHub's advertised compliance is detailed below. For more details, see https://github.com/security

‍

Conclusion

GitHub Copilot is an AI-based tool designed to assist software developers in writing code. It is powered by OpenAI's Codex model, a pretrained language model trained on millions of lines of open-source code. Copilot provides code suggestions and completions based on the context of the code the developer is currently writing. It is intended to increase developer productivity, satisfaction, and code quality. However, it also introduces several risks common to AI systems, including risks of model bias, privacy issues, compliance issues, and environmental impact.

GitHub and OpenAI have implemented some mitigation measures to address certain risks. These include a content filter to block offensive language and personally identifiable information, purchasing of carbon offsets to achieve carbon neutrality, and internal testing to evaluate accessibility. However, the tool lacks explainability into how it generates suggestions, visibility into how it is used, and configurability of its controls. Formal evaluations of the tool have found it can increase developer speed and satisfaction but that it struggles with some complex programming tasks, achieving a wide variance of correctness results across evaluations.

Although a useful productivity tool, GitHub Copilot introduces risks that require governance to address. The lack of visibility and configurability poses challenges for organizations aiming to manage risks from the tool and ensure compliant and ethical use. Additional research into the tool’s abilities, limitations, and best practices for oversight would benefit users and stakeholders. With proper governance, Copilot could become an asset, but without it, it risks becoming a liability.

‍

References

[1] GitHub Copilot FAQs - https://github.com/features/copilot

[2] GitHub Copilot Class Action Lawsuit - https://githubcopilotinvestigation.com/

[3] Evaluating Large Language Models Trained on Code - https://arxiv.org/pdf/2107.03374.pdf

[4] LLaMA: Open and Efficient Foundation Language Models - https://arxiv.org/pdf/2302.13971.pdf

[5] GitHub Copilot adds 400K subscribers in first month - https://www.ciodive.com/news/github-copilot-microsoft-software-developer/628587/

[6] NVidia V100 - https://www.nvidia.com/en-us/data-center/v100/

[7] GitHub Copilot Research Blogpost - https://github.blog/2022-09-07-research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/

[8] U.S. Energy Information Administration - https://www.eia.gov/tools/faqs/faq.php?id=97&t=3#:~:text=In%202021%2C%20the%20average%20annual,about%20886%20kWh%20per%20month.

[9] Configuring GitHub Copilot settings on GitHub.com - https://docs.github.com/en/copilot/configuring-github-copilot/configuring-github-copilot-settings-on-githubcom

[10] Microsoft will be carbon negative by 2030 - https://blogs.microsoft.com/blog/2020/01/16/microsoft-will-be-carbon-negative-by-2030/

[11] GitHub Copilot AI pair programmer: Asset or Liability? - https://www.sciencedirect.com/science/article/pii/S0164121223001292

[12] GitHub Copilot now has a better AI model and new capabilities - https://github.blog/2023-02-14-github-copilot-now-has-a-better-ai-model-and-new-capabilities/

Notes

Italics denote Credo AI definitions of key concepts.

AI Disclosure: The "Product Description", "Intended Use Case", and "Conclusion" sections were generated with assistance from chat-tuned large language models. For the former two sections, Credo AI fed official product documentation to OpenAI's ChatGPT and prompted the model to generate text with the relevant information. For the latter section, Credo AI fed the remainder of this risk profile as background reference information and prompted Anthropic's Claude chatbot to summarize the information. The final text was reviewed for accuracy and suitability and underwent human editing by Credo AI.‍

‍