Risk and Mitigation Summary
The following table provides a quick summary of which common genAI-driven risks are present in Claude and which risks have been addressed by deliberate mitigative measures provided with the product. Mitigation of risk does not necessarily equate to elimination of risk.
For a definition of each risk and details of how these risks arise in the context of Claude, see below. These risk details are non-exhaustive.
Many of the details about Claude's development are derived from technical reports authored and self-published by Anthropic's research team. Most of these reports have not undergone scientific peer review. Additionally, the final Claude model likely differs in critical ways from the models discussed in Anthropic’s research reports. For example, as detailed in the model card  published at the time of the Claude 2 announcement, Claude 2’s performance and alignment are significantly stronger than the performance of the models detailed in Anthropic’s research work dating to 2021.
Where recent, publicly available information exists, we use this as the source of truth. Information from older sources should be viewed with a greater level of scrutiny and points to a potential disconnect in transparency between the Claude product offering and Anthropic's research agenda.
Abuse & Misuse
Pertains to the potential for AI systems to be used maliciously or irresponsibly, including for creating deepfakes, automated cyber attacks, or invasive surveillance systems. Abuse specifically denotes the intentional use of AI for harmful purposes.
Arbitrary code generation
- Because Claude is capable of generating arbitrary code, it could be used to generate code used in cyber attacks. For instance, a malicious user could use the model to generate code for orchestrating a bot network. A successful cyber attack would likely require additional hacking expertise; simply having access to Claude is unlikely to enable a malicious actor to carry out a cyber attack, but the product could lower the barrier for a less sophisticated hacker.
Arbitrary, programmatic text generation
- Because Claude is capable of generating arbitrary text, it could be used to generate text used for misinformation campaigns, generating deepfakes (e.g., text in the style of a public figure), social engineering in phishing attacks, and more. Additionally, when coupled with other generative AI technologies, such as text-to-speech synthesis models capable of mimicking public figures, Claude could be used to perpetrate highly sophisticated misuses. Any misuse would require some expertise in prompt engineering and in other accompanying tools. Nevertheless, given the human-like quality of the model's outputs, the existence of the model dramatically lowers the barrier for both sophisticated and unsophisticated malicious actors.
Generation of text describing harmful and/or illegal activities
- Because Claude is trained on, among other data sources, text from the internet , it is capable of generating descriptions of harmful and/or illegal activities.
For more details on Anthropic's research into abuse and misuse, see https://www.anthropic.com/research.
The versions of the Claude model which Anthropic makes available through its web-app and API has undergone substantial alignment-oriented fine-tuning targeted at addressing the potential for misuse. See the Mitigations section below for more details.
Involves the risk of AI systems violating laws, regulations, and ethical guidelines (including copyright risks). Non-compliance can lead to legal penalties, reputational damage, and loss of user trust.
- Claude was trained on, among other sources, publicly available internet data . Anthropic's public documentation does not provide sufficient detail on their data sources to determine the copyright protection status of each training sample. Assuming the training data does contain copyright-protected content, it is possible that the model will provide (i.e. reproduce) text identical to or substantially similar to copyright protected text. This applies analogously to code generated by the model in response to user prompts asking for code. The legality of the use of generated text and code is subject to ongoing public debate [6-7]. A similar model, developed by OpenAI, is subject to ongoing litigation 
- Because Anthropic's Claude is capable of generating arbitrary text and code, it could be used in the service of activities that violate laws and regulations in the user's jurisdiction. For instance, the model could be used by a company's HR employee to screen resumes and aid in hiring decisions. Doing so could violate anti-discrimination laws and AI-specific laws in ways that are not easy to detect.
- Use of the model could also violate data security and data privacy laws. For instance, the product is not FERPA compliant , meaning that if an individual uses the model for specific education-related work tasks -- say, a college admissions officer using the chatbot or a tool built on the Claude API to analyze application essays -- the use could violate FERPA.
- The Claude model is not innately aware of a particular developer or user organization's internal policies regarding the use of generative AI tools and their outputs. Without specifically imposing controls, organizations' employees could inadvertently or deliberately use the model, through the chat web app or through the API, to violate organization policy.
Anthropic's fine-tuning process, in particular its Constitutional AI paradigm, addresses some risks of violating applicable laws and regulations by curbing certain problematic topics. See the Mitigations section for more details.
Anthropic offers enterprise customers the option to pursue HIPAA compliance. See  for more details.
Organizations using the Claude API to develop downstream applications have the ability to track all inputs to and outputs from the model. As a consequence, the API provides flexibility to pursue additional "after-market" mitigation strategies to address compliance risks that Anthropic's direct offering does not address. See the Mitigation section for more details.
Environmental & Societal Impact
Concerns the broader changes AI might induce in society, such as labor displacement, mental health impacts, or the implications of manipulative technologies like deepfakes. It also includes the environmental implications of AI, particularly the strain on natural resources and carbon emissions caused by training complex AI models, balanced against the potential for AI to help mitigate environmental issues.
Labor market disruption
- Because of the strong performance demonstrated by the latest GPT-style models on analysis tasks, there is significant concern that the models could induce significant disruption to "white collar", cognitive-task labor markets. Among the formal analyses on this topic, work from OpenAI estimates that "around 80% of the U.S. workforce could have at least 10% of their work tasks affected" by LLMs and "19% of workers may see at least 50% of tasks impacted" . The ultimate effect of this disruption is uncertain. It is possible that "disruption" will correspond purely to efficiency gains, enabling workers to focus time on more difficult tasks. It is also conceivable that "disruption" will entail displacement, forcing workers to retrain and/or leading to a substantial increase in unemployment.  highlights that the most likely outcome is some combination of these two varieties of "disruption" and that the impacts will be realized unevenly across economic sectors. While this analysis was performed primarily with OpenAI's ChatGPT and GPT-4 models in mind, Claude is sufficiently powerful that it should be viewed as having the same potential for disruption.
- Anthropic has not published sufficient information about its models to estimate the energy consumption and carbon footprint of training, nor the consumption and emissions of ongoing use. Several Anthropic papers discuss the use of a 52 billion parameter model, which was pre-trained on 850B tokens . Extrapolating from Table 1 in , this implies a compute budget of 2.55e23 FLOPs. Information has been published on other foundation models' energy consumption and carbon footprint. Meta estimates its 65B parameter LLaMa model used just over 1 million GPU hours on NVidia's A100-80GB GPU. Assuming 156TFlops performance (theoretical peak performance of at 32 big TensorFloat precision), the LLaMa-65B model ran for 5.74e23 FLOPs, implying Claude may have used about 1/2 the compute.
- Meta estimates its 65B parameter LLaMa model consumed 449 MWh of power during it's training run, approximately equivalent to the annual power consumption and emissions of 42 U.S. households . They further estimate that, due to experimentation and creation of smaller models (steps Anthropic also followed, with a range of model sizes), the overall energy consumption associated with creating the LLaMa model family was 2.64GWh, approximately equivalent to the consumption of 248 U.S. households. All together, assuming Claude constitutes 1/2 the footprint, this suggests Claude's development consumed ~1.3GWh of energy, on par with the annual consumption of ~125 U.S> households.
- Estimates of daily ongoing consumption are impossible, as Anthropic does not publish user count data.
- Emissions of the models and API are a function of where the models are run - some geographies use more renewable energy than others and thus have lower emissions for the same compute load. It is not publicly known where, geographically, Claude was trained.
- Recent research  estimates training GPT-style models, like Claude, consume hundreds of thousands of liters of fresh water for data center cooling. Estimating ongoing water consumption, like ongoing energy consumption and emissions, is challenging without knowing the compute required to serve the API's many millions of users.
User interaction and dependence, including potential for harm to children
- Claude is explicitly designed to be harmless and non-manipulative . Nevertheless, it may be possible to circumvent this. In particular, developers using the Claude API to build downstream applications can potentially, through prompting and filtering techniques, induce the model to manipulate users or breed emotional dependence on the model.
- In professional contexts, use of Claude or downstream tools built using the Claude API may lead to technical reliance on the tool for completing work tasks. In particular, as workers "assign" more labor to the model, they may lose proficiency in skills traditionally associated with these tasks through lack of practice.
- Large language models like Claude are prone to "confabulate" during the text generation process. They can produce factually incorrect information and make reasoning errors, including errors of omission. Societally, as these models proliferate, there is a risk of confabulations proliferating as well. The confabulation phenomenon could contribute to misinformation spread and a general erosion of trust.
Google, Anthropic's cloud partner, claims historically net-neutral emissions through purchase of carbon credits and claims 100% renewable energy matching. See the Mitigations section below for more details.
Mitigations to address the prevalence of confabulations exist and are an area of active research. See the Mitigations section below for more details.
Explainability & Transparency
Refers to the ability to understand and interpret an AI system's decisions and actions, and the openness about the data used, algorithms employed, and decisions made. Lack of these elements can create risks of misuse, misinterpretation, and lack of accountability.
- Information on the training data used to train the Claude model is limited. According to Anthropic’s model card for Claude 2 , “Claude models are trained on a proprietary mix of publicly available information from the Internet, datasets that [they] license from third party businesses, and data that [their] users affirmatively share or that crowd workers provide. Some of the human feedback data used to finetune Claude was made public alongside [their] RLHF and red-teaming research. Claude 2’s training data cuts off in early 2023, and roughly 10 percent of the data included was non-English.” Anthropic’s fine-tuning approach is detailed in  and other Anthropic-published papers. The exact composition of the final fine-tuning data is unavailable, however Anthropic has published the "constitution" used to align the model .
Explainability of model outputs
- Explanations of Claude's outputs are not generally available. Claude may be prompted to explain its reasoning, but such explanations are not guaranteed to reflect Claude's true "thought process" and are not guaranteed to be logical or cogent. Anthropic has made "mechanistic interpretability", which focuses on explaining large neural network models mathematically, a key focus area of its research efforts.
- Anthropic has disclosed, through a sequence of research papers published between December 2021 and the present, many details about the development of its large language models. Per the July 2023 model card , the final Claude model largely was designed according to the descriptions provided In these papers. It is, nevertheless, possible that the in-deployment Claude models differ from the models studied at a fixed point in time in Anthropic's research. In particular, Anthropic periodically updates the Claude chatbot and API services to prevent identified misuses. This may include changing the model itself in subtle ways. It is likely that such updates generally make the model safer, but may also negatively impact performance.
As discussed above, prompting strategies exist to address the non-explainability of model outputs. Credo AI does not have further comment on this mitigation beyond what is described above.
Fairness & Bias
Arises from the potential for AI systems to make decisions that systematically disadvantage certain groups or individuals. Bias can stem from training data, algorithmic design, or deployment practices, leading to unfair outcomes and possible legal ramifications.
- Claude's training set includes data from a large number of languages, with approximately 10% constituting non-English languages (this percentage also includes computer code) .. The model is therefore capable of performing some tasks regardless of the prompt language and is capable of generating text in a variety of languages [1, 23]. The capability of the model on a given task is generally lower for less-represented languages [1, 23]. Exact per-language performance is calculated using the Flores benchmark and reported in Anthropic’s Claude 2 model card .
- According to Anthropic , many of the evaluations run on and mitigations built into Claude are targeted at American English. As a consequence, mitigative effects are likely lower for non-English languages (i.e., the models may be more likely to confabulate and produce offensive content when prompted in languages or dialects other than American English).
Offensive or biased outputs
- Claude is known to occasionally output profanity, sexual content, stereotypes, and other types of biased or offensive language.
- Because of the potential for Claude to generate offensive language, the model is also capable of adopting biased personas and reasoning when prompted to make decisions (e.g. when prompting the model to compare candidate profiles of two job applicants). The prevalence of this behavior, including the prompting techniques involved in inducing the behavior, is subject to ongoing academic research.
Anthropic's fine-tuning process and content filters address some risks of perpetuating biases or behaving unfairly by curbing certain problematic topics. See the Mitigations section for more details.
Long-term & Existential Risk
Considers the speculative risks posed by future advanced AI systems to human civilization, either through misuse or due to challenges in aligning their objectives with human values.
- Anthropic found  that large language models similar to Claude exhibit sycophantic behavior and, as they grow, show increasing desire to acquire resources and preserve problematic goals. These behaviors are not believed to be imminently risky when confined to the model itself (i.e. running the model without providing it access to external resources). Nevertheless, there is potential for system-system and human-system feedback loops in the current generation of models. For instance, during formal evaluations of the Claude competitor, GPT-4, the model was used to trick a human (a TaskRabbit worker) to take actions on its behalf . The feasibility of performing similar actions with Claude is uncertain, but the potential for this level of misuse points to the need for individual developers to be aware of this risk as they built with Claude.
Anthropic's Constitutional AI approach represents the company's primary strategy for mitigating this risk [12, 13].
Performance & Robustness
Pertains to the AI's ability to fulfill its intended purpose accurately and its resilience to perturbations, unusual inputs, or adverse situations. Failures of performance are fundamental to the AI system performing its function. Failures of robustness can lead to severe consequences, especially in critical applications.
- Claude, like all large language models, is known to "confabulate" facts and information. It is also known to make errors in reasoning, including basic arithmetic errors. The frequency of this behavior depends on the task given to the model.
Code bugginess and vulnerabilities
- Claude is able to generate arbitrary code, including code containing bugs and security vulnerabilities, sub-optimal code, and code that is not fit for purpose. The frequency of this behavior depends on the specific coding task asked of the model.
- Claude's performance on a given task, as with all large language models, is a function of the prompt and any other inputs provided to the model. Benchmarks exist to measure performance and robustness on a fixed set of tasks. (See the evaluations section for details and citations.) The degree to which benchmarks are representative of real world performance, especially when prompt engineering techniques have been implemented, is limited.
Anthropic claims substantial mitigation of these risks through its fine-tuning procedures. For specific tasks, some targeted mitigation strategies exist. No mitigation strategy is 100% effective. Please see the Mitigations section for more details.
Refers to the risk of AI infringing upon individuals' rights to privacy, through the data they collect, how they process that data, or the conclusions they draw.
Data collection and re-use
- By default, prompts and responses submitted to both the Claude web-app and the Claude API are not used for downstream model training. This mitigates the risk of data submitted to the model being leaked to other individuals (e.g. by a future Anthropic model) . Users should note that if they provide feedback to Anthropic about particular model results, Anthropic may use the result, feedback, and associated prompts for future model training . Anthropic automatically deletes prompts and responses to the Claude API 30 days after submission, unless a user submits a separate data deletion request, which may lead to earlier deletion . The same policy applies to the Claude web-app, except where retention is expected by the user as part of the chatbot service (e.g., retaining chat logs so a user can resume a "conversation" at a later date) .
Reproduction of PII from training data
- Because Claude is trained on a large corpus of text data, including potentially publicly available personal information , the model may occasionally generate (i.e., regurgitate) information about individuals. It may be possible that, when augmented with outside data, Claude can be used to identify individuals since the model has strong geographic knowledge and reasoning abilities -- a behavior highlighted by OpenAI with respect to the GPT-4 model .
Encompasses potential vulnerabilities in AI systems that could compromise their integrity, availability, or confidentiality. Security breaches could result in significant harm, from incorrect decision-making to privacy violations.
Vulnerable code generation
- As with any foundation model capable of generating arbitrary code, Claude may output code with security vulnerabilities. There do not exist known estimates of how frequently this occurs.
- At this time, Anthropic does not publicly advertise the possibility of purchasing access to a "sequestered" (i.e. virtual private cloud) tenant nor on-premises deployments. Claude is available through the Amazon AWS Bedrock service.
- Claude is susceptible to "prompt injection" attacks, whereby a malicious user enters a particular style of instruction to encourage the model to (mis)-behave in ways advantageous to the user. This misbehavior can include circumventing any and all safety precautions "built-in" to the model through fine-tuning.
- Applications built on the Claude API (i.e. that call the API as part of the regular functioning of the application), such as custom chatbots and analysis engines, are also susceptible to this attack vector if they do not adopt specific risk mitigation steps. Developers are encouraged to take risk mitigation measures on top of those provided by Anthropic.
Access to external systems
- Claude and its API do not have access to external systems by default.
- Through the prompt injection attack vector (see above), applications which access external systems (e.g., third party API access to document-backed search systems, personal assistants which are given access to email or other personal accounts, auto-trading finance bots, etc.) may be subject to additional risk. For instance, a prompt injection attack which circumvents Anthropic's safety controls could be leveraged to "instruct" the model to take actions which go against the wishes of the user who has granted the bot access.
Anthropic's Constitutional AI paradigm represents a mitigative measure to address some security risks. Anthropic does not clarify whether it employs additional, non-Claude measures to address prompt injection risks. We provide more details in the Mitigation section.