AI Assistant

Anthropic Claude

Anthropic's Claude language model and API offer substantial capabilities along with risks that vary in severity. Anthropic has implemented a Constitutional AI approach aimed at aligning the model's behavior with principles of being helpful, harmless, and honest. While this approach mitigates some risks, others remain challenging to address comprehensively and persist at varying levels. Credo AI's analysis applies to the Claude 2 model announced on July 11, 2023. Profile last updated: July 31, 2023

Product Description

Claude, developed by Anthropic, is an AI assistant adept at a wide array of tasks, from customer service to back-office functions. It provides detailed, natural-sounding responses, processes large volumes of text, and can automate workflows with basic instructions and logical scenarios. Users can integrate Claude into any product or toolchain via standard APIs. Two types of model are available: the primary model for sophisticated tasks and Claude Instant, a faster, cheaper version suitable for casual tasks [1].

Anthropic prioritizes the principles of helpfulness, honesty, and harmlessness in Claude's design. The AI's behavior can be extensively adjusted using prompts. Fine-tuning is offered for select large enterprise users. Although Claude operates without internet access, users can provide text from the internet via copy-paste for the model to process. Despite its extensive functionalities, Claude currently does not support embeddings. Pricing and further details are accessible on the company's website [1].

Anthropic, the developer of Claude, was established in 2021 by former OpenAI employees. Anthropic brands itself as a safety and research-focused company with the goal of "building systems that people can rely on and generating research about the opportunities and risks of AI." [2]

Anthropic has reportedly received substantial investment from Google and maintains a close relationship relating to infrastructure (i.e. Google Cloud Platform) [3].

Profile last updated: July 31, 2023

Intended Use Case

Claude is designed to comprehend both natural language and code. The model generates text outputs based on user-provided inputs, often referred to as "prompts". Prompting can be thought of as analogous to "programming" the Claude model, and typically entails providing instructions or some examples that demonstrate how a task can be effectively completed.

According to Anthropic, the Claude model can be used in the following job roles: customer service, legal, coaching, search, back-office, and sales. When users interact with Claude via the web app, a wide variety of additional use cases and sample prompts are provided.

To interact with the Claude model via API, users send a request containing the prompt and the API key. The API yields a response containing the model’s output [4].

Risk and Mitigation Summary

The following table provides a quick summary of which common genAI-driven risks are present in Claude and which risks have been addressed by deliberate mitigative measures provided with the product. Mitigation of risk does not necessarily equate to elimination of risk.

For a definition of each risk and details of how these risks arise in the context of Claude, see below. These risk details are non-exhaustive.

Risk	Present	Built-in Mitigation
Abuse & Misuse	⚠️	❌
Compliance	⚠️	✅
Environmental & Societal Impact	⚠️	✅
Explainability & Transparency	⚠️	❌
Fairness & Bias	⚠️	✅
Long-term & Existential Risk	⚠️	✅
Performance & Robustness	⚠️	✅
Privacy	⚠️	✅
Security	⚠️	✅

Many of the details about Claude's development are derived from technical reports authored and self-published by Anthropic's research team. Most of these reports have not undergone scientific peer review. Additionally, the final Claude model likely differs in critical ways from the models discussed in Anthropic’s research reports. For example, as detailed in the model card [23] published at the time of the Claude 2 announcement, Claude 2’s performance and alignment are significantly stronger than the performance of the models detailed in Anthropic’s research work dating to 2021.

Where recent, publicly available information exists, we use this as the source of truth. Information from older sources should be viewed with a greater level of scrutiny and points to a potential disconnect in transparency between the Claude product offering and Anthropic's research agenda.

Abuse & Misuse

Pertains to the potential for AI systems to be used maliciously or irresponsibly, including for creating deepfakes, automated cyber attacks, or invasive surveillance systems. Abuse specifically denotes the intentional use of AI for harmful purposes.

Arbitrary code generation

Because Claude is capable of generating arbitrary code, it could be used to generate code used in cyber attacks. For instance, a malicious user could use the model to generate code for orchestrating a bot network. A successful cyber attack would likely require additional hacking expertise; simply having access to Claude is unlikely to enable a malicious actor to carry out a cyber attack, but the product could lower the barrier for a less sophisticated hacker.

Arbitrary, programmatic text generation

Because Claude is capable of generating arbitrary text, it could be used to generate text used for misinformation campaigns, generating deepfakes (e.g., text in the style of a public figure), social engineering in phishing attacks, and more. Additionally, when coupled with other generative AI technologies, such as text-to-speech synthesis models capable of mimicking public figures, Claude could be used to perpetrate highly sophisticated misuses. Any misuse would require some expertise in prompt engineering and in other accompanying tools. Nevertheless, given the human-like quality of the model's outputs, the existence of the model dramatically lowers the barrier for both sophisticated and unsophisticated malicious actors.

Generation of text describing harmful and/or illegal activities

Because Claude is trained on, among other data sources, text from the internet [5], it is capable of generating descriptions of harmful and/or illegal activities.

For more details on Anthropic's research into abuse and misuse, see https://www.anthropic.com/research.

The versions of the Claude model which Anthropic makes available through its web-app and API has undergone substantial alignment-oriented fine-tuning targeted at addressing the potential for misuse. See the Mitigations section below for more details.

Compliance

Involves the risk of AI systems violating laws, regulations, and ethical guidelines (including copyright risks). Non-compliance can lead to legal penalties, reputational damage, and loss of user trust.

Claude was trained on, among other sources, publicly available internet data [5]. Anthropic's public documentation does not provide sufficient detail on their data sources to determine the copyright protection status of each training sample. Assuming the training data does contain copyright-protected content, it is possible that the model will provide (i.e. reproduce) text identical to or substantially similar to copyright protected text. This applies analogously to code generated by the model in response to user prompts asking for code. The legality of the use of generated text and code is subject to ongoing public debate [6-7]. A similar model, developed by OpenAI, is subject to ongoing litigation [8]

Regulatory compliance

Because Anthropic's Claude is capable of generating arbitrary text and code, it could be used in the service of activities that violate laws and regulations in the user's jurisdiction. For instance, the model could be used by a company's HR employee to screen resumes and aid in hiring decisions. Doing so could violate anti-discrimination laws and AI-specific laws in ways that are not easy to detect.
Use of the model could also violate data security and data privacy laws. For instance, the product is not FERPA compliant [9], meaning that if an individual uses the model for specific education-related work tasks -- say, a college admissions officer using the chatbot or a tool built on the Claude API to analyze application essays -- the use could violate FERPA.

Organizational compliance

The Claude model is not innately aware of a particular developer or user organization's internal policies regarding the use of generative AI tools and their outputs. Without specifically imposing controls, organizations' employees could inadvertently or deliberately use the model, through the chat web app or through the API, to violate organization policy.

Anthropic's fine-tuning process, in particular its Constitutional AI paradigm, addresses some risks of violating applicable laws and regulations by curbing certain problematic topics. See the Mitigations section for more details.

Anthropic offers enterprise customers the option to pursue HIPAA compliance. See [9] for more details.

Organizations using the Claude API to develop downstream applications have the ability to track all inputs to and outputs from the model. As a consequence, the API provides flexibility to pursue additional "after-market" mitigation strategies to address compliance risks that Anthropic's direct offering does not address. See the Mitigation section for more details.

Environmental & Societal Impact

Concerns the broader changes AI might induce in society, such as labor displacement, mental health impacts, or the implications of manipulative technologies like deepfakes. It also includes the environmental implications of AI, particularly the strain on natural resources and carbon emissions caused by training complex AI models, balanced against the potential for AI to help mitigate environmental issues.

Labor market disruption

Because of the strong performance demonstrated by the latest GPT-style models on analysis tasks, there is significant concern that the models could induce significant disruption to "white collar", cognitive-task labor markets. Among the formal analyses on this topic, work from OpenAI estimates that "around 80% of the U.S. workforce could have at least 10% of their work tasks affected" by LLMs and "19% of workers may see at least 50% of tasks impacted" [11]. The ultimate effect of this disruption is uncertain. It is possible that "disruption" will correspond purely to efficiency gains, enabling workers to focus time on more difficult tasks. It is also conceivable that "disruption" will entail displacement, forcing workers to retrain and/or leading to a substantial increase in unemployment. [11] highlights that the most likely outcome is some combination of these two varieties of "disruption" and that the impacts will be realized unevenly across economic sectors. While this analysis was performed primarily with OpenAI's ChatGPT and GPT-4 models in mind, Claude is sufficiently powerful that it should be viewed as having the same potential for disruption.

Carbon footprint

Anthropic has not published sufficient information about its models to estimate the energy consumption and carbon footprint of training, nor the consumption and emissions of ongoing use. Several Anthropic papers discuss the use of a 52 billion parameter model, which was pre-trained on 850B tokens [10]. Extrapolating from Table 1 in [5], this implies a compute budget of 2.55e23 FLOPs. Information has been published on other foundation models' energy consumption and carbon footprint. Meta estimates its 65B parameter LLaMa model used just over 1 million GPU hours on NVidia's A100-80GB GPU. Assuming 156TFlops performance (theoretical peak performance of at 32 big TensorFloat precision), the LLaMa-65B model ran for 5.74e23 FLOPs, implying Claude may have used about 1/2 the compute.
Meta estimates its 65B parameter LLaMa model consumed 449 MWh of power during it's training run, approximately equivalent to the annual power consumption and emissions of 42 U.S. households [8]. They further estimate that, due to experimentation and creation of smaller models (steps Anthropic also followed, with a range of model sizes), the overall energy consumption associated with creating the LLaMa model family was 2.64GWh, approximately equivalent to the consumption of 248 U.S. households. All together, assuming Claude constitutes 1/2 the footprint, this suggests Claude's development consumed ~1.3GWh of energy, on par with the annual consumption of ~125 U.S> households.
Estimates of daily ongoing consumption are impossible, as Anthropic does not publish user count data.
Emissions of the models and API are a function of where the models are run - some geographies use more renewable energy than others and thus have lower emissions for the same compute load. It is not publicly known where, geographically, Claude was trained.
Recent research [9] estimates training GPT-style models, like Claude, consume hundreds of thousands of liters of fresh water for data center cooling. Estimating ongoing water consumption, like ongoing energy consumption and emissions, is challenging without knowing the compute required to serve the API's many millions of users.

User interaction and dependence, including potential for harm to children

Claude is explicitly designed to be harmless and non-manipulative [12]. Nevertheless, it may be possible to circumvent this. In particular, developers using the Claude API to build downstream applications can potentially, through prompting and filtering techniques, induce the model to manipulate users or breed emotional dependence on the model.
In professional contexts, use of Claude or downstream tools built using the Claude API may lead to technical reliance on the tool for completing work tasks. In particular, as workers "assign" more labor to the model, they may lose proficiency in skills traditionally associated with these tasks through lack of practice.

Confabulations

Large language models like Claude are prone to "confabulate" during the text generation process. They can produce factually incorrect information and make reasoning errors, including errors of omission. Societally, as these models proliferate, there is a risk of confabulations proliferating as well. The confabulation phenomenon could contribute to misinformation spread and a general erosion of trust.

Google, Anthropic's cloud partner, claims historically net-neutral emissions through purchase of carbon credits and claims 100% renewable energy matching. See the Mitigations section below for more details.

Mitigations to address the prevalence of confabulations exist and are an area of active research. See the Mitigations section below for more details.

Explainability & Transparency

Refers to the ability to understand and interpret an AI system's decisions and actions, and the openness about the data used, algorithms employed, and decisions made. Lack of these elements can create risks of misuse, misinterpretation, and lack of accountability.

Data transparency

Information on the training data used to train the Claude model is limited. According to Anthropic’s model card for Claude 2 [23], “Claude models are trained on a proprietary mix of publicly available information from the Internet, datasets that [they] license from third party businesses, and data that [their] users affirmatively share or that crowd workers provide. Some of the human feedback data used to finetune Claude was made public alongside [their] RLHF and red-teaming research. Claude 2’s training data cuts off in early 2023, and roughly 10 percent of the data included was non-English.” Anthropic’s fine-tuning approach is detailed in [13] and other Anthropic-published papers. The exact composition of the final fine-tuning data is unavailable, however Anthropic has published the "constitution" used to align the model [12].

Explainability of model outputs

Explanations of Claude's outputs are not generally available. Claude may be prompted to explain its reasoning, but such explanations are not guaranteed to reflect Claude's true "thought process" and are not guaranteed to be logical or cogent. Anthropic has made "mechanistic interpretability", which focuses on explaining large neural network models mathematically, a key focus area of its research efforts.

Design decisions

Anthropic has disclosed, through a sequence of research papers published between December 2021 and the present, many details about the development of its large language models. Per the July 2023 model card [23], the final Claude model largely was designed according to the descriptions provided In these papers. It is, nevertheless, possible that the in-deployment Claude models differ from the models studied at a fixed point in time in Anthropic's research. In particular, Anthropic periodically updates the Claude chatbot and API services to prevent identified misuses. This may include changing the model itself in subtle ways. It is likely that such updates generally make the model safer, but may also negatively impact performance.

As discussed above, prompting strategies exist to address the non-explainability of model outputs. Credo AI does not have further comment on this mitigation beyond what is described above.

Fairness & Bias

Arises from the potential for AI systems to make decisions that systematically disadvantage certain groups or individuals. Bias can stem from training data, algorithmic design, or deployment practices, leading to unfair outcomes and possible legal ramifications.

Multi-lingual support

Claude's training set includes data from a large number of languages, with approximately 10% constituting non-English languages (this percentage also includes computer code) [23].. The model is therefore capable of performing some tasks regardless of the prompt language and is capable of generating text in a variety of languages [1, 23]. The capability of the model on a given task is generally lower for less-represented languages [1, 23]. Exact per-language performance is calculated using the Flores benchmark and reported in Anthropic’s Claude 2 model card [23].
According to Anthropic [14], many of the evaluations run on and mitigations built into Claude are targeted at American English. As a consequence, mitigative effects are likely lower for non-English languages (i.e., the models may be more likely to confabulate and produce offensive content when prompted in languages or dialects other than American English).

Offensive or biased outputs

Claude is known to occasionally output profanity, sexual content, stereotypes, and other types of biased or offensive language.

Biased decision-making

Because of the potential for Claude to generate offensive language, the model is also capable of adopting biased personas and reasoning when prompted to make decisions (e.g. when prompting the model to compare candidate profiles of two job applicants). The prevalence of this behavior, including the prompting techniques involved in inducing the behavior, is subject to ongoing academic research.

Anthropic's fine-tuning process and content filters address some risks of perpetuating biases or behaving unfairly by curbing certain problematic topics. See the Mitigations section for more details.

Long-term & Existential Risk

Considers the speculative risks posed by future advanced AI systems to human civilization, either through misuse or due to challenges in aligning their objectives with human values.

Autonomy

Anthropic found [15] that large language models similar to Claude exhibit sycophantic behavior and, as they grow, show increasing desire to acquire resources and preserve problematic goals. These behaviors are not believed to be imminently risky when confined to the model itself (i.e. running the model without providing it access to external resources). Nevertheless, there is potential for system-system and human-system feedback loops in the current generation of models. For instance, during formal evaluations of the Claude competitor, GPT-4, the model was used to trick a human (a TaskRabbit worker) to take actions on its behalf [16]. The feasibility of performing similar actions with Claude is uncertain, but the potential for this level of misuse points to the need for individual developers to be aware of this risk as they built with Claude.

Anthropic's Constitutional AI approach represents the company's primary strategy for mitigating this risk [12, 13].

Performance & Robustness

Pertains to the AI's ability to fulfill its intended purpose accurately and its resilience to perturbations, unusual inputs, or adverse situations. Failures of performance are fundamental to the AI system performing its function. Failures of robustness can lead to severe consequences, especially in critical applications.

Confabulations

Claude, like all large language models, is known to "confabulate" facts and information. It is also known to make errors in reasoning, including basic arithmetic errors. The frequency of this behavior depends on the task given to the model.

Code bugginess and vulnerabilities

Claude is able to generate arbitrary code, including code containing bugs and security vulnerabilities, sub-optimal code, and code that is not fit for purpose. The frequency of this behavior depends on the specific coding task asked of the model.

Robustness

Claude's performance on a given task, as with all large language models, is a function of the prompt and any other inputs provided to the model. Benchmarks exist to measure performance and robustness on a fixed set of tasks. (See the evaluations section for details and citations.) The degree to which benchmarks are representative of real world performance, especially when prompt engineering techniques have been implemented, is limited.

Anthropic claims substantial mitigation of these risks through its fine-tuning procedures. For specific tasks, some targeted mitigation strategies exist. No mitigation strategy is 100% effective. Please see the Mitigations section for more details.

Privacy

Refers to the risk of AI infringing upon individuals' rights to privacy, through the data they collect, how they process that data, or the conclusions they draw.

Data collection and re-use

By default, prompts and responses submitted to both the Claude web-app and the Claude API are not used for downstream model training. This mitigates the risk of data submitted to the model being leaked to other individuals (e.g. by a future Anthropic model) [17]. Users should note that if they provide feedback to Anthropic about particular model results, Anthropic may use the result, feedback, and associated prompts for future model training [17]. Anthropic automatically deletes prompts and responses to the Claude API 30 days after submission, unless a user submits a separate data deletion request, which may lead to earlier deletion [18]. The same policy applies to the Claude web-app, except where retention is expected by the user as part of the chatbot service (e.g., retaining chat logs so a user can resume a "conversation" at a later date) [18].

Reproduction of PII from training data

Because Claude is trained on a large corpus of text data, including potentially publicly available personal information [5], the model may occasionally generate (i.e., regurgitate) information about individuals. It may be possible that, when augmented with outside data, Claude can be used to identify individuals since the model has strong geographic knowledge and reasoning abilities -- a behavior highlighted by OpenAI with respect to the GPT-4 model [16].

Anthropic's Terms of Service and Privacy Policy represent a default privacy risk mitigation measure. We do not discuss this point further in the Mitigations section.

Security

Encompasses potential vulnerabilities in AI systems that could compromise their integrity, availability, or confidentiality. Security breaches could result in significant harm, from incorrect decision-making to privacy violations.

Vulnerable code generation

As with any foundation model capable of generating arbitrary code, Claude may output code with security vulnerabilities. There do not exist known estimates of how frequently this occurs.

Model sequestration

At this time, Anthropic does not publicly advertise the possibility of purchasing access to a "sequestered" (i.e. virtual private cloud) tenant nor on-premises deployments. Claude is available through the Amazon AWS Bedrock service.

Prompt injection

Claude is susceptible to "prompt injection" attacks, whereby a malicious user enters a particular style of instruction to encourage the model to (mis)-behave in ways advantageous to the user. This misbehavior can include circumventing any and all safety precautions "built-in" to the model through fine-tuning.
Applications built on the Claude API (i.e. that call the API as part of the regular functioning of the application), such as custom chatbots and analysis engines, are also susceptible to this attack vector if they do not adopt specific risk mitigation steps. Developers are encouraged to take risk mitigation measures on top of those provided by Anthropic.

Access to external systems

Claude and its API do not have access to external systems by default.
Through the prompt injection attack vector (see above), applications which access external systems (e.g., third party API access to document-backed search systems, personal assistants which are given access to email or other personal accounts, auto-trading finance bots, etc.) may be subject to additional risk. For instance, a prompt injection attack which circumvents Anthropic's safety controls could be leveraged to "instruct" the model to take actions which go against the wishes of the user who has granted the bot access.

Anthropic's Constitutional AI paradigm represents a mitigative measure to address some security risks. Anthropic does not clarify whether it employs additional, non-Claude measures to address prompt injection risks. We provide more details in the Mitigation section.

‍

Mitigation Measures

In this section, we discuss mitigation measures that are built-in to the product (regardless of whether they are enabled by default). We also comment on the feasibility of a procuring organization governing the use of the tool by its employees.

Mitigations that "ship" with Claude

RLHF alignment fine-tuning & Constitutional AI

The Claude model has undergone substantial fine-tuning [5, 10, 12, 13, 15, 20, 23] with the goal of making the model more amenable to human interaction (i.e. instruction/chat tuning) and more aligned with human requirements for factuality and avoiding causing harm. As a probabilistic model, these efforts are mitigative and do not generally eliminate risk.

The cornerstone of Anthropic's mitigation efforts is a combination of reinforcement from human feedback (RLHF) and what Anthropic calls "reinforcement from AI feedback" (RLAIF). Anthropic guides its models to behave in accordance with a set of principles by rewarding the model for behaving in accordance with those principles as defined by a combination of human annotations and AI annotations. Anthropic has provided substantial details on the process of implementing their approach (see the papers cited in the previous paragraph) and has published the "constitutional principles," which they use for this alignment process [12]. Included in the constitution are principles relating to avoiding harmful responses (including bias and profanity), avoiding responses that would reveal an individual's identity or personal information, avoiding responses regarding illicit acts, avoiding manipulation, and encouraging honesty and helpfulness.

This mitigation approach is provided as an "as-is" mitigation; it is not configurable at this time.

Regular updates

Claude periodically updates its models as the organization continues research into capabilities and safety measures. The API allows users to automatically update to the latest models or fix applications to a specific model version [4].

Google Cloud net-neutral carbon footprint

Google, Anthropic's cloud partner [3], claims to be carbon neutral [19] and claims to match 100% of energy consumption with renewables. They claim to have offset 100% of their historical operating emissions to reach historical net-neutrality. They achieve this through the purchase of carbon credits and offsets. Google has publicly committed to reaching 100% carbon-free operations by 2030. It is likely that all systems relevant to Claude's ongoing development and operations are included in this carbon accounting.

This mitigation is non-configurable.

Non-use of prompts sent to, and outputs received from, the API

According to the Anthropic Terms of Service and Privacy Policy, data submitted through prompts to Claude are not used to train future Anthropic models [17, 18]. This eliminates the risk of private data or intellectual property being leaked through model responses to other entities. Prompts and responses are stored for up to 30 days [18]. Anthropic does not detail whether this retention may include human monitoring of prompts and responses for potential misuse or illegal activity, as is the case in other popular chatbot models. Due to the presence of standard, non-AI-specific cybersecurity risk, the risk of data leakage is non-zero. For instance, Anthropic could be targeted by a phishing attack, which could compromise data it stores, including sensitive data submitted to the Claude model.

This mitigation is non-configurable.

Mitigations available through Anthropic

Anthropic use is compliant with HIPAA [9]. Organizations wishing to use Claude for tasks subject to HIPAA rules should inquire further with Anthropic and consult the Anthropic Trust Center: https://trust.anthropic.com/

Mitigations that can be implemented through customized use of Claude

Prompt Engineering

Prompt engineering (see FAQs on the Claude product page [1]) is a popular strategy to induce GPT-style models to behave in accordance with the user's intentions. The strategy can be used to improve the quality of responses (i.e. improve performance) and decrease the likelihood of certain risks (e.g. confabulations). This includes "context loading", format standardization, persona adoption, and numerous other approaches.

The class of prompt engineering strategies is rapidly expanding. The effectiveness of any one strategy is subject to ongoing research and will depend on the use case.

Governability

For an organization to govern its development or use of an AI system, two functionalities are key: the ability of the organization to observe usage patterns among its employees and the ability of the organization to implement and configure controls to mitigate risk. Credo AI assesses systems on these two dimensions.

Use of Claude through the web app by employees of an organization is not directly governable by the organization at this time.

Developers building with the API, on the other hand, have access to all prompts into and outputs from the model. This enables developers to monitor usage for organizational compliance. It also enables developers to use other "ops layer" tools, such as additional content moderation models [21, 22], to filter prompts and responses before serving the results to end-users.

‍

Formal Evaluations & Certifications

Evaluations

Research into the capabilities and risk characteristics of the Claude model is ongoing. Research is limited by the fact that the model is not open; it is only accessible through Anthropic's web app and API. As a consequence, a large portion of known evaluations were performed by Anthropic directly. Reproducibility is often infeasible.

The difference in performance between Claude 2 and Claude 2 Instant is unknown for all evaluations; only evaluations of Claude 1.1 Instant are available [23]. Based on the differences between Claude 1.3 and Claude 1.1 Instant (the two latest models prior to Claude 2’s release), it is likely that Claude 2 Instant is generally less capable than Claude 2.

Capabilities

Claude 2 was evaluated [23] on several academic benchmarks for large language models. The model's performance is strong: it achieves state-of-the-art performance or near state-of-the-art performance on a wide range of tasks. Included among these is the HumanEval computer programming task, on which Claude achieves the top reported performance to date (71.2% accuracy, compared to GPT-4’s 67% accuracy). On other popular benchmarks Claude is competitive with GPT-4, though performance is not always directly comparable. On the ARC Challenge (grade school science) and GSM-8K (grade school math) benchmarks, Claude slightly trails GPT-4: 91% vs. 96.3% and 88% vs. 92% on ARC and GSM-8K. Claude’s performance, however, is calculated with fewer “shots” (attempts): on Claude’s performance was calculated 5-shot on ARC and 0-shot on GSM-8K, while GPT-4’s was evaluated 25-shot on ARC and 5-shot on GSM-8K. These metrics are, therefore, not directly comparable and it may be that Claude is the higher performing model. Claude trails GPT-4 on the MMLU benchmark on an equal number of shots (5): 78.5% vs. 86.4%. In all cases, Claude and GPT-4 dramatically outpace competitor models.

In addition to academic benchmarks, Anthropic followed the precedent set by OpenAI [16] in evaluating its large language model on a set of real world knowledge exams. Claude achieves 76.5% accuracy on a practice edition of the Uniform Bar Exam and scores above the 90th percentile on both the GRE Verbal and GRE Writing exams. Claude’s performance on the Quantitative Reasoning section of the GRE is much lower, scoring in the 42nd percentile. These scores are competitive with GPT-4: 74.5% on the Bar Exam, 99th percentile GRE Verbal, 54th percentile on GRE writing, and 80th percentile on GRE QR.

Anthropic also provides details on Claude 2’s performance as a translator from English to non-English languages. Using the heuristic that a BLEU score of 30% on the Flores benchmark corresponds to an “understandable” translation, while 50% corresponds to a “good and fluent” translation, Claude is “understandable” in at least 23 languages (plus English) and “good and fluent” in 2: French and Portuguese (plus English).

A substantial and growing body of research on LLM performance exists. It is impossible to summarize the entire body of research in this risk profile. Credo AI offers the general guidance that developers considering using Claude for application development should consult the segment of the literature dedicated to their specific use. The results cited above will not necessarily carry over to a narrow use case. Performance lapses are feasible depending on context.

Misbehavior

As stated above, Claude is probabilistic and thus any measure of model (non-) alignment is objective only to the extent that the evaluation conditions match real world use. Developers who use Claude to develop applications that deviate from the tested conditions are likely to experience different risk surfaces.

Factuality

According to an internal evaluation [20], Anthropic's approaches to alignment aid in promoting honesty (e.g. disclosing when the model does not know, rather than "lying"). The model's performance on a truthfulness benchmark, TruthfulQA is presently state-of-the-art at 69% [23], compared to GPT-4’s ~60% (exact number not provided by OpenAI) [16].

Sensitive Content

Anthropic's evaluations are primarily benchmarked on internal datasets and against different versions of the Claude model. Anthropic claims their Constitutional AI approach is highly effective at promoting harmlessness (See Figure 8 in [13]). Anthropic’s evaluations [23] found that Claude 2 is substantially more helpful and honest and marginally less harmless (i.e. more harmful) than the previous version, Claude 1.3, according to human-expressed preferences. It is difficult to contextualize these results since Claude 1.3 has not been compared to any external models using this method. In addition, the Anthropic team evaluated its series of Claude models using the BBQA bias evaluation benchmark, which assesses the degree to which large language models reflect common stereotypes. According to Anthropic’s results, Claude 2 is the least likely to output content exhibiting stereotypes. Again, this result cannot be easily contextualized: competitor models like GPT-4 and LLaMa-2 do not report results on this benchmark and the PapersWithCode page presently contains no submissions for BBQA.

Prompt Injection & Jailbreaking

Because of the probabilistic nature of the Claude model, it is impossible to anticipate the number or variety of prompts that can be used to successfully jailbreak a model. Formal estimates of the rate of these attacks cannot be obtained.

Human preference

Recently, several organizations or individuals have published websites with direct comparisons between popular large language models. These comparisons are typically non-specific: users are provided with a chat interface and are given the ability to submit a prompt to two randomly selected chat models simultaneously. The identity of the models is hidden from the user and the user is encouraged to rank the response they prefer. This is an imperfect measure of model quality and harmlessness. Across thousands of comparisons, the comparison service is able to calculate a per-model win rate and ELO rating. As of this writing, Claude 1.3 and its faster variant, Claude 1.1 Instant are generally considered the second and third best models available on the LLM market. Claude 2 has not been incorporated into these comparisons at this time. For instance, see the leaderboard at https://chat.lmsys.org/?arena

Certifications

Credo AI has identified the following regulations and standards as relevant to the privacy, security, and compliance requirements of our customers. Anthropic's advertised compliance is detailed below. For more details, see https://trust.anthropic.com/

‍

Conclusion

In summary, Anthropic's Claude language model and API offer substantial capabilities along with risks that vary in severity. Anthropic has implemented a Constitutional AI approach aimed at aligning the model's behavior with principles of being helpful, harmless, and honest. While this approach mitigates some risks, others remain challenging to address comprehensively and persist at varying levels.

Organizations considering using Claude for application development or to augment human functions should weigh these risks carefully against the benefits of the technology. They should evaluate how their intended use case may interact with the model's risk surfaces and consider implementing additional controls, especially for applications built on top of the Claude API. Regular monitoring and governance practices are recommended. Ultimately, as with any AI system, the risks associated with Claude can be managed but not entirely eliminated.

Anthropic continues to research methods for improving model alignment and has committed to proactively addressing newly identified issues. However, as models become more capable and complex, risks are likely to evolve as well. Organizations adopting Claude, and stakeholders interacting with applications built on it, should remain vigilant to changes in the risk landscape and push for maximized transparency from Anthropic into their model development and evaluation processes. Overall, while promising, Claude and similar models require cautious and conscientious development and deployment to fulfill their potential benefit to humanity.

‍

References

[1] Claude Product Page - https://www.anthropic.com/product

[2] Anthropic Company Page - https://www.anthropic.com/company

[3] Anthropic Partners with Google Cloud - https://www.anthropic.com/index/anthropic-partners-with-google-cloud

[4] Claude API Documentation - https://docs.anthropic.com/claude/reference/getting-started-with-the-api

[5] A General Language Assistant as a Laboratory for Alignment - https://arxiv.org/pdf/2112.00861.pdf

[6] ChatGPT and Copyright: The Ultimate Appropriation - https://techpolicy.press/chatgpt-and-copyright-the-ultimate-appropriation/

[7] Copyright, Professional Perspective - Copyright Chaos: Legal Implications of Generative AI - https://www.bloomberglaw.com/external/document/XDDQ1PNK000000/copyrights-professional-perspective-copyright-chaos-legal-implic

[8] GitHub Copilot Class Action Lawsuit - https://githubcopilotinvestigation.com/

[9] Anthropic Trust Page - https://trust.anthropic.com/

[10] Language Models (mostly) Know What They Know - https://arxiv.org/pdf/2207.05221.pdf

[11] GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models - https://arxiv.org/pdf/2303.10130.pdf

[12] Claude's Constitution - https://www.anthropic.com/index/claudes-constitution#:~:text=%E2%80%9CPlease%20choose%20the%20assistant%20response,%2C%20peaceful%2C%20and%20ethical.%E2%80%9D

[13] Constitutional AI: Harmlessness from AI Feedback - https://arxiv.org/pdf/2212.08073.pdf

[14] The Capacity for Moral Self-Correction in Large Language Models - https://arxiv.org/pdf/2302.07459.pdf

[15] Discovering Language Model Behaviors with Model-Written Evaluations - https://arxiv.org/pdf/2212.09251.pdf

[16] GPT-4 Technical Report - https://arxiv.org/pdf/2303.08774.pdf

[17] Anthropic Terms of Service - https://console.anthropic.com/legal/terms

[18] Anthropic Privacy Policy - https://console.anthropic.com/legal/privacy

[19] Google: Building a carbon-free future for all - https://sustainability.google/commitments/carbon/

[20] Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback - https://arxiv.org/pdf/2204.05862.pdf

[21] Announcing Arthur Shield: The First Firewall for LLMs - https://www.arthur.ai/blog/announcing-arthur-shield-the-first-firewall-for-llms

[22] Hive Moderation - https://hivemoderation.com/

[23] Model Card and Evaluations for Claude Models - https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf

Notes

Italics denote Credo AI definitions of key concepts.

AI Disclosure: The Product Description and Conclusion sections of this report were generated with assistance from OpenAI's GPT-4 model and Anthropic's Claude model, respectively. For the Product Description section, documentation from Anthropic's website was provided to the model along with a prompt to write a product description from a third party perspective. For the Conclusion section, the other sections of the report were provided to Claude and Claude was prompted to write a 2-3 paragraph conclusion section. The final text for both sections was edited and reviewed for accuracy and suitability by Credo AI.‍

‍