The emergence of AI agents—autonomous systems that can pursue higher-order goals by reading data, generating content, and executing operations—represents a significant evolution beyond traditional generative AI tools. While these agents offer unprecedented efficiency gains across knowledge work functions, they also introduce novel governance challenges that organizations must address to ensure alignment with industry standards, corporate values and policies.
This whitepaper examines the key differences between AI agents and prior-generation AI systems, highlighting how increased autonomy, expanded access privileges, and execution capabilities fundamentally change the risk landscape. We analyze the enterprise adoption journey from cautious exploration to autonomous operations, identifying the governance needs at each stage. Technical risks unique to agentic systems—including unpredictable autonomy, security vulnerabilities, unintended actions, and self-modification—are explored alongside corporate risk management considerations around accountability, oversight, and human-AI collaboration.
The latest technological advancements in the AI space have enabled the development of functional AI “agents”, AI systems that are designed to perform knowledge work (such as coding, marketing, or sales tasks) much like digital employees. These agents autonomously approach higher order goals by reading data, generating content, and even executing operations in IT systems. These transformative capabilities promise efficiency gains, yet also introduce new governance challenges. Unlike earlier generative AI applications that only responded to user prompts, autonomous agents can act on their own – blurring lines of oversight and accountability. It is crucial for organizational leaders to understand the risks and implement safeguards so that these AI agents remain aligned with corporate values and policies. This report outlines key differences between modern AI agents and prior-generation AI systems, examines technical risks of agentic behavior, and discusses corporate risk management considerations including liability, oversight, and mitigation strategies. (Note: While AI agents can include physical robotics or general-purpose autonomous systems, this report focuses primarily on digital-only agents for knowledge work.)
To ground discussion, we define key terms in the AI agents space. This is necessary given the heightened buzz around AI agents and corresponding potential for misperceptions about what they entail or how they may work. For more, see Anthropic’s write-up on building agents and Allie K. Miller’s piece on the modes of AI interaction.
A computational system designed to perform tasks that typically require human intelligence. AI systems use algorithms and models to analyze data, recognize patterns, make predictions, and generate outputs across a variety of domains. They range from rule-based expert systems to modern machine learning frameworks, including neural networks and deep learning architectures. AI systems can be categorized based on their capabilities, autonomy levels, and intended applications.
Examples: A credit risk prediction model, ChatGPT, Github, and Google DeepResearch
A subset of artificial intelligence focused on creating content rather than simply analyzing existing data. Generative AI systems can produce human-like text, images, audio, video, code, or other media based on patterns learned from training data. Examples include large language models (LLMs) like those powering ChatGPT, DALL-E for image generation, and Midjourney for visual art. These systems respond to user prompts but traditionally operate within a single conversational turn and lack the autonomy to pursue goals independently.
Examples: ChatGPT, Github Copilot, and Glean
A simple agentic system, which can interact with its environment through tools, memory, and external data sources, but in a linear, single-turn fashion.
An AI system with enhanced capabilities for autonomous operation, characterized by its ability to perceive its environment, make decisions, and take actions to achieve specific goals with limited human intervention. Unlike basic generative AI that simply responds to prompts, agents can proactively pursue objectives by planning sequences of actions, using tools and external resources, adapting to new information, and maintaining persistent memory across interactions. AI agents can read from various data sources, write to systems, and execute operations (such as running code or making API calls), functioning more like digital workers than simple assistive tools. At a fundamental level, agents are generative systems with additional software support to enable their environment interactions; for instance, API calls to external tools are typically operationalized by writing code that reads an underlying large language model’s outputs to identify a particular (textual) flag signifying which API function to invoke and with which parameters. The software support – known as a ‘harness’ – then deterministically converts this text-based API call to an in-code API call and returns the API response to the LLM for further processing.
Examples: Claude Code, Cursor Agent, and Google DeepResearch
A model of a fully autonomous agent, which receives a prompt from the human, takes actions on its environment, is able to observe the results, and make decisions about whether to take further action or conclude its work. Actions may include spawning child agents, which themselves can take actions in a similar fashion (with the parent LLM taking the role of ‘human’).
An environment where multiple AI agents interact with each other and potentially with humans to accomplish tasks or solve problems. These systems feature dynamic interactions between agents that may cooperate, compete, negotiate, or form coalitions. Multi-agent systems introduce additional complexity through collective behaviors, emergent properties, and potential cascading effects that are not present in single-agent deployments. The coordination challenges and potential information asymmetries between agents create unique governance considerations beyond those of individual agents. A common type of multi-agent system is one in which an ‘orchestrator’ agent triggers the creation – through its software harness – of ‘child’ agents to accomplish sub-tasks.
Examples: Waymo, Autonomous drones; Limited examples of software-only agents presently
Long pre-dating the emergence of modern agents are well-defined models of human oversight of AI. Many of the risks of agentic systems are a function of which level of autonomy the developers or deployers of the system choose for the system’s operation. While these models have discrete definitions, in practice many systems fall on a continuum from human-in-the-loop (most oversight) to human-out-of-the-loop (least oversight).
An oversight model where human judgment and decision-making are integral parts of an AI system's operational workflow. In HITL systems, the AI cannot complete critical tasks or make significant decisions without explicit human review and approval. The human serves as an active participant and gatekeeper, reviewing AI outputs before they are implemented or actioned. This approach maximizes human oversight and control but may reduce velocity gains from automation. HITL is particularly appropriate for high-stakes domains where errors could have significant consequences, such as healthcare diagnostics, financial approvals, or content moderation for sensitive topics. “Microtask” and “Copilot” usage modes fall neatly into the human-in-the-loop oversight model.
A supervision model where AI systems operate with a degree of autonomy while humans monitor their performance and can intervene when necessary. Unlike HITL, the AI can complete tasks without waiting for human approval for each action, but humans retain the ability to override decisions, adjust parameters, or intervene to halt operations. This approach balances efficiency with safety by allowing the AI to function independently while maintaining human supervision. HOTL is suitable for medium-risk applications where immediate human verification isn't required for every action, but oversight remains important, such as customer service chatbots or preliminary legal document review. The “Delegate” and “Teammate” usage modes tend to fall into this oversight category, with some usage patterns elevating to full human out of the loop, described below.
An operational framework where AI systems function entirely autonomously without real-time human supervision or intervention. Once deployed, these systems make decisions and take actions independently based on their programming and learning. Human involvement is limited to initial setup, periodic – typically aggregative – review, and post-hoc analysis. This approach maximizes velocity and scalability but introduces greater risk of undetected errors or misaligned behaviors. HOOTL is appropriate only for well-understood, low-risk applications where the consequences of errors are minimal, such as routine data processing or optimization tasks with established boundaries and safeguards. Many routine uses of ML today, like content recommendations, facial recognition at the airport, and ad auctions are HOOTL, regardless of whether they are in fact low risk.
The AI systems that saw widespread adoption starting in 2022, like ChatGPT and GitHub Copilot paved the way for today’s agentic systems, but they had limited privileges and autonomy. It’s important to compare their interaction capabilities to understand how AI systems continue to evolve.
Consider the work task of authoring a research report on the market dynamics surrounding a new technology in your field. Prior to ChatGPT, such a task would require a human worker to use hard-won skills in information collection (e.g. Google search), analysis, and writing to craft a report that is well-tailored to its intended audience.
ChatGPT triggered a paradigm shift in how knowledge workers could accomplish such a task, by supporting workers with writing support and, in a limited capacity, support with information collection and analysis. The limitations of this system were partly a function of the underlying model’s (GPT-3.5) limited capabilities and due to the ChatGPT application’s built-in restrictions. At the time of its initial release, ChatGPT was a conversational AI assistant that was explicitly reactive. It produced text only when a user prompted it. It could read user input and output natural language replies, but it had no ability to take actions beyond the chat box. It couldn’t fetch new data on its own (only drawing on its training data up to a cutoff) and couldn’t write to external files or execute code. Thus ChatGPT, aside from its inherent “knowledge” (informed by its training data), the system could only produce useful outputs in response to any context a human user manually provided by uploading or pasting in external documents. Moreover, any of the system’s outputs could only have a real world impact if the human user took a deliberate action, such as pasting outputs into a research report draft in Microsoft Word. That is, ChatGPT and its contemporaries were designed to be used with human-in-the-loop oversight.
Building on ChatGPT, we saw the emergence of retrieval-augmented generation-based (RAG) tools, designed to alleviate some of the limitations inherent to the constrained ChatGPT system. Tools like Glean, ChatGPT Web Search, and Slack’s AI integrations made it possible for chat assistants to gather additional grounding context before responding to a user query. This functionality – when properly implemented – could reduce performance risks related to factuality.
On the other hand, giving the system access to these external sources of information could increase the likelihood of some risks. Users may uncritically assume the systems are automatically more factual and thus use a lower standard of review before copying outputs into their work product. In addition to increasing performance risks, this could increase the risk of unwittingly disclosing sensitive information. The earlier form of these systems – such as ChatGPT with Web Search – maintain a human-in-the-loop oversight model, while later entries – such as Gemini’s integration into Google Workspace – adopt a model more in the direction of human-on-the-loop, by generating outputs directly in the user’s work environment and putting more onus on the user to override the AI system’s actions.
A screenshot of OpenAI’s Deep Research, with the agent’s meta output and start of its report on the left, and a listing of its cited sources to the right.
Most recently, OpenAI, Google, and others have made available tools designed to autonomously generate research reports – confusingly, several of them have a name that is some variation of “Deep Research”. The developers of these systems have added logic and tool-use capabilities to endow the systems with agency. In order to achieve a high level goal, such as a user prompt “Tell me about the state of AI agents”, these systems can autonomously identify what to read, assess when the retrieved content is sufficient or if another search loop is required, and can determine how to synthesize the sources coherently into a drafted document. Such actions involve high level orchestration of sub-tasks, often carried out by other models, making them a simple type of multi-agent system. Usage will tend to align with “human-on-the-loop” oversight – while the user interface for these research tools is chatbot-like, in practice, the length and depth of outputs is liable to lead to substantial parts of the agent’s output being adopted and used without direct user editing.
The integrated development environment (IDE) Cursor is well known for being one of the first “AI native” IDEs. As with pre-2022 IDEs, Cursor supports editing of code, file browsing, an integrated terminal, and a healthy ecosystem of integrations and extensions. Building on these core features, Cursor is designed to bring AI support to the software development experience by making it easier and more streamlined to ask a powerful AI model for help and have the AI model generate code. Over time, we’ve seen Cursor develop an increasingly agentic feature set that follows the pattern described above.
Cursor’s selection UI for the level of autonomy the AI system should have for interacting with the codebase.
The earliest AI-powered features were primarily chatbot-oriented. Similar to asking ChatGPT for writing help, developers can use Cursor’s integrated chat functionality to query OpenAI or Anthropic’s models to ask coding questions – about languages, libraries, packages, design advice, etc. – and use this information as part of the human-oriented development process. These features were purely generative (no agency) and fully human-in-the-loop.
Over time, Cursor has increased autonomy and designed interfaces that enable a lesser degree of oversight. Following the pattern in text-oriented chatbots, Cursor supports RAG-based workflows. For instance, an “ask” workflow could be supplemented with access to the internet to reference up to date documentation on a package, or by reading the working codebase so the system could be made aware of the other components already implemented as part of the software being developed.
Building on RAG-empowered workflows, Cursor introduced its “Composer” feature (now simply called “Edit”), which could directly edit code files in response to user prompts. This user experience kept the power of review in the hands of the user by requiring the user to affirmatively accept or reject each edit the system produced. This system design arguably did not constitute a true agent, as it involved minimal task planning on the part of the system and, aside from context retrieval, did not enable the system to take complex actions on the environment. This did, however, represent a shift to a more human-on-the-loop oversight pattern.
Most recently, Cursor debuted (and has continued to improve) an Agent workflow, where the user dictates a high level goal and the AI system is able to make edits directly and then run the code it writes in order to verify the correctness of its outputs. This constitutes an agentic system and, depending on the usage pattern, either a HOTL or HOOTL oversight model, where the agent’s decisions are only weakly reviewed after a substantial amount of independent work is completed and validated.
To summarize, AI agents build on large language models but add layers of logic and tool-use that grant the AI system agency. Agents can read from external sources (e.g. databases, documents, or the web), write to files or interfaces (updating content, sending emails, making API calls), and execute code or commands in pursuit of goals. Crucially, they can operate with a degree of autonomy: once given an objective, an agent can plan steps, generate and follow its own sub-tasks, and adapt based on results – all without needing a human prompt at each step. In effect, the AI is no longer just an assistant waiting for instructions; it can proactively "take the initiative" within its allowed scope. For example, an autonomous coding agent might on its own decide to run tests on the code it wrote, or a marketing agent might gather data from various sources and schedule a series of targeted campaign emails automatically. These capabilities far exceed the single-turn, low-side-effect nature of early generative AI systems.
The differences between early AI systems and the agents of the present and near future include:
As we continue this study, it’s useful to resurface analogies between AI and humans that were dismissed in the early generative AI era in an attempt to avoid anthropomorphization. AI assistants of yesteryear are like efficient clerks – they will do what you ask, when you ask – whereas AI agents are more like autonomous junior employees that you assign goals to and then they figure out how to achieve those goals within boundaries. The latter’s independence is powerful but requires more oversight, as discussed next.
Enterprise AI agent adoption will likely follow a path similar to what we have seen with generative AI adoption, but with heightened complexity due to the expanded capabilities and risks of agent-based systems. Understanding this journey will help organizations understand their governance needs at each stage.
The agentic adoption journey will largely mirror the typical AI adoption journey we’ve observed since the start of the GenAI era in late 2022. Governance needs to be a part of every stage of that journey.
Most enterprises will begin their AI agent journey through limited, low-risk experimentation. This typically involves:
During this phase, enterprises face minimal direct risk as long as appropriate data-sharing policies are in place to prevent sensitive information leakage. However, this unstructured adoption creates the hidden risk of shadow AI—tools used without official sanction or governance.
As agentic tools demonstrate value, organizations will move toward more structured adoption:
This phase marks the beginning of true enterprise AI governance needs, as organizations must determine ownership, risk thresholds, and approval processes for AI agent implementations.
Success in focused implementations leads to broader organizational integration:
During this phase, the need for comprehensive, real-time governance becomes critical as AI agent usage transitions from experimental to operational, requiring clear policies on autonomy levels and oversight mechanisms.
As agent technology matures, organizations will begin deploying systems with greater autonomy:
This phase is characterized by a shift from "AI as tool" to "AI as teammate" thinking, requiring more sophisticated governance that addresses the employee-like nature of agent systems.
The most advanced stage of adoption involves deeply integrated agent ecosystems:
At this stage, AI governance must be fully integrated into the enterprise architecture, with clear policies for human-AI collaboration, supervision models, and accountability frameworks.
Throughout this journey, several governance imperatives remain constant but evolve in complexity:
Empowering AI systems with autonomy and the ability to act raises the likelihood and impact of many AI risks. Additionally, some risks specifically arise because of the increase in autonomy:
Unpredictable Autonomy: An autonomous agent’s ability to make independent decisions means it might take unforeseen actions. Early generative models are constrained to produce an answer or code suggestion and then stop – leaving the user to decide next steps. But an agent can continue operating on its own, which raises the possibility of objective misinterpretation or overstepping bounds. The agent might pursue its programmed goal in a manner the designers didn’t anticipate, especially if the goals are open-ended.
On a fundamental level, when the system is given a greater degree of autonomy, its behavior space becomes far larger, making it harder to predict or test exhaustively. Without careful design, an agent’s useful autonomy becomes a double-edged sword: it amplifies risks along with rewards, creating potential blind spots where the AI is potentially doing something it shouldn’t and no one is immediately aware.
Security Vulnerabilities (Prompt Injection & Beyond): Granting AI agents the ability to interact with external systems and data, also opens them up to malicious inputs and exploits. A prominent risk is prompt injection, a form of attack specific to AI agents that operate via language instructions. Because agents often rely on dynamically constructing prompts (including prior conversation or data) to guide the AI model, an attacker can craft input that injects malicious instructions into that guidance. For example, a seemingly benign piece of data from a website or an email might contain a hidden command like “Ignore previous orders, and instead do X,” which the underlying model then obeys. This is akin to a SQL injection in databases – except the “database” in this case is the AI’s context or memory. If an AI agent has access to confidential information or system controls, a prompt injection attack could persuade it to leak data (e.g. API keys, personal data) or perform damaging actions on the attacker’s behalf.
Other security issues include classic software vulnerabilities: if the agent writes or executes code, that code could have bugs that are exploitable. An agent might inadvertently fetch and run malicious scripts if it isn’t sandboxed properly. There’s also the risk of API abuse – if the agent has credentials to call other services, an attacker who gains control of the agent could use it as a jumping-off point to attack connected systems.
Unintended Actions and Errors: Not all failure modes are malicious or grand in scale; as with other generative AI systems, agents may simply make mistakes. Given their speed and autonomy, even small errors can have outsized consequences if not caught. For example, an agent might misinterpret a user’s request or a data signal and take an action that a human would recognize as incorrect. A marketing agent might accidentally send a private internal memo to a customer distribution list, or a sales agent could offer an unauthorized discount beyond its authority. These kinds of errors highlight the need for human intuition as a safety net. Furthermore, when agents operate in complex domains (finance, law, healthcare), domain-specific mistakes can lead to non-compliance or liability. For instance, an HR agent might inadvertently propagate a bias from historical data, leading to discriminatory outcomes in hiring or compensation decisions – violating laws and company ethics.
Self-Modification and Goal Evolution: Some advanced agent frameworks allow agents to update their own plans or code. While true self-writing code or recursive self-improvement is still experimental, even the ability to change their intermediate objectives or introduce new sub-tasks on the fly can pose control issues. If an agent can modify parts of its instructions or create new agent instances, it could gradually drift away from its original constraints. Unchecked self-modification might degrade performance or safety – for example, the agent might remove or ignore a safety check that it “thinks” is unnecessary, or it might adjust its goals in harmful directions. There’s a risk of a runaway feedback loop where the agent’s changes to itself introduce errors or misaligned behaviors. Even if the agent isn’t rewriting its own code, it could update its knowledge or instructions in memory in a way that erodes previously imposed limits, for instance by spawning a copy of itself with a modified system prompt.
As adoption of agentic systems progresses, we will see increasing interaction among agentic systems (rather than simply Human-AI interaction). These interactions could be among multiple agents being orchestrated by an orchestrator system, multiple independent agents operating on behalf of a particular organization, or between two different organizations’ agents. These scenarios present distinct risk pathways; most organizations likely don’t face these risks yet.
This multi-organization report details the risks associated with agent interactions, highlighting a handful of key risk areas:
In essence, if agents are viewed as analogues or proxies of human workers, these risks are not unique. Humans have the capacity to manipulate each other, develop their own agendas, collude to achieve misaligned goals, circumvent processes, violate trust, and undermine information security. However, these risks are heightened in the context of agents and multi-agent interactions because agents are designed to operate at a greater scale and speed than humans, meaning oversight and governance must be similarly fast and broad.
Deploying AI agents in a business comes with governance and liability challenges that leadership must address. Unlike traditional software, agents have a degree of decision-making freedom, so organizations need robust risk management. Key considerations for Chief Risk Officers (CROs), Chief Legal Officers (CLOs), Chief Information Security Officers (CISOs), and other governance leaders include:
Accountability and Liability: When an AI agent acts on behalf of a company, who is liable for its actions? Legally and ethically, the company cannot evade responsibility by blaming the machine. Internal policy should treat actions by AI agents as actions by the organization itself and its employees. Organizations should:
Oversight and Governance Structures: Given the novel “black box” nature of AI decisions, strong oversight strategy is non-negotiable. Human-in-the-loop oversight cannot be assumed, as it was with non-autonomous AI; identifying and planning around the oversight model must be an active part of the process of building AI agents. Companies should institute monitoring systems to track what agents are doing in real time and enable proactive intervention if the agent goes off course. A key best practice is to require that high-impact decisions (e.g., financial transactions above a threshold, hiring decisions, etc.) are automatically identified, and subjected to higher degrees of oversight (e.g., the AI’s decision is reviewed by a human promptly).
Governance teams should develop guidelines for acceptable agent behavior and update them as technology or business needs evolve. Routine, in-depth reviews of AI agent behaviors, policy adherence, and overall impact should be instituted. It’s also important to foster a culture where raising concerns about AI behavior is encouraged. Oversight is not just reactive but also preventative: setting the agent’s operating boundaries in advance. In summary, oversight requires both technical monitoring and organizational accountability structures. Leadership (CRO, CISO, etc.) should have visibility into AI agent activities, akin to how a manager would require periodic reports from a human team.
Human–AI Collaboration and Workforce Impact: Introducing AI agents into teams raises questions about how humans and AI collaborate and the potential risks therein. One risk is over-reliance: human colleagues may develop too much trust in the AI’s outputs and stop exercising critical judgment. For example, a programmer might accept and deploy code suggested by an AI agent without thoroughly understanding it – only to find later it was flawed or insecure. This “automation complacency” is well-documented; therefore, companies should train staff to treat AI outputs as recommendations, not absolute truths. Users should always review agent outputs for accuracy and appropriateness.
Under-reliance is another concern. Even as AI systems are developed that perform new tasks with greater capability than human counterparts, employees may be slow to adopt them due to a range of factors (hard-to-change habits, distrust). Under-reliance compromises the benefits of AI systems. Managing the dual concerns of under-and-over reliance is important. Business practices should be well calibrated - trusting AI systems in step with their trustworthiness and capability within that use case.
Skills erosion is another emerging concern – if an AI agent handles a certain function entirely, the human team members might lose practice or visibility into that area (for instance, junior analysts stop learning how to create reports because the AI does it). While efficiency is gained, the organization could become vulnerable if the AI fails and humans can’t easily step back in. Proper change management is needed to integrate AI agents into teams in a way that augments human workers rather than alienates them.
Despite significant maturation in AI governance frameworks in the past few years, the industry as a whole faces substantial challenges in developing comprehensive governance solutions for autonomous agent systems. While foundational governance approaches exist, they primarily address the initial steps in governing these systems. As AI agents become more complex and widespread, deeper implementation challenges emerge that require innovative solutions.
The AI governance field must address two critical areas to effectively manage autonomous systems:
Organizations face the challenge of contextualizing risk identification and mitigation actions: governance teams need to understand the specific ways risks manifest in their particular use cases and identify precise actions needed to mitigate these risks.
To advance governance capabilities, the enterprises must solve several fundamental problems:
Efficiently collecting context: Contextualizing governance requires gathering comprehensive information about each AI system. This means finding ways to elicit preferences and priority from stakeholders on organizational values, incentive structures, and objectives. Organizations also need to be able to collect objective information about each AI use case; gathering metrics, monitoring outcomes, and MLOps data.
Using use case context to make use case governance more specific. Beyond identifying applicable risks and risk-mitigating controls from standardized frameworks, organizations need tools to understand how those risks arise in specific contexts, and the specific actions needed to mitigate those risk pathways. Typically, it’s easier to distinguish “system A is better (on a given risk or performance dimension) than system B” than it is to explicitly define what constitutes “good enough” or “sufficiently de-risked”. Using system improvement as a grounding principle can provide clear guidance for governance efforts.
Automating clerical oversight: Today, many governance tasks entail “waving through” approvals on low risk use cases or their components. Today those approvals are largely manual. As agentic systems become ubiquitous, however, this clerical oversight will quickly become overly burdensome for governors. Effective governance will therefore require tools for automating validation of evidence of governance tasks, which will make it easier for governance teams to assess whether requirements are satisfied and enable them to focus on only the highest risk, highest leverage use cases in their organizations.
These problems are not necessarily unique to agents, but are made more urgent by agents; agents are a forcing function for adoption. The complexity and scale of agent-based systems means that the future of policy & process governance will be built on AI-enhanced user experiences. The needs for contextualization in governance are simply too large to solve with static, top-down guidance.
While policy and process frameworks can provide the "what" of governance through contextualized guidance, organizations also need effective tools and technologies to implement the "how" – the last mile of governance. Effective implementation requires integration with a diverse ecosystem of tools across MLOps, CI/CD, data governance, and cybersecurity domains, from major cloud providers to specialized startups.
Even with robust integrations, several open problems in technical governance remain unsolved across the industry:
Agent-oriented evaluation strategy: Agents are expected to generate a list of, and then pursue and complete, tasks in service of the user’s overall goal. Thus, evaluations need to cover both unit test-like scenarios, assessing the system’s ability to complete its individual subtasks, and integration test scenarios that assess the system’s ability to satisfy the overall user goal. Fully identifying the space of actions and environments an agent can operate in is an open problem which is a pre-requisite for effectively designing evaluations.
Outside of major advances in interpretability research and formal safety proofs, it’s likely that for the foreseeable future evaluation will only be behavioral, establishing a pattern of behavior within a particular environmental scope. Thus expanding the scope of evaluations and the scale of behavioral assessments will lead to more confident statements about the capability and safety of AI agents, even as behavioral guarantees remain elusive.
Risk evaluation: Given the expanded capabilities and access privileges of agentic systems, specific approaches to evaluate risks are needed. Areas for development include:
Quality evaluation: Beyond high level evaluation strategy, the question of how to evaluate quality is also not fully solved by the ecosystem. Industry best practices use an “LLM-as-judge” approach, where the inputs to and outputs from the model being evaluated are fed to a separate judge model with a rubric and a prompt to rate the former model’s outputs according to the rubric. While creating a "good judge" remains challenging, an even more fundamental problem lies in defining what quality actually means in the context of agent systems. Value elicitation – the process of determining what the AI developer or deployer actually wants the system to optimize for – presents a more persistent challenge that is unlikely to be solved as quickly as technical judging capabilities.
In particular, for contextualized evaluations that try to assess business-relevant dimensions of a system, this value elicitation problem is particularly acute. It’s not sufficient to evaluate the coherence of a marketing copywriting system. Rather, it’s necessary to determine whether the copywriter adheres to the company’s voice. Articulating these deeper, contextual quality dimensions in ways that can be objectively and programmatically measured remains an open problem.
Even with improved LLM-as-judge technology, the fundamental question of "what should we be measuring?" requires organizational alignment on values and priorities that technical solutions alone cannot provide. Effective governance programs need to be grounded on human-articulated values alongside more technical evaluation capabilities.
Automated jailbreaking: Agents and non-agent systems alike are susceptible to jailbreaking. The risks of jailbreaks are substantially higher for agents, given they typically have greater access to ex-model resources. Enterprise-appropriate, standardized approaches for jailbreaking are necessary to help organizations identify and mitigate these vulnerabilities