Last week, the first-ever public generative AI red team challenge took place at the AI Village at DEF CON 31, where over 2,200 participants tested models provided by some of the biggest Large Language Model (LLM) providers, including: Anthropic, Google, Hugging Face, NVIDIA, OpenAI and Stability.
Attracting a vibrant community of hackers and security experts - whose lives are devoted to understanding and combating vulnerabilities inherent to our digital infrastructure - DEFCON-31 was brimming with valuable lessons for the AI community. Credo AI participated in the Generative AI Red Teaming challenge to gain on-the-ground insights about “what has changed” for risks and vulnerabilities with the advent of these powerful LLMs.
In addition to red-teaming in the “AI Village,” the Credo AI team spent time with representatives from the White House Office of Science and Technology Policy (OSTP), Chief Digital and Artificial Intelligence Office (CDAO), Office of Management and Budget (OMB) and policymakers that also want to understand and combat new risks inherent to AI and LLMs. Opportunities like “Policy Village” at DEFCON continue to further bridge the gap between technologists and policymakers.
As The Dark Tangent wrote in the conference attendee guides for #defcon-31,
“Hackers need to have a seat at the policy table as technology becomes more complicated and important. Hackers see things differently than industry trade groups, lobbyists, and rent seeking commercial entities…this year we have increased the opportunities to engage with policy makers and those that support them. They are as interested in us as we are them.”
This statement is more true in 2023 than ever before, when the focus of global regulators has turned to the unique risks and challenges associated with AI Foundation Models.
Key Insights from DEF CON 31:
- It takes a village: AI Trust is composed of many complicated puzzles, including security. No one company is going to “solve” AI Trust alone, but a vibrant ecosystem of startups can create AI Trust together.
- Translation failures are security failures: AI systems can be vulnerable in many ways. Comprehensive safeguards require regulatory, human and technological coordination and translation to create controls that ensure coverage and redundancy.
- It’s good to break things together, in the open: while there is healthy debate about the extent to which models should be open-sourced, it’s clear that AI systems are made safer when vulnerabilities are openly identified. The benefits of AI red teaming are clear… however, it’s important to note that red-teaming alone is not a scalable solution to surfacing vulnerabilities in LLMs.
- Apply cybersecurity risk management to AI: a chain of reporting called “third party risk management” - where both the application builder and upstream vendors are required to provide reports - is very common in the cybersecurity space, and it can (and should) be applied to safeguarding the AI ecosystem.
It takes a village
DEF CON is composed of a diverse mix of "villages," which highlight how security emerges from the intersection of many specialized domains. “Policy Village,” “Voting Village,” “BioHacking Village” and more are just a few examples of the different but interrelated subcomponents of security. AI Trust also “takes a village.” Trustworthy AI is a multifaceted goal with many complications, including technical puzzles and transparency requirements for a variety of different audiences.
Credo AI witnessed first-hand the deep expertise needed to address the security concerns that underpin Trustworthy AI - problems of misinformation, prompt injection, bias, misuse and model steganography (the process of embedding a malware payload into a model’s weights) to name a few. No single organization or discipline possesses expertise across all these areas, let alone the components of AI Trust beyond security. The necessary innovation is therefore distributed, happening in a vibrant marketplace of startups and small/medium enterprises, which have become essential to safeguarding this ecosystem. This distributed innovation ensures that best practices are more easily accessed and transferred, and prevents consolidation of power in the security of the ecosystem at large.
Credo AI is part of the solution for SMEs in AI Governance.
All members of this fast changing ecosystem can sometimes feel overwhelmed - but solving these giant problems does not need to reside in any one sector or type of person. Security is a fundamental layer to the development of Trustworthy AI - and just like anything else - we need a policy to code translation.
As a Responsible AI Governance Platform, Credo AI contributes a holistic perspective spanning regulation, ethics, and technical implementation. We provide oversight while partnering with domain specialists tackling individual trust puzzles. Ultimately, trust emerges from an open community applying expertise and oversight in a coordinated way. Like an ethical security culture, it requires leadership embracing accountability and transparency across the ecosystem.
Translation failures are security failures
Just like any “game,” understanding the powers and abilities of other “players” enables all players to play the game more effectively. All “players” or individuals who interact with an AI system - engineers writing code, businesses developing applications on that code, and policymakers mandating what security measures should be in place for that AI system to enter the market - must understand each other. Failing to translate regulation or legislation accurately to the coder of an AI system will create security risks and vulnerabilities (e.g. compliance checklists that are technically followed to the letter but do not address risks to the end user for which the policymaker designed the checklist). Conversely, a failure to understand the design of an AI system will cause policymakers to require the implementation of policies that fail to address real security concerns (e.g. overly burdensome regulations that require arbitrary checks and limitations that negatively affect the security of an AI system, or have no effect at all).
Policies can fall short of their intended aims if downstream implementations reveal unanticipated dynamics and consequences. For example, compliance checklists may technically be followed to the letter while missing the spirit of the law.
As DEF CON demonstrated through its social engineering village, humans themselves can be a “security vulnerability,” and understanding each other will be key to “leveling up” LLM security. Psychological biases and social pressures can open attack vectors no policy can completely mitigate, and regulations and processes are only as strong as the code that ultimately executes the system's behaviors.
Our participation in the Generative AI Red Teaming event, as well as our exploration of and discussion with the team behind the excellent security game Gandalf (designed by new AI Security start-up Lakera), as well as conversations with other AI and LLM security startups, demonstrated how much innovation remains to be explored in AI security.
“We created Gandalf and Mosscap (a DEFCON special edition spin-off of Gandalf) because we recognize that the key to effectively integrating LLM models at scale lies in understanding their vulnerabilities and limitations.The insights gathered from initiatives like GRT at DEFCON will likely play a pivotal role in shaping AI regulations for LLM technology in the foreseeable future. As an AI security company, we are proud to be part of this collective effort alongside numerous organizations and individuals, all united under the shared mission of ensuring the safety and security of the AI systems we're implementing.” - David Haber, CEO of Lakera
These multiple points of attack and control underscore why security must be viewed holistically, with redundancy built-in using a coordinated system of controls. Perhaps more critical, however, is a security culture driven by incentives at least as much as controls. Thoughtfully crafted regulations - based on technical understanding - are important not just for what they immediately require, but for the norms and practices they nurture over time. Genuine security emerges from a commitment to the principles behind policies, not just adherence to the letter. Realizing security from policy intentions to implemented code requires diligence at each layer. Breaking things together, in the open.
It’s good to break things together, in the open
DEF CON 31 held the largest AI red teaming event, where 1000’s of conference attendees volunteered to attack LLMs on multiple dimensions. “Red teaming” for AI refers to the practice of subjecting artificial intelligence systems, algorithms, models, and technologies to adversarial testing in order to identify vulnerabilities, weaknesses, and potential risks. Just as traditional red teaming aims to assess the security and effectiveness of various systems, red teaming for AI focuses on evaluating the robustness, reliability, and potential failures of AI systems in different scenarios.
Approaches to red teaming have evolved (and will continue to evolve) over time. Foundation model developers have been doing private red teaming of their systems for years. Since the end of 2022, inspired by ChatGPT’s release, a kind of distributed, directionless “red-teaming” has been happening on social media, where individuals find “jail-breaks” in the form of specific prompts that compromise model safeguards. For more specialized risks, like outputting information relevant to bioweapons, or misaligned agentic behavior, expert red teams are necessary.
AI Village brought a new flavor to this enterprise at this year’s DEF CON - encouraging security-minded “hackers” to compromise LLMs on particular dimensions. The data generated at DEFCON-31 will assuredly showcase new attack vectors and support the next iteration of safeguards.
However, it is also important to note that this period of human-based red teaming is only a start, and will not be sufficient for discovering the vulnerabilities of current AI systems or keeping pace with the evolving capabilities of future systems. For instance, in the weeks since Llama2’s release, researchers have already developed automated approaches to compromising LLMs that go far beyond what a human can do on their own. The red teaming of the future will see specialized teams using a range of approaches to compromise model security and safeguards. It’s also possible that the “red teams” themselves are increasingly composed of autonomous systems, due to the vast range of vulnerabilities and capabilities that need to be tested. This is the iterative, adversarial nature of safety. We should recognize that “red teaming” is a practice, not a specific approach: as our safeguards and attacks improve, and AI systems change, so must the approach.
Of course, identifying vulnerabilities is only the first step and doesn’t yet identify a plan for mitigation. The point of red-teaming is to proactively harden AI systems against future attacks, and, more generally, make these systems more trustworthy and beneficial for humanity. Ultimately, the proof of success for a safety initiative is born out by improvements in the system themselves, which must be tracked and understood.
This requires evaluation of the AI systems and open communication of the evolving risks and mitigation approaches developed. Thoughtful implementation of openness principles enables continuous improvement driven by a shared culture of trust. Providers receive specific guidance to strengthen systems, while the public gains confidence through evidence of accountability. Just as open coordination made DefCon attendees collectively smarter, cooperation driven by mutual commitment to safety promises the most secure path forward for AI.
Lessons from cybersecurity: third-party risk management
The cybersecurity industry has developed robust practices for managing third party risk that can serve as a model for the AI ecosystem. Supply chain transparency and security audits are commonplace requirements in cybersecurity contracts today. Customers expect detailed reporting on vulnerabilities identified and remediations taken. Providers operate with the understanding that their systems will be continuously tested by customers and independent auditors. This chain of accountability creates strong incentives for all parties to build and maintain secure systems.
Applying similar third party risk management practices to AI systems would significantly improve transparency and trust. AI providers could be required to share details on their model training processes, documentation, and ongoing monitoring. Customers deploying AI could mandate security audits and red team testing, sharing results back with providers to enable ongoing model improvements. Independent auditors could also be engaged to evaluate systems. This web of reporting distributed across the AI value chain reinforces accountability and provides assurance to the public. As in cybersecurity, embracing transparency as a cultural norm, not just a compliance checklist, is key to realizing robust, trustworthy AI.
The insights and lessons from DEF CON 31 highlight that realizing trustworthy AI is a complex challenge requiring diligence across many dimensions. Policy intentions must translate into robust technical implementations, with humans thoughtfully stewarding AI development each step of the way. No one organization can address this challenge alone. Instead, an ecosystem of contributors is needed, collaborating openly to identify vulnerabilities, and drive continuous improvement. Credo AI is excited to support this ecosystem as a governance keystone, ensuring that the contributions of these diverse actors result in more broadly distributed, trustworthy AI.